High-Risk Operations
To avoid adverse impacts on ModelArts Lite Server, you must perform high-risk operations according to operation guides during the routine O&M.
Risky operations fall into three levels:
- High: Such operations may cause service failures, data loss, system maintenance failures, and system resource exhaustion.
- Medium: Such operations may cause security risks and reduce service reliability.
- Low: Such operations include high-risk operations other than those of a high or medium risk level.
Object |
Operation |
Risk |
Severity |
Solution |
---|---|---|---|---|
OS |
Upgrade or modify the OS kernel. |
The driver and kernel versions may not be compatible. As a result, the OS cannot be started or basic functions are unavailable. High-risk commands, such as apt-get upgrade (upgrading all software in the system, including the kernel), are involved. Run the uname -a command to view the current kernel. |
High |
To perform upgrade or modification, contact Huawei Cloud technical support. |
Switch or reset OS. |
The EVS system ID is changed. As a result, the EVS system disk cannot be scaled out, and message "The order is expired. The capacity cannot be expanded. Renew the order." is displayed. |
Low |
Mount an EVS or SFS disk for capacity expansion after you switch or reset the OS. |
|
When the cloud server service is running properly, the user deletes the NIC route in the system or performs network destruction operations, such as running ifconfigdown and ifconfigup, on the NIC. |
The network service will be restarted and DHCP will be triggered to obtain the IP address and route again. As a result, the NIC route may be lost and the node may be unavailable. |
High |
Reset the OS. Ensure that your data has been backed up. |
|
Modify kernel parameters such as net.ipv4.ip_forward. |
The route forwarding function of the ECS may be affected, causing network disconnection. |
Medium |
Set net.ipv4.ip_forward to 1. |
|
Enable the system firewall. |
The performance of HCCL, NCCL, and multi-node multi-PU training tasks may be affected. |
Low |
Disable the firewall. |
|
Change the time zone. |
The node time changes, which will affect services. |
Medium |
Restore the time zone. |
|
Driver and firmware |
Upgrade the NPU driver or firmware. |
The driver and firmware may not match, causing unavailable servers and affecting services. |
Medium |
Reset the OS. Ensure that your data has been backed up. |
Change the GPU driver. |
The driver and firmware may not match, causing unavailable servers and affecting services. |
Medium |
Reset the OS. Ensure that your data has been backed up. |
|
Change the SDI PU driver. |
The NIC may be unavailable, causing unavailable servers and affecting services. |
Medium |
Reset the OS. Ensure that your data has been backed up. |
|
Network |
Change the NIC MAC address or IP address. |
If misoperations are performed, the VM communication and services are interrupted, and other services are affected. |
High |
Roll back the modification. If the rollback fails, reset the OS. Ensure that your data has been backed up. |
Add, delete, or edit iptables rules, or restart the iptables service. |
Service access requests are rejected. |
High |
Roll back the modification. If the rollback fails, reset the OS. Ensure that your data has been backed up. |
|
Built-in OS software |
Upgrade, downgrade, or uninstall built-in OS software such as Python 3. |
Network configuration software, such as the system built-in network, may be abnormal. As a result, the server NIC fails to be configured and the node is unavailable. |
High |
Roll back the modification. If the rollback fails, reset the OS. Ensure that your data has been backed up. |
Directory/File |
Modify key system directories and files of root or opt, such as /etc/hccn.conf and /etc/netplan/roce.yaml. |
The system functions may be affected, and the cloud server may be unavailable. |
High |
Roll back the modification. If the rollback fails, reset the OS. Ensure that your data has been backed up. |
Modify the permissions of directories and files. |
The service may be abnormal. |
High |
Roll back the modification. |
|
Server |
Do not perform non-query operations on the server, such as stopping or starting the server, when the server instance is being provisioned, initialized, or when disks are being added, deleted, or the instance is being deleted. |
Operations on the cloud server may fail. |
Medium |
Reset the OS. Ensure that your data has been backed up. |
Switch or reset OS. |
The EVS system ID is changed. As a result, the EVS system disk cannot be scaled out, and message "The order is expired. The capacity cannot be expanded. Renew the order." is displayed. |
Low |
Mount an EVS or SFS disk for capacity expansion. |
|
Process |
Run the service network restart command. Stop key system processes, such as sshd ces-agent. |
Services may fail to be provisioned, the remote access to the cloud server may fail. Moreover, data may fail to be collected, affecting the reporting of monitoring indicators. |
High |
Restart the closed service. |
Data disk |
Modify the data disk mounting mode and mount point. |
Services that are being used may become abnormal. |
Low |
Ensure that the data disk is not used by any service. |
Security group |
Modify the port communication protocol. Allow high-risk ports such as port 22. IP address whitelist not configured. |
The network may be attacked, affecting services of the server. |
Medium |
Restore the original content. |
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot