Updated on 2025-08-18 GMT+08:00

High-Risk Operations

To avoid adverse impacts on ModelArts Lite Server, you must perform high-risk operations according to operation guides during the routine O&M.

Risky operations fall into three levels:

  • High: Such operations may cause service failures, data loss, system maintenance failures, and system resource exhaustion.
  • Medium: Such operations may cause security risks and reduce service reliability.
  • Low: Such operations include high-risk operations other than those of a high or medium risk level.
Table 1 High-risk operations

Object

Operation

Risk

Severity

Solution

OS

Upgrade or modify the OS kernel.

The driver and kernel versions may not be compatible. As a result, the OS cannot be started or basic functions are unavailable. High-risk commands, such as apt-get upgrade (upgrading all software in the system, including the kernel), are involved.

Run the uname -a command to view the current kernel.

High

To perform upgrade or modification, contact Huawei Cloud technical support.

Switch or reset OS.

The EVS system ID is changed. As a result, the EVS system disk cannot be scaled out, and message "The order is expired. The capacity cannot be expanded. Renew the order." is displayed.

Low

Mount an EVS or SFS disk for capacity expansion after you switch or reset the OS.

When the cloud server service is running properly, the user deletes the NIC route in the system or performs network destruction operations, such as running ifconfigdown and ifconfigup, on the NIC.

The network service will be restarted and DHCP will be triggered to obtain the IP address and route again. As a result, the NIC route may be lost and the node may be unavailable.

High

Reset the OS. Ensure that your data has been backed up.

Modify kernel parameters such as net.ipv4.ip_forward.

The route forwarding function of the ECS may be affected, causing network disconnection.

Medium

Set net.ipv4.ip_forward to 1.

Enable the system firewall.

The performance of HCCL, NCCL, and multi-node multi-PU training tasks may be affected.

Low

Disable the firewall.

Change the time zone.

The node time changes, which will affect services.

Medium

Restore the time zone.

Driver and firmware

Upgrade the NPU driver or firmware.

The driver and firmware may not match, causing unavailable servers and affecting services.

Medium

Reset the OS. Ensure that your data has been backed up.

Change the GPU driver.

The driver and firmware may not match, causing unavailable servers and affecting services.

Medium

Reset the OS. Ensure that your data has been backed up.

Change the SDI PU driver.

The NIC may be unavailable, causing unavailable servers and affecting services.

Medium

Reset the OS. Ensure that your data has been backed up.

Network

Change the NIC MAC address or IP address.

If misoperations are performed, the VM communication and services are interrupted, and other services are affected.

High

Roll back the modification. If the rollback fails, reset the OS. Ensure that your data has been backed up.

Add, delete, or edit iptables rules, or restart the iptables service.

Service access requests are rejected.

High

Roll back the modification. If the rollback fails, reset the OS. Ensure that your data has been backed up.

Built-in OS software

Upgrade, downgrade, or uninstall built-in OS software such as Python 3.

Network configuration software, such as the system built-in network, may be abnormal. As a result, the server NIC fails to be configured and the node is unavailable.

High

Roll back the modification. If the rollback fails, reset the OS. Ensure that your data has been backed up.

Directory/File

Modify key system directories and files of root or opt, such as /etc/hccn.conf and /etc/netplan/roce.yaml.

The system functions may be affected, and the cloud server may be unavailable.

High

Roll back the modification. If the rollback fails, reset the OS. Ensure that your data has been backed up.

Modify the permissions of directories and files.

The service may be abnormal.

High

Roll back the modification.

Server

Do not perform non-query operations on the server, such as stopping or starting the server, when the server instance is being provisioned, initialized, or when disks are being added, deleted, or the instance is being deleted.

Operations on the cloud server may fail.

Medium

Reset the OS. Ensure that your data has been backed up.

Switch or reset OS.

The EVS system ID is changed. As a result, the EVS system disk cannot be scaled out, and message "The order is expired. The capacity cannot be expanded. Renew the order." is displayed.

Low

Mount an EVS or SFS disk for capacity expansion.

Process

Run the service network restart command.

Stop key system processes, such as sshd ces-agent.

Services may fail to be provisioned, the remote access to the cloud server may fail.

Moreover, data may fail to be collected, affecting the reporting of monitoring indicators.

High

Restart the closed service.

Data disk

Modify the data disk mounting mode and mount point.

Services that are being used may become abnormal.

Low

Ensure that the data disk is not used by any service.

Security group

Modify the port communication protocol.

Allow high-risk ports such as port 22.

IP address whitelist not configured.

The network may be attacked, affecting services of the server.

Medium

Restore the original content.