Updated on 2024-11-11 GMT+08:00

High-Risk Operations

When you perform operations on ModelArts Lite Cluster resources on the CCE, ECS, or BMS console, certain resource pool functions may be abnormal. The table below shows common risky operations.

Risky operations fall into three levels:

  • High: Such operations may cause service failures, data loss, system maintenance failures, and system resource exhaustion.
  • Medium: Such operations may cause security risks and reduce service reliability.
  • Low: Such operations include high-risk operations other than those of a high or medium risk level.
Table 1 Operations and risks

Object

Operation

Risk

Severity

Solution

Cluster

Upgrade, modify, hibernate, or delete clusters.

These operations may impact basic ModelArts functions, including resource pool management, node management, scaling, and driver upgrades

High

These operations cannot be undone.

Node

Unsubscribe, remove, shut down, manage taints, or switch or reinstall OS.

These operations may impact basic ModelArts functions, including node management, scaling, driver upgrades, and data loss of local disks.

High

These operations cannot be undone.

Modify a network security group.

These operations may impact basic ModelArts functions, including node management, scaling, and driver upgrades

Medium

If needed, revert back to the original data.

Network

Modify or delete the CIDR block associated with a cluster.

These operations impact basic ModelArts functions, including node management, scaling, and driver upgrades

High

These operations cannot be undone.

Plug-in

Upgrade or uninstall the gpu-beta plug-in.

The GPU driver may be abnormal.

Medium

Roll back the version and reinstall the plug-in.

Upgrade or uninstall the huawei-npu plug-in.

The NPU driver may be abnormal.

Medium

Roll back the version and reinstall the plug-in.

Upgrade or uninstall the volcano plug-in.

Job scheduling may be abnormal.

Medium

Roll back the version and reinstall the plug-in.

Uninstall the ICAgent plug-in.

Logging and monitoring may be abnormal.

Medium

Roll back the version and reinstall the plug-in.

helm

Upgrade, roll back, or uninstall os-node-agent.

Driver upgrades, fault detection, metric collection, and node O&M are abnormal.

High

Contact Huawei Cloud technical support to reinstall os-node-agent.

Upgrade, roll back, or uninstall rdma-sriov-dev-plugin.

The use of RDMA NICs in containers may be affected.

High

Contact Huawei Cloud technical support to reinstall rdma-sriov-dev-plugin.