How Do I Locate and Rectify a Node Fault in a Cluster Resource Pool?
Fault Description and Handling Suggestions

For a ModelArts Lite resource pool, the node-agent component is deployed on each node in DaemonSet mode. This component detects the node status and writes the detection result to the Kubernetes NodeCondition. In addition, node fault metrics are reported to AOM by default. You can configure alarm notifications on AOM.
If a node is abnormal, you can identify whether the node is subhealthy based on Table 1 and rectify the fault preliminarily. If the node is not subhealthy, contact the customer manager to initiate a repair process. If no customer manager is available, submit a service ticket.
NodeCondition Type |
Category |
Sub-Category |
Description |
Detection Method |
Solution |
---|---|---|---|---|---|
NT_NPU_DEVICE |
NPU |
Others |
The NPU DCMI device is abnormal. |
Check whether the NPU device is abnormal and the Ascend DCMI API returns a major or critical alarm. |
The node may be subhealthy. You are advised to restart the node first. If the fault persists after the node is restarted, initiate the repair process. |
NT_NPU_NET |
NPU |
Link |
The NPU DCMI network is abnormal. |
Check for NPU network connection exceptions. |
The node may be subhealthy. You are advised to restart the node first. If the fault persists after the node is restarted, initiate the repair process. |
NT_NPU_CARD_LOSE |
NPU |
Disconnected card |
The NPU is disconnected. |
Check whether the number of NPUs in the node flavor is inconsistent with the number of schedulable NPUs on the Kubernetes node. |
The node may be subhealthy. You are advised to restart the node first. If the fault persists after the node is restarted, initiate the repair process. |
NT_NPU_OTHER |
NPU |
Others |
Other NPU errors. |
Check whether other NPU errors exist. Such errors usually should be corrected with the assistance of the technical support. |
Initiate the repair process. |
NT_NPU_ECC_COUNT |
NPU |
Graphics memory |
The number of NPU ECC errors reaches the repair threshold. |
Check whether the NPU's HBM multi-bit ECC isolation addresses reaches 64. |
Initiate the repair process. |
NT_NET_NTP_CHECK |
Runtime |
Others |
The NTP is abnormal. |
Check whether the NTPD or Chronyd service is abnormal. |
Initiate the repair process. |
NT_KUBE_DISK_READONLY_CHECK |
Runtime |
Others |
The Kubelet hard disk is read-only. |
Check whether the following directory is read-only: /mnt/paas/kubernetes/kubelet |
Initiate the repair process. |
NT_GPU_SMI_ECC_CHECK |
GPU |
Graphics memory |
GPU ECC error. |
Run the nvidia-smi -a command and check whether Pending Page Blacklist is Yes or the multi-bit register file is greater than 0. For Ampere GPUs, check whether any of the following situations occurs:
(For details, see NVIDIA GPU Memory Error Management.) The Ampere architecture has the following levels of GPU memory errors:
|
The node may be subhealthy. You are advised to restart the node first. If the fault persists after the node is restarted, initiate the repair process. |
NT_GPU_SMI_ERROR |
GPU |
Others |
The nvidia-smi output contains ERR. |
Run nvidia-smi -a and check whether the output contains ERR!. Normally, such errors are caused by hardware faults, such as the faulty power supply or fan. |
Initiate the repair process. |
NT_GPU_SMI_RUNTIME |
GPU |
Others |
The execution of nvidia-smi times out or does not exist. |
Check that the exit code of nvidia-smi is not 0. |
Initiate the repair process. |
NT_GPU_SMI_ECC_COUNT |
GPU |
Graphics memory |
The ECC error has occurred 64 times. |
Run the nvidia-smi -a command, locate Retired Pages, and check whether the sum of Single Bit and Double Bit is greater than 64. |
Initiate the repair process. |
NT_GPU_CARD_LOSE |
GPU |
Disconnected card |
The GPU is disconnected. |
Check whether the number of GPUs in the node flavor is different from any of the following values:
|
Initiate the repair process. |
NT_GPU_SMI_INFOROM_ERROR |
GPU |
Others |
An infoROM alarm is generated. |
Run the nvidia-smi command and check whether the output contains alarm infoROM is corrupted. |
Initiate the repair process. |
NT_GPU_OTHER |
GPU |
Others |
Other GPU errors. |
Check whether other GPU errors exist. Normally, such errors are caused by faulty hardware. Contact the technical support. |
Initiate the repair process. |
NT_NET_IB_CHECK |
IB |
Link |
The InfiniBand NIC is abnormal. |
Run the ibstat command and check whether the NIC is not in the active state. |
The node may be subhealthy. You are advised to restart the node first. If the fault persists after the node is restarted, initiate the repair process. |
Some fault modes can be detected by hardware alarm monitoring on the Huawei Cloud O&M platform. Table 2 describes the fault definitions and handling suggestions. In addition, an AOM event is reported by default when such a fault occurs. You can configure alarm notifications on AOM.
Error Code |
Category |
Sub-Category |
Description |
Detection Method |
Solution |
---|---|---|---|---|---|
A050804 |
Hardware fault |
Hardware fault |
Detected through hardware alarms. |
Detected through hardware alarms. |
Authorize O&M in Event Center.For details, see Authorizing O&M on the Event Center Page. |
A050202 |
Runtime |
Other |
Kubernetes node not ready |
Log in to the CCE cluster and check the status of the node where the alarm is generated. |
After confirming the exception, set the node to unschedulable and schedule the service pods to another node. |
Configuring Node Metric Alarm Notifications
By default, the node fault metric (nt_npg) is reported to AOM. You can configure notifications such as the SMS or email on AOM.

The following steps are performed on AOM 2.0.
The nt_npg metric with type=2 represents an invalid value. The expression nt_npg{type="NT_NPU_CARD_LOSE"} !=2 is used to filter out those invalid values.
- Log in to the AOM console.
- In the navigation pane, choose Alarm Center > Alarm Rules. Then, click Create Alarm Rule.
- Set an alarm rule. (NPU disconnection is used as an example.)
- Rule Type: Select Metric alarm rule.
- Configuration Mode: Select PromQL.
- Default Rule: Select Custom and enter the following information in the command input box:
sum(nt_npg{type="NT_NPU_CARD_LOSE"} !=2) by (cluster_name, node_ip,type)
- Alarm Rule Details > Duration: Select 1 minute, which triggers a major alarm when the rule's conditions persist continuously for one minute.
- (Optional) Alarm Notification: To get notified of alarms by email or SMS message, configure action rules for the alarm rule. If no action rule is available, you can create one.
Configuring Node Event Alarm Notifications
AOM automatically reports node fault events by default. You can set up SMS or email notifications through AOM.

The following steps are performed on AOM 2.0.
- Log in to the AOM console.
- In the navigation pane, choose Alarm Center > Alarm Rules. Then, click Create Alarm Rule in the upper right corner.
- Set an alarm rule (using error code A050804 as an example).
- Rule Type: Select Event alarm rule.
- Event Type: Select System.
- Event Source: Select ModelArts.
- Monitored Object: Filter monitored objects by custom attributes. The event format is code=${Error code}.
This example uses the code=A050804 with Trigger Mode set to Immediate Trigger.
- Alarm Mode: Select Direct alarm reporting.
- (Optional) Alarm Notification: To get notified of alarms by email or SMS message, configure action rules for the alarm rule. If no action rule is available, you can create one.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot