Help Center/ ModelArts/ Troubleshooting/ Lite Cluster/ How Do I Locate and Rectify a Node Fault in a Cluster Resource Pool?

Updated on 2025-08-22 GMT+08:00

View PDF

How Do I Locate and Rectify a Node Fault in a Cluster Resource Pool?

Fault Description and Handling Suggestions

Figure 1 Troubleshooting process
Click to enlarge

For a ModelArts Lite resource pool, the node-agent component is deployed on each node in DaemonSet mode. This component detects the node status and writes the detection result to the Kubernetes NodeCondition. In addition, node fault metrics are reported to AOM by default. You can configure alarm notifications on AOM.

If a node is abnormal, you can identify whether the node is subhealthy based on Table 1 and rectify the fault preliminarily. If the node is not subhealthy, contact the customer manager to initiate a repair process. If no customer manager is available, submit a service ticket.

**Table 1** Node fault metrics
NodeCondition Type	Category	Sub-Category	Description	Detection Method	Solution
NT_NPU_DEVICE	NPU	Others	The NPU DCMI device is abnormal.	Check whether the NPU device is abnormal and the Ascend DCMI API returns a major or critical alarm.	The node may be subhealthy. You are advised to restart the node first. If the fault persists after the node is restarted, initiate the repair process.
NT_NPU_NET	NPU	Link	The NPU DCMI network is abnormal.	Check for NPU network connection exceptions.	The node may be subhealthy. You are advised to restart the node first. If the fault persists after the node is restarted, initiate the repair process.
NT_NPU_CARD_LOSE	NPU	Disconnected card	The NPU is disconnected.	Check whether the number of NPUs in the node flavor is inconsistent with the number of schedulable NPUs on the Kubernetes node.	The node may be subhealthy. You are advised to restart the node first. If the fault persists after the node is restarted, initiate the repair process.
NT_NPU_OTHER	NPU	Others	Other NPU errors.	Check whether other NPU errors exist. Such errors usually should be corrected with the assistance of the technical support.	Initiate the repair process.
NT_NPU_ECC_COUNT	NPU	Graphics memory	The number of NPU ECC errors reaches the repair threshold.	Check whether the NPU's HBM multi-bit ECC isolation addresses reaches 64.	Initiate the repair process.
NT_NET_NTP_CHECK	Runtime	Others	The NTP is abnormal.	Check whether the NTPD or Chronyd service is abnormal.	Initiate the repair process.
NT_KUBE_DISK_READONLY_CHECK	Runtime	Others	The Kubelet hard disk is read-only.	Check whether the following directory is read-only: /mnt/paas/kubernetes/kubelet	Initiate the repair process.
NT_GPU_SMI_ECC_CHECK	GPU	Graphics memory	GPU ECC error.	Run the nvidia-smi -a command and check whether Pending Page Blacklist is Yes or the multi-bit register file is greater than 0. For Ampere GPUs, check whether any of the following situations occurs: Uncorrectable SRAM error Remapping Failure records Xid 95 events in dmsg (For details, see NVIDIA GPU Memory Error Management.) The Ampere architecture has the following levels of GPU memory errors: L1: These are single-bit ECC errors that can be corrected. They do not affect the running services. To check for these errors, run the nvidia-smi -a command and look for Volatile Correctable. L2: These are multi-bit ECC errors that cannot be corrected. They cause the running services to fail and require a process restart to recover. To check for these errors, run the nvidia-smi -a command and look for Volatile Uncorrectable. L3: These are unsuppressed errors and may affect other services. They require a card reset or a node reboot to clear. To check for these errors, look for the Xid events that contain the number 95. (The Remapped Pending records are only for reference. You need to reset the cards when the service is idle to trigger the remapping process.) L4: These are errors that require a card replacement. To check for these errors, look for the SRAM Uncorrectable field that is greater than 4 or the Remapped Failed field that is not zero.	The node may be subhealthy. You are advised to restart the node first. If the fault persists after the node is restarted, initiate the repair process.
NT_GPU_SMI_ERROR	GPU	Others	The nvidia-smi output contains ERR.	Run nvidia-smi -a and check whether the output contains ERR!. Normally, such errors are caused by hardware faults, such as the faulty power supply or fan.	Initiate the repair process.
NT_GPU_SMI_RUNTIME	GPU	Others	The execution of nvidia-smi times out or does not exist.	Check that the exit code of nvidia-smi is not 0.	Initiate the repair process.
NT_GPU_SMI_ECC_COUNT	GPU	Graphics memory	The ECC error has occurred 64 times.	Run the nvidia-smi -a command, locate Retired Pages, and check whether the sum of Single Bit and Double Bit is greater than 64.	Initiate the repair process.
NT_GPU_CARD_LOSE	GPU	Disconnected card	The GPU is disconnected.	Check whether the number of GPUs in the node flavor is different from any of the following values: Number of GPUs visible to lspci Number of visible nvidia-smi cards Number of schedulable Kubernetes cards	Initiate the repair process.
NT_GPU_SMI_INFOROM_ERROR	GPU	Others	An infoROM alarm is generated.	Run the nvidia-smi command and check whether the output contains alarm infoROM is corrupted.	Initiate the repair process.
NT_GPU_OTHER	GPU	Others	Other GPU errors.	Check whether other GPU errors exist. Normally, such errors are caused by faulty hardware. Contact the technical support.	Initiate the repair process.
NT_NET_IB_CHECK	IB	Link	The InfiniBand NIC is abnormal.	Run the ibstat command and check whether the NIC is not in the active state.	The node may be subhealthy. You are advised to restart the node first. If the fault persists after the node is restarted, initiate the repair process.

Some fault modes can be detected by hardware alarm monitoring on the Huawei Cloud O&M platform. Table 2 describes the fault definitions and handling suggestions. In addition, an AOM event is reported by default when such a fault occurs. You can configure alarm notifications on AOM.

**Table 2** Node fault events
Error Code	Category	Sub-Category	Description	Detection Method	Solution
A050804	Hardware fault	Hardware fault	Detected through hardware alarms.	Detected through hardware alarms.	Authorize O&M in Event Center.For details, see Authorizing O&M on the Event Center Page.
A050202	Runtime	Other	Kubernetes node not ready	Log in to the CCE cluster and check the status of the node where the alarm is generated.	After confirming the exception, set the node to unschedulable and schedule the service pods to another node.

Configuring Node Metric Alarm Notifications

By default, the node fault metric (nt_npg) is reported to AOM. You can configure notifications such as the SMS or email on AOM.

The following steps are performed on AOM 2.0.

The nt_npg metric with type=2 represents an invalid value. The expression nt_npg{type="NT_NPU_CARD_LOSE"} !=2 is used to filter out those invalid values.

Log in to the AOM console.
In the navigation pane, choose Alarm Center > Alarm Rules. Then, click Create Alarm Rule.
Set an alarm rule. (NPU disconnection is used as an example.)
- Rule Type: Select Metric alarm rule.
- Configuration Mode: Select PromQL.
- Default Rule: Select Custom and enter the following information in the command input box:
```
sum(nt_npg{type="NT_NPU_CARD_LOSE"} !=2) by (cluster_name, node_ip,type)
```
- Alarm Rule Details > Duration: Select 1 minute, which triggers a major alarm when the rule's conditions persist continuously for one minute.
- (Optional) Alarm Notification: To get notified of alarms by email or SMS message, configure action rules for the alarm rule. If no action rule is available, you can create one.

Configuring Node Event Alarm Notifications

AOM automatically reports node fault events by default. You can set up SMS or email notifications through AOM.

The following steps are performed on AOM 2.0.

Log in to the AOM console.
In the navigation pane, choose Alarm Center > Alarm Rules. Then, click Create Alarm Rule in the upper right corner.
Set an alarm rule (using error code A050804 as an example).
- Rule Type: Select Event alarm rule.
- Event Type: Select System.
- Event Source: Select ModelArts.
- Monitored Object: Filter monitored objects by custom attributes. The event format is code=${Error code}.
  This example uses the code=A050804 with Trigger Mode set to Immediate Trigger.
- Alarm Mode: Select Direct alarm reporting.
- (Optional) Alarm Notification: To get notified of alarms by email or SMS message, configure action rules for the alarm rule. If no action rule is available, you can create one.