Help Center/ ModelArts/ Troubleshooting/ Resource Pool/ Faulty Nodes in a Standard Resource Pool

Updated on 2025-08-22 GMT+08:00

View PDF

Faulty Nodes in a Standard Resource Pool

Locating Faulty Nodes

In a Standard resource pool, ModelArts will add a taint to a faulty Kubernetes node so that jobs will not be scheduled to the tainted node. The following table lists the faults can be detected. You can locate the fault by referring to the isolation code and detection method.

**Table 1** Isolation code
Isolation Code	Category	Sub-Category	Description	Detection Method
A050101	GPU	Video RAM	GPU ECC error.	Run the nvidia-smi -a command and check whether Pending Page Blacklist is Yes or the multi-bit register file is greater than 0. For Ampere GPUs, check whether any of the following situations occurs: Uncorrectable SRAM error Remapping Failure records Xid 95 events in dmsg The Ampere architecture has the following levels of GPU memory errors: L1: These are single-bit ECC errors that can be corrected. They do not affect the running services. To check for these errors, run the nvidia-smi -a command and look for Volatile Correctable. L2: These are multi-bit ECC errors that cannot be corrected. They cause the running services to fail and require a process restart to recover. To check for these errors, run the nvidia-smi -a command and look for Volatile Uncorrectable. L3: These are unsuppressed errors and may affect other services. They require a card reset or a node reboot to clear. To check for these errors, look for the Xid events that contain the number 95. (The Remapped Pending records are only for reference. You need to reset the cards when the service is idle to trigger the remapping process.) L4: These are errors that require a card replacement. To check for these errors, look for the SRAM Uncorrectable field that is greater than 4 or the Remapped Failed field that is not zero.
A050102	GPU	Other	The nvidia-smi output contains ERR.	Run nvidia-smi -a and check whether the output contains ERR!. Normally, such errors are caused by hardware faults, such as the faulty power supply or fan.
A050103	GPU	Other	The execution of nvidia-smi times out or does not exist.	Check that the exit code of nvidia-smi is not 0.
A050104	GPU	Video RAM	The ECC error has occurred 64 times.	Run the nvidia-smi -a command, locate Retired Pages, and check whether the sum of Single Bit and Double Bit is greater than 64.
A050148	GPU	Other	An infoROM alarm is generated.	Run the nvidia-smi command and check whether the output contains alarm infoROM is corrupted.
A050109	GPU	Other	Other GPU errors.	Check whether other GPU errors exist. Normally, such errors are caused by faulty hardware. Contact the technical support.
A050147	IB	Link	The InfiniBand NIC is abnormal.	Run the ibstat command and check whether the NIC is not in the active state.
A050121	NPU	Other	A driver exception is detected by NPU DCMI.	Check whether the NPU driver environment is abnormal.
A050122	NPU	Other	The NPU DCMI device is abnormal.	Check whether the NPU device is abnormal and the Ascend DCMI API returns a major or critical alarm.
A050123	NPU	Link	The NPU DCMI network is abnormal.	Check for NPU network connection exceptions.
A050129	NPU	Other	Other NPU errors.	Check whether other NPU errors exist. Such errors usually should be corrected with the assistance of the technical support.
A050149	NPU	Link	Check whether the network port of the hccn tool is intermittently disconnected.	The NPU network is unstable and intermittently disconnected. Run the hccn_tool-i ${device_id} -link_stat -g command to check whether the network has been intermittently disconnected more than five times within 24 hours.
A050951	NPU	Video RAM	The number of NPU ECC errors reaches the repair threshold.	Check whether the NPU's HBM Double Bit Isolated Pages Count value is greater than or equal to 64.
A050146	Runtime	Other	The NTP is abnormal.	Check whether the NTPD or Chronyd service is abnormal.
A050202	Runtime	Other	The node is not ready.	Check whether the node is unavailable. An unavailable Kubernetes node may have one of the following taints: node.kubernetes.io/unreachable node.kubernetes.io/not-ready
A050203	Runtime	Disconnected PU	The number of normal AI cards does not match the actual capacity.	Check whether the GPU or NPU is disconnected.
A050206	Runtime	Other	The Kubelet hard disk is read-only.	Check whether the /mnt/paas/kubernetes/kubelet directory is read-only.
A050801	Node management	Node O&M	Resources are reserved.	Check whether the node is marked as a standby and has a taint.
A050802	Node management	Node O&M	Unknown error.	Check whether the node is marked with an unknown taint.
A200001	Node management	Driver upgrade	The GPU is being upgraded.	Check whether the GPU is being upgraded on the node.
A200002	Node management	Driver upgrade	The NPU is being upgraded.	Check whether the NPU is being upgraded on the node.
A200008	Node management	Node admission	The admission is being examined.	Check whether the admission is being examined on the node, including basic node configuration check and simple service verification.
A050933	Node management	Failover	The failover service on the tainted node will be migrated.	Check whether the node is marked with the taint and its failover service is migrated.
A050931	Training toolkit	Pre-check container	A GPU error is detected in the pre-check container.	Check whether a GPU error is detected in the pre-check container.
A050932	Training toolkit	Pre-check container	An InfiniBand error is detected in the pre-check container.	Check whether an InfiniBand error is detected in the pre-check container.
A050804	Hardware fault	Hardware fault	Detected through hardware alarms.	Detected through hardware alarms. Authorize O&M in Event Center.For details, see Authorizing O&M on the Event Center Page.

Configuring Node Event Alarm Notifications

AOM automatically reports node fault events by default. You can set up SMS or email notifications through AOM.

The following steps are performed on AOM 2.0.

Log in to the AOM console.
In the navigation pane, choose Alarm Center > Alarm Rules. Then, click Create Alarm Rule in the upper right corner.
Set an alarm rule (using error code A050804 as an example).
- Rule Type: Select Event alarm rule.
- Event Type: Select System.
- Event Source: Select ModelArts.
- Monitored Object: Filter monitored objects by custom attributes. The event format is code=${Error code}.
  This example uses the code=A050804 with Trigger Mode set to Immediate Trigger.
- Alarm Mode: Select Direct alarm reporting.
- (Optional) Alarm Notification: To get notified of alarms by email or SMS message, configure action rules for the alarm rule. If no action rule is available, you can create one.