Faulty Nodes in a Standard Resource Pool
Locating Faulty Nodes
In a Standard resource pool, ModelArts will add a taint to a faulty Kubernetes node so that jobs will not be scheduled to the tainted node. The following table lists the faults can be detected. You can locate the fault by referring to the isolation code and detection method.
Isolation Code |
Category |
Sub-Category |
Description |
Detection Method |
---|---|---|---|---|
A050101 |
GPU |
Video RAM |
GPU ECC error. |
Run the nvidia-smi -a command and check whether Pending Page Blacklist is Yes or the multi-bit register file is greater than 0. For Ampere GPUs, check whether any of the following situations occurs:
The Ampere architecture has the following levels of GPU memory errors:
|
A050102 |
GPU |
Other |
The nvidia-smi output contains ERR. |
Run nvidia-smi -a and check whether the output contains ERR!. Normally, such errors are caused by hardware faults, such as the faulty power supply or fan. |
A050103 |
GPU |
Other |
The execution of nvidia-smi times out or does not exist. |
Check that the exit code of nvidia-smi is not 0. |
A050104 |
GPU |
Video RAM |
The ECC error has occurred 64 times. |
Run the nvidia-smi -a command, locate Retired Pages, and check whether the sum of Single Bit and Double Bit is greater than 64. |
A050148 |
GPU |
Other |
An infoROM alarm is generated. |
Run the nvidia-smi command and check whether the output contains alarm infoROM is corrupted. |
A050109 |
GPU |
Other |
Other GPU errors. |
Check whether other GPU errors exist. Normally, such errors are caused by faulty hardware. Contact the technical support. |
A050147 |
IB |
Link |
The InfiniBand NIC is abnormal. |
Run the ibstat command and check whether the NIC is not in the active state. |
A050121 |
NPU |
Other |
A driver exception is detected by NPU DCMI. |
Check whether the NPU driver environment is abnormal. |
A050122 |
NPU |
Other |
The NPU DCMI device is abnormal. |
Check whether the NPU device is abnormal and the Ascend DCMI API returns a major or critical alarm. |
A050123 |
NPU |
Link |
The NPU DCMI network is abnormal. |
Check for NPU network connection exceptions. |
A050129 |
NPU |
Other |
Other NPU errors. |
Check whether other NPU errors exist. Such errors usually should be corrected with the assistance of the technical support. |
A050149 |
NPU |
Link |
Check whether the network port of the hccn tool is intermittently disconnected. |
The NPU network is unstable and intermittently disconnected. Run the hccn_tool-i ${device_id} -link_stat -g command to check whether the network has been intermittently disconnected more than five times within 24 hours. |
A050951 |
NPU |
Video RAM |
The number of NPU ECC errors reaches the repair threshold. |
Check whether the NPU's HBM Double Bit Isolated Pages Count value is greater than or equal to 64. |
A050146 |
Runtime |
Other |
The NTP is abnormal. |
Check whether the NTPD or Chronyd service is abnormal. |
A050202 |
Runtime |
Other |
The node is not ready. |
Check whether the node is unavailable. An unavailable Kubernetes node may have one of the following taints:
|
A050203 |
Runtime |
Disconnected PU |
The number of normal AI cards does not match the actual capacity. |
Check whether the GPU or NPU is disconnected. |
A050206 |
Runtime |
Other |
The Kubelet hard disk is read-only. |
Check whether the /mnt/paas/kubernetes/kubelet directory is read-only. |
A050801 |
Node management |
Node O&M |
Resources are reserved. |
Check whether the node is marked as a standby and has a taint. |
A050802 |
Node management |
Node O&M |
Unknown error. |
Check whether the node is marked with an unknown taint. |
A200001 |
Node management |
Driver upgrade |
The GPU is being upgraded. |
Check whether the GPU is being upgraded on the node. |
A200002 |
Node management |
Driver upgrade |
The NPU is being upgraded. |
Check whether the NPU is being upgraded on the node. |
A200008 |
Node management |
Node admission |
The admission is being examined. |
Check whether the admission is being examined on the node, including basic node configuration check and simple service verification. |
A050933 |
Node management |
Failover |
The failover service on the tainted node will be migrated. |
Check whether the node is marked with the taint and its failover service is migrated. |
A050931 |
Training toolkit |
Pre-check container |
A GPU error is detected in the pre-check container. |
Check whether a GPU error is detected in the pre-check container. |
A050932 |
Training toolkit |
Pre-check container |
An InfiniBand error is detected in the pre-check container. |
Check whether an InfiniBand error is detected in the pre-check container. |
A050804 |
Hardware fault |
Hardware fault |
Detected through hardware alarms. |
Detected through hardware alarms. Authorize O&M in Event Center.For details, see Authorizing O&M on the Event Center Page. |
Configuring Node Event Alarm Notifications
AOM automatically reports node fault events by default. You can set up SMS or email notifications through AOM.

The following steps are performed on AOM 2.0.
- Log in to the AOM console.
- In the navigation pane, choose Alarm Center > Alarm Rules. Then, click Create Alarm Rule in the upper right corner.
- Set an alarm rule (using error code A050804 as an example).
- Rule Type: Select Event alarm rule.
- Event Type: Select System.
- Event Source: Select ModelArts.
- Monitored Object: Filter monitored objects by custom attributes. The event format is code=${Error code}.
This example uses the code=A050804 with Trigger Mode set to Immediate Trigger.
- Alarm Mode: Select Direct alarm reporting.
- (Optional) Alarm Notification: To get notified of alarms by email or SMS message, configure action rules for the alarm rule. If no action rule is available, you can create one.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot