Help Center/ ModelArts/ Troubleshooting/ Resource Pool/ Faulty Nodes in a Standard Resource Pool
Updated on 2025-08-22 GMT+08:00

Faulty Nodes in a Standard Resource Pool

Locating Faulty Nodes

In a Standard resource pool, ModelArts will add a taint to a faulty Kubernetes node so that jobs will not be scheduled to the tainted node. The following table lists the faults can be detected. You can locate the fault by referring to the isolation code and detection method.

Table 1 Isolation code

Isolation Code

Category

Sub-Category

Description

Detection Method

A050101

GPU

Video RAM

GPU ECC error.

Run the nvidia-smi -a command and check whether Pending Page Blacklist is Yes or the multi-bit register file is greater than 0. For Ampere GPUs, check whether any of the following situations occurs:

  • Uncorrectable SRAM error
  • Remapping Failure records
  • Xid 95 events in dmsg

The Ampere architecture has the following levels of GPU memory errors:

  • L1: These are single-bit ECC errors that can be corrected. They do not affect the running services. To check for these errors, run the nvidia-smi -a command and look for Volatile Correctable.
  • L2: These are multi-bit ECC errors that cannot be corrected. They cause the running services to fail and require a process restart to recover. To check for these errors, run the nvidia-smi -a command and look for Volatile Uncorrectable.
  • L3: These are unsuppressed errors and may affect other services. They require a card reset or a node reboot to clear. To check for these errors, look for the Xid events that contain the number 95. (The Remapped Pending records are only for reference. You need to reset the cards when the service is idle to trigger the remapping process.)
  • L4: These are errors that require a card replacement. To check for these errors, look for the SRAM Uncorrectable field that is greater than 4 or the Remapped Failed field that is not zero.

A050102

GPU

Other

The nvidia-smi output contains ERR.

Run nvidia-smi -a and check whether the output contains ERR!. Normally, such errors are caused by hardware faults, such as the faulty power supply or fan.

A050103

GPU

Other

The execution of nvidia-smi times out or does not exist.

Check that the exit code of nvidia-smi is not 0.

A050104

GPU

Video RAM

The ECC error has occurred 64 times.

Run the nvidia-smi -a command, locate Retired Pages, and check whether the sum of Single Bit and Double Bit is greater than 64.

A050148

GPU

Other

An infoROM alarm is generated.

Run the nvidia-smi command and check whether the output contains alarm infoROM is corrupted.

A050109

GPU

Other

Other GPU errors.

Check whether other GPU errors exist. Normally, such errors are caused by faulty hardware. Contact the technical support.

A050147

IB

Link

The InfiniBand NIC is abnormal.

Run the ibstat command and check whether the NIC is not in the active state.

A050121

NPU

Other

A driver exception is detected by NPU DCMI.

Check whether the NPU driver environment is abnormal.

A050122

NPU

Other

The NPU DCMI device is abnormal.

Check whether the NPU device is abnormal and the Ascend DCMI API returns a major or critical alarm.

A050123

NPU

Link

The NPU DCMI network is abnormal.

Check for NPU network connection exceptions.

A050129

NPU

Other

Other NPU errors.

Check whether other NPU errors exist. Such errors usually should be corrected with the assistance of the technical support.

A050149

NPU

Link

Check whether the network port of the hccn tool is intermittently disconnected.

The NPU network is unstable and intermittently disconnected. Run the hccn_tool-i ${device_id} -link_stat -g command to check whether the network has been intermittently disconnected more than five times within 24 hours.

A050951

NPU

Video RAM

The number of NPU ECC errors reaches the repair threshold.

Check whether the NPU's HBM Double Bit Isolated Pages Count value is greater than or equal to 64.

A050146

Runtime

Other

The NTP is abnormal.

Check whether the NTPD or Chronyd service is abnormal.

A050202

Runtime

Other

The node is not ready.

Check whether the node is unavailable. An unavailable Kubernetes node may have one of the following taints:

  • node.kubernetes.io/unreachable
  • node.kubernetes.io/not-ready

A050203

Runtime

Disconnected PU

The number of normal AI cards does not match the actual capacity.

Check whether the GPU or NPU is disconnected.

A050206

Runtime

Other

The Kubelet hard disk is read-only.

Check whether the /mnt/paas/kubernetes/kubelet directory is read-only.

A050801

Node management

Node O&M

Resources are reserved.

Check whether the node is marked as a standby and has a taint.

A050802

Node management

Node O&M

Unknown error.

Check whether the node is marked with an unknown taint.

A200001

Node management

Driver upgrade

The GPU is being upgraded.

Check whether the GPU is being upgraded on the node.

A200002

Node management

Driver upgrade

The NPU is being upgraded.

Check whether the NPU is being upgraded on the node.

A200008

Node management

Node admission

The admission is being examined.

Check whether the admission is being examined on the node, including basic node configuration check and simple service verification.

A050933

Node management

Failover

The failover service on the tainted node will be migrated.

Check whether the node is marked with the taint and its failover service is migrated.

A050931

Training toolkit

Pre-check container

A GPU error is detected in the pre-check container.

Check whether a GPU error is detected in the pre-check container.

A050932

Training toolkit

Pre-check container

An InfiniBand error is detected in the pre-check container.

Check whether an InfiniBand error is detected in the pre-check container.

A050804

Hardware fault

Hardware fault

Detected through hardware alarms.

Detected through hardware alarms.

Authorize O&M in Event Center.For details, see Authorizing O&M on the Event Center Page.

Configuring Node Event Alarm Notifications

AOM automatically reports node fault events by default. You can set up SMS or email notifications through AOM.

The following steps are performed on AOM 2.0.

  1. Log in to the AOM console.
  2. In the navigation pane, choose Alarm Center > Alarm Rules. Then, click Create Alarm Rule in the upper right corner.
  3. Set an alarm rule (using error code A050804 as an example).

    • Rule Type: Select Event alarm rule.
    • Event Type: Select System.
    • Event Source: Select ModelArts.
    • Monitored Object: Filter monitored objects by custom attributes. The event format is code=${Error code}.

      This example uses the code=A050804 with Trigger Mode set to Immediate Trigger.

    • Alarm Mode: Select Direct alarm reporting.
    • (Optional) Alarm Notification: To get notified of alarms by email or SMS message, configure action rules for the alarm rule. If no action rule is available, you can create one.