Help Center/ ModelArts/ Troubleshooting/ Lite Cluster/ How Do I Locate and Rectify a Node Fault in a Cluster Resource Pool?
Updated on 2025-08-22 GMT+08:00

How Do I Locate and Rectify a Node Fault in a Cluster Resource Pool?

Fault Description and Handling Suggestions

Figure 1 Troubleshooting process

For a ModelArts Lite resource pool, the node-agent component is deployed on each node in DaemonSet mode. This component detects the node status and writes the detection result to the Kubernetes NodeCondition. In addition, node fault metrics are reported to AOM by default. You can configure alarm notifications on AOM.

If a node is abnormal, you can identify whether the node is subhealthy based on Table 1 and rectify the fault preliminarily. If the node is not subhealthy, contact the customer manager to initiate a repair process. If no customer manager is available, submit a service ticket.

Table 1 Node fault metrics

NodeCondition Type

Category

Sub-Category

Description

Detection Method

Solution

NT_NPU_DEVICE

NPU

Others

The NPU DCMI device is abnormal.

Check whether the NPU device is abnormal and the Ascend DCMI API returns a major or critical alarm.

The node may be subhealthy. You are advised to restart the node first. If the fault persists after the node is restarted, initiate the repair process.

NT_NPU_NET

NPU

Link

The NPU DCMI network is abnormal.

Check for NPU network connection exceptions.

The node may be subhealthy. You are advised to restart the node first. If the fault persists after the node is restarted, initiate the repair process.

NT_NPU_CARD_LOSE

NPU

Disconnected card

The NPU is disconnected.

Check whether the number of NPUs in the node flavor is inconsistent with the number of schedulable NPUs on the Kubernetes node.

The node may be subhealthy. You are advised to restart the node first. If the fault persists after the node is restarted, initiate the repair process.

NT_NPU_OTHER

NPU

Others

Other NPU errors.

Check whether other NPU errors exist. Such errors usually should be corrected with the assistance of the technical support.

Initiate the repair process.

NT_NPU_ECC_COUNT

NPU

Graphics memory

The number of NPU ECC errors reaches the repair threshold.

Check whether the NPU's HBM multi-bit ECC isolation addresses reaches 64.

Initiate the repair process.

NT_NET_NTP_CHECK

Runtime

Others

The NTP is abnormal.

Check whether the NTPD or Chronyd service is abnormal.

Initiate the repair process.

NT_KUBE_DISK_READONLY_CHECK

Runtime

Others

The Kubelet hard disk is read-only.

Check whether the following directory is read-only:

/mnt/paas/kubernetes/kubelet

Initiate the repair process.

NT_GPU_SMI_ECC_CHECK

GPU

Graphics memory

GPU ECC error.

Run the nvidia-smi -a command and check whether Pending Page Blacklist is Yes or the multi-bit register file is greater than 0. For Ampere GPUs, check whether any of the following situations occurs:

  • Uncorrectable SRAM error
  • Remapping Failure records
  • Xid 95 events in dmsg

(For details, see NVIDIA GPU Memory Error Management.)

The Ampere architecture has the following levels of GPU memory errors:

  • L1: These are single-bit ECC errors that can be corrected. They do not affect the running services. To check for these errors, run the nvidia-smi -a command and look for Volatile Correctable.
  • L2: These are multi-bit ECC errors that cannot be corrected. They cause the running services to fail and require a process restart to recover. To check for these errors, run the nvidia-smi -a command and look for Volatile Uncorrectable.
  • L3: These are unsuppressed errors and may affect other services. They require a card reset or a node reboot to clear. To check for these errors, look for the Xid events that contain the number 95. (The Remapped Pending records are only for reference. You need to reset the cards when the service is idle to trigger the remapping process.)
  • L4: These are errors that require a card replacement. To check for these errors, look for the SRAM Uncorrectable field that is greater than 4 or the Remapped Failed field that is not zero.

The node may be subhealthy. You are advised to restart the node first. If the fault persists after the node is restarted, initiate the repair process.

NT_GPU_SMI_ERROR

GPU

Others

The nvidia-smi output contains ERR.

Run nvidia-smi -a and check whether the output contains ERR!. Normally, such errors are caused by hardware faults, such as the faulty power supply or fan.

Initiate the repair process.

NT_GPU_SMI_RUNTIME

GPU

Others

The execution of nvidia-smi times out or does not exist.

Check that the exit code of nvidia-smi is not 0.

Initiate the repair process.

NT_GPU_SMI_ECC_COUNT

GPU

Graphics memory

The ECC error has occurred 64 times.

Run the nvidia-smi -a command, locate Retired Pages, and check whether the sum of Single Bit and Double Bit is greater than 64.

Initiate the repair process.

NT_GPU_CARD_LOSE

GPU

Disconnected card

The GPU is disconnected.

Check whether the number of GPUs in the node flavor is different from any of the following values:

  1. Number of GPUs visible to lspci
  2. Number of visible nvidia-smi cards
  3. Number of schedulable Kubernetes cards

Initiate the repair process.

NT_GPU_SMI_INFOROM_ERROR

GPU

Others

An infoROM alarm is generated.

Run the nvidia-smi command and check whether the output contains alarm infoROM is corrupted.

Initiate the repair process.

NT_GPU_OTHER

GPU

Others

Other GPU errors.

Check whether other GPU errors exist. Normally, such errors are caused by faulty hardware. Contact the technical support.

Initiate the repair process.

NT_NET_IB_CHECK

IB

Link

The InfiniBand NIC is abnormal.

Run the ibstat command and check whether the NIC is not in the active state.

The node may be subhealthy. You are advised to restart the node first. If the fault persists after the node is restarted, initiate the repair process.

Some fault modes can be detected by hardware alarm monitoring on the Huawei Cloud O&M platform. Table 2 describes the fault definitions and handling suggestions. In addition, an AOM event is reported by default when such a fault occurs. You can configure alarm notifications on AOM.

Table 2 Node fault events

Error Code

Category

Sub-Category

Description

Detection Method

Solution

A050804

Hardware fault

Hardware fault

Detected through hardware alarms.

Detected through hardware alarms.

Authorize O&M in Event Center.For details, see Authorizing O&M on the Event Center Page.

A050202

Runtime

Other

Kubernetes node not ready

Log in to the CCE cluster and check the status of the node where the alarm is generated.

After confirming the exception, set the node to unschedulable and schedule the service pods to another node.

Configuring Node Metric Alarm Notifications

By default, the node fault metric (nt_npg) is reported to AOM. You can configure notifications such as the SMS or email on AOM.

The following steps are performed on AOM 2.0.

The nt_npg metric with type=2 represents an invalid value. The expression nt_npg{type="NT_NPU_CARD_LOSE"} !=2 is used to filter out those invalid values.

  1. Log in to the AOM console.
  2. In the navigation pane, choose Alarm Center > Alarm Rules. Then, click Create Alarm Rule.
  3. Set an alarm rule. (NPU disconnection is used as an example.)

    • Rule Type: Select Metric alarm rule.
    • Configuration Mode: Select PromQL.
    • Default Rule: Select Custom and enter the following information in the command input box:
      sum(nt_npg{type="NT_NPU_CARD_LOSE"} !=2) by (cluster_name, node_ip,type)
    • Alarm Rule Details > Duration: Select 1 minute, which triggers a major alarm when the rule's conditions persist continuously for one minute.
    • (Optional) Alarm Notification: To get notified of alarms by email or SMS message, configure action rules for the alarm rule. If no action rule is available, you can create one.

Configuring Node Event Alarm Notifications

AOM automatically reports node fault events by default. You can set up SMS or email notifications through AOM.

The following steps are performed on AOM 2.0.

  1. Log in to the AOM console.
  2. In the navigation pane, choose Alarm Center > Alarm Rules. Then, click Create Alarm Rule in the upper right corner.
  3. Set an alarm rule (using error code A050804 as an example).

    • Rule Type: Select Event alarm rule.
    • Event Type: Select System.
    • Event Source: Select ModelArts.
    • Monitored Object: Filter monitored objects by custom attributes. The event format is code=${Error code}.

      This example uses the code=A050804 with Trigger Mode set to Immediate Trigger.

    • Alarm Mode: Select Direct alarm reporting.
    • (Optional) Alarm Notification: To get notified of alarms by email or SMS message, configure action rules for the alarm rule. If no action rule is available, you can create one.