Help Center/ ModelArts/ ModelArts User Guide (Lite Server)/ Lite Server Monitoring Alarms/ Using Cloud Eye to Monitor Lite Server NPU Events
Updated on 2025-11-27 GMT+08:00

Using Cloud Eye to Monitor Lite Server NPU Events

Scenario

You can use Cloud Eye to centrally collect key events and cloud resource operational events. When an event occurs, you will receive an alarm. Lite Server mainly supports BMS and ECS events. The table below lists NPU-related events. For details about other events, see Events Supported by Event Monitoring.

Constraints

  • The Cloud Eye Agent plugin, which has strict resource usage restrictions, is required for event reporting to Cloud Eye. When the resource usage exceeds the threshold, the Agent circuit breaker occurs. For details about the resource usage, see Cloud Eye Server Monitoring.
  • You have fully tested the monitoring agent in the public image provided by Lite Server. If you use your own image, perform the test before deploying the image in the production environment to prevent information errors.

Prerequisites

The Cloud Eye Agent has been installed on the Lite Server. For details about how to check whether the Cloud Eye Agent is installed and how to install it, see Installing Cloud Eye Agent Monitoring Plug-ins.

Event Source

BMS/ECS

Event Namespace

SYS.BMS/SYS.ECS

Event List

Table 1 Supported Fault Events (BMS/ECS)

Event

Event ID

Event Severity

Description

Solution

Impact

Supported Model

Supported Cloud Eye Agent Version

NPU: device not found by npu-smi info

NPUSMICardNotFound

Major

The Ascend driver is faulty or the NPU is disconnected.

Contact O&M engineers.

The NPU cannot be used normally.

Snt3P

Snt3PD

Snt9b

Snt9b23

telescope:

2.7.4.3

2.7.5.3

2.7.5.4

2.7.5.9 or later

NPU: PCIe link error

PCIeErrorFound

Major

The lspci command output shows that the NPU is in the rev ff state.

Contact O&M engineers.

The NPU cannot be used normally.

Snt3P

Snt3PD

Snt9b

Snt9b23

telescope:

2.7.4.3

2.7.5.3

2.7.5.4

2.7.5.9 or later

NPU: device not found by lspci

LspciCardNotFound

Major

The NPU is disconnected.

Contact O&M engineers.

The NPU cannot be used normally.

Snt3P

Snt3PD

Snt9b

Snt9b23

telescope:

2.7.4.3

2.7.5.3

2.7.5.4

2.7.5.9 or later

NPU: overtemperature

TemperatureOverUpperLimit

Major

The temperature of DDR or software is too high.

Stop services, restart the system, check the heat dissipation system, and reset the device.

The instance may be powered off and devices may not be found.

Snt3P

Snt3PD

Snt9b

Snt9b23

telescope:

2.7.4.3

2.7.5.3

2.7.5.4

2.7.5.9 or later

NPU: uncorrectable ECC error

UncorrectableEccErrorCount

Major

There are uncorrectable ECC errors on the NPU.

If services are affected, replace the NPU with another one.

Services may be interrupted.

Snt3P

Snt3PD

telescope:

2.7.4.3

2.7.5.3

2.7.5.4

2.7.5.9 or later

NPU: request for instance restart

RebootVirtualMachine

Suggestion

A fault occurs and the instance needs to be restarted.

Collect the fault information, and restart the instance.

Services may be interrupted.

Snt3P

Snt3PD

Snt9b

Snt9b23

telescope:

2.7.4.3

2.7.5.3

2.7.5.4

2.7.5.9 or later

NPU: request for SoC reset

ResetSOC

Suggestion

A fault occurs and the SoC needs to be reset.

Collect the fault information, and reset the SoC.

Services may be interrupted.

Snt3P

Snt3PD

Snt9b

Snt9b23

telescope:

2.7.4.3

2.7.5.3

2.7.5.4

2.7.5.9 or later

NPU: request for restart AI process

RestartAIProcess

Suggestion

A fault occurs and the AI process needs to be restarted.

Collect the fault information, and restart the AI process.

The current AI task will be interrupted.

Snt3P

Snt3PD

Snt9b

Snt9b23

telescope:

2.7.4.3

2.7.5.3

2.7.5.4

2.7.5.9 or later

NPU: error codes

NPUErrorCodeWarning

Major

A large number of NPU error codes indicating major or higher-level errors are returned. You can further locate the faults based on the error codes.

Locate the faults according to the Black Box Error Code Information List and Health Management Error Definition.

Services may be interrupted.

Snt3P

Snt3PD

Snt9b

Snt9b23

telescope:

2.7.4.3

2.7.5.3

2.7.5.4

2.7.5.9 or later

Multiple NPU HBM ECC errors

NpuHbmMultiEccInfo

Suggestion

There are NPU HBM ECC errors.

This event is only a reference for other events. You do not need to handle it separately.

This event is only a reference for other events. You do not need to handle it separately.

Snt9b

Snt9b23

telescope:

2.7.5.9 or later

GPU: invalid RoCE NIC configuration

GpuRoceNicConfigIncorrect

Major

GPU: invalid RoCE NIC configuration

Contact O&M engineers.

The parameter plane network is abnormal, preventing the execution of the multi-node task.

GPU

telescope:

2.7.5.9 or later

ReadOnly issues in OS

ReadOnlyFileSystem

Critical

The file system %s is read-only.

Check the disk health status.

The files cannot be written or operated.

-

telescope:

2.7.5.3

2.7.5.9 or later

NPU: driver and firmware not matching

NpuDriverFirmwareMismatch

Major

The NPU's driver and firmware do not match.

Obtain the matched version from the Ascend official website and reinstall it.

NPUs cannot be used.

Snt3P

Snt3PD

Snt9b

Snt9b23

telescope:

2.7.5.3

2.7.5.9 or later

NPU: RoCE NIC down

RoCELinkStatusDown

Major

The RoCE link of NPU %d is down.

Check the NPU RoCE network port status.

The NPU NIC is unavailable.

Snt9b

Snt9b23

telescope:

2.7.5.3

2.7.5.9 or later

NPU: RoCE NIC health status abnormal

RoCEHealthStatusError

Major

The RoCE network health status of NPU %d is abnormal.

Check the health status of the NPU RoCE NIC.

The NPU NIC is unavailable.

Snt9b

Snt9b23

telescope:

2.7.5.3

2.7.5.9 or later

NPU: RoCE NIC configuration file /etc/hccn.conf not exist

HccnConfNotExisted

Major

The RoCE NIC configuration file /etc/hccn.conf does not exist.

Check the /etc/hccn.conf NIC configuration file.

The RoCE NIC is unavailable.

Snt9b

Snt9b23

telescope:

2.7.5.3

2.7.5.9 or later

GPU: basic components abnormal

GpuEnvironmentSystem

Major

The nvidia-smi command is abnormal.

Check whether the GPU driver is normal.

The GPU driver is unavailable.

GPU

telescope:

2.7.5.3

2.7.5.9 or later

Major

The nvidia-fabricmanager version was inconsistent with the GPU driver version.

Check the GPU driver version and nvidia-fabricmanager version.

The nvidia-fabricmanager cannot work properly, affecting GPU usage.

Major

The container plugin nvidia-container-toolkit is not installed.

Install the container plugin nvidia-container-toolkit.

GPUs cannot be attached to Docker containers.

Local disk mounting inspection

MountDiskSystem

Major

The /etc/fstab file contains invalid UUIDs.

Ensure that the UUIDs in the /etc/fstab configuration file are correct. Otherwise, the server may fail to be restarted.

The disk mounting process fails, preventing the server from restarting.

-

telescope:

2.7.5.3

2.7.5.9 or later

GP: incorrectly configured dynamic route for Ant series server

GpuRouteConfigError

Major

The dynamic route of the NIC %s of an Ant series server is not configured or is incorrectly configured. CMD [ip route]: %s | CMD [ip route show table all]: %s.

Configure the RoCE NIC route correctly.

The NPU network communication is abnormal.

GPU

telescope:

2.7.5.3

2.7.5.9 or later

NPU: RoCE port not split

RoCEUdpConfigError

Major

The RoCE UDP port is not split.

Check the RoCE UDP port configuration on the NPU.

The communication performance of NPUs is affected.

Snt9b

Snt9b23

telescope:

2.7.5.9 or later

Warning of automatic system kernel upgrade

KernelUpgradeWarning

Major

Warning of automatic system kernel upgrade. Old version: %s; new version: %s.

System kernel upgrade may cause AI software exceptions. Check the system update logs and prevent the server from restarting.

The AI software may be unavailable.

Snt3P

Snt3PD

Snt9b

Snt9b23

telescope:

2.7.5.3

2.7.5.9 or later

NPU environment command detection

NpuToolsWarning

Major

The hccn_tool is unavailable.

Check if the NPU driver is normal.

The IP address and gateway of the RoCE NIC cannot be configured.

Snt9b

Snt9b23

telescope:

2.7.5.3

2.7.5.9 or later

Major

The npu-smi is unavailable.

Check if the NPU driver is normal.

NPUs cannot be used.

Snt3P

Snt3PD

Snt9b

Snt9b23

telescope:

2.7.5.3

2.7.5.9 or later

Major

The ascend-dmi is unavailable.

Check if ToolBox is properly installed.

The ascend-dmi cannot be used for performance analysis.

Snt9b

Snt9b23

telescope:

2.7.5.3

2.7.5.9 or later

NPU: L1 switch port partial failure

NpuL1SwitchPortPartialFunctionFailure

Major

Some functions of the NPU's L1 1520 switch port fail.

Transfer this issue to the Ascend or hardware team for handling.

Services may be interrupted.

Snt9b23

telescope:

2.7.5.9 or later

lqdcmi:

2.1.0 and later

NPU: L1 switch fault

NpuL1SwitchFault

Major

There are faults in the L1 1520 switch of the NPU.

Transfer this issue to the Ascend or hardware team for handling.

Services may be interrupted.

Snt9b23

telescope:

2.7.5.9 or later

lqdcmi:

2.1.0 and later

NPU: Unmatched RoCE IP address

NpuRoceIPAddressMismatch

Major

The actual IP address of the RoCE NIC is inconsistent with the IP address in the hccn.conf configuration file.

Contact O&M engineers.

The parameter plane network is abnormal, preventing the execution of the multi-node task.

Snt9b

Snt9b23

telescope:

2.7.5.9 or later

Server: Component info collection

SysComponentInfo

Suggestion

The event collects server component details, including the driver and firmware versions, OS kernel, image name, ces-agent version, and SFS Turbo Client+ version.

None

None

Snt3PD

Snt9b

Snt9b23

-

NPU: PFC status abnormal

NpuPfcStatusWarning

Major

The event occurred because the fifth PFC priority queue on the NPU was not configured with a value of 1.

Run hccn_tool -i [device_id] -pfc -s bitmap 0,0,0,0,1,0,0,0 or contact AI Compute Service engineers.

RoCE communication is abnormal, and services may be interrupted.

Snt9b23

-

NPU: TLS certificate status abnormal

NpuTlsStatusWarning

Major

The certificate configurations of different NPUs are inconsistent.

It is recommended that priority queues be either all enabled or all disabled.

hccn_tool -i <device_id> -tls -s enable 0

hccn_tool -i <device_id> -tls -s enable 1

HCCL communication is abnormal, and services may be interrupted.

Snt9b23

-

NPU: HCCS health status abnormal

NpuHccsHealthWarning

Major

The event occurred because the NPU HCCS health status is abnormal (not OK).

Transfer this issue to the Ascend or hardware team for handling.

This event causes data synchronization between NPUs to fail. As a result, the training task is interrupted or the inference result is incorrect.

Snt9b23

-

SDI NIC: Status abnormal

SdiCheckWarning

Major

The event occurred because the SDI NIC was neither UP or RUNNING.

  1. Check the driver or contact the vendor to restore the NIC.
  2. Check the network configuration file /etc/sysconfig/network-scripts/ifcfg-[xxx] (xxx indicates the NIC name), ensure that ONBOOT is set to yes, and restart the network service using systemctl restart NetworkManager.
  3. Check whether the NIC is physically damaged or the power supply is faulty.

Services may be interrupted.

Snt9b

Snt9b23

-