Help Center/ ModelArts/ ModelArts User Guide (Lite Server)/ Monitoring Lite Servers/ Using Cloud Eye to Monitor NPU Events of Lite Servers
Updated on 2026-06-01 GMT+08:00

Using Cloud Eye to Monitor NPU Events of Lite Servers

Description

You can use Cloud Eye to centrally collect key events and cloud resource operational events. When an event occurs, you will receive an alarm. Lite Servers mainly support BMS and ECS events. The table below lists NPU-related events. For details about other events, see Events Supported by Event Monitoring.

Constraints

  • The Cloud Eye Agent plugin, which has strict resource usage restrictions, is required for event reporting to Cloud Eye. When the resource usage exceeds the threshold, the Agent circuit breaker occurs. For details about the resource usage, see Cloud Eye Server Monitoring.
  • You have fully tested the monitoring agent in the public image provided by Lite Servers. If you use your own image, perform the test before deploying the image in the production environment to prevent information errors.

Prerequisites

The Cloud Eye Agent has been installed on the Lite Server. For details about how to check whether the Cloud Eye Agent is installed and how to install it, see Installing Cloud Eye Agent Monitoring Plug-ins.

Event Source

BMS/ECS

Event Namespace

SYS.BMS/SYS.ECS/service.ModelArts

Event List

Table 1 Fault events supported by Lite Servers (BMS/ECS)

Event

Event ID

Event Severity

Description

Solution

Impact

Supported Model

Supported Cloud Eye Agent Version

NPU: device not found by npu-smi info

NPUSMICardNotFound

Major

The Ascend driver is faulty or the NPU is disconnected.

Reference:

ModelArts Troubleshooting > "Lite Server" > "Handling Suggestions for NPUSMICardNotFound Events"

The NPU cannot be used normally.

Snt3P

Snt3PD

Snt9b

Snt9b23

telescope:

2.7.4.3

2.7.5.3

2.7.5.4

2.7.5.9 or later

NPU: PCIe link error

PCIeErrorFound

Major

The lspci command output shows that the NPU is in the rev ff state.

Reference:

ModelArts Troubleshooting > "Lite Server" > "Handling Suggestions for PCIeErrorFound Events"

The NPU cannot be used normally.

Snt3P

Snt3PD

Snt9b

Snt9b23

telescope:

2.7.4.3

2.7.5.3

2.7.5.4

2.7.5.9 or later

NPU: device not found by lspci

LspciCardNotFound

Major

The NPU is disconnected.

Reference:

ModelArts Troubleshooting > "Lite Server" > "Handling Suggestions for LspciCardNotFound Events"

The NPU cannot be used normally.

Snt3P

Snt3PD

Snt9b

Snt9b23

telescope:

2.7.4.3

2.7.5.3

2.7.5.4

2.7.5.9 or later

NPU: overtemperature

TemperatureOverUpperLimit

Major

The temperature of DDR or software is too high.

Reference:

ModelArts Troubleshooting > "Lite Server" > "Handling Suggestions for NPUErrorCodeWarning Events"

The instance may be powered off and devices may not be found.

Snt3P

Snt3PD

Snt9b

Snt9b23

telescope:

2.7.4.3

2.7.5.3

2.7.5.4

2.7.5.9 or later

NPU: uncorrectable ECC error

UncorrectableEccErrorCount

Major

There are uncorrectable ECC errors on the NPU.

Reference:

ModelArts Troubleshooting > "Lite Server" > "Handling Suggestions for NPUErrorCodeWarning Events"

Services may be interrupted.

Snt3P

Snt3PD

telescope:

2.7.4.3

2.7.5.3

2.7.5.4

2.7.5.9 or later

NPU: request for instance restart

RebootVirtualMachine

Suggestion

A fault occurs and the instance needs to be restarted.

Reference:

ModelArts Troubleshooting > "Lite Server" > "Handling Suggestions for NPUErrorCodeWarning Events"

Services may be interrupted.

Snt3P

Snt3PD

Snt9b

Snt9b23

telescope:

2.7.4.3

2.7.5.3

2.7.5.4

2.7.5.9 or later

NPU: request for SoC reset

ResetSOC

Suggestion

A fault occurs and the SoC needs to be reset.

Reference:

ModelArts Troubleshooting > "Lite Server" > "Handling Suggestions for NPUErrorCodeWarning Events"

Services may be interrupted.

Snt3P

Snt3PD

Snt9b

Snt9b23

telescope:

2.7.4.3

2.7.5.3

2.7.5.4

2.7.5.9 or later

NPU: request for restart AI process

RestartAIProcess

Suggestion

A fault occurs and the AI process needs to be restarted.

Reference:

ModelArts Troubleshooting > "Lite Server" > "Handling Suggestions for NPUErrorCodeWarning Events"

The current AI task will be interrupted.

Snt3P

Snt3PD

Snt9b

Snt9b23

telescope:

2.7.4.3

2.7.5.3

2.7.5.4

2.7.5.9 or later

NPU: error codes

NPUErrorCodeWarning

Major

A large number of NPU error codes indicating major or higher-level errors are returned. You can further locate the faults based on the error codes.

Reference:

ModelArts Troubleshooting > "Lite Server" > "Handling Suggestions for NPUErrorCodeWarning Events"

Services may be interrupted.

Snt3P

Snt3PD

Snt9b

Snt9b23

telescope:

2.7.4.3

2.7.5.3

2.7.5.4

2.7.5.9 or later

Multiple NPU HBM ECC errors

NpuHbmMultiEccInfo

Suggestion

There are NPU HBM ECC errors.

This event is only a reference for other events. You do not need to handle it separately.

This event is only a reference for other events. You do not need to handle it separately.

Snt9b

Snt9b23

telescope:

2.7.5.9 or later

GPU: invalid RoCE NIC configuration

GpuRoceNicConfigIncorrect

Major

GPU: invalid RoCE NIC configuration

Reference:

ModelArts Troubleshooting > "Lite Server" > "Handling Suggestions for GpuRoceNicConfigIncorrect Events"

The parameter plane network is abnormal, preventing the execution of the multi-node task.

GPU

telescope:

2.7.5.9 or later

ReadOnly issues in OS

ReadOnlyFileSystem

Critical

The file system %s is read-only.

Reference:

ModelArts Troubleshooting > "Lite Server" > "Handling Suggestions for ReadOnlyFileSystem Events"

The files cannot be written or operated.

-

telescope:

2.7.5.3

2.7.5.9 or later

NPU: driver and firmware not matching

NpuDriverFirmwareMismatch

Major

The NPU's driver and firmware do not match.

Reference:

ModelArts Troubleshooting > "Lite Server" > "Handling Suggestions for NpuDriverFirmwareMismatch Events"

NPUs cannot be used.

Snt3P

Snt3PD

Snt9b

Snt9b23

telescope:

2.7.5.3

2.7.5.9 or later

NPU: RoCE NIC down

RoCELinkStatusDown

Major

The RoCE link of NPU %d is down.

Reference:

ModelArts Troubleshooting > "Lite Server" > "Handling Suggestions for RoCELinkStatusDown Events"

The NPU NIC is unavailable.

Snt9b

Snt9b23

telescope:

2.7.5.3

2.7.5.9 or later

NPU: RoCE NIC health status abnormal

RoCEHealthStatusError

Major

The RoCE network health status of NPU %d is abnormal.

Reference:

ModelArts Troubleshooting > "Lite Server" > "Handling Suggestions for RoCEHealthStatusError Events"

The NPU NIC is unavailable.

Snt9b

Snt9b23

telescope:

2.7.5.3

2.7.5.9 or later

NPU: RoCE NIC configuration file /etc/hccn.conf not exist

HccnConfNotExisted

Major

The RoCE NIC configuration file /etc/hccn.conf does not exist.

Reference:

ModelArts Troubleshooting > "Lite Server" > "Handling Suggestions for HccnConfNotExisted Events"

The RoCE NIC is unavailable.

Snt9b

Snt9b23

telescope:

2.7.5.3

2.7.5.9 or later

GPU: basic components abnormal

GpuEnvironmentSystem

Major

The nvidia-smi command is abnormal.

1. Check whether the driver version is compatible with the GPU. Access the NVIDIA official website to check the GPU model and driver version.

https://www.nvidia.cn/drivers/lookup/

2. Run lsmod | grep nvidia to check whether the NVIDIA kernel module is loaded. If not, run sudo modprobe nvidia to manually load it.

3. Reinstall the GPU driver.

4. If the fault persists, submit a service ticket and contact the O&M engineer.

The GPU driver is unavailable.

GPU

telescope:

2.7.5.3

2.7.5.9 or later

Major

The nvidia-fabricmanager version was inconsistent with the GPU driver version.

Check whether the GPU driver version matches the nvidia-fabricmanager version.

The nvidia-fabricmanager cannot work properly, affecting GPU usage.

Major

The container plugin nvidia-container-toolkit is not installed.

Install and configure the NVIDIA Container Toolkit by referring to its installation guide.

https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html

Docker cannot attach GPUs.

Local disk attachment inspection

MountDiskSystem

Major

The /etc/fstab file contains invalid UUIDs.

Reference:

ModelArts Troubleshooting > "Lite Server" > "Handling Suggestions for MountDiskSystem Events"

The disk attachment error resulted in abnormal server restart.

-

telescope:

2.7.5.3

2.7.5.9 or later

GP: incorrectly configured dynamic route for Ant series server

GpuRouteConfigError

Major

The dynamic route of the NIC %s of an Ant series server is not configured or is incorrectly configured. CMD [ip route]: %s | CMD [ip route show table all]: %s.

Submit a service ticket and contact the O&M engineer.

The NPU network communication is abnormal.

GPU

telescope:

2.7.5.3

2.7.5.9 or later

NPU: RoCE port not split

RoCEUdpConfigError

Major

The RoCE UDP port is not split.

Reference:

ModelArts Troubleshooting > "Lite Server" > "Handling Suggestions for RoCEUdpConfigError Events"

The communication performance of NPUs is affected.

Snt9b

Snt9b23

telescope:

2.7.5.9 or later

Warning of automatic system kernel upgrade

KernelUpgradeWarning

Major

Warning of automatic system kernel upgrade. Old version: %s; new version: %s.

Reference:

ModelArts Troubleshooting > "Lite Server" > "Handling Suggestions for KernelUpgradeWarning Events"

The AI software may be unavailable.

Snt3P

Snt3PD

Snt9b

Snt9b23

telescope:

2.7.5.3

2.7.5.9 or later

NPU environment command detection

NpuToolsWarning

Major

The hccn_tool is unavailable.

Reference:

ModelArts Troubleshooting > "Lite Server" > "Handling Suggestions for NpuToolsWarning Events"

The IP address and gateway of the RoCE NIC cannot be configured.

Snt9b

Snt9b23

telescope:

2.7.5.3

2.7.5.9 or later

Major

The npu-smi is unavailable.

Reference:

ModelArts Troubleshooting > "Lite Server" > "Handling Suggestions for NpuToolsWarning Events"

NPUs cannot be used.

Snt3P

Snt3PD

Snt9b

Snt9b23

telescope:

2.7.5.3

2.7.5.9 or later

Major

The ascend-dmi is unavailable.

Reference:

ModelArts Troubleshooting > "Lite Server" > "Handling Suggestions for NpuToolsWarning Events"

The ascend-dmi cannot be used for performance analysis.

Snt9b

Snt9b23

telescope:

2.7.5.3

2.7.5.9 or later

NPU: L1 switch port partial failure

NpuL1SwitchPortPartialFunctionFailure

Major

Some functions of the NPU's L1 1520 switch port fail.

Reference:

ModelArts Troubleshooting > "Lite Server" > "Handling Suggestions for NpuL1SwitchPortPartialFunctionFailure Events"

Services may be interrupted.

Snt9b23

telescope:

2.7.5.9 or later

lqdcmi:

2.1.0 and later

NPU: L1 switch fault

NpuL1SwitchFault

Major

There are faults in the L1 1520 switch of the NPU.

Reference:

ModelArts Troubleshooting > "Lite Server" > "Handling Suggestions for NpuL1SwitchFault Events"

Services may be interrupted.

Snt9b23

telescope:

2.7.5.9 or later

lqdcmi:

2.1.0 and later

NPU: Unmatched RoCE IP address

NpuRoceIPAddressMismatch

Major

The actual IP address of the RoCE NIC is inconsistent with the IP address in the hccn.conf configuration file.

Reference:

ModelArts Troubleshooting > "Lite Server" > "Handling Suggestions for NpuRoceIPAddressMismatch Events"

The parameter plane network is abnormal, preventing the execution of the multi-node task.

Snt9b

Snt9b23

telescope:

2.7.5.9 or later

VRD target version differs from the current version

InconsistentNpuVrdVersion

Major

VRD target version differs from the current version.

Upgrade VRD by powering off, waiting 5 minutes, then powering it back on.

The event has no short-term impacts, but it may affect NPU stability over time.

Snt9b

Snt9b23

-