Help Center/ ModelArts/ ModelArts User Guide (Lite Server)/ Monitoring Lite Server Resources/ Using Cloud Eye to Monitor NPU Resources of a Single Lite Server Node
Updated on 2025-08-01 GMT+08:00

Using Cloud Eye to Monitor NPU Resources of a Single Lite Server Node

Scenario

You need Cloud Eye to monitor Lite Server. This section describes how to interconnect with Cloud Eye to monitor resources and events on Lite Server.

Constraints

  • The Agent plug-in, which has strict resource usage restrictions, is required for monitoring. When the resource usage exceeds the threshold, the Agent circuit breaker occurs. For details about the resource usage, see Cloud Eye Server Monitoring.
  • If you run the NPU pressure test command using Ascend-dmi, some NPU metric data may be lost.
  • You have fully tested the monitoring agent in the public image provided by Lite Server. If you use your own image, perform the test before deploying the image in the production environment to prevent information errors.

Overview

For details, see Bare Metal Server (BMS) Server Monitoring. In addition to the images listed in the document, Ubuntu 20.04 is also supported.

The sampling period of monitoring metrics is 1 minute. Do not change it. Otherwise, the function may be abnormal. The current monitoring metrics include the CPU, memory, disk, and network. After the accelerator card driver is installed on the host, the related metrics can be collected.

The NPU metric collection function depends on the Linux system tool lspci. Some events depend on the blkid and grub2-editenv system tools. Ensure that these tools are normal.

Tool

Check Method

Installation Method

lspci

Run lspci in the shell environment. The PCI device in the system can be queried. The following shows an example:

$ sudo lspci
00:00.0 PCI bridge: Huawei Technologies Co., Ltd. HiSilicon PCIe Root Port with Gen4 (rev 21)
00:08.0 PCI bridge: Huawei Technologies Co., Ltd. HiSilicon PCIe Root Port with Gen4 (rev 21)
00:10.0 PCI bridge: Huawei Technologies Co., Ltd. HiSilicon PCIe Root Port with Gen4 (rev 21)

lspci is a tool used to display PCI device information. It is usually included in the pciutils software package. This software package is installed by default in most Linux versions. Generally, lspci is pre-installed. If lspci is not installed, you can use the package manager to install pciutils.

Run the following commands in Debian/Ubuntu:

sudo apt-get update
sudo apt-get install pciutils

Run the following command in Red Hat/CentOS/EulerOS:

sudo yum install pciutils

blkid

Run blkid in the shell environment. The block device in the system can be queried. The following shows an example:

$ sudo blkid
/dev/sda1: UUID="123e4567-e89b-12d3-a456-426614174000" TYPE="vfat" PARTUUID="56789abc-def0-1234-5678-9abcd3f2c0a1"
/dev/sda2: UUID="a1b2c3d4-e5f6-789a-bcde-f0123456789a" TYPE="swap" PARTUUID="edcba98-7654-3210-fedc-ba9876543210"
/dev/sda3: UUID="01234567-89ab-cdef-0123-456789abcdef" TYPE="ext4" PARTUUID="fedcba09-8765-4321-fedc-ba0987654321"

blkid is a tool used to display block device attributes in Linux. It is usually included in the util-linux software package. This software package is installed by default in most Linux versions. Generally, blkid is pre-installed. If blkid is not installed, you can use the package manager to install util-linux.

Run the following commands in Debian/Ubuntu:

sudo apt-get update
sudo apt-get install util-linux

Run the following command in Red Hat/CentOS/EulerOS:

sudo yum install util-linux

grub2-editenv (required only for Red Hat, CentOS, and EulerOS)

Run blkid in the shell environment. The block device in the system can be queried. The following shows an example:

1 2 3 4 $ sudo grub2-editenv list timeout=5default=0saved_entry=Red Hat Enterprise Linux Server, with Linux 4.18.0-305.el8.x86_64

grub2-editenv is part of GRUB2 and is used to manage GRUB environment variables. GRUB2 is installed by default in most Linux versions. Generally, grub2-editenv is pre-installed. If grub2-editenv is not installed, you can use the package manager to install it.

Run the following commands in Debian/Ubuntu:

sudo apt-get update
sudo apt-get install grub2

Run the following command in Red Hat/CentOS/EulerOS:

sudo yum install grub2

Installing CES Agent Monitoring Plug-ins

OS-level, proactive, and fine-grained server monitoring is provided after the Agent is installed on the ECS or BMS.

  1. Create an agency for Cloud Eye. For details, see Creating a User and Granting Permissions. If you have enabled Cloud Eye host monitoring authorization when creating the server, skip this step.
  2. Currently, one-click monitoring installation is not supported on the Cloud Eye page. You need to log in to the server and run the following commands to install and configure the agent. For details about how to install the agent in other regions, see Installing the Agent on a Linux Server.

    cd /usr/local && curl -k -O https://obs.cn-north-4.myhuaweicloud.com/uniagent-cn-north-4/script/agent_install.sh && bash agent_install.sh

    If the following information is displayed, the installation is successful.

    Figure 1 Installation succeeded

  3. View the monitoring items on Cloud Eye page. Accelerator card monitoring items are available only after the accelerator card driver is installed on the host.

    Figure 2 Monitoring page

    The monitoring plug-in is now installed. You can view the collected metrics on the UI or configure alarms based on the metric values.

Metric Namespace

AGT.ECS and SERVICE.BMS

Lite Server Monitoring Metrics

Table 1 only displays NPU-related metrics. For other metrics, see Metrics Supported by the Agent.

Table 1 NPU metrics

No.

Category

Metric

Display Name

Description

Unit

Conversion Rule

Value Range

Dimension

Supported Model

Supported Versions

1

Overall

npu_device_health

NPU Health Status

Health status of the NPU

-

N/A

0: normal

1: minor alarm

2: major alarm

3: critical alarm

instance_id, npu

Snt3P

300IDuo

Snt9b

Snt9b23

telescope:

2.7.4.3

2.7.5.3

2.7.5.4

2.7.5.9 or later

2

npu_driver_health

NPU Driver Health Status

Health status of the NPU driver

-

N/A

0: normal

3: critical alarm

instance_id, npu

3

npu_power

NPU Power

NPU power

W

N/A

>0

instance_id, npu

4

npu_temperature

NPU Temperature

NPU temperature

°C

N/A

Natural number

instance_id, npu

5

npu_voltage

NPU Voltage

NPU voltage

V

N/A

Natural number

instance_id, npu

6

HBM

npu_util_rate_hbm

NPU HBM Usage

HBM usage of the NPU

%

N/A

0%–100%

instance_id, npu

Snt9b

Snt9b23

telescope:

2.7.4.3

2.7.5.3

2.7.5.4

2.7.5.9 or later

7

npu_hbm_freq

HBM Frequency

NPU HBM frequency

MHz

N/A

>0

instance_id, npu

8

npu_freq_hbm

HBM Frequency

NPU HBM frequency

MHz

N/A

>0

instance_id, npu

9

npu_hbm_usage

HBM Usage

NPU HBM usage

MB

N/A

≥0

instance_id, npu

10

npu_hbm_temperature

HBM Temperature

NPU HBM temperature

°C

N/A

Natural number

instance_id, npu

11

npu_hbm_bandwidth_util

HBM Bandwidth Usage

NPU HBM bandwidth usage

%

N/A

0%–100%

instance_id, npu

12

npu_util_rate_hbm_bw

HBM Bandwidth Usage

NPU HBM bandwidth usage

%

N/A

0%–100%

instance_id, npu

13

npu_hbm_mem_capacity

NPU HBM Memory Capacity

HBM memory capacity of the NPU

MB

N/A

≥0

instance_id, npu

14

npu_hbm_ecc_enable

HBM ECC Status

NPU HBM ECC status

-

N/A

0: ECC detection is disabled.

1: ECC detection is enabled.

instance_id, npu

15

npu_hbm_single_bit_error_cnt

Single-bit Errors on HBM

Current number of single-bit errors on the NPU HBM

count

N/A

≥0

instance_id, npu

16

npu_hbm_double_bit_error_cnt

Double-bit Errors on HBM

Current number of double-bit errors on the NPU HBM

count

N/A

≥0

instance_id, npu

17

npu_hbm_total_single_bit_error_cnt

Single-bit Errors in HBM Lifecycle

Number of single-bit errors in the NPU HBM lifecycle

count

N/A

≥0

instance_id, npu

18

npu_hbm_total_double_bit_error_cnt

Double-bit Errors in HBM Lifecycle

Number of double-bit errors in the NPU HBM lifecycle

count

N/A

≥0

instance_id, npu

19

npu_hbm_single_bit_isolated_pages_cnt

Isolated NPU Memory Pages with HBM Single-bit Errors

Number of isolated NPU memory pages with HBM single-bit errors

count

N/A

≥0

instance_id, npu

20

npu_hbm_double_bit_isolated_pages_cnt

Isolated NPU Memory Pages with HBM Multi-bit Errors

Number of isolated NPU memory pages with HBM double-bit errors

count

N/A

≥0

instance_id, npu

21

DDR

npu_usage_mem

Used NPU Memory

Used NPU memory

MB

N/A

≥0

instance_id, npu

Snt3P

300IDuo

telescope:

2.7.4.3

2.7.5.3

2.7.5.4

2.7.5.9 or later

22

npu_util_rate_mem

NPU Memory Usage

NPU memory usage

%

N/A

0%–100%

instance_id, npu

23

npu_freq_mem

NPU Memory Frequency

NPU memory frequency

MHz

N/A

>0

instance_id, npu

24

npu_util_rate_mem_bandwidth

NPU Memory Bandwidth Usage

NPU memory bandwidth usage

%

N/A

0%–100%

instance_id, npu

25

npu_sbe

NPU Single-bit Errors

Number of single-bit errors on the NPU

count

N/A

≥0

instance_id, npu

26

npu_dbe

NPU Double-bit Errors

Number of double-bit errors on the NPU

count

N/A

≥0

instance_id, npu

27

AI Core

npu_freq_ai_core

AI Core Frequency of the NPU

AI core frequency of the NPU

MHz

N/A

>0

instance_id, npu

Snt3P

300IDuo

Snt9b

Snt9b23

telescope:

2.7.4.3

2.7.5.3

2.7.5.4

2.7.5.9 or later

28

npu_freq_ai_core_rated

Rated Frequency of the NPU AI Core

Rated frequency of the NPU AI core

MHz

N/A

>0

instance_id, npu

29

npu_util_rate_ai_core

AI Core Usage of the NPU

AI core usage of the NPU

%

N/A

0%–100%

instance_id, npu

30

AI Vector

npu_util_rate_vector_core

NPU Vector Core Usage

NPU Vector Core Usage

%

N/A

0%–100%

instance_id, npu

Snt3P

300IDuo

Snt9b

Snt9b23

telescope:

2.7.5.9 or later

31

AI CPU

npu_aicpu_num

Number of AI CPUs of the NPU

Number of AI CPUs of the NPU

count

N/A

≥0

instance_id, npu

Snt3P

300IDuo

Snt9b

Snt9b23

telescope:

2.7.4.3

2.7.5.3

2.7.5.4

2.7.5.9 or later

32

npu_util_rate_ai_cpu

NPU AI CPU Usage

AI CPU usage of the NPU

%

N/A

0%–100%

instance_id, npu

33

npu_aicpu_avg_util_rate

Average AI CPU Usage of the NPU

Average AI CPU usage of the NPU

%

N/A

0%–100%

instance_id, npu

34

npu_aicpu_max_freq

Maximum AI CPU Frequency of the NPU

Maximum AI CPU frequency of the NPU

MHz

N/A

>0

instance_id, npu

35

npu_aicpu_cur_freq

AI CPU Frequency of the NPU

AI CPU frequency of the NPU

MHz

N/A

>0

instance_id, npu

36

CTRL CPU

npu_util_rate_ctrl_cpu

Control CPU Usage of the NPU

Control CPU usage of the NPU

%

N/A

0%–100%

instance_id, npu

Snt3P

300IDuo

Snt9b

Snt9b23

telescope:

2.7.4.3

2.7.5.3

2.7.5.4

2.7.5.9 or later

37

npu_freq_ctrl_cpu

Control CPU Frequency of the NPU

Control CPU frequency of the NPU

MHz

N/A

>0

instance_id, npu

38

PCIe link

npu_link_cap_speed

Max. NPU Link Speed

Maximum link speed of the NPU

GT/s

N/A

≥0

instance_id, npu

310P

300IDuo

Snt9b

Snt9b23

telescope:

2.7.4.3

2.7.5.3

2.7.5.4

2.7.5.9 or later

39

npu_link_cap_width

Max. NPU Link Width

Maximum link width of the NPU

count

N/A

≥0

instance_id, npu

40

npu_link_status_speed

NPU Link Speed

Link speed of the NPU

GT/s

N/A

≥0

instance_id, npu

41

npu_link_status_width

NPU Link Width

Link width of the NPU

count

N/A

≥0

instance_id, npu

42

RoCE network

npu_device_network_health

NPU Network Health Status

Connectivity of the IP address of the RoCE NIC on the NPU

-

N/A

0: The network health status is normal.

Other values: The network status is abnormal.

instance_id, npu

Snt9b

Snt9b23

telescope:

2.7.4.3

2.7.5.3

2.7.5.4

2.7.5.9 or later

43

npu_network_port_link_status

NPU Network Port Link Status

Link status of the NPU network port

-

N/A

0: up

1: down

instance_id, npu

44

npu_roce_tx_rate

NPU NIC Uplink Rate

Uplink rate of the NPU NIC

MB/s

N/A

≥0

instance_id, npu

45

npu_roce_rx_rate

NPU NIC Downlink Rate

Downlink rate of the NPU NIC

MB/s

N/A

≥0

instance_id, npu

46

npu_mac_tx_mac_pause_num

PAUSE Frames Sent from MAC

Total number of PAUSE frames sent from the MAC address corresponding to the NPU

count

N/A

≥0

instance_id, npu

47

npu_mac_rx_mac_pause_num

PAUSE Frames Received by MAC

Total number of PAUSE frames received by the MAC address corresponding to the NPU

count

N/A

≥0

instance_id, npu

48

npu_mac_tx_pfc_pkt_num

PFC Frames Sent from MAC

Total number of PFC frames sent from the MAC address corresponding to the NPU

count

N/A

≥0

instance_id, npu

49

npu_mac_rx_pfc_pkt_num

PFC Frames Received by MAC

Total number of PFC frames received by the MAC address corresponding to the NPU

count

N/A

≥0

instance_id, npu

50

npu_mac_tx_bad_pkt_num

Bad Packets Sent from MAC

Total number of bad packets sent from the MAC address corresponding to the NPU

count

N/A

≥0

instance_id, npu

51

npu_mac_rx_bad_pkt_num

Bad Packets Received by MAC

Total number of bad packets received by the MAC address corresponding to the NPU

count

N/A

≥0

instance_id, npu

52

npu_roce_tx_err_pkt_num

Bad Packets Sent by RoCE

Total number of bad packets sent by the RoCE NIC on the NPU

count

N/A

≥0

instance_id, npu

53

npu_roce_rx_err_pkt_num

Bad Packets Received by RoCE

Total number of bad packets received by the RoCE NIC on the NPU

count

N/A

≥0

instance_id, npu

54

npu_roce_tx_all_pkt_num

Packets Transmitted by NPU RoCE

The number of packets transmitted by the NPU's RoCE.

count

N/A

≥0

instance_id, npu

telescope:

2.7.5.9 or later

55

npu_roce_rx_all_pkt_num

Packets Received by NPU RoCE

The number of packets received by the NPU's RoCE.

count

N/A

≥0

instance_id, npu

56

npu_roce_new_pkt_rty_num

Packets Retransmitted by NPU RoCE

The number of packets retransmitted by the NPU's RoCE.

count

N/A

≥0

instance_id, npu

57

npu_roce_out_of_order_num

Abnormal PSN Packets Received by NPU RoCE

This metric indicates that number of PSN packets received by NPU RoCE is greater than that of expected or duplicate PSN packets. If packets are out of order or lost, retransmission is triggered.

count

N/A

≥0

instance_id, npu

58

npu_roce_rx_cnp_pkt_num

CNP Packets Received by NPU RoCE

The number of CNP packets received by the NPU's RoCE.

count

N/A

≥0

instance_id, npu

59

npu_roce_tx_cnp_pkt_num

CNP Packets Transmitted by NPU RoCE

The number of CNP packets transmitted by the NPU's RoCE.

count

N/A

≥0

instance_id, npu

60

RoCE optical module

npu_opt_temperature

NPU Optical Module Temperature

NPU optical module temperature

°C

N/A

Natural number

instance_id, npu

Snt9b

Snt9b23

telescope:

2.7.4.3

2.7.5.3

2.7.5.4

2.7.5.9 or later

61

npu_opt_temperature_high_thres

Upper Limit of the NPU Optical Module Temperature

Upper limit of the NPU optical module temperature

°C

N/A

Natural number

instance_id, npu

62

npu_opt_temperature_low_thres

Lower Limit of the NPU Optical Module Temperature

Lower limit of the NPU optical module temperature

°C

N/A

Natural number

instance_id, npu

63

npu_opt_voltage

NPU Optical Module Voltage

NPU optical module voltage

mV

N/A

Natural number

instance_id, npu

64

npu_opt_voltage_high_thres

Upper Limit of the NPU Optical Module Voltage

Upper limit of the NPU optical module voltage

mV

N/A

Natural number

instance_id, npu

65

npu_opt_voltage_low_thres

Lower Limit of the NPU Optical Module Voltage

Lower limit of the NPU optical module voltage

mV

N/A

Natural number

instance_id, npu

66

npu_opt_tx_power_lane0

TX Power of the NPU Optical Module in Channel 0

Transmit power of the NPU optical module in channel 0

mW

N/A

≥0

instance_id, npu

67

npu_opt_tx_power_lane1

TX Power of the NPU Optical Module in Channel 1

Transmit power of the NPU optical module in channel 1

mW

N/A

≥0

instance_id, npu

68

npu_opt_tx_power_lane2

TX Power of the NPU Optical Module in Channel 2

Transmit power of the NPU optical module in channel 2

mW

N/A

≥0

instance_id, npu

69

npu_opt_tx_power_lane3

TX Power of the NPU Optical Module in Channel 3

Transmit power of the NPU optical module in channel 3

mW

N/A

≥0

instance_id, npu

70

npu_opt_rx_power_lane0

RX Power of the NPU Optical Module in Channel 0

Receive power of the NPU optical module in channel 0

mW

N/A

≥0

instance_id, npu

71

npu_opt_rx_power_lane1

RX Power of the NPU Optical Module in Channel 1

Receive power of the NPU optical module in channel 1

mW

N/A

≥0

instance_id, npu

72

npu_opt_rx_power_lane2

RX Power of the NPU Optical Module in Channel 2

Receive power of the NPU optical module in channel 2

mW

N/A

≥0

instance_id, npu

73

npu_opt_rx_power_lane3

RX Power of the NPU Optical Module in Channel 3

Receive power of the NPU optical module in channel 3

mW

N/A

≥0

instance_id, npu

74

npu_opt_tx_bias_lane0

TX Bias Current of the NPU Optical Module in Channel 0

Transmitted bias current of the NPU optical module in channel 0

mA

N/A

≥0

instance_id, npu

75

npu_opt_tx_bias_lane1

TX Bias Current of the NPU Optical Module in Channel 1

Transmitted bias current of the NPU optical module in channel 1

mA

N/A

≥0

instance_id, npu

76

npu_opt_tx_bias_lane2

TX Bias Current of the NPU Optical Module in Channel 2

Transmitted bias current of the NPU optical module in channel 2

mA

N/A

≥0

instance_id, npu

77

npu_opt_tx_bias_lane3

TX Bias Current of the NPU Optical Module in Channel 3

Transmitted bias current of the NPU optical module in channel 3

mA

N/A

≥0

instance_id, npu

78

npu_opt_tx_los

TX Los of the NPU Optical Module

TX Los flag of the NPU optical module

count

N/A

≥0

instance_id, npu

79

npu_opt_rx_los

RX Los of the NPU Optical Module

RX Los flag of the NPU optical module

count

N/A

≥0

instance_id, npu

80

npu_opt_media_snr_lane0

NPU Optical Module Channel 0 Optical SNR

The signal-to-noise ratio (SNR) on the media (optical) side of channel 0 in the NPU optical module

dB

N/A

Natural number

instance_id, npu

telescope:

2.7.5.9 or later

81

npu_opt_media_snr_lane1

NPU Optical Module Channel 1 Optical SNR

The signal-to-noise ratio (SNR) on the media (optical) side of channel 1 in the NPU optical module

dB

N/A

Natural number

instance_id, npu

82

npu_opt_media_snr_lane2

NPU Optical Module Channel 2 Optical SNR

The signal-to-noise ratio (SNR) on the media (optical) side of channel 2 in the NPU optical module

dB

N/A

Natural number

instance_id, npu

83

npu_opt_media_snr_lane3

NPU Optical Module Channel 3 Optical SNR

The signal-to-noise ratio (SNR) on the media (optical) side of channel 3 in the NPU optical module

dB

N/A

Natural number

instance_id, npu

84

HCCS Lane mode

npu_macro1_0lane_max_consec_sec

Maximum Duration of NPU Macro1 in Lane 0 Mode

The maximum time NPU Macro1 operates in Lane 0 mode during a detection period

s

N/A

≥0

instance_id, npu

Snt9b

Snt9b23

telescope:

2.7.5.9 or later

85

npu_macro2_0lane_max_consec_sec

Maximum Duration of NPU Macro2 in Lane 0 Mode

The maximum time NPU Macro2 operates in Lane 0 mode during a detection period

s

N/A

≥0

instance_id, npu

86

npu_macro3_0lane_max_consec_sec

Maximum Duration of NPU Macro3 in Lane 0 Mode

The maximum time NPU Macro3 operates in Lane 0 mode during a detection period

s

N/A

≥0

instance_id, npu

87

npu_macro4_0lane_max_consec_sec

Maximum Duration of NPU Macro4 in Lane 0 Mode

The maximum time NPU Macro4 operates in Lane 0 mode during a detection period

s

N/A

≥0

instance_id, npu

88

npu_macro5_0lane_max_consec_sec

Maximum Duration of NPU Macro5 in Lane 0 Mode

The maximum time NPU Macro5 operates in Lane 0 mode during a detection period

s

N/A

≥0

instance_id, npu

89

npu_macro6_0lane_max_consec_sec

Maximum Duration of NPU Macro6 in Lane 0 Mode

The maximum time NPU Macro6 operates in Lane 0 mode during a detection period

s

N/A

≥0

instance_id, npu

90

npu_macro7_0lane_max_consec_sec

Maximum Duration of NPU Macro7 in Lane 0 Mode

The maximum time NPU Macro7 operates in Lane 0 mode during a detection period

s

N/A

≥0

instance_id, npu

91

npu_macro1_0lane_total_sec

Total Duration of NPU Macro1 in Lane 0 Mode

The total time NPU Macro1 operates in Lane 0 mode during a detection period

s

N/A

≥0

instance_id, npu

92

npu_macro2_0lane_total_sec

Total Duration of NPU Macro2 in Lane 0 Mode

The total time NPU Macro2 operates in Lane 0 mode during a detection period

s

N/A

≥0

instance_id, npu

93

npu_macro3_0lane_total_sec

Total Duration of NPU Macro3 in Lane 0 Mode

The total time NPU Macro3 operates in Lane 0 mode during a detection period

s

N/A

≥0

instance_id, npu

94

npu_macro4_0lane_total_sec

Total Duration of NPU Macro4 in Lane 0 Mode

The total time NPU Macro4 operates in Lane 0 mode during a detection period

s

N/A

≥0

instance_id, npu

95

npu_macro5_0lane_total_sec

Total Duration of NPU Macro5 in Lane 0 Mode

The total time NPU Macro5 operates in Lane 0 mode during a detection period

s

N/A

≥0

instance_id, npu

96

npu_macro6_0lane_total_sec

Total Duration of NPU Macro6 in Lane 0 Mode

The total time NPU Macro6 operates in Lane 0 mode during a detection period

s

N/A

≥0

instance_id, npu

97

npu_macro7_0lane_total_sec

Total Duration of NPU Macro7 in Lane 0 Mode

The total time NPU Macro7 operates in Lane 0 mode during a detection period

s

N/A

≥0

instance_id, npu

98

HCCS Serdes SNR

npu_macro1_serdes_lane0_snr

NPU Macro1 SerDes Lane 0 SNR

The SNR for SerDes Lane 0 in NPU Macro1

dB

N/A

Natural number

instance_id, npu

Snt9b

Snt9b23

telescope:

2.7.5.9 or later

99

npu_macro1_serdes_lane1_snr

NPU Macro1 SerDes Lane 1 SNR

The SNR for SerDes Lane 1 in NPU Macro1

dB

N/A

Natural number

instance_id, npu

100

npu_macro1_serdes_lane2_snr

NPU Macro1 SerDes Lane 2 SNR

The SNR for SerDes Lane 2 in NPU Macro1

dB

N/A

Natural number

instance_id, npu

101

npu_macro1_serdes_lane3_snr

NPU Macro1 SerDes Lane 3 SNR

The SNR for SerDes Lane 3 in NPU Macro1

dB

N/A

Natural number

instance_id, npu

102

npu_macro2_serdes_lane0_snr

NPU Macro2 SerDes Lane 0 SNR

The SNR for SerDes Lane 0 in NPU Macro2

dB

N/A

Natural number

instance_id, npu

103

npu_macro2_serdes_lane1_snr

NPU Macro2 SerDes Lane 1 SNR

The SNR for SerDes Lane 1 in NPU Macro2

dB

N/A

Natural number

instance_id, npu

104

npu_macro2_serdes_lane2_snr

NPU Macro2 SerDes Lane 2 SNR

The SNR for SerDes Lane 2 in NPU Macro2

dB

N/A

Natural number

instance_id, npu

105

npu_macro2_serdes_lane3_snr

NPU Macro2 SerDes Lane 3 SNR

The SNR for SerDes Lane 3 in NPU Macro2

dB

N/A

Natural number

instance_id, npu

106

npu_macro3_serdes_lane0_snr

NPU Macro3 SerDes Lane 0 SNR

The SNR for SerDes Lane 0 in NPU Macro3

dB

N/A

Natural number

instance_id, npu

107

npu_macro3_serdes_lane1_snr

NPU Macro3 SerDes Lane 1 SNR

The SNR for SerDes Lane 1 in NPU Macro3

dB

N/A

Natural number

instance_id, npu

108

npu_macro3_serdes_lane2_snr

NPU Macro3 SerDes Lane 2 SNR

The SNR for SerDes Lane 2 in NPU Macro3

dB

N/A

Natural number

instance_id, npu

109

npu_macro3_serdes_lane3_snr

NPU Macro3 SerDes Lane 3 SNR

The SNR for SerDes Lane 3 in NPU Macro3

dB

N/A

Natural number

instance_id, npu

110

npu_macro4_serdes_lane0_snr

NPU Macro4 SerDes Lane 0 SNR

The SNR for SerDes Lane 0 in NPU Macro4

dB

N/A

Natural number

instance_id, npu

111

npu_macro4_serdes_lane1_snr

NPU Macro4 SerDes Lane 1 SNR

The SNR for SerDes Lane 1 in NPU Macro4

dB

N/A

Natural number

instance_id, npu

112

npu_macro4_serdes_lane2_snr

NPU Macro4 SerDes Lane 2 SNR

The SNR for SerDes Lane 2 in NPU Macro4

dB

N/A

Natural number

instance_id, npu

113

npu_macro4_serdes_lane3_snr

NPU Macro4 SerDes Lane 3 SNR

The SNR for SerDes Lane 3 in NPU Macro4

dB

N/A

Natural number

instance_id, npu

114

npu_macro5_serdes_lane0_snr

NPU Macro5 SerDes Lane 0 SNR

The SNR for SerDes Lane 0 in NPU Macro5

dB

N/A

Natural number

instance_id, npu

115

npu_macro5_serdes_lane1_snr

NPU Macro5 SerDes Lane 1 SNR

The SNR for SerDes Lane 1 in NPU Macro5

dB

N/A

Natural number

instance_id, npu

116

npu_macro5_serdes_lane2_snr

NPU Macro5 SerDes Lane 2 SNR

The SNR for SerDes Lane 2 in NPU Macro5

dB

N/A

Natural number

instance_id, npu

117

npu_macro5_serdes_lane3_snr

NPU Macro5 SerDes Lane 3 SNR

The SNR for SerDes Lane 3 in NPU Macro5

dB

N/A

Natural number

instance_id, npu

118

npu_macro6_serdes_lane0_snr

NPU Macro6 SerDes Lane 0 SNR

The SNR for SerDes Lane 0 in NPU Macro6

dB

N/A

Natural number

instance_id, npu

119

npu_macro6_serdes_lane1_snr

NPU Macro6 SerDes Lane 1 SNR

The SNR for SerDes Lane 1 in NPU Macro6

dB

N/A

Natural number

instance_id, npu

120

npu_macro6_serdes_lane2_snr

NPU Macro6 SerDes Lane 2 SNR

The SNR for SerDes Lane 2 in NPU Macro6

dB

N/A

Natural number

instance_id, npu

121

npu_macro6_serdes_lane3_snr

NPU Macro6 SerDes Lane 3 SNR

The SNR for SerDes Lane 3 in NPU Macro6

dB

N/A

Natural number

instance_id, npu

122

npu_macro7_serdes_lane0_snr

NPU Macro7 SerDes Lane 0 SNR

The SNR for SerDes Lane 0 in NPU Macro7

dB

N/A

Natural number

instance_id, npu

123

npu_macro7_serdes_lane1_snr

NPU Macro7 SerDes Lane 1 SNR

The SNR for SerDes Lane 1 in NPU Macro7

dB

N/A

Natural number

instance_id, npu

124

npu_macro7_serdes_lane2_snr

NPU Macro7 SerDes Lane 2 SNR

The SNR for SerDes Lane 2 in NPU Macro7

dB

N/A

Natural number

instance_id, npu

125

npu_macro7_serdes_lane3_snr

NPU Macro7 SerDes Lane 3 SNR

The SNR for SerDes Lane 3 in NPU Macro7

dB

N/A

Natural number

instance_id, npu

126

HCCS packet statistics

npu_macro1_rx_cnt

Packets Received by NPU Macro1

The number of packets received by NPU Macro1 during a detection period

count

N/A

≥0

instance_id, npu

Snt9b

Snt9b23

telescope:

2.7.5.9 or later

127

npu_macro2_rx_cnt

Packets Received by NPU Macro2

The number of packets received by NPU Macro2 during a detection period

count

N/A

≥0

instance_id, npu

128

npu_macro3_rx_cnt

Packets Received by NPU Macro3

The number of packets received by NPU Macro3 during a detection period

count

N/A

≥0

instance_id, npu

129

npu_macro4_rx_cnt

Packets Received by NPU Macro4

The number of packets received by NPU Macro4 during a detection period

count

N/A

≥0

instance_id, npu

130

npu_macro5_rx_cnt

Packets Received by NPU Macro5

The number of packets received by NPU Macro5 during a detection period

count

N/A

≥0

instance_id, npu

131

npu_macro6_rx_cnt

Packets Received by NPU Macro6

The number of packets received by NPU Macro6 during a detection period

count

N/A

≥0

instance_id, npu

132

npu_macro7_rx_cnt

Packets Received by NPU Macro7

The number of packets received by NPU Macro7 during a detection period

count

N/A

≥0

instance_id, npu

133

npu_macro1_tx_cnt

Packets Sent by NPU Macro1

The number of packets sent by NPU Macro1 during a detection period

count

N/A

≥0

instance_id, npu

134

npu_macro2_tx_cnt

Packets Sent by NPU Macro2

The number of packets sent by NPU Macro2 during a detection period

count

N/A

≥0

instance_id, npu

135

npu_macro3_tx_cnt

Packets Sent by NPU Macro3

The number of packets sent by NPU Macro3 during a detection period

count

N/A

≥0

instance_id, npu

136

npu_macro4_tx_cnt

Packets Sent by NPU Macro4

The number of packets sent by NPU Macro4 during a detection period

count

N/A

≥0

instance_id, npu

137

npu_macro5_tx_cnt

Packets Sent by NPU Macro5

The number of packets sent by NPU Macro5 during a detection period

count

N/A

≥0

instance_id, npu

138

npu_macro6_tx_cnt

Packets Sent by NPU Macro6

The number of packets sent by NPU Macro6 during a detection period

count

N/A

≥0

instance_id, npu

139

npu_macro7_tx_cnt

Packets Sent by NPU Macro7

The number of packets sent by NPU Macro7 during a detection period

count

N/A

≥0

instance_id, npu

140

HCCS retransmission statistics

npu_macro1_retry_cnt

Packets Retransmitted by NPU Macro1

The number of packets retransmitted by NPU Macro1 during a detection period

count

N/A

≥0

instance_id, npu

Snt9b

Snt9b23

telescope:

2.7.5.9 or later

141

npu_macro2_retry_cnt

Packets Retransmitted by NPU Macro2

The number of packets retransmitted by NPU Macro2 during a detection period

count

N/A

≥0

instance_id, npu

142

npu_macro3_retry_cnt

Packets Retransmitted by NPU Macro3

The number of packets retransmitted by NPU Macro3 during a detection period

count

N/A

≥0

instance_id, npu

143

npu_macro4_retry_cnt

Packets Retransmitted by NPU Macro4

The number of packets retransmitted by NPU Macro4 during a detection period

count

N/A

≥0

instance_id, npu

144

npu_macro5_retry_cnt

Packets Retransmitted by NPU Macro5

The number of packets retransmitted by NPU Macro5 during a detection period

count

N/A

≥0

instance_id, npu

145

npu_macro6_retry_cnt

Packets Retransmitted by NPU Macro6

The number of packets retransmitted by NPU Macro6 during a detection period

count

N/A

≥0

instance_id, npu

146

npu_macro7_retry_cnt

Packets Retransmitted by NPU Macro7

The number of packets retransmitted by NPU Macro7 during a detection period

count

N/A

≥0

instance_id, npu

147

HCCS error packet statistics

npu_macro1_crc_error_cnt

Invalid Packets Received by NPU Macro1

The number of invalid CRC packets received by NPU Macro1 during a detection period

count

N/A

≥0

instance_id, npu

Snt9b

Snt9b23

telescope:

2.7.5.9 or later

148

npu_macro2_crc_error_cnt

Invalid Packets Received by NPU Macro2

The number of invalid CRC packets received by NPU Macro2 during a detection period

count

N/A

≥0

instance_id, npu

149

npu_macro3_crc_error_cnt

Invalid Packets Received by NPU Macro3

The number of invalid CRC packets received by NPU Macro3 during a detection period

count

N/A

≥0

instance_id, npu

150

npu_macro4_crc_error_cnt

Invalid Packets Received by NPU Macro4

The number of invalid CRC packets received by NPU Macro4 during a detection period

count

N/A

≥0

instance_id, npu

151

npu_macro5_crc_error_cnt

Invalid Packets Received by NPU Macro5

The number of invalid CRC packets received by NPU Macro5 during a detection period

count

N/A

≥0

instance_id, npu

152

npu_macro6_crc_error_cnt

Invalid Packets Received by NPU Macro6

The number of invalid CRC packets received by NPU Macro6 during a detection period

count

N/A

≥0

instance_id, npu

153

npu_macro7_crc_error_cnt

Invalid Packets Received by NPU Macro7

The number of invalid CRC packets received by NPU Macro7 during a detection period

count

N/A

≥0

instance_id, npu

154

npu_macro1_crc_error_rate

NPU Macro1 BER

The percentage of invalid CRC packets received by NPU Macro1 during a detection period

count

N/A

≥0

instance_id, npu

155

npu_macro2_crc_error_rate

NPU Macro2 BER

The percentage of invalid CRC packets received by NPU Macro2 during a detection period

%

N/A

≥0

instance_id, npu

156

npu_macro3_crc_error_rate

NPU Macro3 BER

The percentage of invalid CRC packets received by NPU Macro3 during a detection period

%

N/A

≥0

instance_id, npu

157

npu_macro4_crc_error_rate

NPU Macro4 BER

The percentage of invalid CRC packets received by NPU Macro4 during a detection period

%

N/A

≥0

instance_id, npu

158

npu_macro5_crc_error_rate

NPU Macro5 BER

The percentage of invalid CRC packets received by NPU Macro5 during a detection period

%

N/A

≥0

instance_id, npu

159

npu_macro6_crc_error_rate

NPU Macro6 BER

The percentage of invalid CRC packets received by NPU Macro6 during a detection period

%

N/A

≥0

instance_id, npu

160

npu_macro7_crc_error_rate

NPU Macro7 BER

The percentage of invalid CRC packets received by NPU Macro7 during a detection period

%

N/A

≥0

instance_id, npu

Supported Events

You can use Cloud Eye to centrally collect key events and cloud resource operational events. When an event occurs, you will receive an alarm. Lite Server supports mainly BMS and ECS events. The table below lists NPU-related events. For details about other events, see Events Supported by Event Monitoring.

Table 2 Events supported by Lite Server

Event Source

Namespace

Event

Event ID

Event Severity

Description

Solution

Impact

Supported Model

Supported Versions

BMS/ECS

SYS.BMS/SYS.ECS

NPU: device not found by npu-smi info

NPUSMICardNotFound

Major

The Ascend driver is faulty or the NPU is disconnected.

Contact O&M engineers.

The NPU cannot be used normally.

Snt3P

300IDuo

Snt9b

Snt9b23

telescope:

2.7.4.3

2.7.5.3

2.7.5.4

2.7.5.9 or later

NPU: PCIe link error

PCIeErrorFound

Major

The lspci command output shows that the NPU is in the rev ff state.

Contact O&M engineers.

The NPU cannot be used normally.

Snt3P

300IDuo

Snt9b

Snt9b23

telescope:

2.7.4.3

2.7.5.3

2.7.5.4

2.7.5.9 or later

NPU: device not found by lspci

LspciCardNotFound

Major

The NPU is disconnected.

Contact O&M engineers.

The NPU cannot be used normally.

Snt3P

300IDuo

Snt9b

Snt9b23

telescope:

2.7.4.3

2.7.5.3

2.7.5.4

2.7.5.9 or later

NPU: overtemperature

TemperatureOverUpperLimit

Major

The temperature of DDR or software is too high.

Stop services, restart the system, check the heat dissipation system, and reset the device.

The instance may be powered off and devices may not be found.

Snt3P

300IDuo

Snt9b

Snt9b23

telescope:

2.7.4.3

2.7.5.3

2.7.5.4

2.7.5.9 or later

NPU: uncorrectable ECC error

UncorrectableEccErrorWarning

Major

There are uncorrectable ECC errors on the NPU.

If services are affected, replace the NPU with another one.

Services may be interrupted.

Snt3P

300IDuo

telescope:

2.7.4.3

2.7.5.3

2.7.5.4

2.7.5.9 or later

NPU: request for instance restart

RebootVirtualMachine

Suggestion

A fault occurs and the instance needs to be restarted.

Collect the fault information, and restart the instance.

Services may be interrupted.

Snt3P

300IDuo

Snt9b

Snt9b23

telescope:

2.7.4.3

2.7.5.3

2.7.5.4

2.7.5.9 or later

NPU: request for SoC reset

ResetSOC

Suggestion

A fault occurs and the SoC needs to be reset.

Collect the fault information, and reset the SoC.

Services may be interrupted.

Snt3P

300IDuo

Snt9b

Snt9b23

telescope:

2.7.4.3

2.7.5.3

2.7.5.4

2.7.5.9 or later

NPU: request for restart AI process

RestartAIProcess

Suggestion

A fault occurs and the AI process needs to be restarted.

Collect the fault information, and restart the AI process.

The current AI task will be interrupted.

Snt3P

300IDuo

Snt9b

Snt9b23

telescope:

2.7.4.3

2.7.5.3

2.7.5.4

2.7.5.9 or later

NPU: error codes

NPUErrorCodeWarning

Major

A large number of NPU error codes indicating major or higher-level errors are returned. You can further locate the faults based on the error codes.

Locate the faults according to the Black Box Error Code Information List and Health Management Error Definition.

Services may be interrupted.

Snt3P

300IDuo

Snt9b

Snt9b23

telescope:

2.7.4.3

2.7.5.3

2.7.5.4

2.7.5.9 or later

Multiple NPU HBM ECC errors

NpuHbmMultiEccInfo

Suggestion

There are NPU HBM ECC errors.

This event is only a reference for other events. You do not need to handle it separately.

This event is only a reference for other events. You do not need to handle it separately.

Snt9b

Snt9b23

telescope:

2.7.5.9 or later

GPU: invalid RoCE NIC configuration

GpuRoceNicConfigIncorrect

Major

GPU: invalid RoCE NIC configuration

Contact O&M engineers.

The parameter plane network is abnormal, preventing the execution of the multi-node task.

GPU

telescope:

2.7.5.9 or later

ReadOnly issues in OS

ReadOnlyFileSystem

Critical

The file system %s is read-only.

Check the disk health status.

The files cannot be written or operated.

-

telescope:

2.7.5.3

2.7.5.9 or later

NPU: driver and firmware not matching

NpuDriverFirmwareMismatch

Major

The NPU's driver and firmware do not match.

Obtain the matched version from the Ascend official website and reinstall it.

NPUs cannot be used.

Snt3P

300IDuo

Snt9b

Snt9b23

telescope:

2.7.5.3

2.7.5.9 or later

NPU: Docker container environment check

NpuContainerEnvSystem

Major

Docker unavailable

Check if the Docker software is normal.

Docker cannot be used.

-

telescope:

2.7.5.3

2.7.5.9 or later

Major

The container plug-in Ascend-Docker-Runtime is not installed.

Install the container plug-in Ascend-Docker-Runtime. Otherwise, the container cannot use Ascend cards.

NPUs cannot be mounted to Docker containers.

Snt3P

300IDuo

Snt9b

Snt9b23

telescope:

2.7.5.3

2.7.5.9 or later

Major

IP forwarding is not enabled in the OS.

Check the net.ipv4.ip_forward configuration in the /etc/sysctl.conf file.

Docker containers experience network communication issues.

-

telescope:

2.7.5.3

2.7.5.9 or later

Major

The shared memory of the container is too small.

The default shared memory is 64 MB, which can be modified as needed.

Method 1: Modify the default-shm-size field in the /etc/docker/daemon.json configuration file.

Method 2: Use the --shm-size parameter in the docker run command to set the shared memory size of a container.

Distributed training failed due to insufficient shared memory.

-

telescope:

2.7.5.3

2.7.5.9 or later

NPU: RoCE NIC down

RoCELinkStatusDown

Major

The RoCE link of NPU %d is down.

Check the NPU RoCE network port status.

The NPU NIC is unavailable.

Snt9b

Snt9b23

telescope:

2.7.5.3

2.7.5.9 or later

NPU: RoCE NIC health status abnormal

RoCEHealthStatusError

Major

The RoCE network health status of NPU %d is abnormal.

Check the health status of the NPU RoCE NIC.

The NPU NIC is unavailable.

Snt9b

Snt9b23

telescope:

2.7.5.3

2.7.5.9 or later

NPU: RoCE NIC configuration file /etc/hccn.conf not exist

HccnConfNotExisted

Major

The RoCE NIC configuration file /etc/hccn.conf does not exist.

Check the /etc/hccn.conf NIC configuration file.

The RoCE NIC is unavailable.

Snt9b

Snt9b23

telescope:

2.7.5.3

2.7.5.9 or later

GPU: basic components abnormal

GpuEnvironmentSystem

Major

The nvidia-smi command is abnormal.

Check whether the GPU driver is normal.

The GPU driver is unavailable.

GPU

telescope:

2.7.5.3

2.7.5.9 or later

Major

The nvidia-fabricmanager version was inconsistent with the GPU driver version.

Check the GPU driver version and nvidia-fabricmanager version.

The nvidia-fabricmanager cannot work properly, affecting GPU usage.

Major

The container plug-in nvidia-container-toolkit is not installed.

Install the container plug-in nvidia-container-toolkit.

GPUs cannot be attached to Docker containers.

Local disk mounting inspection

MountDiskSystem

Major

The /etc/fstab file contains invalid UUIDs.

Ensure that the UUIDs in the /etc/fstab configuration file are correct. Otherwise, the server may fail to be restarted.

The disk mounting process fails, preventing the server from restarting.

-

telescope:

2.7.5.3

2.7.5.9 or later

GP: incorrectly configured dynamic route for Ant series server

GpuRouteConfigError

Major

The dynamic route of the NIC %s of an Ant series server is not configured or is incorrectly configured. CMD [ip route]: %s | CMD [ip route show table all]: %s.

Configure the RoCE NIC route correctly.

The NPU network communication is abnormal.

GPU

telescope:

2.7.5.3

2.7.5.9 or later

NPU: RoCE port not split

RoCEUdpConfigError

Major

The RoCE UDP port is not split.

Check the RoCE UDP port configuration on the NPU.

The communication performance of NPUs is affected.

Snt9b

Snt9b23

telescope:

2.7.5.9 or later

Warning of automatic system kernel upgrade

KernelUpgradeWarning

Major

Warning of automatic system kernel upgrade. Old version: %s; new version: %s.

System kernel upgrade may cause AI software exceptions. Check the system update logs and prevent the server from restarting.

The AI software may be unavailable.

Snt3P

300IDuo

Snt9b

Snt9b23

telescope:

2.7.5.3

2.7.5.9 or later

NPU environment command detection

NpuToolsWarning

Major

The hccn_tool is unavailable.

Check if the NPU driver is normal.

The IP address and gateway of the RoCE NIC cannot be configured.

Snt9b

Snt9b23

telescope:

2.7.5.3

2.7.5.9 or later

Major

The npu-smi is unavailable.

Check if the NPU driver is normal.

NPUs cannot be used.

Snt3P

300IDuo

Snt9b

Snt9b23

telescope:

2.7.5.3

2.7.5.9 or later

Major

The ascend-dmi is unavailable.

Check if ToolBox is properly installed.

The ascend-dmi cannot be used for performance analysis.

Snt9b

Snt9b23

telescope:

2.7.5.3

2.7.5.9 or later

NPU: L1 switch port partial failure

NpuL1SwitchPortPartialFunctionFailure

Major

Some functions of the NPU's L1 1520 switch port fail.

Transfer this issue to the Ascend or hardware team for handling.

Services may be interrupted.

Snt9b23

telescope:

2.7.5.9 or later

lqdcmi:

2.1.0 and later

NPU: L1 switch fault

NpuL1SwitchFault

Major

There are faults in the L1 1520 switch of the NPU.

Transfer this issue to the Ascend or hardware team for handling.

Services may be interrupted.

Snt9b23

telescope:

2.7.5.9 or later

lqdcmi:

2.1.0 and later

NPU: Unmatched RoCE IP address

NpuRoceIPAddressMismatch

Major

The actual IP address of the RoCE NIC is inconsistent with the IP address in the hccn.conf configuration file.

Contact O&M engineers.

The parameter plane network is abnormal, preventing the execution of the multi-node task.

Snt9b

Snt9b23

telescope:

2.7.5.9 or later