Using Cloud Eye to Monitor NPU Resources of a Single Lite Server Node
Scenario
You need Cloud Eye to monitor Lite Server. This section describes how to interconnect with Cloud Eye to monitor resources and events on Lite Server.
Constraints
- The Agent plug-in, which has strict resource usage restrictions, is required for monitoring. When the resource usage exceeds the threshold, the Agent circuit breaker occurs. For details about the resource usage, see Cloud Eye Server Monitoring.
- If you run the NPU pressure test command using Ascend-dmi, some NPU metric data may be lost.
- You have fully tested the monitoring agent in the public image provided by Lite Server. If you use your own image, perform the test before deploying the image in the production environment to prevent information errors.
Overview
For details, see Bare Metal Server (BMS) Server Monitoring. In addition to the images listed in the document, Ubuntu 20.04 is also supported.
The sampling period of monitoring metrics is 1 minute. Do not change it. Otherwise, the function may be abnormal. The current monitoring metrics include the CPU, memory, disk, and network. After the accelerator card driver is installed on the host, the related metrics can be collected.
The NPU metric collection function depends on the Linux system tool lspci. Some events depend on the blkid and grub2-editenv system tools. Ensure that these tools are normal.
Tool |
Check Method |
Installation Method |
---|---|---|
lspci |
Run lspci in the shell environment. The PCI device in the system can be queried. The following shows an example: $ sudo lspci 00:00.0 PCI bridge: Huawei Technologies Co., Ltd. HiSilicon PCIe Root Port with Gen4 (rev 21) 00:08.0 PCI bridge: Huawei Technologies Co., Ltd. HiSilicon PCIe Root Port with Gen4 (rev 21) 00:10.0 PCI bridge: Huawei Technologies Co., Ltd. HiSilicon PCIe Root Port with Gen4 (rev 21) |
lspci is a tool used to display PCI device information. It is usually included in the pciutils software package. This software package is installed by default in most Linux versions. Generally, lspci is pre-installed. If lspci is not installed, you can use the package manager to install pciutils. Run the following commands in Debian/Ubuntu: sudo apt-get update sudo apt-get install pciutils Run the following command in Red Hat/CentOS/EulerOS: sudo yum install pciutils |
blkid |
Run blkid in the shell environment. The block device in the system can be queried. The following shows an example: $ sudo blkid /dev/sda1: UUID="123e4567-e89b-12d3-a456-426614174000" TYPE="vfat" PARTUUID="56789abc-def0-1234-5678-9abcd3f2c0a1" /dev/sda2: UUID="a1b2c3d4-e5f6-789a-bcde-f0123456789a" TYPE="swap" PARTUUID="edcba98-7654-3210-fedc-ba9876543210" /dev/sda3: UUID="01234567-89ab-cdef-0123-456789abcdef" TYPE="ext4" PARTUUID="fedcba09-8765-4321-fedc-ba0987654321" |
blkid is a tool used to display block device attributes in Linux. It is usually included in the util-linux software package. This software package is installed by default in most Linux versions. Generally, blkid is pre-installed. If blkid is not installed, you can use the package manager to install util-linux. Run the following commands in Debian/Ubuntu: sudo apt-get update sudo apt-get install util-linux Run the following command in Red Hat/CentOS/EulerOS: sudo yum install util-linux |
grub2-editenv (required only for Red Hat, CentOS, and EulerOS) |
Run blkid in the shell environment. The block device in the system can be queried. The following shows an example: 1 2 3 4 $ sudo grub2-editenv list timeout=5default=0saved_entry=Red Hat Enterprise Linux Server, with Linux 4.18.0-305.el8.x86_64 |
grub2-editenv is part of GRUB2 and is used to manage GRUB environment variables. GRUB2 is installed by default in most Linux versions. Generally, grub2-editenv is pre-installed. If grub2-editenv is not installed, you can use the package manager to install it. Run the following commands in Debian/Ubuntu: sudo apt-get update sudo apt-get install grub2 Run the following command in Red Hat/CentOS/EulerOS: sudo yum install grub2 |
Installing CES Agent Monitoring Plug-ins
OS-level, proactive, and fine-grained server monitoring is provided after the Agent is installed on the ECS or BMS.
- Create an agency for Cloud Eye. For details, see Creating a User and Granting Permissions. If you have enabled Cloud Eye host monitoring authorization when creating the server, skip this step.
- Currently, one-click monitoring installation is not supported on the Cloud Eye page. You need to log in to the server and run the following commands to install and configure the agent. For details about how to install the agent in other regions, see Installing the Agent on a Linux Server.
cd /usr/local && curl -k -O https://obs.cn-north-4.myhuaweicloud.com/uniagent-cn-north-4/script/agent_install.sh && bash agent_install.sh
If the following information is displayed, the installation is successful.
Figure 1 Installation succeeded - View the monitoring items on Cloud Eye page. Accelerator card monitoring items are available only after the accelerator card driver is installed on the host.
Figure 2 Monitoring page
The monitoring plug-in is now installed. You can view the collected metrics on the UI or configure alarms based on the metric values.
Metric Namespace
AGT.ECS and SERVICE.BMS
Lite Server Monitoring Metrics
Table 1 only displays NPU-related metrics. For other metrics, see Metrics Supported by the Agent.
No. |
Category |
Metric |
Display Name |
Description |
Unit |
Conversion Rule |
Value Range |
Dimension |
Supported Model |
Supported Versions |
---|---|---|---|---|---|---|---|---|---|---|
1 |
Overall |
npu_device_health |
NPU Health Status |
Health status of the NPU |
- |
N/A |
0: normal 1: minor alarm 2: major alarm 3: critical alarm |
instance_id, npu |
Snt3P 300IDuo Snt9b Snt9b23 |
telescope: 2.7.4.3 2.7.5.3 2.7.5.4 2.7.5.9 or later |
2 |
npu_driver_health |
NPU Driver Health Status |
Health status of the NPU driver |
- |
N/A |
0: normal 3: critical alarm |
instance_id, npu |
|||
3 |
npu_power |
NPU Power |
NPU power |
W |
N/A |
>0 |
instance_id, npu |
|||
4 |
npu_temperature |
NPU Temperature |
NPU temperature |
°C |
N/A |
Natural number |
instance_id, npu |
|||
5 |
npu_voltage |
NPU Voltage |
NPU voltage |
V |
N/A |
Natural number |
instance_id, npu |
|||
6 |
HBM |
npu_util_rate_hbm |
NPU HBM Usage |
HBM usage of the NPU |
% |
N/A |
0%–100% |
instance_id, npu |
Snt9b Snt9b23 |
telescope: 2.7.4.3 2.7.5.3 2.7.5.4 2.7.5.9 or later |
7 |
npu_hbm_freq |
HBM Frequency |
NPU HBM frequency |
MHz |
N/A |
>0 |
instance_id, npu |
|||
8 |
npu_freq_hbm |
HBM Frequency |
NPU HBM frequency |
MHz |
N/A |
>0 |
instance_id, npu |
|||
9 |
npu_hbm_usage |
HBM Usage |
NPU HBM usage |
MB |
N/A |
≥0 |
instance_id, npu |
|||
10 |
npu_hbm_temperature |
HBM Temperature |
NPU HBM temperature |
°C |
N/A |
Natural number |
instance_id, npu |
|||
11 |
npu_hbm_bandwidth_util |
HBM Bandwidth Usage |
NPU HBM bandwidth usage |
% |
N/A |
0%–100% |
instance_id, npu |
|||
12 |
npu_util_rate_hbm_bw |
HBM Bandwidth Usage |
NPU HBM bandwidth usage |
% |
N/A |
0%–100% |
instance_id, npu |
|||
13 |
npu_hbm_mem_capacity |
NPU HBM Memory Capacity |
HBM memory capacity of the NPU |
MB |
N/A |
≥0 |
instance_id, npu |
|||
14 |
npu_hbm_ecc_enable |
HBM ECC Status |
NPU HBM ECC status |
- |
N/A |
0: ECC detection is disabled. 1: ECC detection is enabled. |
instance_id, npu |
|||
15 |
npu_hbm_single_bit_error_cnt |
Single-bit Errors on HBM |
Current number of single-bit errors on the NPU HBM |
count |
N/A |
≥0 |
instance_id, npu |
|||
16 |
npu_hbm_double_bit_error_cnt |
Double-bit Errors on HBM |
Current number of double-bit errors on the NPU HBM |
count |
N/A |
≥0 |
instance_id, npu |
|||
17 |
npu_hbm_total_single_bit_error_cnt |
Single-bit Errors in HBM Lifecycle |
Number of single-bit errors in the NPU HBM lifecycle |
count |
N/A |
≥0 |
instance_id, npu |
|||
18 |
npu_hbm_total_double_bit_error_cnt |
Double-bit Errors in HBM Lifecycle |
Number of double-bit errors in the NPU HBM lifecycle |
count |
N/A |
≥0 |
instance_id, npu |
|||
19 |
npu_hbm_single_bit_isolated_pages_cnt |
Isolated NPU Memory Pages with HBM Single-bit Errors |
Number of isolated NPU memory pages with HBM single-bit errors |
count |
N/A |
≥0 |
instance_id, npu |
|||
20 |
npu_hbm_double_bit_isolated_pages_cnt |
Isolated NPU Memory Pages with HBM Multi-bit Errors |
Number of isolated NPU memory pages with HBM double-bit errors |
count |
N/A |
≥0 |
instance_id, npu |
|||
21 |
DDR |
npu_usage_mem |
Used NPU Memory |
Used NPU memory |
MB |
N/A |
≥0 |
instance_id, npu |
Snt3P 300IDuo |
telescope: 2.7.4.3 2.7.5.3 2.7.5.4 2.7.5.9 or later |
22 |
npu_util_rate_mem |
NPU Memory Usage |
NPU memory usage |
% |
N/A |
0%–100% |
instance_id, npu |
|||
23 |
npu_freq_mem |
NPU Memory Frequency |
NPU memory frequency |
MHz |
N/A |
>0 |
instance_id, npu |
|||
24 |
npu_util_rate_mem_bandwidth |
NPU Memory Bandwidth Usage |
NPU memory bandwidth usage |
% |
N/A |
0%–100% |
instance_id, npu |
|||
25 |
npu_sbe |
NPU Single-bit Errors |
Number of single-bit errors on the NPU |
count |
N/A |
≥0 |
instance_id, npu |
|||
26 |
npu_dbe |
NPU Double-bit Errors |
Number of double-bit errors on the NPU |
count |
N/A |
≥0 |
instance_id, npu |
|||
27 |
AI Core |
npu_freq_ai_core |
AI Core Frequency of the NPU |
AI core frequency of the NPU |
MHz |
N/A |
>0 |
instance_id, npu |
Snt3P 300IDuo Snt9b Snt9b23 |
telescope: 2.7.4.3 2.7.5.3 2.7.5.4 2.7.5.9 or later |
28 |
npu_freq_ai_core_rated |
Rated Frequency of the NPU AI Core |
Rated frequency of the NPU AI core |
MHz |
N/A |
>0 |
instance_id, npu |
|||
29 |
npu_util_rate_ai_core |
AI Core Usage of the NPU |
AI core usage of the NPU |
% |
N/A |
0%–100% |
instance_id, npu |
|||
30 |
AI Vector |
npu_util_rate_vector_core |
NPU Vector Core Usage |
NPU Vector Core Usage |
% |
N/A |
0%–100% |
instance_id, npu |
Snt3P 300IDuo Snt9b Snt9b23 |
telescope: 2.7.5.9 or later |
31 |
AI CPU |
npu_aicpu_num |
Number of AI CPUs of the NPU |
Number of AI CPUs of the NPU |
count |
N/A |
≥0 |
instance_id, npu |
Snt3P 300IDuo Snt9b Snt9b23 |
telescope: 2.7.4.3 2.7.5.3 2.7.5.4 2.7.5.9 or later |
32 |
npu_util_rate_ai_cpu |
NPU AI CPU Usage |
AI CPU usage of the NPU |
% |
N/A |
0%–100% |
instance_id, npu |
|||
33 |
npu_aicpu_avg_util_rate |
Average AI CPU Usage of the NPU |
Average AI CPU usage of the NPU |
% |
N/A |
0%–100% |
instance_id, npu |
|||
34 |
npu_aicpu_max_freq |
Maximum AI CPU Frequency of the NPU |
Maximum AI CPU frequency of the NPU |
MHz |
N/A |
>0 |
instance_id, npu |
|||
35 |
npu_aicpu_cur_freq |
AI CPU Frequency of the NPU |
AI CPU frequency of the NPU |
MHz |
N/A |
>0 |
instance_id, npu |
|||
36 |
CTRL CPU |
npu_util_rate_ctrl_cpu |
Control CPU Usage of the NPU |
Control CPU usage of the NPU |
% |
N/A |
0%–100% |
instance_id, npu |
Snt3P 300IDuo Snt9b Snt9b23 |
telescope: 2.7.4.3 2.7.5.3 2.7.5.4 2.7.5.9 or later |
37 |
npu_freq_ctrl_cpu |
Control CPU Frequency of the NPU |
Control CPU frequency of the NPU |
MHz |
N/A |
>0 |
instance_id, npu |
|||
38 |
PCIe link |
npu_link_cap_speed |
Max. NPU Link Speed |
Maximum link speed of the NPU |
GT/s |
N/A |
≥0 |
instance_id, npu |
310P 300IDuo Snt9b Snt9b23 |
telescope: 2.7.4.3 2.7.5.3 2.7.5.4 2.7.5.9 or later |
39 |
npu_link_cap_width |
Max. NPU Link Width |
Maximum link width of the NPU |
count |
N/A |
≥0 |
instance_id, npu |
|||
40 |
npu_link_status_speed |
NPU Link Speed |
Link speed of the NPU |
GT/s |
N/A |
≥0 |
instance_id, npu |
|||
41 |
npu_link_status_width |
NPU Link Width |
Link width of the NPU |
count |
N/A |
≥0 |
instance_id, npu |
|||
42 |
RoCE network |
npu_device_network_health |
NPU Network Health Status |
Connectivity of the IP address of the RoCE NIC on the NPU |
- |
N/A |
0: The network health status is normal. Other values: The network status is abnormal. |
instance_id, npu |
Snt9b Snt9b23 |
telescope: 2.7.4.3 2.7.5.3 2.7.5.4 2.7.5.9 or later |
43 |
npu_network_port_link_status |
NPU Network Port Link Status |
Link status of the NPU network port |
- |
N/A |
0: up 1: down |
instance_id, npu |
|||
44 |
npu_roce_tx_rate |
NPU NIC Uplink Rate |
Uplink rate of the NPU NIC |
MB/s |
N/A |
≥0 |
instance_id, npu |
|||
45 |
npu_roce_rx_rate |
NPU NIC Downlink Rate |
Downlink rate of the NPU NIC |
MB/s |
N/A |
≥0 |
instance_id, npu |
|||
46 |
npu_mac_tx_mac_pause_num |
PAUSE Frames Sent from MAC |
Total number of PAUSE frames sent from the MAC address corresponding to the NPU |
count |
N/A |
≥0 |
instance_id, npu |
|||
47 |
npu_mac_rx_mac_pause_num |
PAUSE Frames Received by MAC |
Total number of PAUSE frames received by the MAC address corresponding to the NPU |
count |
N/A |
≥0 |
instance_id, npu |
|||
48 |
npu_mac_tx_pfc_pkt_num |
PFC Frames Sent from MAC |
Total number of PFC frames sent from the MAC address corresponding to the NPU |
count |
N/A |
≥0 |
instance_id, npu |
|||
49 |
npu_mac_rx_pfc_pkt_num |
PFC Frames Received by MAC |
Total number of PFC frames received by the MAC address corresponding to the NPU |
count |
N/A |
≥0 |
instance_id, npu |
|||
50 |
npu_mac_tx_bad_pkt_num |
Bad Packets Sent from MAC |
Total number of bad packets sent from the MAC address corresponding to the NPU |
count |
N/A |
≥0 |
instance_id, npu |
|||
51 |
npu_mac_rx_bad_pkt_num |
Bad Packets Received by MAC |
Total number of bad packets received by the MAC address corresponding to the NPU |
count |
N/A |
≥0 |
instance_id, npu |
|||
52 |
npu_roce_tx_err_pkt_num |
Bad Packets Sent by RoCE |
Total number of bad packets sent by the RoCE NIC on the NPU |
count |
N/A |
≥0 |
instance_id, npu |
|||
53 |
npu_roce_rx_err_pkt_num |
Bad Packets Received by RoCE |
Total number of bad packets received by the RoCE NIC on the NPU |
count |
N/A |
≥0 |
instance_id, npu |
|||
54 |
npu_roce_tx_all_pkt_num |
Packets Transmitted by NPU RoCE |
The number of packets transmitted by the NPU's RoCE. |
count |
N/A |
≥0 |
instance_id, npu |
telescope: 2.7.5.9 or later |
||
55 |
npu_roce_rx_all_pkt_num |
Packets Received by NPU RoCE |
The number of packets received by the NPU's RoCE. |
count |
N/A |
≥0 |
instance_id, npu |
|||
56 |
npu_roce_new_pkt_rty_num |
Packets Retransmitted by NPU RoCE |
The number of packets retransmitted by the NPU's RoCE. |
count |
N/A |
≥0 |
instance_id, npu |
|||
57 |
npu_roce_out_of_order_num |
Abnormal PSN Packets Received by NPU RoCE |
This metric indicates that number of PSN packets received by NPU RoCE is greater than that of expected or duplicate PSN packets. If packets are out of order or lost, retransmission is triggered. |
count |
N/A |
≥0 |
instance_id, npu |
|||
58 |
npu_roce_rx_cnp_pkt_num |
CNP Packets Received by NPU RoCE |
The number of CNP packets received by the NPU's RoCE. |
count |
N/A |
≥0 |
instance_id, npu |
|||
59 |
npu_roce_tx_cnp_pkt_num |
CNP Packets Transmitted by NPU RoCE |
The number of CNP packets transmitted by the NPU's RoCE. |
count |
N/A |
≥0 |
instance_id, npu |
|||
60 |
RoCE optical module |
npu_opt_temperature |
NPU Optical Module Temperature |
NPU optical module temperature |
°C |
N/A |
Natural number |
instance_id, npu |
Snt9b Snt9b23 |
telescope: 2.7.4.3 2.7.5.3 2.7.5.4 2.7.5.9 or later |
61 |
npu_opt_temperature_high_thres |
Upper Limit of the NPU Optical Module Temperature |
Upper limit of the NPU optical module temperature |
°C |
N/A |
Natural number |
instance_id, npu |
|||
62 |
npu_opt_temperature_low_thres |
Lower Limit of the NPU Optical Module Temperature |
Lower limit of the NPU optical module temperature |
°C |
N/A |
Natural number |
instance_id, npu |
|||
63 |
npu_opt_voltage |
NPU Optical Module Voltage |
NPU optical module voltage |
mV |
N/A |
Natural number |
instance_id, npu |
|||
64 |
npu_opt_voltage_high_thres |
Upper Limit of the NPU Optical Module Voltage |
Upper limit of the NPU optical module voltage |
mV |
N/A |
Natural number |
instance_id, npu |
|||
65 |
npu_opt_voltage_low_thres |
Lower Limit of the NPU Optical Module Voltage |
Lower limit of the NPU optical module voltage |
mV |
N/A |
Natural number |
instance_id, npu |
|||
66 |
npu_opt_tx_power_lane0 |
TX Power of the NPU Optical Module in Channel 0 |
Transmit power of the NPU optical module in channel 0 |
mW |
N/A |
≥0 |
instance_id, npu |
|||
67 |
npu_opt_tx_power_lane1 |
TX Power of the NPU Optical Module in Channel 1 |
Transmit power of the NPU optical module in channel 1 |
mW |
N/A |
≥0 |
instance_id, npu |
|||
68 |
npu_opt_tx_power_lane2 |
TX Power of the NPU Optical Module in Channel 2 |
Transmit power of the NPU optical module in channel 2 |
mW |
N/A |
≥0 |
instance_id, npu |
|||
69 |
npu_opt_tx_power_lane3 |
TX Power of the NPU Optical Module in Channel 3 |
Transmit power of the NPU optical module in channel 3 |
mW |
N/A |
≥0 |
instance_id, npu |
|||
70 |
npu_opt_rx_power_lane0 |
RX Power of the NPU Optical Module in Channel 0 |
Receive power of the NPU optical module in channel 0 |
mW |
N/A |
≥0 |
instance_id, npu |
|||
71 |
npu_opt_rx_power_lane1 |
RX Power of the NPU Optical Module in Channel 1 |
Receive power of the NPU optical module in channel 1 |
mW |
N/A |
≥0 |
instance_id, npu |
|||
72 |
npu_opt_rx_power_lane2 |
RX Power of the NPU Optical Module in Channel 2 |
Receive power of the NPU optical module in channel 2 |
mW |
N/A |
≥0 |
instance_id, npu |
|||
73 |
npu_opt_rx_power_lane3 |
RX Power of the NPU Optical Module in Channel 3 |
Receive power of the NPU optical module in channel 3 |
mW |
N/A |
≥0 |
instance_id, npu |
|||
74 |
npu_opt_tx_bias_lane0 |
TX Bias Current of the NPU Optical Module in Channel 0 |
Transmitted bias current of the NPU optical module in channel 0 |
mA |
N/A |
≥0 |
instance_id, npu |
|||
75 |
npu_opt_tx_bias_lane1 |
TX Bias Current of the NPU Optical Module in Channel 1 |
Transmitted bias current of the NPU optical module in channel 1 |
mA |
N/A |
≥0 |
instance_id, npu |
|||
76 |
npu_opt_tx_bias_lane2 |
TX Bias Current of the NPU Optical Module in Channel 2 |
Transmitted bias current of the NPU optical module in channel 2 |
mA |
N/A |
≥0 |
instance_id, npu |
|||
77 |
npu_opt_tx_bias_lane3 |
TX Bias Current of the NPU Optical Module in Channel 3 |
Transmitted bias current of the NPU optical module in channel 3 |
mA |
N/A |
≥0 |
instance_id, npu |
|||
78 |
npu_opt_tx_los |
TX Los of the NPU Optical Module |
TX Los flag of the NPU optical module |
count |
N/A |
≥0 |
instance_id, npu |
|||
79 |
npu_opt_rx_los |
RX Los of the NPU Optical Module |
RX Los flag of the NPU optical module |
count |
N/A |
≥0 |
instance_id, npu |
|||
80 |
npu_opt_media_snr_lane0 |
NPU Optical Module Channel 0 Optical SNR |
The signal-to-noise ratio (SNR) on the media (optical) side of channel 0 in the NPU optical module |
dB |
N/A |
Natural number |
instance_id, npu |
telescope: 2.7.5.9 or later |
||
81 |
npu_opt_media_snr_lane1 |
NPU Optical Module Channel 1 Optical SNR |
The signal-to-noise ratio (SNR) on the media (optical) side of channel 1 in the NPU optical module |
dB |
N/A |
Natural number |
instance_id, npu |
|||
82 |
npu_opt_media_snr_lane2 |
NPU Optical Module Channel 2 Optical SNR |
The signal-to-noise ratio (SNR) on the media (optical) side of channel 2 in the NPU optical module |
dB |
N/A |
Natural number |
instance_id, npu |
|||
83 |
npu_opt_media_snr_lane3 |
NPU Optical Module Channel 3 Optical SNR |
The signal-to-noise ratio (SNR) on the media (optical) side of channel 3 in the NPU optical module |
dB |
N/A |
Natural number |
instance_id, npu |
|||
84 |
HCCS Lane mode |
npu_macro1_0lane_max_consec_sec |
Maximum Duration of NPU Macro1 in Lane 0 Mode |
The maximum time NPU Macro1 operates in Lane 0 mode during a detection period |
s |
N/A |
≥0 |
instance_id, npu |
Snt9b Snt9b23 |
telescope: 2.7.5.9 or later |
85 |
npu_macro2_0lane_max_consec_sec |
Maximum Duration of NPU Macro2 in Lane 0 Mode |
The maximum time NPU Macro2 operates in Lane 0 mode during a detection period |
s |
N/A |
≥0 |
instance_id, npu |
|||
86 |
npu_macro3_0lane_max_consec_sec |
Maximum Duration of NPU Macro3 in Lane 0 Mode |
The maximum time NPU Macro3 operates in Lane 0 mode during a detection period |
s |
N/A |
≥0 |
instance_id, npu |
|||
87 |
npu_macro4_0lane_max_consec_sec |
Maximum Duration of NPU Macro4 in Lane 0 Mode |
The maximum time NPU Macro4 operates in Lane 0 mode during a detection period |
s |
N/A |
≥0 |
instance_id, npu |
|||
88 |
npu_macro5_0lane_max_consec_sec |
Maximum Duration of NPU Macro5 in Lane 0 Mode |
The maximum time NPU Macro5 operates in Lane 0 mode during a detection period |
s |
N/A |
≥0 |
instance_id, npu |
|||
89 |
npu_macro6_0lane_max_consec_sec |
Maximum Duration of NPU Macro6 in Lane 0 Mode |
The maximum time NPU Macro6 operates in Lane 0 mode during a detection period |
s |
N/A |
≥0 |
instance_id, npu |
|||
90 |
npu_macro7_0lane_max_consec_sec |
Maximum Duration of NPU Macro7 in Lane 0 Mode |
The maximum time NPU Macro7 operates in Lane 0 mode during a detection period |
s |
N/A |
≥0 |
instance_id, npu |
|||
91 |
npu_macro1_0lane_total_sec |
Total Duration of NPU Macro1 in Lane 0 Mode |
The total time NPU Macro1 operates in Lane 0 mode during a detection period |
s |
N/A |
≥0 |
instance_id, npu |
|||
92 |
npu_macro2_0lane_total_sec |
Total Duration of NPU Macro2 in Lane 0 Mode |
The total time NPU Macro2 operates in Lane 0 mode during a detection period |
s |
N/A |
≥0 |
instance_id, npu |
|||
93 |
npu_macro3_0lane_total_sec |
Total Duration of NPU Macro3 in Lane 0 Mode |
The total time NPU Macro3 operates in Lane 0 mode during a detection period |
s |
N/A |
≥0 |
instance_id, npu |
|||
94 |
npu_macro4_0lane_total_sec |
Total Duration of NPU Macro4 in Lane 0 Mode |
The total time NPU Macro4 operates in Lane 0 mode during a detection period |
s |
N/A |
≥0 |
instance_id, npu |
|||
95 |
npu_macro5_0lane_total_sec |
Total Duration of NPU Macro5 in Lane 0 Mode |
The total time NPU Macro5 operates in Lane 0 mode during a detection period |
s |
N/A |
≥0 |
instance_id, npu |
|||
96 |
npu_macro6_0lane_total_sec |
Total Duration of NPU Macro6 in Lane 0 Mode |
The total time NPU Macro6 operates in Lane 0 mode during a detection period |
s |
N/A |
≥0 |
instance_id, npu |
|||
97 |
npu_macro7_0lane_total_sec |
Total Duration of NPU Macro7 in Lane 0 Mode |
The total time NPU Macro7 operates in Lane 0 mode during a detection period |
s |
N/A |
≥0 |
instance_id, npu |
|||
98 |
HCCS Serdes SNR |
npu_macro1_serdes_lane0_snr |
NPU Macro1 SerDes Lane 0 SNR |
The SNR for SerDes Lane 0 in NPU Macro1 |
dB |
N/A |
Natural number |
instance_id, npu |
Snt9b Snt9b23 |
telescope: 2.7.5.9 or later |
99 |
npu_macro1_serdes_lane1_snr |
NPU Macro1 SerDes Lane 1 SNR |
The SNR for SerDes Lane 1 in NPU Macro1 |
dB |
N/A |
Natural number |
instance_id, npu |
|||
100 |
npu_macro1_serdes_lane2_snr |
NPU Macro1 SerDes Lane 2 SNR |
The SNR for SerDes Lane 2 in NPU Macro1 |
dB |
N/A |
Natural number |
instance_id, npu |
|||
101 |
npu_macro1_serdes_lane3_snr |
NPU Macro1 SerDes Lane 3 SNR |
The SNR for SerDes Lane 3 in NPU Macro1 |
dB |
N/A |
Natural number |
instance_id, npu |
|||
102 |
npu_macro2_serdes_lane0_snr |
NPU Macro2 SerDes Lane 0 SNR |
The SNR for SerDes Lane 0 in NPU Macro2 |
dB |
N/A |
Natural number |
instance_id, npu |
|||
103 |
npu_macro2_serdes_lane1_snr |
NPU Macro2 SerDes Lane 1 SNR |
The SNR for SerDes Lane 1 in NPU Macro2 |
dB |
N/A |
Natural number |
instance_id, npu |
|||
104 |
npu_macro2_serdes_lane2_snr |
NPU Macro2 SerDes Lane 2 SNR |
The SNR for SerDes Lane 2 in NPU Macro2 |
dB |
N/A |
Natural number |
instance_id, npu |
|||
105 |
npu_macro2_serdes_lane3_snr |
NPU Macro2 SerDes Lane 3 SNR |
The SNR for SerDes Lane 3 in NPU Macro2 |
dB |
N/A |
Natural number |
instance_id, npu |
|||
106 |
npu_macro3_serdes_lane0_snr |
NPU Macro3 SerDes Lane 0 SNR |
The SNR for SerDes Lane 0 in NPU Macro3 |
dB |
N/A |
Natural number |
instance_id, npu |
|||
107 |
npu_macro3_serdes_lane1_snr |
NPU Macro3 SerDes Lane 1 SNR |
The SNR for SerDes Lane 1 in NPU Macro3 |
dB |
N/A |
Natural number |
instance_id, npu |
|||
108 |
npu_macro3_serdes_lane2_snr |
NPU Macro3 SerDes Lane 2 SNR |
The SNR for SerDes Lane 2 in NPU Macro3 |
dB |
N/A |
Natural number |
instance_id, npu |
|||
109 |
npu_macro3_serdes_lane3_snr |
NPU Macro3 SerDes Lane 3 SNR |
The SNR for SerDes Lane 3 in NPU Macro3 |
dB |
N/A |
Natural number |
instance_id, npu |
|||
110 |
npu_macro4_serdes_lane0_snr |
NPU Macro4 SerDes Lane 0 SNR |
The SNR for SerDes Lane 0 in NPU Macro4 |
dB |
N/A |
Natural number |
instance_id, npu |
|||
111 |
npu_macro4_serdes_lane1_snr |
NPU Macro4 SerDes Lane 1 SNR |
The SNR for SerDes Lane 1 in NPU Macro4 |
dB |
N/A |
Natural number |
instance_id, npu |
|||
112 |
npu_macro4_serdes_lane2_snr |
NPU Macro4 SerDes Lane 2 SNR |
The SNR for SerDes Lane 2 in NPU Macro4 |
dB |
N/A |
Natural number |
instance_id, npu |
|||
113 |
npu_macro4_serdes_lane3_snr |
NPU Macro4 SerDes Lane 3 SNR |
The SNR for SerDes Lane 3 in NPU Macro4 |
dB |
N/A |
Natural number |
instance_id, npu |
|||
114 |
npu_macro5_serdes_lane0_snr |
NPU Macro5 SerDes Lane 0 SNR |
The SNR for SerDes Lane 0 in NPU Macro5 |
dB |
N/A |
Natural number |
instance_id, npu |
|||
115 |
npu_macro5_serdes_lane1_snr |
NPU Macro5 SerDes Lane 1 SNR |
The SNR for SerDes Lane 1 in NPU Macro5 |
dB |
N/A |
Natural number |
instance_id, npu |
|||
116 |
npu_macro5_serdes_lane2_snr |
NPU Macro5 SerDes Lane 2 SNR |
The SNR for SerDes Lane 2 in NPU Macro5 |
dB |
N/A |
Natural number |
instance_id, npu |
|||
117 |
npu_macro5_serdes_lane3_snr |
NPU Macro5 SerDes Lane 3 SNR |
The SNR for SerDes Lane 3 in NPU Macro5 |
dB |
N/A |
Natural number |
instance_id, npu |
|||
118 |
npu_macro6_serdes_lane0_snr |
NPU Macro6 SerDes Lane 0 SNR |
The SNR for SerDes Lane 0 in NPU Macro6 |
dB |
N/A |
Natural number |
instance_id, npu |
|||
119 |
npu_macro6_serdes_lane1_snr |
NPU Macro6 SerDes Lane 1 SNR |
The SNR for SerDes Lane 1 in NPU Macro6 |
dB |
N/A |
Natural number |
instance_id, npu |
|||
120 |
npu_macro6_serdes_lane2_snr |
NPU Macro6 SerDes Lane 2 SNR |
The SNR for SerDes Lane 2 in NPU Macro6 |
dB |
N/A |
Natural number |
instance_id, npu |
|||
121 |
npu_macro6_serdes_lane3_snr |
NPU Macro6 SerDes Lane 3 SNR |
The SNR for SerDes Lane 3 in NPU Macro6 |
dB |
N/A |
Natural number |
instance_id, npu |
|||
122 |
npu_macro7_serdes_lane0_snr |
NPU Macro7 SerDes Lane 0 SNR |
The SNR for SerDes Lane 0 in NPU Macro7 |
dB |
N/A |
Natural number |
instance_id, npu |
|||
123 |
npu_macro7_serdes_lane1_snr |
NPU Macro7 SerDes Lane 1 SNR |
The SNR for SerDes Lane 1 in NPU Macro7 |
dB |
N/A |
Natural number |
instance_id, npu |
|||
124 |
npu_macro7_serdes_lane2_snr |
NPU Macro7 SerDes Lane 2 SNR |
The SNR for SerDes Lane 2 in NPU Macro7 |
dB |
N/A |
Natural number |
instance_id, npu |
|||
125 |
npu_macro7_serdes_lane3_snr |
NPU Macro7 SerDes Lane 3 SNR |
The SNR for SerDes Lane 3 in NPU Macro7 |
dB |
N/A |
Natural number |
instance_id, npu |
|||
126 |
HCCS packet statistics |
npu_macro1_rx_cnt |
Packets Received by NPU Macro1 |
The number of packets received by NPU Macro1 during a detection period |
count |
N/A |
≥0 |
instance_id, npu |
Snt9b Snt9b23 |
telescope: 2.7.5.9 or later |
127 |
npu_macro2_rx_cnt |
Packets Received by NPU Macro2 |
The number of packets received by NPU Macro2 during a detection period |
count |
N/A |
≥0 |
instance_id, npu |
|||
128 |
npu_macro3_rx_cnt |
Packets Received by NPU Macro3 |
The number of packets received by NPU Macro3 during a detection period |
count |
N/A |
≥0 |
instance_id, npu |
|||
129 |
npu_macro4_rx_cnt |
Packets Received by NPU Macro4 |
The number of packets received by NPU Macro4 during a detection period |
count |
N/A |
≥0 |
instance_id, npu |
|||
130 |
npu_macro5_rx_cnt |
Packets Received by NPU Macro5 |
The number of packets received by NPU Macro5 during a detection period |
count |
N/A |
≥0 |
instance_id, npu |
|||
131 |
npu_macro6_rx_cnt |
Packets Received by NPU Macro6 |
The number of packets received by NPU Macro6 during a detection period |
count |
N/A |
≥0 |
instance_id, npu |
|||
132 |
npu_macro7_rx_cnt |
Packets Received by NPU Macro7 |
The number of packets received by NPU Macro7 during a detection period |
count |
N/A |
≥0 |
instance_id, npu |
|||
133 |
npu_macro1_tx_cnt |
Packets Sent by NPU Macro1 |
The number of packets sent by NPU Macro1 during a detection period |
count |
N/A |
≥0 |
instance_id, npu |
|||
134 |
npu_macro2_tx_cnt |
Packets Sent by NPU Macro2 |
The number of packets sent by NPU Macro2 during a detection period |
count |
N/A |
≥0 |
instance_id, npu |
|||
135 |
npu_macro3_tx_cnt |
Packets Sent by NPU Macro3 |
The number of packets sent by NPU Macro3 during a detection period |
count |
N/A |
≥0 |
instance_id, npu |
|||
136 |
npu_macro4_tx_cnt |
Packets Sent by NPU Macro4 |
The number of packets sent by NPU Macro4 during a detection period |
count |
N/A |
≥0 |
instance_id, npu |
|||
137 |
npu_macro5_tx_cnt |
Packets Sent by NPU Macro5 |
The number of packets sent by NPU Macro5 during a detection period |
count |
N/A |
≥0 |
instance_id, npu |
|||
138 |
npu_macro6_tx_cnt |
Packets Sent by NPU Macro6 |
The number of packets sent by NPU Macro6 during a detection period |
count |
N/A |
≥0 |
instance_id, npu |
|||
139 |
npu_macro7_tx_cnt |
Packets Sent by NPU Macro7 |
The number of packets sent by NPU Macro7 during a detection period |
count |
N/A |
≥0 |
instance_id, npu |
|||
140 |
HCCS retransmission statistics |
npu_macro1_retry_cnt |
Packets Retransmitted by NPU Macro1 |
The number of packets retransmitted by NPU Macro1 during a detection period |
count |
N/A |
≥0 |
instance_id, npu |
Snt9b Snt9b23 |
telescope: 2.7.5.9 or later |
141 |
npu_macro2_retry_cnt |
Packets Retransmitted by NPU Macro2 |
The number of packets retransmitted by NPU Macro2 during a detection period |
count |
N/A |
≥0 |
instance_id, npu |
|||
142 |
npu_macro3_retry_cnt |
Packets Retransmitted by NPU Macro3 |
The number of packets retransmitted by NPU Macro3 during a detection period |
count |
N/A |
≥0 |
instance_id, npu |
|||
143 |
npu_macro4_retry_cnt |
Packets Retransmitted by NPU Macro4 |
The number of packets retransmitted by NPU Macro4 during a detection period |
count |
N/A |
≥0 |
instance_id, npu |
|||
144 |
npu_macro5_retry_cnt |
Packets Retransmitted by NPU Macro5 |
The number of packets retransmitted by NPU Macro5 during a detection period |
count |
N/A |
≥0 |
instance_id, npu |
|||
145 |
npu_macro6_retry_cnt |
Packets Retransmitted by NPU Macro6 |
The number of packets retransmitted by NPU Macro6 during a detection period |
count |
N/A |
≥0 |
instance_id, npu |
|||
146 |
npu_macro7_retry_cnt |
Packets Retransmitted by NPU Macro7 |
The number of packets retransmitted by NPU Macro7 during a detection period |
count |
N/A |
≥0 |
instance_id, npu |
|||
147 |
HCCS error packet statistics |
npu_macro1_crc_error_cnt |
Invalid Packets Received by NPU Macro1 |
The number of invalid CRC packets received by NPU Macro1 during a detection period |
count |
N/A |
≥0 |
instance_id, npu |
Snt9b Snt9b23 |
telescope: 2.7.5.9 or later |
148 |
npu_macro2_crc_error_cnt |
Invalid Packets Received by NPU Macro2 |
The number of invalid CRC packets received by NPU Macro2 during a detection period |
count |
N/A |
≥0 |
instance_id, npu |
|||
149 |
npu_macro3_crc_error_cnt |
Invalid Packets Received by NPU Macro3 |
The number of invalid CRC packets received by NPU Macro3 during a detection period |
count |
N/A |
≥0 |
instance_id, npu |
|||
150 |
npu_macro4_crc_error_cnt |
Invalid Packets Received by NPU Macro4 |
The number of invalid CRC packets received by NPU Macro4 during a detection period |
count |
N/A |
≥0 |
instance_id, npu |
|||
151 |
npu_macro5_crc_error_cnt |
Invalid Packets Received by NPU Macro5 |
The number of invalid CRC packets received by NPU Macro5 during a detection period |
count |
N/A |
≥0 |
instance_id, npu |
|||
152 |
npu_macro6_crc_error_cnt |
Invalid Packets Received by NPU Macro6 |
The number of invalid CRC packets received by NPU Macro6 during a detection period |
count |
N/A |
≥0 |
instance_id, npu |
|||
153 |
npu_macro7_crc_error_cnt |
Invalid Packets Received by NPU Macro7 |
The number of invalid CRC packets received by NPU Macro7 during a detection period |
count |
N/A |
≥0 |
instance_id, npu |
|||
154 |
npu_macro1_crc_error_rate |
NPU Macro1 BER |
The percentage of invalid CRC packets received by NPU Macro1 during a detection period |
count |
N/A |
≥0 |
instance_id, npu |
|||
155 |
npu_macro2_crc_error_rate |
NPU Macro2 BER |
The percentage of invalid CRC packets received by NPU Macro2 during a detection period |
% |
N/A |
≥0 |
instance_id, npu |
|||
156 |
npu_macro3_crc_error_rate |
NPU Macro3 BER |
The percentage of invalid CRC packets received by NPU Macro3 during a detection period |
% |
N/A |
≥0 |
instance_id, npu |
|||
157 |
npu_macro4_crc_error_rate |
NPU Macro4 BER |
The percentage of invalid CRC packets received by NPU Macro4 during a detection period |
% |
N/A |
≥0 |
instance_id, npu |
|||
158 |
npu_macro5_crc_error_rate |
NPU Macro5 BER |
The percentage of invalid CRC packets received by NPU Macro5 during a detection period |
% |
N/A |
≥0 |
instance_id, npu |
|||
159 |
npu_macro6_crc_error_rate |
NPU Macro6 BER |
The percentage of invalid CRC packets received by NPU Macro6 during a detection period |
% |
N/A |
≥0 |
instance_id, npu |
|||
160 |
npu_macro7_crc_error_rate |
NPU Macro7 BER |
The percentage of invalid CRC packets received by NPU Macro7 during a detection period |
% |
N/A |
≥0 |
instance_id, npu |
Supported Events
You can use Cloud Eye to centrally collect key events and cloud resource operational events. When an event occurs, you will receive an alarm. Lite Server supports mainly BMS and ECS events. The table below lists NPU-related events. For details about other events, see Events Supported by Event Monitoring.
Event Source |
Namespace |
Event |
Event ID |
Event Severity |
Description |
Solution |
Impact |
Supported Model |
Supported Versions |
---|---|---|---|---|---|---|---|---|---|
BMS/ECS |
SYS.BMS/SYS.ECS |
NPU: device not found by npu-smi info |
NPUSMICardNotFound |
Major |
The Ascend driver is faulty or the NPU is disconnected. |
Contact O&M engineers. |
The NPU cannot be used normally. |
Snt3P 300IDuo Snt9b Snt9b23 |
telescope: 2.7.4.3 2.7.5.3 2.7.5.4 2.7.5.9 or later |
NPU: PCIe link error |
PCIeErrorFound |
Major |
The lspci command output shows that the NPU is in the rev ff state. |
Contact O&M engineers. |
The NPU cannot be used normally. |
Snt3P 300IDuo Snt9b Snt9b23 |
telescope: 2.7.4.3 2.7.5.3 2.7.5.4 2.7.5.9 or later |
||
NPU: device not found by lspci |
LspciCardNotFound |
Major |
The NPU is disconnected. |
Contact O&M engineers. |
The NPU cannot be used normally. |
Snt3P 300IDuo Snt9b Snt9b23 |
telescope: 2.7.4.3 2.7.5.3 2.7.5.4 2.7.5.9 or later |
||
NPU: overtemperature |
TemperatureOverUpperLimit |
Major |
The temperature of DDR or software is too high. |
Stop services, restart the system, check the heat dissipation system, and reset the device. |
The instance may be powered off and devices may not be found. |
Snt3P 300IDuo Snt9b Snt9b23 |
telescope: 2.7.4.3 2.7.5.3 2.7.5.4 2.7.5.9 or later |
||
NPU: uncorrectable ECC error |
UncorrectableEccErrorWarning |
Major |
There are uncorrectable ECC errors on the NPU. |
If services are affected, replace the NPU with another one. |
Services may be interrupted. |
Snt3P 300IDuo |
telescope: 2.7.4.3 2.7.5.3 2.7.5.4 2.7.5.9 or later |
||
NPU: request for instance restart |
RebootVirtualMachine |
Suggestion |
A fault occurs and the instance needs to be restarted. |
Collect the fault information, and restart the instance. |
Services may be interrupted. |
Snt3P 300IDuo Snt9b Snt9b23 |
telescope: 2.7.4.3 2.7.5.3 2.7.5.4 2.7.5.9 or later |
||
NPU: request for SoC reset |
ResetSOC |
Suggestion |
A fault occurs and the SoC needs to be reset. |
Collect the fault information, and reset the SoC. |
Services may be interrupted. |
Snt3P 300IDuo Snt9b Snt9b23 |
telescope: 2.7.4.3 2.7.5.3 2.7.5.4 2.7.5.9 or later |
||
NPU: request for restart AI process |
RestartAIProcess |
Suggestion |
A fault occurs and the AI process needs to be restarted. |
Collect the fault information, and restart the AI process. |
The current AI task will be interrupted. |
Snt3P 300IDuo Snt9b Snt9b23 |
telescope: 2.7.4.3 2.7.5.3 2.7.5.4 2.7.5.9 or later |
||
NPU: error codes |
NPUErrorCodeWarning |
Major |
A large number of NPU error codes indicating major or higher-level errors are returned. You can further locate the faults based on the error codes. |
Locate the faults according to the Black Box Error Code Information List and Health Management Error Definition. |
Services may be interrupted. |
Snt3P 300IDuo Snt9b Snt9b23 |
telescope: 2.7.4.3 2.7.5.3 2.7.5.4 2.7.5.9 or later |
||
Multiple NPU HBM ECC errors |
NpuHbmMultiEccInfo |
Suggestion |
There are NPU HBM ECC errors. |
This event is only a reference for other events. You do not need to handle it separately. |
This event is only a reference for other events. You do not need to handle it separately. |
Snt9b Snt9b23 |
telescope: 2.7.5.9 or later |
||
GPU: invalid RoCE NIC configuration |
GpuRoceNicConfigIncorrect |
Major |
GPU: invalid RoCE NIC configuration |
Contact O&M engineers. |
The parameter plane network is abnormal, preventing the execution of the multi-node task. |
GPU |
telescope: 2.7.5.9 or later |
||
ReadOnly issues in OS |
ReadOnlyFileSystem |
Critical |
The file system %s is read-only. |
Check the disk health status. |
The files cannot be written or operated. |
- |
telescope: 2.7.5.3 2.7.5.9 or later |
||
NPU: driver and firmware not matching |
NpuDriverFirmwareMismatch |
Major |
The NPU's driver and firmware do not match. |
Obtain the matched version from the Ascend official website and reinstall it. |
NPUs cannot be used. |
Snt3P 300IDuo Snt9b Snt9b23 |
telescope: 2.7.5.3 2.7.5.9 or later |
||
NPU: Docker container environment check |
NpuContainerEnvSystem |
Major |
Docker unavailable |
Check if the Docker software is normal. |
Docker cannot be used. |
- |
telescope: 2.7.5.3 2.7.5.9 or later |
||
Major |
The container plug-in Ascend-Docker-Runtime is not installed. |
Install the container plug-in Ascend-Docker-Runtime. Otherwise, the container cannot use Ascend cards. |
NPUs cannot be mounted to Docker containers. |
Snt3P 300IDuo Snt9b Snt9b23 |
telescope: 2.7.5.3 2.7.5.9 or later |
||||
Major |
IP forwarding is not enabled in the OS. |
Check the net.ipv4.ip_forward configuration in the /etc/sysctl.conf file. |
Docker containers experience network communication issues. |
- |
telescope: 2.7.5.3 2.7.5.9 or later |
||||
Major |
The shared memory of the container is too small. |
The default shared memory is 64 MB, which can be modified as needed. Method 1: Modify the default-shm-size field in the /etc/docker/daemon.json configuration file. Method 2: Use the --shm-size parameter in the docker run command to set the shared memory size of a container. |
Distributed training failed due to insufficient shared memory. |
- |
telescope: 2.7.5.3 2.7.5.9 or later |
||||
NPU: RoCE NIC down |
RoCELinkStatusDown |
Major |
The RoCE link of NPU %d is down. |
Check the NPU RoCE network port status. |
The NPU NIC is unavailable. |
Snt9b Snt9b23 |
telescope: 2.7.5.3 2.7.5.9 or later |
||
NPU: RoCE NIC health status abnormal |
RoCEHealthStatusError |
Major |
The RoCE network health status of NPU %d is abnormal. |
Check the health status of the NPU RoCE NIC. |
The NPU NIC is unavailable. |
Snt9b Snt9b23 |
telescope: 2.7.5.3 2.7.5.9 or later |
||
NPU: RoCE NIC configuration file /etc/hccn.conf not exist |
HccnConfNotExisted |
Major |
The RoCE NIC configuration file /etc/hccn.conf does not exist. |
Check the /etc/hccn.conf NIC configuration file. |
The RoCE NIC is unavailable. |
Snt9b Snt9b23 |
telescope: 2.7.5.3 2.7.5.9 or later |
||
GPU: basic components abnormal |
GpuEnvironmentSystem |
Major |
The nvidia-smi command is abnormal. |
Check whether the GPU driver is normal. |
The GPU driver is unavailable. |
GPU |
telescope: 2.7.5.3 2.7.5.9 or later |
||
Major |
The nvidia-fabricmanager version was inconsistent with the GPU driver version. |
Check the GPU driver version and nvidia-fabricmanager version. |
The nvidia-fabricmanager cannot work properly, affecting GPU usage. |
||||||
Major |
The container plug-in nvidia-container-toolkit is not installed. |
Install the container plug-in nvidia-container-toolkit. |
GPUs cannot be attached to Docker containers. |
||||||
Local disk mounting inspection |
MountDiskSystem |
Major |
The /etc/fstab file contains invalid UUIDs. |
Ensure that the UUIDs in the /etc/fstab configuration file are correct. Otherwise, the server may fail to be restarted. |
The disk mounting process fails, preventing the server from restarting. |
- |
telescope: 2.7.5.3 2.7.5.9 or later |
||
GP: incorrectly configured dynamic route for Ant series server |
GpuRouteConfigError |
Major |
The dynamic route of the NIC %s of an Ant series server is not configured or is incorrectly configured. CMD [ip route]: %s | CMD [ip route show table all]: %s. |
Configure the RoCE NIC route correctly. |
The NPU network communication is abnormal. |
GPU |
telescope: 2.7.5.3 2.7.5.9 or later |
||
NPU: RoCE port not split |
RoCEUdpConfigError |
Major |
The RoCE UDP port is not split. |
Check the RoCE UDP port configuration on the NPU. |
The communication performance of NPUs is affected. |
Snt9b Snt9b23 |
telescope: 2.7.5.9 or later |
||
Warning of automatic system kernel upgrade |
KernelUpgradeWarning |
Major |
Warning of automatic system kernel upgrade. Old version: %s; new version: %s. |
System kernel upgrade may cause AI software exceptions. Check the system update logs and prevent the server from restarting. |
The AI software may be unavailable. |
Snt3P 300IDuo Snt9b Snt9b23 |
telescope: 2.7.5.3 2.7.5.9 or later |
||
NPU environment command detection |
NpuToolsWarning |
Major |
The hccn_tool is unavailable. |
Check if the NPU driver is normal. |
The IP address and gateway of the RoCE NIC cannot be configured. |
Snt9b Snt9b23 |
telescope: 2.7.5.3 2.7.5.9 or later |
||
Major |
The npu-smi is unavailable. |
Check if the NPU driver is normal. |
NPUs cannot be used. |
Snt3P 300IDuo Snt9b Snt9b23 |
telescope: 2.7.5.3 2.7.5.9 or later |
||||
Major |
The ascend-dmi is unavailable. |
Check if ToolBox is properly installed. |
The ascend-dmi cannot be used for performance analysis. |
Snt9b Snt9b23 |
telescope: 2.7.5.3 2.7.5.9 or later |
||||
NPU: L1 switch port partial failure |
NpuL1SwitchPortPartialFunctionFailure |
Major |
Some functions of the NPU's L1 1520 switch port fail. |
Transfer this issue to the Ascend or hardware team for handling. |
Services may be interrupted. |
Snt9b23 |
telescope: 2.7.5.9 or later lqdcmi: 2.1.0 and later |
||
NPU: L1 switch fault |
NpuL1SwitchFault |
Major |
There are faults in the L1 1520 switch of the NPU. |
Transfer this issue to the Ascend or hardware team for handling. |
Services may be interrupted. |
Snt9b23 |
telescope: 2.7.5.9 or later lqdcmi: 2.1.0 and later |
||
NPU: Unmatched RoCE IP address |
NpuRoceIPAddressMismatch |
Major |
The actual IP address of the RoCE NIC is inconsistent with the IP address in the hccn.conf configuration file. |
Contact O&M engineers. |
The parameter plane network is abnormal, preventing the execution of the multi-node task. |
Snt9b Snt9b23 |
telescope: 2.7.5.9 or later |
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot