Using CES to Monitor Lite Server Resources
Scenario
You need Cloud Eye Service (CES) to monitor Lite Server. This section describes how to interconnect with CES to monitor resources and events on Lite Server.
Overview
For details, see BMS Overview. In addition to the images listed in the document, Ubuntu 20.04 is also supported.
The sampling period of monitoring metrics is 1 minute. The current monitoring metrics include the CPU, memory, disk, and network. After the accelerator card driver is installed on the host, the related metrics can be collected. Table 1 only displays NPU-related metrics. For other metrics, see Metrics Supported by the Agent.
No. |
Category |
Metric |
Display Name |
Description |
Unit |
Value Range |
Dimension |
Supported Model |
---|---|---|---|---|---|---|---|---|
1 |
Overall |
npu_device_health |
NPU Health Status |
Health status of the NPU |
- |
0: normal 1: minor alarm 2: major alarm 3: critical alarm |
instance_id, npu |
Snt3P 300IDuo Snt9B Snt9C |
2 |
npu_driver_health |
NPU Driver Health Status |
Health status of the NPU driver |
- |
0: normal 3: critical alarm |
instance_id, npu |
||
3 |
npu_power |
NPU Power |
NPU power |
W |
>0 |
instance_id, npu |
||
4 |
npu_temperature |
NPU Temperature |
NPU temperature |
°C |
Natural number |
instance_id, npu |
||
5 |
npu_voltage |
NPU Voltage |
NPU voltage |
V |
Natural number |
instance_id, npu |
||
6 |
HBM |
npu_util_rate_hbm |
NPU HBM Usage |
HBM usage of the NPU |
% |
0%–100% |
instance_id, npu |
Snt9B Snt9C |
7 |
npu_hbm_freq |
HBM Frequency |
NPU HBM frequency |
MHz |
>0 |
instance_id, npu |
||
8 |
npu_hbm_usage |
HBM Usage |
NPU HBM usage |
MB |
≥0 |
instance_id, npu |
||
9 |
npu_hbm_temperature |
HBM Temperature |
NPU HBM temperature |
°C |
Natural number |
instance_id, npu |
||
10 |
npu_hbm_bandwidth_util |
HBM Bandwidth Usage |
NPU HBM bandwidth usage |
% |
0%–100% |
instance_id, npu |
||
11 |
npu_hbm_mem_capacity |
NPU HBM Memory Capacity |
HBM memory capacity of the NPU |
MB |
≥0 |
instance_id, npu |
||
12 |
npu_hbm_ecc_enable |
HBM ECC Status |
NPU HBM ECC status |
- |
0: ECC detection is disabled. 1: ECC detection is enabled. |
instance_id, npu |
||
13 |
npu_hbm_single_bit_error_cnt |
Single-bit Errors on HBM |
Current number of single-bit errors on the NPU HBM |
count |
≥0 |
instance_id, npu |
||
14 |
npu_hbm_double_bit_error_cnt |
Double-bit Errors on HBM |
Current number of double-bit errors on the NPU HBM |
count |
≥0 |
instance_id, npu |
||
15 |
npu_hbm_total_single_bit_error_cnt |
Single-bit Errors in HBM Lifecycle |
Number of single-bit errors in the NPU HBM lifecycle |
count |
≥0 |
instance_id, npu |
||
16 |
npu_hbm_total_double_bit_error_cnt |
Double-bit Errors in HBM Lifecycle |
Number of double-bit errors in the NPU HBM lifecycle |
count |
≥0 |
instance_id, npu |
||
17 |
npu_hbm_single_bit_isolated_pages_cnt |
Isolated NPU Memory Pages with HBM Single-bit Errors |
Number of isolated NPU memory pages with HBM single-bit errors |
count |
≥0 |
instance_id, npu |
||
18 |
npu_hbm_double_bit_isolated_pages_cnt |
Isolated NPU Memory Pages with HBM Multi-bit Errors |
Number of isolated NPU memory pages with HBM double-bit errors |
count |
≥0 |
instance_id, npu |
||
19 |
DDR |
npu_usage_mem |
Used NPU Memory |
Used NPU memory |
MB |
≥0 |
instance_id, npu |
Snt3P 300IDuo |
20 |
npu_util_rate_mem |
NPU Memory Usage |
NPU memory usage |
% |
0%–100% |
instance_id, npu |
||
21 |
npu_freq_mem |
NPU Memory Frequency |
NPU memory frequency |
MHz |
>0 |
instance_id, npu |
||
22 |
npu_util_rate_mem_bandwidth |
NPU Memory Bandwidth Usage |
NPU memory bandwidth usage |
% |
0%–100% |
instance_id, npu |
||
23 |
npu_sbe |
NPU Single-bit Errors |
Number of single-bit errors on the NPU |
count |
≥0 |
instance_id, npu |
||
24 |
npu_dbe |
NPU Double-bit Errors |
Number of double-bit errors on the NPU |
count |
≥0 |
instance_id, npu |
||
25 |
AI Core |
npu_freq_ai_core |
AI Core Frequency of the NPU |
AI core frequency of the NPU |
MHz |
>0 |
instance_id, npu |
Snt3P 300IDuo Snt9B Snt9C |
26 |
npu_freq_ai_core_rated |
Rated Frequency of the NPU AI Core |
Rated frequency of the NPU AI core |
MHz |
>0 |
instance_id, npu |
||
27 |
npu_util_rate_ai_core |
AI Core Usage of the NPU |
AI core usage of the NPU |
% |
0%–100% |
instance_id, npu |
||
28 |
AI CPU |
npu_aicpu_num |
AI CPUs of the NPU |
Number of AI CPUs of the NPU |
count |
≥0 |
instance_id, npu |
Snt3P 300IDuo Snt9B Snt9C |
29 |
npu_util_rate_ai_cpu |
AI CPU Usage of the NPU |
AI CPU usage of the NPU |
% |
0%–100% |
instance_id, npu |
||
30 |
npu_aicpu_avg_util_rate |
Average AI CPU Usage of the NPU |
Average AICPU usage of the NPU |
% |
0%–100% |
instance_id, npu |
||
31 |
npu_aicpu_max_freq |
Maximum AI CPU Frequency of the NPU |
Maximum AI CPU frequency of the NPU |
MHz |
>0 |
instance_id, npu |
||
32 |
npu_aicpu_cur_freq |
AI CPU Frequency of the NPU |
AI CPU frequency of the NPU |
MHz |
>0 |
instance_id, npu |
||
33 |
CTRL CPU |
npu_util_rate_ctrl_cpu |
Control CPU Usage of the NPU |
Control CPU usage of the NPU |
% |
0%–100% |
instance_id, npu |
Snt3P 300IDuo Snt9B Snt9C |
34 |
npu_freq_ctrl_cpu |
Control CPU Frequency of the NPU |
Control CPU frequency of the NPU |
MHz |
>0 |
instance_id, npu |
||
35 |
PCIe link |
npu_link_cap_speed |
Max. NPU Link Speed |
Maximum link speed of the NPU |
GT/s |
≥0 |
instance_id, npu |
310P 300IDuo Snt9B Snt9C |
36 |
npu_link_cap_width |
Max. NPU Link Width |
Maximum link width of the NPU |
count |
≥0 |
instance_id, npu |
||
37 |
npu_link_status_speed |
NPU Link Speed |
Link speed of the NPU |
GT/s |
≥0 |
instance_id, npu |
||
38 |
npu_link_status_width |
NPU Link Width |
Link width of the NPU |
count |
≥0 |
instance_id, npu |
||
39 |
RoCE network |
npu_device_network_health |
NPU Network Health Status |
Connectivity of the IP address of the RoCE NIC on the NPU |
- |
0: The network health status is normal. Other values: The network status is abnormal. |
instance_id, npu |
Snt9B Snt9C |
40 |
npu_network_port_link_status |
NPU Network Port Link Status |
Link status of the NPU network port |
- |
0: up 1: down |
instance_id, npu |
||
41 |
npu_roce_tx_rate |
NPU NIC Uplink Rate |
Uplink rate of the NPU NIC |
MB/s |
≥0 |
instance_id, npu |
||
42 |
npu_roce_rx_rate |
NPU NIC Downlink Rate |
Downlink rate of the NPU NIC |
MB/s |
≥0 |
instance_id, npu |
||
43 |
npu_mac_tx_mac_pause_num |
PAUSE Frames Sent from MAC |
Total number of PAUSE frames sent from the MAC address corresponding to the NPU |
count |
≥0 |
instance_id, npu |
||
44 |
npu_mac_rx_mac_pause_num |
PAUSE Frames Received by MAC |
Total number of PAUSE frames received by the MAC address corresponding to the NPU |
count |
≥0 |
instance_id, npu |
||
45 |
npu_mac_tx_pfc_pkt_num |
PFC Frames Sent from MAC |
Total number of PFC frames sent from the MAC address corresponding to the NPU |
count |
≥0 |
instance_id, npu |
||
46 |
npu_mac_rx_pfc_pkt_num |
PFC Frames Received by MAC |
Total number of PFC frames received by the MAC address corresponding to the NPU |
count |
≥0 |
instance_id, npu |
||
47 |
npu_mac_tx_bad_pkt_num |
Bad Packets Sent from MAC |
Total number of bad packets sent from the MAC address corresponding to the NPU |
count |
≥0 |
instance_id, npu |
||
48 |
npu_mac_rx_bad_pkt_num |
Bad Packets Received by MAC |
Total number of bad packets received by the MAC address corresponding to the NPU |
count |
≥0 |
instance_id, npu |
||
49 |
npu_roce_tx_err_pkt_num |
Bad Packets Sent by RoCE |
Total number of bad packets sent by the RoCE NIC on the NPU |
count |
≥0 |
instance_id, npu |
||
50 |
npu_roce_rx_err_pkt_num |
Bad Packets Received by RoCE |
Total number of bad packets received by the RoCE NIC on the NPU |
count |
≥0 |
instance_id, npu |
||
51 |
RoCE optical module |
npu_opt_temperature |
NPU Optical Module Temperature |
NPU optical module temperature |
°C |
Natural number |
instance_id, npu |
Snt9B Snt9C |
52 |
npu_opt_temperature_high_thres |
Upper Limit of the NPU Optical Module Temperature |
Upper limit of the NPU optical module temperature |
°C |
Natural number |
instance_id, npu |
||
53 |
npu_opt_temperature_low_thres |
Lower Limit of the NPU Optical Module Temperature |
Lower limit of the NPU optical module temperature |
°C |
Natural number |
instance_id, npu |
||
54 |
npu_opt_voltage |
NPU Optical Module Voltage |
NPU optical module voltage |
mV |
Natural number |
instance_id, npu |
||
55 |
npu_opt_voltage_high_thres |
Upper Limit of the NPU Optical Module Voltage |
Upper limit of the NPU optical module voltage |
mV |
Natural number |
instance_id, npu |
||
56 |
npu_opt_voltage_low_thres |
Lower Limit of the NPU Optical Module Voltage |
Lower limit of the NPU optical module voltage |
mV |
Natural number |
instance_id, npu |
||
57 |
npu_opt_tx_power_lane0 |
TX Power of the NPU Optical Module in Channel 0 |
Transmit power of the NPU optical module in channel 0 |
mW |
≥0 |
instance_id, npu |
||
58 |
npu_opt_tx_power_lane1 |
TX Power of the NPU Optical Module in Channel 1 |
Transmit power of the NPU optical module in channel 1 |
mW |
≥0 |
instance_id, npu |
||
59 |
npu_opt_tx_power_lane2 |
TX Power of the NPU Optical Module in Channel 2 |
Transmit power of the NPU optical module in channel 2 |
mW |
≥0 |
instance_id, npu |
||
60 |
npu_opt_tx_power_lane3 |
TX Power of the NPU Optical Module in Channel 3 |
Transmit power of the NPU optical module in channel 3 |
mW |
≥0 |
instance_id, npu |
||
61 |
npu_opt_rx_power_lane0 |
RX Power of the NPU Optical Module in Channel 0 |
Receive power of the NPU optical module in channel 0 |
mW |
≥0 |
instance_id, npu |
||
62 |
npu_opt_rx_power_lane1 |
RX Power of the NPU Optical Module in Channel 1 |
Receive power of the NPU optical module in channel 1 |
mW |
≥0 |
instance_id, npu |
||
63 |
npu_opt_rx_power_lane2 |
RX Power of the NPU Optical Module in Channel 2 |
Receive power of the NPU optical module in channel 2 |
mW |
≥0 |
instance_id, npu |
||
64 |
npu_opt_rx_power_lane3 |
RX Power of the NPU Optical Module in Channel 3 |
Receive power of the NPU optical module in channel 3 |
mW |
≥0 |
instance_id, npu |
||
65 |
npu_opt_tx_bias_lane0 |
TX Bias Current of the NPU Optical Module in Channel 0 |
Transmitted bias current of the NPU optical module in channel 0 |
mA |
≥0 |
instance_id, npu |
||
66 |
npu_opt_tx_bias_lane1 |
TX Bias Current of the NPU Optical Module in Channel 1 |
Transmitted bias current of the NPU optical module in channel 1 |
mA |
≥0 |
instance_id, npu |
||
67 |
npu_opt_tx_bias_lane2 |
TX Bias Current of the NPU Optical Module in Channel 2 |
Transmitted bias current of the NPU optical module in channel 2 |
mA |
≥0 |
instance_id, npu |
||
68 |
npu_opt_tx_bias_lane3 |
TX Bias Current of the NPU Optical Module in Channel 3 |
Transmitted bias current of the NPU optical module in channel 3 |
mA |
≥0 |
instance_id, npu |
||
69 |
npu_opt_tx_los |
TX Los of the NPU Optical Module |
TX Los flag of the NPU optical module |
count |
≥0 |
instance_id, npu |
||
70 |
npu_opt_rx_los |
RX Los of the NPU Optical Module |
RX Los flag of the NPU optical module |
count |
≥0 |
instance_id, npu |
Supported Events
You can use CES to centrally collect key events and cloud resource operational events. When an event occurs, you will receive an alarm. Lite Server mainly supports events from BMS. For details, see the following table.
Event Source |
Namespace |
Event |
Event ID |
Event Severity |
Description |
Solution |
Impact |
Supported Model |
---|---|---|---|---|---|---|---|---|
BMS |
SYS.BMS |
NPU: device not found by npu-smi info |
NPUSMICardNotFound |
Major |
The Ascend driver is faulty or the NPU is disconnected. |
Contact O&M engineers. |
The NPU cannot be used normally. |
Snt3P 300IDuo Snt9B Snt9C |
NPU: PCIe link error |
PCIeErrorFound |
Major |
The lspci command output shows that the NPU is in the rev ff state. |
Contact O&M engineers. |
The NPU cannot be used normally. |
Snt3P 300IDuo Snt9B Snt9C |
||
NPU: device not found by lspci |
LspciCardNotFound |
Major |
The NPU is disconnected. |
Contact O&M engineers. |
The NPU cannot be used normally. |
Snt3P 300IDuo Snt9B Snt9C |
||
NPU: overtemperature |
TemperatureOverUpperLimit |
Major |
The temperature of DDR or software is too high. |
Stop services, restart the BMS, check the heat dissipation system, and reset the devices. |
The instance may be powered off and devices may not be found. |
Snt3P 300IDuo Snt9B Snt9C |
||
NPU: uncorrectable ECC error |
UncorrectableEccErrorWarning |
Major |
There are uncorrectable ECC errors on the NPU. |
If services are affected, replace the NPU with another one. |
Services may be interrupted. |
Snt3P 300IDuo |
||
NPU: request for instance restart |
RebootVirtualMachine |
Suggestion |
A fault occurs and the BMS needs to be restarted. |
Collect the fault information, and restart the BMS. |
Services may be interrupted. |
Snt3P 300IDuo Snt9B Snt9C |
||
NPU: request for SoC reset |
ResetSOC |
Suggestion |
A fault occurs and the SoC needs to be reset. |
Collect the fault information, and reset the SoC. |
Services may be interrupted. |
Snt3P 300IDuo Snt9B Snt9C |
||
NPU: request for restart AI process |
RestartAIProcess |
Suggestion |
A fault occurs and the AI process needs to be restarted. |
Collect the fault information, and restart the AI process. |
The current AI task will be interrupted. |
Snt3P 300IDuo Snt9B Snt9C |
||
NPU: error codes |
NPUErrorCodeWarning |
Major |
A large number of NPU error codes indicating major or higher-level errors are returned. You can further locate the faults based on the error codes. |
Locate the faults according to the Black Box Error Code Information List and Health Management Error Definition. |
Services may be interrupted. |
Snt3P 300IDuo Snt9B Snt9C |
||
Multiple NPU HBM ECC errors |
NpuHbmMultiEccInfo |
Suggestion |
There are NPU HBM ECC errors. |
This event is only a reference for other events. You do not need to handle it separately. |
This event is only a reference for other events. You do not need to handle it separately. |
Snt9B Snt9C |
||
GPU: invalid RoCE NIC configuration |
GpuRoceNicConfigIncorrect |
Major |
GPU: invalid RoCE NIC configuration |
Contact O&M engineers. |
The parameter plane network is abnormal, preventing the execution of the multi-node task. |
GPU |
||
ReadOnly issues in OS |
ReadOnlyFileSystem |
Critical |
The file system %s is read-only. |
Check the disk health status. |
The files cannot be written or operated. |
- |
||
NPU: driver and firmware not matching |
NpuDriverFirmwareMismatch |
Major |
The NPU's driver and firmware do not match. |
Obtain the matched version from the Ascend official website and reinstall it. |
NPUs cannot be used. |
Snt3P 300IDuo Snt9B Snt9C |
||
NPU: Docker container environment check |
NpuContainerEnvSystem |
Major |
Docker unavailable |
Check if the Docker software is normal. |
Docker cannot be used. |
- |
||
Major |
The container plug-in Ascend-Docker-Runtime is not installed. |
Install the container plug-in Ascend-Docker-Runtime. Otherwise, the container cannot use Ascend cards. |
NPUs cannot be mounted to Docker containers. |
Snt3P 300IDuo Snt9B Snt9C |
||||
Major |
IP forwarding is not enabled in the OS. |
Check the net.ipv4.ip_forward configuration in the /etc/sysctl.conf file. |
Docker containers experience network communication issues. |
- |
||||
Major |
The shared memory of the container is too small. |
The default shared memory is 64 MB, which can be modified as needed. |
Distributed training failed due to insufficient shared memory. |
- |
||||
Method 1 |
||||||||
Modify the default-shm-size field in the /etc/docker/daemon.json configuration file. |
||||||||
Method 2 |
||||||||
Use the --shm-size parameter in the docker run command to set the shared memory size of a container. |
||||||||
NPU: RoCE NIC down |
RoCELinkStatusDown |
Major |
The RoCE link of NPU card %d is down. |
Check the NPU RoCE network port status. |
The NPU NIC is unavailable. |
Snt9B Snt9C |
||
NPU: RoCE NIC health status abnormal |
RoCEHealthStatusError |
Major |
The RoCE network health status of NPU %d is abnormal. |
Check the health status of the NPU RoCE NIC. |
The NPU NIC is unavailable. |
Snt9B Snt9C |
||
NPU: RoCE NIC configuration file /etc/hccn.conf not exist |
HccnConfNotExisted |
Major |
The RoCE NIC configuration file /etc/hccn.conf does not exist. |
Check the /etc/hccn.conf NIC configuration file. |
The RoCE NIC is unavailable. |
Snt9B Snt9C |
||
GPU: basic components abnormal |
GpuEnvironmentSystem |
Major |
The nvidia-smi command is abnormal. |
Check if the GPU driver is normal. |
The GPU driver is unavailable. |
GPU |
||
Major |
The nvidia-fabricmanager version is inconsistent with the GPU driver version. |
Check the GPU driver version and nvidia-fabricmanager version. |
The nvidia-fabricmanager cannot work properly, affecting GPU usage. |
|||||
Major |
The container plug-in nvidia-container-toolkit is not installed. |
Install the container plug-in nvidia-container-toolkit. |
GPUs cannot be mounted to Docker containers. |
|||||
Local disk mounting inspection |
MountDiskSystem |
Major |
The /etc/fstab file contains invalid UUIDs. |
Ensure that the UUIDs in the /etc/fstab configuration file are correct. Otherwise, the server may fail to be restarted. |
The disk mounting process fails, preventing the server from restarting. |
- |
||
GPU: incorrectly configured dynamic route for Ant series servers |
GpuRouteConfigError |
Major |
The dynamic route of the NIC %s of an Ant series server is not configured or is incorrectly configured. CMD [ip route]: %s | CMD [ip route show table all]: %s. |
Configure the RoCE NIC route correctly. |
The NPU network communication is abnormal. |
GPU |
||
NPU: RoCE port not split |
RoCEUdpConfigError |
Major |
The RoCE UDP port is not split. |
Check the RoCE UDP port configuration on the NPU. |
The communication performance of NPUs is affected. |
Snt9B Snt9C |
||
Warning of automatic system kernel upgrade |
KernelUpgradeWarning |
Major |
Warning of automatic system kernel upgrade. Old version: %s; new version: %s. |
System kernel upgrade may cause AI software exceptions. Check the system update logs and prevent the server from restarting. |
The AI software may be unavailable. |
Snt3P 300IDuo Snt9B Snt9C |
||
NPU environment command detection |
NpuToolsWarning |
Major |
The hccn_tool is unavailable. |
Check if the NPU driver is normal. |
The IP address and gateway of the RoCE NIC cannot be configured. |
Snt9B Snt9C |
||
Major |
The npu-smi is unavailable. |
Check if the NPU driver is normal. |
NPUs cannot be used. |
Snt3P 300IDuo Snt9B Snt9C |
||||
Major |
The ascend-dmi is unavailable. |
Check if ToolBox is properly installed. |
ascend-dmi cannot be used for performance analysis. |
Snt9B Snt9C |
Installing CES Agent Monitoring Plug-ins
- Create an agency for CES. For details, see Creating a User and Granting Permissions.
- Currently, one-click monitoring installation is not supported on the CES page. You need to log in to the server and run the following commands to install and configure the agent. For details about how to install the agent in other regions, see Installing the Agent on a Linux Server.
cd /usr/local && curl -k -O https://obs.cn-north-4.myhuaweicloud.com/uniagent-cn-north-4/script/agent_install.sh && bash agent_install.sh
If the following information is displayed, the installation is successful.
Figure 1 Installation succeeded - On the Cloud Eye console, choose Service Monitoring > Bare Metal Server to view the monitoring items. Accelerator card monitoring items are only available after the accelerator card driver is installed on the host.
Figure 2 Monitoring page
The monitoring plug-in is now installed. You can view the collected metrics on the UI or configure alarms based on the metric values.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot