Using CES to Monitor Lite Server Resources

Scenario

You need Cloud Eye Service (CES) to monitor Lite Server. This section describes how to interconnect with CES to monitor resources and events on Lite Server.

Overview

For details, see BMS Overview. In addition to the images listed in the document, Ubuntu 20.04 is also supported.

The sampling period of monitoring metrics is 1 minute. The current monitoring metrics include the CPU, memory, disk, and network. After the accelerator card driver is installed on the host, the related metrics can be collected. Table 1 only displays NPU-related metrics. For other metrics, see Metrics Supported by the Agent.

**Table 1** NPU metrics
No.	Category	Metric	Display Name	Description	Unit	Value Range	Dimension	Supported Model
1	Overall	npu_device_health	NPU Health Status	Health status of the NPU	-	0: normal 1: minor alarm 2: major alarm 3: critical alarm	instance_id, npu	Snt3P 300IDuo Snt9B Snt9C
2		npu_driver_health	NPU Driver Health Status	Health status of the NPU driver	-	0: normal 3: critical alarm	instance_id, npu
3		npu_power	NPU Power	NPU power	W	>0	instance_id, npu
4		npu_temperature	NPU Temperature	NPU temperature	°C	Natural number	instance_id, npu
5		npu_voltage	NPU Voltage	NPU voltage	V	Natural number	instance_id, npu
6	HBM	npu_util_rate_hbm	NPU HBM Usage	HBM usage of the NPU	%	0%–100%	instance_id, npu	Snt9B Snt9C
7		npu_hbm_freq	HBM Frequency	NPU HBM frequency	MHz	>0	instance_id, npu
8		npu_hbm_usage	HBM Usage	NPU HBM usage	MB	≥0	instance_id, npu
9		npu_hbm_temperature	HBM Temperature	NPU HBM temperature	°C	Natural number	instance_id, npu
10		npu_hbm_bandwidth_util	HBM Bandwidth Usage	NPU HBM bandwidth usage	%	0%–100%	instance_id, npu
11		npu_hbm_mem_capacity	NPU HBM Memory Capacity	HBM memory capacity of the NPU	MB	≥0	instance_id, npu
12		npu_hbm_ecc_enable	HBM ECC Status	NPU HBM ECC status	-	0: ECC detection is disabled. 1: ECC detection is enabled.	instance_id, npu
13		npu_hbm_single_bit_error_cnt	Single-bit Errors on HBM	Current number of single-bit errors on the NPU HBM	count	≥0	instance_id, npu
14		npu_hbm_double_bit_error_cnt	Double-bit Errors on HBM	Current number of double-bit errors on the NPU HBM	count	≥0	instance_id, npu
15		npu_hbm_total_single_bit_error_cnt	Single-bit Errors in HBM Lifecycle	Number of single-bit errors in the NPU HBM lifecycle	count	≥0	instance_id, npu
16		npu_hbm_total_double_bit_error_cnt	Double-bit Errors in HBM Lifecycle	Number of double-bit errors in the NPU HBM lifecycle	count	≥0	instance_id, npu
17		npu_hbm_single_bit_isolated_pages_cnt	Isolated NPU Memory Pages with HBM Single-bit Errors	Number of isolated NPU memory pages with HBM single-bit errors	count	≥0	instance_id, npu
18		npu_hbm_double_bit_isolated_pages_cnt	Isolated NPU Memory Pages with HBM Multi-bit Errors	Number of isolated NPU memory pages with HBM double-bit errors	count	≥0	instance_id, npu
19	DDR	npu_usage_mem	Used NPU Memory	Used NPU memory	MB	≥0	instance_id, npu	Snt3P 300IDuo
20		npu_util_rate_mem	NPU Memory Usage	NPU memory usage	%	0%–100%	instance_id, npu
21		npu_freq_mem	NPU Memory Frequency	NPU memory frequency	MHz	>0	instance_id, npu
22		npu_util_rate_mem_bandwidth	NPU Memory Bandwidth Usage	NPU memory bandwidth usage	%	0%–100%	instance_id, npu
23		npu_sbe	NPU Single-bit Errors	Number of single-bit errors on the NPU	count	≥0	instance_id, npu
24		npu_dbe	NPU Double-bit Errors	Number of double-bit errors on the NPU	count	≥0	instance_id, npu
25	AI Core	npu_freq_ai_core	AI Core Frequency of the NPU	AI core frequency of the NPU	MHz	>0	instance_id, npu	Snt3P 300IDuo Snt9B Snt9C
26		npu_freq_ai_core_rated	Rated Frequency of the NPU AI Core	Rated frequency of the NPU AI core	MHz	>0	instance_id, npu
27		npu_util_rate_ai_core	AI Core Usage of the NPU	AI core usage of the NPU	%	0%–100%	instance_id, npu
28	AI CPU	npu_aicpu_num	AI CPUs of the NPU	Number of AI CPUs of the NPU	count	≥0	instance_id, npu	Snt3P 300IDuo Snt9B Snt9C
29		npu_util_rate_ai_cpu	AI CPU Usage of the NPU	AI CPU usage of the NPU	%	0%–100%	instance_id, npu
30		npu_aicpu_avg_util_rate	Average AI CPU Usage of the NPU	Average AICPU usage of the NPU	%	0%–100%	instance_id, npu
31		npu_aicpu_max_freq	Maximum AI CPU Frequency of the NPU	Maximum AI CPU frequency of the NPU	MHz	>0	instance_id, npu
32		npu_aicpu_cur_freq	AI CPU Frequency of the NPU	AI CPU frequency of the NPU	MHz	>0	instance_id, npu
33	CTRL CPU	npu_util_rate_ctrl_cpu	Control CPU Usage of the NPU	Control CPU usage of the NPU	%	0%–100%	instance_id, npu	Snt3P 300IDuo Snt9B Snt9C
34	CTRL CPU	npu_freq_ctrl_cpu	Control CPU Frequency of the NPU	Control CPU frequency of the NPU	MHz	>0	instance_id, npu	Snt3P 300IDuo Snt9B Snt9C
35	PCIe link	npu_link_cap_speed	Max. NPU Link Speed	Maximum link speed of the NPU	GT/s	≥0	instance_id, npu	310P 300IDuo Snt9B Snt9C
36		npu_link_cap_width	Max. NPU Link Width	Maximum link width of the NPU	count	≥0	instance_id, npu
37		npu_link_status_speed	NPU Link Speed	Link speed of the NPU	GT/s	≥0	instance_id, npu
38		npu_link_status_width	NPU Link Width	Link width of the NPU	count	≥0	instance_id, npu
39	RoCE network	npu_device_network_health	NPU Network Health Status	Connectivity of the IP address of the RoCE NIC on the NPU	-	0: The network health status is normal. Other values: The network status is abnormal.	instance_id, npu	Snt9B Snt9C
40		npu_network_port_link_status	NPU Network Port Link Status	Link status of the NPU network port	-	0: up 1: down	instance_id, npu
41		npu_roce_tx_rate	NPU NIC Uplink Rate	Uplink rate of the NPU NIC	MB/s	≥0	instance_id, npu
42		npu_roce_rx_rate	NPU NIC Downlink Rate	Downlink rate of the NPU NIC	MB/s	≥0	instance_id, npu
43		npu_mac_tx_mac_pause_num	PAUSE Frames Sent from MAC	Total number of PAUSE frames sent from the MAC address corresponding to the NPU	count	≥0	instance_id, npu
44		npu_mac_rx_mac_pause_num	PAUSE Frames Received by MAC	Total number of PAUSE frames received by the MAC address corresponding to the NPU	count	≥0	instance_id, npu
45		npu_mac_tx_pfc_pkt_num	PFC Frames Sent from MAC	Total number of PFC frames sent from the MAC address corresponding to the NPU	count	≥0	instance_id, npu
46		npu_mac_rx_pfc_pkt_num	PFC Frames Received by MAC	Total number of PFC frames received by the MAC address corresponding to the NPU	count	≥0	instance_id, npu
47		npu_mac_tx_bad_pkt_num	Bad Packets Sent from MAC	Total number of bad packets sent from the MAC address corresponding to the NPU	count	≥0	instance_id, npu
48		npu_mac_rx_bad_pkt_num	Bad Packets Received by MAC	Total number of bad packets received by the MAC address corresponding to the NPU	count	≥0	instance_id, npu
49		npu_roce_tx_err_pkt_num	Bad Packets Sent by RoCE	Total number of bad packets sent by the RoCE NIC on the NPU	count	≥0	instance_id, npu
50		npu_roce_rx_err_pkt_num	Bad Packets Received by RoCE	Total number of bad packets received by the RoCE NIC on the NPU	count	≥0	instance_id, npu
51	RoCE optical module	npu_opt_temperature	NPU Optical Module Temperature	NPU optical module temperature	°C	Natural number	instance_id, npu	Snt9B Snt9C
52		npu_opt_temperature_high_thres	Upper Limit of the NPU Optical Module Temperature	Upper limit of the NPU optical module temperature	°C	Natural number	instance_id, npu
53		npu_opt_temperature_low_thres	Lower Limit of the NPU Optical Module Temperature	Lower limit of the NPU optical module temperature	°C	Natural number	instance_id, npu
54		npu_opt_voltage	NPU Optical Module Voltage	NPU optical module voltage	mV	Natural number	instance_id, npu
55		npu_opt_voltage_high_thres	Upper Limit of the NPU Optical Module Voltage	Upper limit of the NPU optical module voltage	mV	Natural number	instance_id, npu
56		npu_opt_voltage_low_thres	Lower Limit of the NPU Optical Module Voltage	Lower limit of the NPU optical module voltage	mV	Natural number	instance_id, npu
57		npu_opt_tx_power_lane0	TX Power of the NPU Optical Module in Channel 0	Transmit power of the NPU optical module in channel 0	mW	≥0	instance_id, npu
58		npu_opt_tx_power_lane1	TX Power of the NPU Optical Module in Channel 1	Transmit power of the NPU optical module in channel 1	mW	≥0	instance_id, npu
59		npu_opt_tx_power_lane2	TX Power of the NPU Optical Module in Channel 2	Transmit power of the NPU optical module in channel 2	mW	≥0	instance_id, npu
60		npu_opt_tx_power_lane3	TX Power of the NPU Optical Module in Channel 3	Transmit power of the NPU optical module in channel 3	mW	≥0	instance_id, npu
61		npu_opt_rx_power_lane0	RX Power of the NPU Optical Module in Channel 0	Receive power of the NPU optical module in channel 0	mW	≥0	instance_id, npu
62		npu_opt_rx_power_lane1	RX Power of the NPU Optical Module in Channel 1	Receive power of the NPU optical module in channel 1	mW	≥0	instance_id, npu
63		npu_opt_rx_power_lane2	RX Power of the NPU Optical Module in Channel 2	Receive power of the NPU optical module in channel 2	mW	≥0	instance_id, npu
64		npu_opt_rx_power_lane3	RX Power of the NPU Optical Module in Channel 3	Receive power of the NPU optical module in channel 3	mW	≥0	instance_id, npu
65		npu_opt_tx_bias_lane0	TX Bias Current of the NPU Optical Module in Channel 0	Transmitted bias current of the NPU optical module in channel 0	mA	≥0	instance_id, npu
66		npu_opt_tx_bias_lane1	TX Bias Current of the NPU Optical Module in Channel 1	Transmitted bias current of the NPU optical module in channel 1	mA	≥0	instance_id, npu
67		npu_opt_tx_bias_lane2	TX Bias Current of the NPU Optical Module in Channel 2	Transmitted bias current of the NPU optical module in channel 2	mA	≥0	instance_id, npu
68		npu_opt_tx_bias_lane3	TX Bias Current of the NPU Optical Module in Channel 3	Transmitted bias current of the NPU optical module in channel 3	mA	≥0	instance_id, npu
69		npu_opt_tx_los	TX Los of the NPU Optical Module	TX Los flag of the NPU optical module	count	≥0	instance_id, npu
70		npu_opt_rx_los	RX Los of the NPU Optical Module	RX Los flag of the NPU optical module	count	≥0	instance_id, npu

Supported Events

You can use CES to centrally collect key events and cloud resource operational events. When an event occurs, you will receive an alarm. Lite Server mainly supports events from BMS. For details, see the following table.

**Table 2** Events supported by Lite Server
Event Source	Namespace	Event	Event ID	Event Severity	Description	Solution	Impact	Supported Model
BMS	SYS.BMS	NPU: device not found by npu-smi info	NPUSMICardNotFound	Major	The Ascend driver is faulty or the NPU is disconnected.	Contact O&M engineers.	The NPU cannot be used normally.	Snt3P 300IDuo Snt9B Snt9C
		NPU: PCIe link error	PCIeErrorFound	Major	The lspci command output shows that the NPU is in the rev ff state.	Contact O&M engineers.	The NPU cannot be used normally.	Snt3P 300IDuo Snt9B Snt9C
		NPU: device not found by lspci	LspciCardNotFound	Major	The NPU is disconnected.	Contact O&M engineers.	The NPU cannot be used normally.	Snt3P 300IDuo Snt9B Snt9C
		NPU: overtemperature	TemperatureOverUpperLimit	Major	The temperature of DDR or software is too high.	Stop services, restart the BMS, check the heat dissipation system, and reset the devices.	The instance may be powered off and devices may not be found.	Snt3P 300IDuo Snt9B Snt9C
		NPU: uncorrectable ECC error	UncorrectableEccErrorWarning	Major	There are uncorrectable ECC errors on the NPU.	If services are affected, replace the NPU with another one.	Services may be interrupted.	Snt3P 300IDuo
		NPU: request for instance restart	RebootVirtualMachine	Suggestion	A fault occurs and the BMS needs to be restarted.	Collect the fault information, and restart the BMS.	Services may be interrupted.	Snt3P 300IDuo Snt9B Snt9C
		NPU: request for SoC reset	ResetSOC	Suggestion	A fault occurs and the SoC needs to be reset.	Collect the fault information, and reset the SoC.	Services may be interrupted.	Snt3P 300IDuo Snt9B Snt9C
		NPU: request for restart AI process	RestartAIProcess	Suggestion	A fault occurs and the AI process needs to be restarted.	Collect the fault information, and restart the AI process.	The current AI task will be interrupted.	Snt3P 300IDuo Snt9B Snt9C
		NPU: error codes	NPUErrorCodeWarning	Major	A large number of NPU error codes indicating major or higher-level errors are returned. You can further locate the faults based on the error codes.	Locate the faults according to the Black Box Error Code Information List and Health Management Error Definition.	Services may be interrupted.	Snt3P 300IDuo Snt9B Snt9C
		Multiple NPU HBM ECC errors	NpuHbmMultiEccInfo	Suggestion	There are NPU HBM ECC errors.	This event is only a reference for other events. You do not need to handle it separately.	This event is only a reference for other events. You do not need to handle it separately.	Snt9B Snt9C
		GPU: invalid RoCE NIC configuration	GpuRoceNicConfigIncorrect	Major	GPU: invalid RoCE NIC configuration	Contact O&M engineers.	The parameter plane network is abnormal, preventing the execution of the multi-node task.	GPU
		ReadOnly issues in OS	ReadOnlyFileSystem	Critical	The file system %s is read-only.	Check the disk health status.	The files cannot be written or operated.	-
		NPU: driver and firmware not matching	NpuDriverFirmwareMismatch	Major	The NPU's driver and firmware do not match.	Obtain the matched version from the Ascend official website and reinstall it.	NPUs cannot be used.	Snt3P 300IDuo Snt9B Snt9C
		NPU: Docker container environment check	NpuContainerEnvSystem	Major	Docker unavailable	Check if the Docker software is normal.	Docker cannot be used.	-
				Major	The container plug-in Ascend-Docker-Runtime is not installed.	Install the container plug-in Ascend-Docker-Runtime. Otherwise, the container cannot use Ascend cards.	NPUs cannot be mounted to Docker containers.	Snt3P 300IDuo Snt9B Snt9C
				Major	IP forwarding is not enabled in the OS.	Check the net.ipv4.ip_forward configuration in the /etc/sysctl.conf file.	Docker containers experience network communication issues.	-
				Major	The shared memory of the container is too small.	The default shared memory is 64 MB, which can be modified as needed.	Distributed training failed due to insufficient shared memory.	-
						Method 1
						Modify the default-shm-size field in the /etc/docker/daemon.json configuration file.
						Method 2
						Use the --shm-size parameter in the docker run command to set the shared memory size of a container.
		NPU: RoCE NIC down	RoCELinkStatusDown	Major	The RoCE link of NPU card %d is down.	Check the NPU RoCE network port status.	The NPU NIC is unavailable.	Snt9B Snt9C
		NPU: RoCE NIC health status abnormal	RoCEHealthStatusError	Major	The RoCE network health status of NPU %d is abnormal.	Check the health status of the NPU RoCE NIC.	The NPU NIC is unavailable.	Snt9B Snt9C
		NPU: RoCE NIC configuration file /etc/hccn.conf not exist	HccnConfNotExisted	Major	The RoCE NIC configuration file /etc/hccn.conf does not exist.	Check the /etc/hccn.conf NIC configuration file.	The RoCE NIC is unavailable.	Snt9B Snt9C
		GPU: basic components abnormal	GpuEnvironmentSystem	Major	The nvidia-smi command is abnormal.	Check if the GPU driver is normal.	The GPU driver is unavailable.	GPU
				Major	The nvidia-fabricmanager version is inconsistent with the GPU driver version.	Check the GPU driver version and nvidia-fabricmanager version.	The nvidia-fabricmanager cannot work properly, affecting GPU usage.
				Major	The container plug-in nvidia-container-toolkit is not installed.	Install the container plug-in nvidia-container-toolkit.	GPUs cannot be mounted to Docker containers.
		Local disk mounting inspection	MountDiskSystem	Major	The /etc/fstab file contains invalid UUIDs.	Ensure that the UUIDs in the /etc/fstab configuration file are correct. Otherwise, the server may fail to be restarted.	The disk mounting process fails, preventing the server from restarting.	-
		GPU: incorrectly configured dynamic route for Ant series servers	GpuRouteConfigError	Major	The dynamic route of the NIC %s of an Ant series server is not configured or is incorrectly configured. CMD [ip route]: %s \| CMD [ip route show table all]: %s.	Configure the RoCE NIC route correctly.	The NPU network communication is abnormal.	GPU
		NPU: RoCE port not split	RoCEUdpConfigError	Major	The RoCE UDP port is not split.	Check the RoCE UDP port configuration on the NPU.	The communication performance of NPUs is affected.	Snt9B Snt9C
		Warning of automatic system kernel upgrade	KernelUpgradeWarning	Major	Warning of automatic system kernel upgrade. Old version: %s; new version: %s.	System kernel upgrade may cause AI software exceptions. Check the system update logs and prevent the server from restarting.	The AI software may be unavailable.	Snt3P 300IDuo Snt9B Snt9C
		NPU environment command detection	NpuToolsWarning	Major	The hccn_tool is unavailable.	Check if the NPU driver is normal.	The IP address and gateway of the RoCE NIC cannot be configured.	Snt9B Snt9C
				Major	The npu-smi is unavailable.	Check if the NPU driver is normal.	NPUs cannot be used.	Snt3P 300IDuo Snt9B Snt9C
				Major	The ascend-dmi is unavailable.	Check if ToolBox is properly installed.	ascend-dmi cannot be used for performance analysis.	Snt9B Snt9C

Installing CES Agent Monitoring Plug-ins

Create an agency for CES. For details, see Creating a User and Granting Permissions.
Currently, one-click monitoring installation is not supported on the CES page. You need to log in to the server and run the following commands to install and configure the agent. For details about how to install the agent in other regions, see Installing the Agent on a Linux Server.
```
cd /usr/local && curl -k -O https://obs.cn-north-4.myhuaweicloud.com/uniagent-cn-north-4/script/agent_install.sh && bash agent_install.sh
```
If the following information is displayed, the installation is successful.

Figure 1 Installation succeeded
On the Cloud Eye console, choose Service Monitoring > Bare Metal Server to view the monitoring items. Accelerator card monitoring items are only available after the accelerator card driver is installed on the host.

Figure 2 Monitoring page

The monitoring plug-in is now installed. You can view the collected metrics on the UI or configure alarms based on the metric values.