Using CES to Monitor Lite Server Resources

Scenario

This section describes how to configure the BMS metric monitoring solution provided by Huawei Cloud BMS and Cloud Eye Service (CES). You can view the monitoring metrics of CPU, CPU load, memory, disk, disk I/O, file system, NIC, software RAID, and process.

About BMS Monitoring

For details, see BMS Overview. In addition to the images listed in the document, Ubuntu 20.04 is also supported.

The sampling period of monitoring metrics is 1 minute. The current monitoring metrics include the CPU, memory, disk, and network. After the accelerator card driver is installed on the host, the metrics listed in the following table can be collected.

**Table 1** Metrics
Metric	Name	Description	Unit	Dimensions
gpu_status	GPU Health Status	Overall measurement of the GPU health. 0 indicates the GPU is healthy. 1 indicates the GPU is subhealthy. 2 indicates the GPU is faulty.	-	instance_id, gpu
gpu_utilization	GPU Usage	GPU computing power usage	%	instance_id, gpu
memory_utilization	GPU Memory Usage	GPU memory usage	%	instance_id, gpu
gpu_performance	GPU Performance Status	Performance status of the GPU	-	instance_id, gpu
encoder_utilization	Encoding Usage	GPU encoding capability usage	%	instance_id, gpu
decoder_utilization	Decoding Usage	GPU decoding capability usage	%	instance_id, gpu
volatile_correctable	Volatile Correctable ECC Errors	Number of correctable ECC errors since the GPU is reset. The value is reset to 0 each time the GPU is reset.	Number	instance_id, gpu
volatile_uncorrectable	Volatile Uncorrectable ECC Errors	Number of uncorrectable ECC errors since the GPU is reset. The value is reset to 0 each time the GPU is reset.	Number	instance_id, gpu
aggregate_correctable	Aggregate Correctable ECC Errors	Number of correctable ECC errors on the GPU	Number	instance_id, gpu
aggregate_uncorrectable	Aggregate Uncorrectable ECC Errors	Number of uncorrectable ECC Errors on the GPU	Number	instance_id, gpu
retired_page_single_bit	Retired Page Single Bit Errors	Number of retired page single bit errors, which indicates the number of single-bit pages blocked by the graphics card	Number	instance_id, gpu
retired_page_double_bit	Retired Page Double Bit Errors	Number of retired page double bit errors, which indicates the number of double-bit pages blocked by the graphics card	Number	instance_id, gpu

Installing the Monitoring Plug-in

Create an agency for CES. For details, see Creating a User and Granting Permissions.
Currently, one-click monitoring installation is not supported on the CES page. You need to log in to the server and run the following commands to install and configure the agent. For details about how to install the agent in other regions, see Installing the Agent on a Linux Server.
```
cd /usr/local && curl -k -O https://obs.cn-north-4.myhuaweicloud.com/uniagent-cn-north-4/script/agent_install.sh && bash agent_install.sh
```
If the following information is displayed, the installation is successful.

Figure 1 Installation succeeded
View the monitoring items on CES page. Accelerator card monitoring items are available only after the accelerator card driver is installed on the host.

Figure 2 Monitoring page

The monitoring plug-in is now installed. You can view the collected metrics on the UI or configure alarms based on the metric values.