Updated on 2024-11-19 GMT+08:00

Using CES to Monitor Lite Server Resources

Scenario

This section describes how to configure the BMS metric monitoring solution provided by Huawei Cloud BMS and Cloud Eye Service (CES). You can view the monitoring metrics of CPU, CPU load, memory, disk, disk I/O, file system, NIC, software RAID, and process.

About BMS Monitoring

For details, see BMS Overview. In addition to the images listed in the document, Ubuntu 20.04 is also supported.

The sampling period of monitoring metrics is 1 minute. The current monitoring metrics include the CPU, memory, disk, and network. After the accelerator card driver is installed on the host, the metrics listed in the following table can be collected.

Table 1 Metrics

Metric

Name

Description

Unit

Dimensions

gpu_status

GPU Health Status

Overall measurement of the GPU health. 0 indicates the GPU is healthy. 1 indicates the GPU is subhealthy. 2 indicates the GPU is faulty.

-

instance_id, gpu

gpu_utilization

GPU Usage

GPU computing power usage

%

instance_id, gpu

memory_utilization

GPU Memory Usage

GPU memory usage

%

instance_id, gpu

gpu_performance

GPU Performance Status

Performance status of the GPU

-

instance_id, gpu

encoder_utilization

Encoding Usage

GPU encoding capability usage

%

instance_id, gpu

decoder_utilization

Decoding Usage

GPU decoding capability usage

%

instance_id, gpu

volatile_correctable

Volatile Correctable ECC Errors

Number of correctable ECC errors since the GPU is reset. The value is reset to 0 each time the GPU is reset.

Number

instance_id, gpu

volatile_uncorrectable

Volatile Uncorrectable ECC Errors

Number of uncorrectable ECC errors since the GPU is reset. The value is reset to 0 each time the GPU is reset.

Number

instance_id, gpu

aggregate_correctable

Aggregate Correctable ECC Errors

Number of correctable ECC errors on the GPU

Number

instance_id, gpu

aggregate_uncorrectable

Aggregate Uncorrectable ECC Errors

Number of uncorrectable ECC Errors on the GPU

Number

instance_id, gpu

retired_page_single_bit

Retired Page Single Bit Errors

Number of retired page single bit errors, which indicates the number of single-bit pages blocked by the graphics card

Number

instance_id, gpu

retired_page_double_bit

Retired Page Double Bit Errors

Number of retired page double bit errors, which indicates the number of double-bit pages blocked by the graphics card

Number

instance_id, gpu

Installing the Monitoring Plug-in

  1. Create an agency for CES. For details, see Creating a User and Granting Permissions.
  2. Currently, one-click monitoring installation is not supported on the CES page. You need to log in to the server and run the following commands to install and configure the agent. For details about how to install the agent in other regions, see Installing the Agent on a Linux Server.

    cd /usr/local && curl -k -O https://obs.cn-north-4.myhuaweicloud.com/uniagent-cn-north-4/script/agent_install.sh && bash agent_install.sh

    If the following information is displayed, the installation is successful.

    Figure 1 Installation succeeded

  3. View the monitoring items on CES page. Accelerator card monitoring items are available only after the accelerator card driver is installed on the host.

    Figure 2 Monitoring page

    The monitoring plug-in is now installed. You can view the collected metrics on the UI or configure alarms based on the metric values.