Using CES to Monitor Lite Server Resources
Scenario
This section describes how to configure the BMS metric monitoring solution provided by Huawei Cloud BMS and Cloud Eye Service (CES). You can view the monitoring metrics of CPU, CPU load, memory, disk, disk I/O, file system, NIC, software RAID, and process.
About BMS Monitoring
For details, see BMS Overview. In addition to the images listed in the document, Ubuntu 20.04 is also supported.
The sampling period of monitoring metrics is 1 minute. The current monitoring metrics include the CPU, memory, disk, and network. After the accelerator card driver is installed on the host, the metrics listed in the following table can be collected.
Metric |
Name |
Description |
Unit |
Dimensions |
---|---|---|---|---|
gpu_status |
GPU Health Status |
Overall measurement of the GPU health. 0 indicates the GPU is healthy. 1 indicates the GPU is subhealthy. 2 indicates the GPU is faulty. |
- |
instance_id, gpu |
gpu_utilization |
GPU Usage |
GPU computing power usage |
% |
instance_id, gpu |
memory_utilization |
GPU Memory Usage |
GPU memory usage |
% |
instance_id, gpu |
gpu_performance |
GPU Performance Status |
Performance status of the GPU |
- |
instance_id, gpu |
encoder_utilization |
Encoding Usage |
GPU encoding capability usage |
% |
instance_id, gpu |
decoder_utilization |
Decoding Usage |
GPU decoding capability usage |
% |
instance_id, gpu |
volatile_correctable |
Volatile Correctable ECC Errors |
Number of correctable ECC errors since the GPU is reset. The value is reset to 0 each time the GPU is reset. |
Number |
instance_id, gpu |
volatile_uncorrectable |
Volatile Uncorrectable ECC Errors |
Number of uncorrectable ECC errors since the GPU is reset. The value is reset to 0 each time the GPU is reset. |
Number |
instance_id, gpu |
aggregate_correctable |
Aggregate Correctable ECC Errors |
Number of correctable ECC errors on the GPU |
Number |
instance_id, gpu |
aggregate_uncorrectable |
Aggregate Uncorrectable ECC Errors |
Number of uncorrectable ECC Errors on the GPU |
Number |
instance_id, gpu |
retired_page_single_bit |
Retired Page Single Bit Errors |
Number of retired page single bit errors, which indicates the number of single-bit pages blocked by the graphics card |
Number |
instance_id, gpu |
retired_page_double_bit |
Retired Page Double Bit Errors |
Number of retired page double bit errors, which indicates the number of double-bit pages blocked by the graphics card |
Number |
instance_id, gpu |
Installing the Monitoring Plug-in
- Create an agency for CES. For details, see Creating a User and Granting Permissions.
- Currently, one-click monitoring installation is not supported on the CES page. You need to log in to the server and run the following commands to install and configure the agent. For details about how to install the agent in other regions, see Installing the Agent on a Linux Server.
cd /usr/local && curl -k -O https://obs.cn-north-4.myhuaweicloud.com/uniagent-cn-north-4/script/agent_install.sh && bash agent_install.sh
If the following information is displayed, the installation is successful.
Figure 1 Installation succeeded
- View the monitoring items on CES page. Accelerator card monitoring items are available only after the accelerator card driver is installed on the host.
Figure 2 Monitoring page
The monitoring plug-in is now installed. You can view the collected metrics on the UI or configure alarms based on the metric values.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot