Using DCGM to Monitor Lite Server Resources
Scenario
This section describes how to configure Data Center GPU Manager (DCGM) monitoring. DCGM is an integrated tool for managing and monitoring large-scale NVIDIA GPU clusters based on Linux. It provides multiple capabilities, including proactive health monitoring, diagnosis, system verification, policy, power and clock management, configuration management, and audit.
Prerequisites
The driver, CUDA, and fabric-manager software packages have been installed for BMS.
Step 1 Installing Docker
Install the latest Docker using the official script.
curl https://get.docker.com | sh sudo systemctl --now enable docker
Step 2 Installing the NVIDIA Container Tools
Set the repository address and GPG key.
distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \ && curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - \ && curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
Install nvidia-docker2.
sudo apt-get update \ && sudo apt-get install -y nvidia-docker2
Modify the /etc/docker/daemon.json file as follows:
{ "default-runtime": "nvidia", "runtimes": { "nvidia": { "path": "nvidia-container-runtime", "runtimeArgs": [] } } }
Restart Docker daemon.
sudo systemctl restart docker
Step 3 Running DCGM-Exporter
Run DCGM-Exporter in Docker mode.
DCGM_EXPORTER_VERSION=3.1.7-3.1.4 && \ docker run -d --rm \ --gpus all \ --net host \ --cap-add SYS_ADMIN \ nvcr.io/nvidia/k8s/dcgm-exporter:${DCGM_EXPORTER_VERSION}-ubuntu20.04 \ -f /etc/dcgm-exporter/dcp-metrics-included.csv
The default metric collection configuration file /etc/dcgm-exporter/dcp-metrics-included.csv of DCGM-Exporter is used. For details about the metric collection objects, see dcgm-exporter. If the collection objects cannot meet the requirements, you can customize an image or mount the image.
Wait for about 1 minute and run the following command to obtain the GPU metrics:
curl localhost:9400/metrics
The output is as follows:
# HELP DCGM_FI_DEV_SM_CLOCK SM clock frequency (in MHz). # TYPE DCGM_FI_DEV_SM_CLOCK gauge # HELP DCGM_FI_DEV_MEM_CLOCK Memory clock frequency (in MHz). # TYPE DCGM_FI_DEV_MEM_CLOCK gauge # HELP DCGM_FI_DEV_MEMORY_TEMP Memory temperature (in C). # TYPE DCGM_FI_DEV_MEMORY_TEMP gauge ... DCGM_FI_DEV_SM_CLOCK{gpu="0", UUID="GPU-6ad7ea4c-5517-05e7-0b54-7554cb4374d3"} 1 DCGM_FI_DEV_MEM_CLOCK{gpu="0", UUID="GPU-6ad7ea4c-5517-05e7-0b54-7554cb4374d3"} 4 DCGM_FI_DEV_MEMORY_TEMP{gpu="0", UUID="GPU-6ad7ea4c-5517-05e7-0b54-7554cb4374d3"} 9223372036854578794 ...
Step 4 Installing Prometheus
Create the prometheus.yml file in the /usr/local/prometheus directory. The file content is as follows:
global: scrape_interval: 15s # Collection interval scrape_configs: - job_name: 'prometheus' static_configs: - targets: ['xx.xx.xx.xx:9400'] # Port for obtaining DCGM-Exporter metrics. Replace xx.xx.xx.xx with the IP address of the node where DCGM-Exporter resides.
Run Prometheus.
docker run -d \ -p 9090:9090 \ -v /usr/local/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml \ prom/prometheus
The basic functions of Prometheus are described. For higher requirements, see the official Prometheus document.
Step 5 Installing Grafana
Run the latest Grafana.
docker run -d -p 3000:3000 grafana/grafana-oss
On the BMS page, open the security group configuration of the node where Grafana is located and add an inbound rule to allow external access to ports 3000 and 9090.
Enter xx.xx.xx.xx:3000 in the address box of the browser to log in to Grafana. The default username and password are both admin. On the configuration management page, add a data source and set the type to Prometheus.
Note: xx.xx.xx.xx is the IP address of the host machine where Grafana is located.
Enter the Prometheus IP address and port number in the HTTP URL text box and click Save&Test.
The metric monitoring solution is installed, as shown in the following figure.
The basic functions of Grafana are described. For higher requirements, see the official Grafana document.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot