Using DCGM to Monitor GPU Resources of Lite Servers

Description

This section describes how to configure DCGM monitoring on Lite Servers to monitor GPU resources.

DCGM is an integrated tool for managing and monitoring large-scale GPU clusters based on Linux. It provides multiple capabilities, including proactive health monitoring, diagnosis, system verification, policy, power and clock management, configuration management, and audit.

Constraints

Only GPU resources can be monitored.

Prerequisites

The driver, CUDA, and fabric-manager software packages have been installed for BMS.

Step 1: Installing Docker

Install the latest Docker using the official script.

curl https://get.docker.com | sh
sudo systemctl --now enable docker

Step 2: Installing the Container Toolkit

Set the repository address and GPG key.

distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
   && curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - \
   && curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list

Install nvidia-docker2.

sudo apt-get update \
   && sudo apt-get install -y nvidia-docker2

Modify the /etc/docker/daemon.json file as follows:

{
   "default-runtime": "nvidia",
   "runtimes": {
        "nvidia": {
            "path": "nvidia-container-runtime",
            "runtimeArgs": []
      }
   }
}

Restart Docker daemon.

sudo systemctl restart docker

Step 3: Running DCGM-Exporter

Run DCGM-Exporter in Docker mode.

DCGM_EXPORTER_VERSION=3.1.7-3.1.4 && \
docker run -d --rm \
   --gpus all \
   --net host \
   --cap-add SYS_ADMIN \
   nvcr.io/nvidia/k8s/dcgm-exporter:${DCGM_EXPORTER_VERSION}-ubuntu20.04 \
   -f /etc/dcgm-exporter/dcp-metrics-included.csv

The default metric collection configuration file /etc/dcgm-exporter/dcp-metrics-included.csv of DCGM-Exporter is used. For details about the metric collection objects, see dcgm-exporter. If the collection objects cannot meet the requirements, you can customize an image or mount the image.

Wait for about 1 minute and run the following command to obtain the GPU metrics:

curl localhost:9400/metrics

The output is as follows:

# HELP DCGM_FI_DEV_SM_CLOCK SM clock frequency (in MHz).
# TYPE DCGM_FI_DEV_SM_CLOCK gauge
# HELP DCGM_FI_DEV_MEM_CLOCK Memory clock frequency (in MHz).
# TYPE DCGM_FI_DEV_MEM_CLOCK gauge
# HELP DCGM_FI_DEV_MEMORY_TEMP Memory temperature (in C).
# TYPE DCGM_FI_DEV_MEMORY_TEMP gauge
...
DCGM_FI_DEV_SM_CLOCK{gpu="0", UUID="GPU-6ad7ea4c-5517-05e7-0b54-7554cb4374d3"} 1
DCGM_FI_DEV_MEM_CLOCK{gpu="0", UUID="GPU-6ad7ea4c-5517-05e7-0b54-7554cb4374d3"} 4
DCGM_FI_DEV_MEMORY_TEMP{gpu="0", UUID="GPU-6ad7ea4c-5517-05e7-0b54-7554cb4374d3"} 9223372036854578794
...

Step 4: Installing Prometheus

Create the prometheus.yml file in the /usr/local/prometheus directory. The file content is as follows:

global:
  scrape_interval: 15s # Collection interval
scrape_configs:
  - job_name: 'prometheus'
    static_configs:
    - targets: ['xx.xx.xx.xx:9400'] # Port for obtaining DCGM-Exporter metrics. Replace xx.xx.xx.xx with the IP address of the node where DCGM-Exporter resides.

Run Prometheus.

docker run -d \
    -p 9090:9090 \
    -v /usr/local/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml \
    prom/prometheus

The basic functions of Prometheus are described. For higher requirements, see the official Prometheus document.

Step 5: Installing Grafana

Run the latest Grafana.

docker run -d -p 3000:3000 grafana/grafana-oss

On the BMS page, open the security group configuration of the node where Grafana is located and add an inbound rule to allow external access to ports 3000 and 9090.

Enter xx.xx.xx.xx:3000 in the address box of the browser to log in to Grafana. The default username and password are both admin. On the configuration management page, add a data source and set the type to Prometheus.

Note: xx.xx.xx.xx is the IP address of the host machine where Grafana is located.

Figure 1 Prometheus