Help Center/ Cloud Container Engine/ Best Practices/ Monitoring/ Monitoring GPU Metrics Using dcgm-exporter

Updated on 2024-07-04 GMT+08:00

View PDF

Monitoring GPU Metrics Using dcgm-exporter

Application Scenarios

If a cluster contains GPU nodes, learn about the GPU resources used by GPU applications, such as the GPU usage, memory usage, running temperature, and power. You can configure auto scaling policies or set alarm rules based on the obtained GPU metrics. This section walks you through how to observe GPU resource usage based on open source Prometheus and DCGM Exporter. For more details about DCGM Exporter, see DCGM Exporter.

Prerequisites

You have created a cluster and there are GPU nodes and GPU related services running in the cluster.
The CCE AI Suite (NVIDIA GPU) and Cloud Native Cluster Monitoring add-ons have been installed in the cluster.
- CCE AI Suite (NVIDIA GPU) is a device management add-on that supports GPUs in containers. To use GPU nodes in the cluster, this add-on must be installed. Select and install the corresponding GPU driver based on the GPU type and CUDA version.
- Cloud Native Cluster Monitoring monitors the cluster metrics. During the installation, you can interconnect this add-on with Grafana to gain a better observability of your cluster.
  - You need to set the deployment mode of Cloud Native Cluster Monitoring to the server mode.
  - The configuration for interconnecting with Grafana is supported by the Cloud Native Cluster Monitoring add-on of a version earlier than 3.9.0. For the add-on of version 3.9.0 or later, if Grafana is required, use the Grafana add-on separately..

Collecting GPU Monitoring Metrics

This section describes how to deploy the dcgm-exporter component in the cluster to collect GPU metrics and expose GPU metrics through port 9400.

Log in to a node that has been bound with an EIP.
Pull the dcgm-exporter image to the local host. The image address comes from the DCGM official example. For details, see https://github.com/NVIDIA/dcgm-exporter/blob/main/dcgm-exporter.yaml.
```
docker pull nvcr.io/nvidia/k8s/dcgm-exporter:3.0.4-3.0.0-ubuntu20.04
```
Push the dcgm-exporter image to SWR.
1. (Optional) Log in to the SWR console, choose Organizations in the navigation pane, and click Create Organization in the upper right corner of the page.
  Skip this step if you already have an organization.
2. In the navigation pane, choose My Images and then click Upload Through Client. On the page displayed, click Generate a temporary login command and click to copy the command.
3. Run the login command copied in the previous step on the cluster node. If the login is successful, the message "Login Succeeded" is displayed.
4. Add a tag to the dcgm-exporter image.
  docker tag {Image name 1:Tag 1}/{Image repository address}/{Organization name}/{Image name 2:Tag 2}
  - {Image name 1:Tag 1}: name and tag of the local image to be uploaded.
  - {Image repository address}: The domain name at the end of the login command in 2 is the image repository address, which can be obtained on the SWR console.
  - {Organization name}: name of the organization created in 1.
  - {Image name 2:Tag 2}: desired image name and tag to be displayed on the SWR console.
  The following is an example:
```
docker tag nvcr.io/nvidia/k8s/dcgm-exporter:3.0.4-3.0.0-ubuntu20.04 swr.cn-east-3.myhuaweicloud.com/container/dcgm-exporter:3.0.4-3.0.0-ubuntu20.04
```
5. Push the image to the image repository.
  docker push {Image repository address}/{Organization name}/{Image name 2:Tag 2}
  
  The following is an example:
```
docker push swr.cn-east-3.myhuaweicloud.com/container/dcgm-exporter:3.0.4-3.0.0-ubuntu20.04
```
  The following information will be returned upon a successful push:
```
489a396b91d1: Pushed 
... 
c3f11d77a5de: Pushed 
3.0.4-3.0.0-ubuntu20.04: digest: sha256:bd2b1a73025*** size: 2414
```
6. To view the pushed image, go to the SWR console and refresh the My Images page.

Deploy dcgm-exporter.

When deploying dcgm-exporter on CCE, add some specific configurations to monitor GPU information. The detailed YAML file is as follows. The information in red is important.

After Cloud Native Cluster Monitoring is interconnected with AOM, metrics will be reported to the AOM instance you select. Basic metrics are free. Custom metrics are billed based on the standard pricing of AOM. For details, see Pricing Details.

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: "dcgm-exporter"
  namespace: "monitoring"      # Select a namespace as required.
  labels:
    app.kubernetes.io/name: "dcgm-exporter"
    app.kubernetes.io/version: "3.0.0"
spec:
  updateStrategy:
    type: RollingUpdate
  selector:
    matchLabels:
      app.kubernetes.io/name: "dcgm-exporter"
      app.kubernetes.io/version: "3.0.0"
  template:
    metadata:
      labels:
        app.kubernetes.io/name: "dcgm-exporter"
        app.kubernetes.io/version: "3.0.0"
      name: "dcgm-exporter"
    spec:
      containers:
      - image: "swr.cn-east-3.myhuaweicloud.com/container/dcgm-exporter:3.0.4-3.0.0-ubuntu20.04"   # The SWR image address of dcgm-exporter. The address is the image address in 5.
        env:
        - name: "DCGM_EXPORTER_LISTEN"                   # Service port number
          value: ":9400"
        - name: "DCGM_EXPORTER_KUBERNETES"               # Supports mapping of Kubernetes metrics to pods.
          value: "true"
        - name: "DCGM_EXPORTER_KUBERNETES_GPU_ID_TYPE"   # GPU ID type. The value can be uid or device-name.
          value: "device-name"
        name: "dcgm-exporter"
        ports:
        - name: "metrics"
          containerPort: 9400
        resources:      # Request and limit resources as required.
          limits:
            cpu: '200m'
            memory: '256Mi'
          requests:
            cpu: 100m
            memory: 128Mi
        securityContext:      # Enable the privilege mode for the dcgm-exporter container.
          privileged: true
          runAsNonRoot: false
          runAsUser: 0
        volumeMounts:
        - name: "pod-gpu-resources"
          readOnly: true
          mountPath: "/var/lib/kubelet/pod-resources"
        - name: "nvidia-install-dir-host"      # The environment variables configured in the dcgm-exporter image depend on the file in the /usr/local/nvidia directory of the container.
          readOnly: true
          mountPath: "/usr/local/nvidia"
      volumes:
      - name: "pod-gpu-resources"
        hostPath:
          path: "/var/lib/kubelet/pod-resources"
      - name: "nvidia-install-dir-host"       # The directory where the GPU driver is installed.
        hostPath:
          path: "/opt/cloud/cce/nvidia"       #If the GPU add-on version is 2.0.0 or later, replace the driver installation directory with /usr/local/nvidia.
      affinity:       # Label generated when CCE creates GPU nodes. You can set node affinity for this component based on this label.
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: accelerator
                operator: Exists
---
kind: Service
apiVersion: v1
metadata:
  name: "dcgm-exporter"
  namespace: "monitoring"      # Select a namespace as required.
  labels:
    app.kubernetes.io/name: "dcgm-exporter"
    app.kubernetes.io/version: "3.0.0"
spec:
  selector:
    app.kubernetes.io/name: "dcgm-exporter"
    app.kubernetes.io/version: "3.0.0"
  ports:
  - name: "metrics"
    port: 9400
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  labels:
    app.kubernetes.io/name: "dcgm-exporter"
    app.kubernetes.io/version: "3.0.0"
  name: dcgm-exporter
  namespace: monitoring     #Select a namespace as required.
spec:
  endpoints:
  - honorLabels: true
    interval: 15s
    path: /metrics
    port: metrics
    relabelings:
    - action: labelmap
      regex: __meta_kubernetes_service_label_(.+)
    - action: replace
      sourceLabels:
      - __meta_kubernetes_namespace
      targetLabel: kubernetes_namespace
    - action: replace
      sourceLabels:
      - __meta_kubernetes_service_name
      targetLabel: kubernetes_service
    scheme: http
  namespaceSelector:
    matchNames:
    - monitoring
  selector:
    matchLabels:
      app.kubernetes.io/name: "dcgm-exporter"

Monitor application GPU metrics.

Run the following command to check whether the dcgm-exporter is running properly:

kubectl get po -n monitoring -owide

Information similar to the following is displayed:

# kubectl get po -n monitoring -owide
NAME                                        READY   STATUS    RESTARTS   AGE   IP             NODE           NOMINATED NODE   READINESS GATES
alertmanager-alertmanager-0                 0/2     Pending   0          19m   <none>         <none>         <none>           <none>
custom-metrics-apiserver-5bb67f4b99-grxhq   1/1     Running   0          19m   172.16.0.6     192.168.0.73   <none>           <none>
dcgm-exporter-hkr77                         1/1     Running   0          17m   172.16.0.11    192.168.0.73   <none>           <none>
grafana-785cdcd47-9jlgr                     1/1     Running   0          19m   172.16.0.9     192.168.0.73   <none>           <none>
kube-state-metrics-647b6585b8-6l2zm         1/1     Running   0          19m   172.16.0.8     192.168.0.73   <none>           <none>
node-exporter-xvk82                         1/1     Running   0          19m   192.168.0.73   192.168.0.73   <none>           <none>
prometheus-operator-5ff8744d5f-mhbqv        1/1     Running   0          19m   172.16.0.7     192.168.0.73   <none>           <none>
prometheus-server-0                         2/2     Running   0          19m   172.16.0.10    192.168.0.73   <none>           <none>

Call the dcgm-exporter API to verify the collected application GPU information.
172.16.0.11 indicates the pod IP address of the dcgm-exporter.
```
curl 172.16.0.11:9400/metrics | grep DCGM_FI_DEV_GPU_UTIL
```

View metric monitoring information on the Prometheus page.

After prometheus and the related add-on are installed, a ClusterIP Service is created by default. To allow external systems to access the Service, create a NodePort or a LoadBalancer Service. For details, see Monitoring Custom Metrics Using Prometheus.

As shown in the following figure, you can view the GPU usage and other related metrics on the GPU node. For more GPU metrics, see Observable Metrics.
Log in to the Grafana page to view GPU information.

If you have installed Grafana, you can import NVIDIA DCGM Exporter dashboard to display GPU metrics.

For details, see Manage dashboards.

Observable Metrics

The following table lists some observable GPU metrics. For details about more metrics, see Field Identifiers.

**Table 1** Usage
Metric Name	Metric Type	Unit	Description
DCGM_FI_DEV_GPU_UTIL	Gauge	%	GPU usage
DCGM_FI_DEV_MEM_COPY_UTIL	Gauge	%	Memory usage
DCGM_FI_DEV_ENC_UTIL	Gauge	%	Encoder usage
DCGM_FI_DEV_DEC_UTIL	Gauge	%	Decoder usage

**Table 2** Memory
Metric Name	Metric Type	Unit	Description
DCGM_FI_DEV_FB_FREE	Gauge	MB	Number of remaining frame buffers. The frame buffer is called VRAM.
DCGM_FI_DEV_FB_USED	Gauge	MB	Number of used frame buffers. The value is the same as the value of memory-usage in the nvidia-smi command.

**Table 3** Temperature and power
Metric Name	Metric Type	Unit	Description
DCGM_FI_DEV_GPU_TEMP	Gauge	°C	Current GPU temperature of the device
DCGM_FI_DEV_POWER_USAGE	Gauge	W	Power usage of the device

Parent Topic: Monitoring

Previous topic: Monitoring Multiple Clusters Using Prometheus

Next topic: Reporting Prometheus Monitoring Data to a Third-Party Monitoring Platform

Feedback

Was this page helpful?

Helpful Not helpful

Provide feedback

Thank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.

The system is busy. Please try again later.

Which of the following issues have you encountered?

Content is inconsistent with the product UI

Unclear descriptions

Lack of examples or code

Incorrect steps

Can't find what I need

Lack of best practices

Feedback (optional)

0/500

Select at least one type of issue, and enter your comments or suggestions.

Enter a maximum of 500 characters.

Submit Cancel

For any further questions, feel free to contact us through the chatbot.

Chatbot