Updated on 2024-08-16 GMT+08:00

Monitoring GPU Metrics

You can use Prometheus and Grafana to observe GPU metrics. This section uses Prometheus as an example to describe how to view the GPU memory usage of a cluster.

The process is as follows:

  1. Accessing Prometheus

    (Optional) Bind a LoadBalancer Service to Prometheus so that Prometheus can be accessed from external networks.

  2. Monitoring GPU Metrics

    After a GPU workload is deployed in the cluster, GPU metrics will be automatically reported.

  3. Accessing Grafana

    View Prometheus monitoring data on Grafana, a visualization panel.

Prerequisites

Accessing Prometheus

After the Prometheus add-on is installed, you can deploy workloads and Services. The Prometheus server will be deployed as a StatefulSet in the monitoring namespace.

You can create a public network LoadBalancer Service so that Prometheus can be accessed from an external network.

  1. Log in to the CCE console and click the name of the cluster with Prometheus installed to access the cluster console. In the navigation pane, choose Services & Ingresses.
  2. Click Create from YAML in the upper right corner to create a public network LoadBalancer Service.

    apiVersion: v1
    kind: Service
    metadata:
      name: prom-lb     # Service name, which is customizable.
      namespace: monitoring
      labels:
        app: prometheus
        component: server
      annotations:
        kubernetes.io/elb.id: 038ff***     # Replace it with the ID of the public network load balancer in the VPC that the cluster belongs to.
    spec:
      ports:
        - name: cce-service-0
          protocol: TCP
          port: 88             # Service port, which is customizable.
          targetPort: 9090     # Default Prometheus port. Retain the default value.
      selector:                # The label selector can be adjusted based on the label of a Prometheus server instance.
        app.kubernetes.io/name: prometheus
        prometheus: server
      type: LoadBalancer

  3. After the Service is created, visit Public IP address of the load balancer:Service port to access Prometheus.

    Figure 1 Accessing Prometheus

  4. Choose Status > Targets to view the targets monitored by Prometheus.

    Figure 2 Viewing monitored targets

Monitoring GPU Metrics

Create a GPU workload. After the workload runs properly, access Prometheus and view GPU metrics on the Graph page.

Figure 3 Viewing GPU metrics
Table 1 Basic GPU monitoring metrics

Type

Metric

Monitoring Level

Description

Utilization

cce_gpu_utilization

GPU cards

GPU compute usage

cce_gpu_memory_utilization

GPU cards

GPU memory usage

cce_gpu_encoder_utilization

GPU cards

GPU encoding usage

cce_gpu_decoder_utilization

GPU cards

GPU decoding usage

cce_gpu_utilization_process

GPU processes

GPU compute usage of each process

cce_gpu_memory_utilization_process

GPU processes

GPU memory usage of each process

cce_gpu_encoder_utilization_process

GPU processes

GPU encoding usage of each process

cce_gpu_decoder_utilization_process

GPU processes

GPU decoding usage of each process

Memory

cce_gpu_memory_used

GPU cards

Used GPU memory

cce_gpu_memory_total

GPU cards

Total GPU memory

cce_gpu_memory_free

GPU cards

Free GPU memory

cce_gpu_bar1_memory_used

GPU cards

Used GPU BAR1 memory

cce_gpu_bar1_memory_total

GPU cards

Total GPU BAR1 memory

Frequency

cce_gpu_clock

GPU cards

GPU clock frequency

cce_gpu_memory_clock

GPU cards

GPU memory frequency

cce_gpu_graphics_clock

GPU cards

GPU frequency

cce_gpu_video_clock

GPU cards

GPU video processor frequency

Physical status

cce_gpu_temperature

GPU cards

GPU temperature

cce_gpu_power_usage

GPU cards

GPU power

cce_gpu_total_energy_consumption

GPU cards

Total GPU energy consumption

Bandwidth

cce_gpu_pcie_link_bandwidth

GPU cards

GPU PCIe bandwidth

cce_gpu_nvlink_bandwidth

GPU cards

GPU NVLink bandwidth

cce_gpu_pcie_throughput_rx

GPU cards

GPU PCIe RX bandwidth

cce_gpu_pcie_throughput_tx

GPU cards

GPU PCIe TX bandwidth

cce_gpu_nvlink_utilization_counter_rx

GPU cards

GPU NVLink RX bandwidth

cce_gpu_nvlink_utilization_counter_tx

GPU cards

GPU NVLink TX bandwidth

Memory isolation page

cce_gpu_retired_pages_sbe

GPU cards

Number of isolated GPU memory pages with single-bit errors

cce_gpu_retired_pages_dbe

GPU cards

Number of isolated GPU memory pages with dual-bit errors

Table 2 xGPU metrics

Metric

Monitoring Level

Description

xgpu_memory_total

GPU processes

Total xGPU memory

xgpu_memory_used

GPU processes

Used xGPU memory

xgpu_core_percentage_total

GPU processes

Total xGPU cores

xgpu_core_percentage_used

GPU processes

Used xGPU cores

gpu_schedule_policy

GPU cards

xGPU scheduling policy. Options:

  • 0: xGPU memory is isolated and cores are shared.
  • 1: Both xGPU memory and cores are isolated.
  • 2: default mode, indicating that the current card is not used by any xGPU device for allocation.

xgpu_device_health

GPU cards

Health status of an xGPU device. Options:

  • 0: The xGPU device is healthy.
  • 1: The xGPU device is not healthy.

Accessing Grafana

The Prometheus add-on has had Grafana (an open-source visualization tool) installed and interconnected. You can create a public network LoadBalancer Service so that you can access Grafana from the public network and view Prometheus monitoring data on Grafana.

Click the access address to access Grafana and select a proper dashboard to view the aggregated content.

  1. Log in to the CCE console and click the name of the cluster with Prometheus installed to access the cluster console. In the navigation pane, choose Services & Ingresses.
  2. Click Create from YAML in the upper right corner to create a public network LoadBalancer Service for Grafana.

    apiVersion: v1
    kind: Service
    metadata:
      name: grafana-lb     # Service name, which is customizable
      namespace: monitoring
      labels:
        app: grafana
      annotations:
        kubernetes.io/elb.id: 038ff***     # Replace it with the ID of the public network load balancer in the VPC to which the cluster belongs.
    spec:
      ports:
        - name: cce-service-0
          protocol: TCP
          port: 80     # Service port, which is customizable
          targetPort: 3000     # Default Grafana port. Retain the default value.
      selector:
        app: grafana
      type: LoadBalancer

  3. After the Service is created, visit Public IP address of the load balancer:Service port to access Grafana and select a proper dashboard to view xGPU resources.

    Figure 4 Viewing xGPU resources