Monitoring GPU Metrics

You can use Prometheus and Grafana to observe GPU metrics. This section uses Prometheus as an example to describe how to view the GPU memory usage of a cluster.

The process is as follows:

Accessing Prometheus
(Optional) Bind a LoadBalancer Service to Prometheus so that Prometheus can be accessed from external networks.
Monitoring GPU Metrics
After a GPU workload is deployed in the cluster, GPU metrics will be automatically reported.
Accessing Grafana
View Prometheus monitoring data on Grafana, a visualization panel.

Prerequisites

The Cloud Native Cluster Monitoring add-on has been installed in the cluster.
The CCE AI Suite (NVIDIA GPU) add-on has been installed in the cluster, and the add-on version is 2.0.10 or later.
The Volcano Scheduler add-on has been installed in the cluster, and the add-on version is 1.10.5 or later.

Accessing Prometheus

After the Prometheus add-on is installed, you can deploy workloads and Services. The Prometheus server will be deployed as a StatefulSet in the monitoring namespace.

You can create a public network LoadBalancer Service so that Prometheus can be accessed from an external network.

Log in to the CCE console and click the name of the cluster with Prometheus installed to access the cluster console. In the navigation pane, choose Services & Ingresses.

Click Create from YAML in the upper right corner to create a public network LoadBalancer Service.

apiVersion: v1
kind: Service
metadata:
  name: prom-lb     # Service name, which is customizable.
  namespace: monitoring
  labels:
    app: prometheus
    component: server
  annotations:
    kubernetes.io/elb.id: 038ff***     # Replace it with the ID of the public network load balancer in the VPC that the cluster belongs to.
spec:
  ports:
    - name: cce-service-0
      protocol: TCP
      port: 88             # Service port, which is customizable.
      targetPort: 9090     # Default Prometheus port. Retain the default value.
  selector:                # The label selector can be adjusted based on the label of a Prometheus server instance.
    app.kubernetes.io/name: prometheus
    prometheus: server
  type: LoadBalancer

After the Service is created, visit Public IP address of the load balancer:Service port to access Prometheus.

Figure 1 Accessing Prometheus
Choose Status > Targets to view the targets monitored by Prometheus.

Figure 2 Viewing monitored targets

Monitoring GPU Metrics

Create a GPU workload. After the workload runs properly, access Prometheus and view GPU metrics on the Graph page.

Figure 3 Viewing GPU metrics

**Table 1** Basic GPU monitoring metrics
Type	Metric	Monitoring Level	Description
Utilization	cce_gpu_utilization	GPU cards	GPU compute usage
	cce_gpu_memory_utilization	GPU cards	GPU memory usage
	cce_gpu_encoder_utilization	GPU cards	GPU encoding usage
	cce_gpu_decoder_utilization	GPU cards	GPU decoding usage
	cce_gpu_utilization_process	GPU processes	GPU compute usage of each process
	cce_gpu_memory_utilization_process	GPU processes	GPU memory usage of each process
	cce_gpu_encoder_utilization_process	GPU processes	GPU encoding usage of each process
	cce_gpu_decoder_utilization_process	GPU processes	GPU decoding usage of each process
Memory	cce_gpu_memory_used	GPU cards	Used GPU memory
	cce_gpu_memory_total	GPU cards	Total GPU memory
	cce_gpu_memory_free	GPU cards	Free GPU memory
	cce_gpu_bar1_memory_used	GPU cards	Used GPU BAR1 memory
	cce_gpu_bar1_memory_total	GPU cards	Total GPU BAR1 memory
Frequency	cce_gpu_clock	GPU cards	GPU clock frequency
	cce_gpu_memory_clock	GPU cards	GPU memory frequency
	cce_gpu_graphics_clock	GPU cards	GPU frequency
	cce_gpu_video_clock	GPU cards	GPU video processor frequency
Physical status	cce_gpu_temperature	GPU cards	GPU temperature
	cce_gpu_power_usage	GPU cards	GPU power
	cce_gpu_total_energy_consumption	GPU cards	Total GPU energy consumption
Bandwidth	cce_gpu_pcie_link_bandwidth	GPU cards	GPU PCIe bandwidth
	cce_gpu_nvlink_bandwidth	GPU cards	GPU NVLink bandwidth
	cce_gpu_pcie_throughput_rx	GPU cards	GPU PCIe RX bandwidth
	cce_gpu_pcie_throughput_tx	GPU cards	GPU PCIe TX bandwidth
	cce_gpu_nvlink_utilization_counter_rx	GPU cards	GPU NVLink RX bandwidth
	cce_gpu_nvlink_utilization_counter_tx	GPU cards	GPU NVLink TX bandwidth
Memory isolation page	cce_gpu_retired_pages_sbe	GPU cards	Number of isolated GPU memory pages with single-bit errors
Memory isolation page	cce_gpu_retired_pages_dbe	GPU cards	Number of isolated GPU memory pages with dual-bit errors

**Table 2** xGPU metrics
Metric	Monitoring Level	Description
xgpu_memory_total	GPU processes	Total xGPU memory
xgpu_memory_used	GPU processes	Used xGPU memory
xgpu_core_percentage_total	GPU processes	Total xGPU cores
xgpu_core_percentage_used	GPU processes	Used xGPU cores
gpu_schedule_policy	GPU cards	xGPU scheduling policy. Options: 0: xGPU memory is isolated and cores are shared. 1: Both xGPU memory and cores are isolated. 2: default mode, indicating that the current card is not used by any xGPU device for allocation.
xgpu_device_health	GPU cards	Health status of an xGPU device. Options: 0: The xGPU device is healthy. 1: The xGPU device is not healthy.

Accessing Grafana

The Prometheus add-on has had Grafana (an open-source visualization tool) installed and interconnected. You can create a public network LoadBalancer Service so that you can access Grafana from the public network and view Prometheus monitoring data on Grafana.

Click the access address to access Grafana and select a proper dashboard to view the aggregated content.

Log in to the CCE console and click the name of the cluster with Prometheus installed to access the cluster console. In the navigation pane, choose Services & Ingresses.

Click Create from YAML in the upper right corner to create a public network LoadBalancer Service for Grafana.

apiVersion: v1
kind: Service
metadata:
  name: grafana-lb     # Service name, which is customizable
  namespace: monitoring
  labels:
    app: grafana
  annotations:
    kubernetes.io/elb.id: 038ff***     # Replace it with the ID of the public network load balancer in the VPC to which the cluster belongs.
spec:
  ports:
    - name: cce-service-0
      protocol: TCP
      port: 80     # Service port, which is customizable
      targetPort: 3000     # Default Grafana port. Retain the default value.
  selector:
    app: grafana
  type: LoadBalancer

After the Service is created, visit Public IP address of the load balancer:Service port to access Grafana and select a proper dashboard to view xGPU resources.

Figure 4 Viewing xGPU resources