Monitoring GPU Metrics
You can use Prometheus and Grafana to observe GPU metrics. This section uses Prometheus as an example to describe how to view the GPU memory usage of a cluster.
The process is as follows:
- Accessing Prometheus
(Optional) Bind a LoadBalancer Service to Prometheus so that Prometheus can be accessed from external networks.
- Monitoring GPU Metrics
After a GPU workload is deployed in the cluster, GPU metrics will be automatically reported.
- Accessing Grafana
View Prometheus monitoring data on Grafana, a visualization panel.
Prerequisites
- The Cloud Native Cluster Monitoring add-on has been installed in the cluster.
- The CCE AI Suite (NVIDIA GPU) add-on has been installed in the cluster, and the add-on version is 2.0.10 or later.
- To monitor GPU virtualization metrics, ensure Volcano Scheduler has been installed in the cluster and the add-on version is 1.10.5 or later.
Accessing Prometheus
After the Prometheus add-on is installed, you can deploy workloads and Services. The Prometheus server will be deployed as a StatefulSet in the monitoring namespace.
You can create a public network LoadBalancer Service so that Prometheus can be accessed from an external network.
- Log in to the CCE console and click the name of the cluster with Prometheus installed to access the cluster console. In the navigation pane, choose Services & Ingresses.
- Click Create from YAML in the upper right corner to create a public network LoadBalancer Service.
apiVersion: v1 kind: Service metadata: name: prom-lb # Service name, which is customizable. namespace: monitoring labels: app: prometheus component: server annotations: kubernetes.io/elb.id: 038ff*** # Replace it with the ID of the public network load balancer in the VPC that the cluster belongs to. spec: ports: - name: cce-service-0 protocol: TCP port: 88 # Service port, which is customizable. targetPort: 9090 # Default Prometheus port. Retain the default value. selector: # The label selector can be adjusted based on the label of a Prometheus server instance. app.kubernetes.io/name: prometheus prometheus: server type: LoadBalancer
- After the Service is created, visit Public IP address of the load balancer:Service port to access Prometheus.
Figure 1 Accessing Prometheus
- Choose Status > Targets to view the targets monitored by Prometheus.
Figure 2 Viewing monitored targets
Monitoring GPU Metrics
Create a GPU workload. After the workload runs properly, access Prometheus and view GPU metrics on the Graph page.
Type |
Metric |
Monitoring Level |
Description |
---|---|---|---|
Utilization |
cce_gpu_utilization |
GPU cards |
GPU compute usage |
cce_gpu_memory_utilization |
GPU cards |
GPU memory usage |
|
cce_gpu_encoder_utilization |
GPU cards |
GPU encoding usage |
|
cce_gpu_decoder_utilization |
GPU cards |
GPU decoding usage |
|
cce_gpu_utilization_process |
GPU processes |
GPU compute usage of each process |
|
cce_gpu_memory_utilization_process |
GPU processes |
GPU memory usage of each process |
|
cce_gpu_encoder_utilization_process |
GPU processes |
GPU encoding usage of each process |
|
cce_gpu_decoder_utilization_process |
GPU processes |
GPU decoding usage of each process |
|
Memory |
cce_gpu_memory_used |
GPU cards |
Used GPU memory |
cce_gpu_memory_total |
GPU cards |
Total GPU memory |
|
cce_gpu_memory_free |
GPU cards |
Free GPU memory |
|
cce_gpu_bar1_memory_used |
GPU cards |
Used GPU BAR1 memory |
|
cce_gpu_bar1_memory_total |
GPU cards |
Total GPU BAR1 memory |
|
Frequency |
cce_gpu_clock |
GPU cards |
GPU clock frequency |
cce_gpu_memory_clock |
GPU cards |
GPU memory frequency |
|
cce_gpu_graphics_clock |
GPU cards |
GPU frequency |
|
cce_gpu_video_clock |
GPU cards |
GPU video processor frequency |
|
Physical status |
cce_gpu_temperature |
GPU cards |
GPU temperature |
cce_gpu_power_usage |
GPU cards |
GPU power |
|
cce_gpu_total_energy_consumption |
GPU cards |
Total GPU energy consumption |
|
Bandwidth |
cce_gpu_pcie_link_bandwidth |
GPU cards |
GPU PCIe bandwidth |
cce_gpu_nvlink_bandwidth |
GPU cards |
GPU NVLink bandwidth |
|
cce_gpu_pcie_throughput_rx |
GPU cards |
GPU PCIe RX bandwidth |
|
cce_gpu_pcie_throughput_tx |
GPU cards |
GPU PCIe TX bandwidth |
|
cce_gpu_nvlink_utilization_counter_rx |
GPU cards |
GPU NVLink RX bandwidth |
|
cce_gpu_nvlink_utilization_counter_tx |
GPU cards |
GPU NVLink TX bandwidth |
|
Memory isolation page |
cce_gpu_retired_pages_sbe |
GPU cards |
Number of isolated GPU memory pages with single-bit errors |
cce_gpu_retired_pages_dbe |
GPU cards |
Number of isolated GPU memory pages with dual-bit errors |
Metric |
Monitoring Level |
Description |
---|---|---|
xgpu_memory_total |
GPU processes |
Total xGPU memory |
xgpu_memory_used |
GPU processes |
Used xGPU memory |
xgpu_core_percentage_total |
GPU processes |
Total xGPU cores |
xgpu_core_percentage_used |
GPU processes |
Used xGPU cores |
gpu_schedule_policy |
GPU cards |
xGPU scheduling policy. Options:
|
xgpu_device_health |
GPU cards |
Health status of an xGPU device. Options:
|
Accessing Grafana
The Prometheus add-on has had Grafana (an open-source visualization tool) installed and interconnected. You can create a public network LoadBalancer Service so that you can access Grafana from the public network and view Prometheus monitoring data on Grafana.
Click the access address to access Grafana and select a proper dashboard to view the aggregated content.
- Log in to the CCE console and click the name of the cluster with Prometheus installed to access the cluster console. In the navigation pane, choose Services & Ingresses.
- Click Create from YAML in the upper right corner to create a public network LoadBalancer Service for Grafana.
apiVersion: v1 kind: Service metadata: name: grafana-lb # Service name, which is customizable namespace: monitoring labels: app: grafana annotations: kubernetes.io/elb.id: 038ff*** # Replace it with the ID of the public network load balancer in the VPC to which the cluster belongs. spec: ports: - name: cce-service-0 protocol: TCP port: 80 # Service port, which is customizable targetPort: 3000 # Default Grafana port. Retain the default value. selector: app: grafana type: LoadBalancer
- After the Service is created, visit Public IP address of the load balancer:Service port to access Grafana and select a proper dashboard to view xGPU resources.
Figure 4 Viewing xGPU resources
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.