Monitoring GPU Metrics
You can use Prometheus and Grafana to observe GPU metrics. This section uses Prometheus as an example to describe how to view the GPU memory usage of a cluster.
The process is as follows:
- Accessing Prometheus
(Optional) Bind a LoadBalancer Service to Prometheus so that Prometheus can be accessed from external networks.
- Monitoring GPU Metrics
After a GPU workload is deployed in the cluster, GPU metrics will be automatically reported.
- Accessing Grafana
View Prometheus monitoring data on Grafana, a visualization panel.
Prerequisites
- The Cloud Native Cluster Monitoring add-on has been installed in the cluster.
- The CCE AI Suite (NVIDIA GPU) add-on has been installed in the cluster, and the add-on version is 2.0.10 or later.
- To monitor GPU virtualization metrics, ensure Volcano Scheduler has been installed in the cluster and the add-on version is 1.10.5 or later.
Accessing Prometheus
After the Prometheus add-on is installed, you can deploy workloads and Services. The Prometheus server will be deployed as a StatefulSet in the monitoring namespace.
You can create a public network LoadBalancer Service so that Prometheus can be accessed from an external network.
- Log in to the CCE console and click the name of the cluster with Prometheus installed to access the cluster console. In the navigation pane, choose Services & Ingresses.
- Click Create from YAML in the upper right corner to create a public network LoadBalancer Service.
apiVersion: v1 kind: Service metadata: name: prom-lb # Service name, which is customizable. namespace: monitoring labels: app: prometheus component: server annotations: kubernetes.io/elb.id: 038ff*** # Replace it with the ID of the public network load balancer in the VPC that the cluster belongs to. spec: ports: - name: cce-service-0 protocol: TCP port: 88 # Service port, which is customizable. targetPort: 9090 # Default Prometheus port. Retain the default value. selector: # The label selector can be adjusted based on the label of a Prometheus server instance. app.kubernetes.io/name: prometheus prometheus: server type: LoadBalancer
- After the Service is created, visit Public IP address of the load balancer:Service port to access Prometheus.
Figure 1 Accessing Prometheus
- Choose Status > Targets to view the targets monitored by Prometheus.
Figure 2 Viewing monitored targets
Monitoring GPU Metrics
Create a GPU workload. After the workload runs properly, access Prometheus and view GPU metrics on the Graph page.
Type |
Metric |
Monitoring Level |
Description |
---|---|---|---|
Utilization |
cce_gpu_utilization |
GPU cards |
GPU compute usage |
cce_gpu_memory_utilization |
GPU cards |
GPU memory usage |
|
cce_gpu_encoder_utilization |
GPU cards |
GPU encoding usage |
|
cce_gpu_decoder_utilization |
GPU cards |
GPU decoding usage |
|
cce_gpu_utilization_process |
GPU processes |
GPU compute usage of each process |
|
cce_gpu_memory_utilization_process |
GPU processes |
GPU memory usage of each process |
|
cce_gpu_encoder_utilization_process |
GPU processes |
GPU encoding usage of each process |
|
cce_gpu_decoder_utilization_process |
GPU processes |
GPU decoding usage of each process |
|
Memory |
cce_gpu_memory_used |
GPU cards |
Used GPU memory |
cce_gpu_memory_total |
GPU cards |
Total GPU memory |
|
cce_gpu_memory_free |
GPU cards |
Free GPU memory |
|
cce_gpu_bar1_memory_used |
GPU cards |
Used GPU BAR1 memory |
|
cce_gpu_bar1_memory_total |
GPU cards |
Total GPU BAR1 memory |
|
Frequency |
cce_gpu_clock |
GPU cards |
GPU clock frequency |
cce_gpu_memory_clock |
GPU cards |
GPU memory frequency |
|
cce_gpu_graphics_clock |
GPU cards |
GPU frequency |
|
cce_gpu_video_clock |
GPU cards |
GPU video processor frequency |
|
Physical status |
cce_gpu_temperature |
GPU cards |
GPU temperature |
cce_gpu_power_usage |
GPU cards |
GPU power |
|
cce_gpu_total_energy_consumption |
GPU cards |
Total GPU energy consumption |
|
Bandwidth |
cce_gpu_pcie_link_bandwidth |
GPU cards |
GPU PCIe bandwidth |
cce_gpu_nvlink_bandwidth |
GPU cards |
GPU NVLink bandwidth |
|
cce_gpu_pcie_throughput_rx |
GPU cards |
GPU PCIe RX bandwidth |
|
cce_gpu_pcie_throughput_tx |
GPU cards |
GPU PCIe TX bandwidth |
|
cce_gpu_nvlink_utilization_counter_rx |
GPU cards |
GPU NVLink RX bandwidth |
|
cce_gpu_nvlink_utilization_counter_tx |
GPU cards |
GPU NVLink TX bandwidth |
|
Memory isolation page |
cce_gpu_retired_pages_sbe |
GPU cards |
Number of isolated GPU memory pages with single-bit errors |
cce_gpu_retired_pages_dbe |
GPU cards |
Number of isolated GPU memory pages with dual-bit errors |
Metric |
Monitoring Level |
Description |
---|---|---|
xgpu_memory_total |
GPU processes |
Total xGPU memory |
xgpu_memory_used |
GPU processes |
Used xGPU memory |
xgpu_core_percentage_total |
GPU processes |
Total xGPU cores |
xgpu_core_percentage_used |
GPU processes |
Used xGPU cores |
gpu_schedule_policy |
GPU cards |
xGPU scheduling policy. Options:
|
xgpu_device_health |
GPU cards |
Health status of an xGPU device. Options:
|
Accessing Grafana
The Prometheus add-on has had Grafana (an open-source visualization tool) installed and interconnected. You can create a public network LoadBalancer Service so that you can access Grafana from the public network and view Prometheus monitoring data on Grafana.
Click the access address to access Grafana and select a proper dashboard to view the aggregated content.
- Log in to the CCE console and click the name of the cluster with Prometheus installed to access the cluster console. In the navigation pane, choose Services & Ingresses.
- Click Create from YAML in the upper right corner to create a public network LoadBalancer Service for Grafana.
apiVersion: v1 kind: Service metadata: name: grafana-lb # Service name, which is customizable namespace: monitoring labels: app: grafana annotations: kubernetes.io/elb.id: 038ff*** # Replace it with the ID of the public network load balancer in the VPC to which the cluster belongs. spec: ports: - name: cce-service-0 protocol: TCP port: 80 # Service port, which is customizable targetPort: 3000 # Default Grafana port. Retain the default value. selector: app: grafana type: LoadBalancer
- After the Service is created, visit Public IP address of the load balancer:Service port to access Grafana and select a proper dashboard to view xGPU resources.
Figure 4 Viewing xGPU resources
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot