GPU Metrics
The CCE AI Suite (NVIDIA GPU) add-on provides GPU monitoring metrics. This add-on offers additional GPU observability options. This section describes the metrics provided by CCE AI Suite (NVIDIA GPU).
GPU Metrics Provided by CCE
Category |
Metric |
Type |
Unit |
Monitoring Level |
Description |
---|---|---|---|---|---|
Utilization |
cce_gpu_utilization |
Gauge |
% |
GPU cards |
GPU compute usage |
cce_gpu_memory_utilization |
Gauge |
% |
GPU cards |
GPU memory usage |
|
cce_gpu_encoder_utilization |
Gauge |
% |
GPU cards |
GPU encoding usage |
|
cce_gpu_decoder_utilization |
Gauge |
% |
GPU cards |
GPU decoding usage |
|
cce_gpu_utilization_process |
Gauge |
% |
GPU processes |
GPU compute usage of each process |
|
cce_gpu_memory_utilization_process |
Gauge |
% |
GPU processes |
GPU memory usage of each process |
|
cce_gpu_encoder_utilization_process |
Gauge |
% |
GPU processes |
GPU encoding usage of each process |
|
cce_gpu_decoder_utilization_process |
Gauge |
% |
GPU processes |
GPU decoding usage of each process |
|
Memory |
cce_gpu_memory_used |
Gauge |
Byte |
GPU cards |
Used GPU memory
NOTE:
If the NVIDIA driver version is 510 or later, the cce_gpu_memory_used value may be inaccurate in full GPU mode. The details are as follows:
|
cce_gpu_memory_total |
Gauge |
Byte |
GPU cards |
Total GPU memory |
|
cce_gpu_memory_free |
Gauge |
Byte |
GPU cards |
Idle GPU memory |
|
cce_gpu_bar1_memory_used |
Gauge |
Byte |
GPU cards |
Used GPU BAR1 memory |
|
cce_gpu_bar1_memory_total |
Gauge |
Byte |
GPU cards |
Total GPU BAR1 memory |
|
Frequency |
cce_gpu_clock |
Gauge |
MHz |
GPU cards |
GPU clock frequency |
cce_gpu_memory_clock |
Gauge |
MHz |
GPU cards |
The speed at which the GPU memory operates |
|
cce_gpu_graphics_clock |
Gauge |
MHz |
GPU cards |
GPU frequency |
|
cce_gpu_video_clock |
Gauge |
MHz |
GPU cards |
GPU video processor frequency |
|
Physical status |
cce_gpu_temperature |
Gauge |
°C |
GPU cards |
GPU temperature |
cce_gpu_power_usage |
Gauge |
Milliwatt |
GPU cards |
GPU power |
|
cce_gpu_total_energy_consumption |
Gauge |
Millijoule |
GPU cards |
Total GPU energy consumption |
|
Bandwidth |
cce_gpu_pcie_link_bandwidth |
Gauge |
bit |
GPU cards |
GPU PCIe bandwidth |
cce_gpu_nvlink_bandwidth |
Gauge |
Gbit/s |
GPU cards |
GPU NVLink bandwidth |
|
cce_gpu_pcie_throughput_rx |
Gauge |
KB/s |
GPU cards |
GPU PCIe RX bandwidth |
|
cce_gpu_pcie_throughput_tx |
Gauge |
KB/s |
GPU cards |
GPU PCIe TX bandwidth |
|
cce_gpu_nvlink_utilization_counter_rx |
Gauge |
KB/s |
GPU cards |
GPU NVLink RX bandwidth |
|
cce_gpu_nvlink_utilization_counter_tx |
Gauge |
KB/s |
GPU cards |
GPU NVLink TX bandwidth |
|
Memory isolation page |
cce_gpu_retired_pages_sbe |
Gauge |
N/A |
GPU cards |
Number of isolated GPU memory pages with single-bit errors |
cce_gpu_retired_pages_dbe |
Gauge |
N/A |
GPU cards |
Number of isolated GPU memory pages with dual-bit errors |
Metric |
Type |
Unit |
Monitoring Level |
Description |
---|---|---|---|---|
xgpu_memory_total |
Gauge |
Byte |
GPU processes |
Total xGPU memory |
xgpu_memory_used |
Gauge |
Byte |
GPU processes |
Used xGPU memory |
xgpu_core_percentage_total |
Gauge |
% |
GPU processes |
Total xGPU cores |
xgpu_core_percentage_used |
Gauge |
% |
GPU processes |
Used xGPU cores |
gpu_schedule_policy |
Gauge |
N/A |
GPU cards |
xGPU scheduling policy. Options:
|
xgpu_device_health |
Gauge |
N/A |
GPU cards |
xGPU device health. Options:
|

- To use the metrics listed in Table 3, ensure that the version of the CCE AI Suite (NVIDIA GPU) add-on is 2.1.30, 2.7.46, or later. If you require these metrics, promptly upgrade the add-on.
- Cloud Native Cluster Monitoring does not automatically collect GPU pod monitoring metrics. To view relevant data in the monitoring center, configure Cloud Native Cluster Monitoring to collect necessary metrics by referring to "Monitoring" > "Collecting GPU Pod Monitoring Metrics and Setting Up a Grafana Dashboard" in Best Practices.
- If the NVIDIA driver version is 510 or later, the gpu_pod_memory_used value may be inaccurate in full GPU mode. The details are as follows:
- In CCE AI Suite (NVIDIA GPU) of a version earlier than 2.7.60 or 2.1.44, the gpu_pod_memory_used value might be approximately 250 MB higher than the actual usage. This discrepancy reflects the memory reserved by the system for the GPU driver or firmware.
- In CCE AI Suite (NVIDIA GPU) of version 2.7.60, 2.1.44, or later, the gpu_pod_memory_used value may be about 100 KB higher than the actual value.
Metric |
Type |
Unit |
Monitoring Process |
Description |
---|---|---|---|---|
gpu_pod_core_percentage_total |
Gauge |
% |
GPU processes |
GPU compute allocated by a GPU card to GPU workloads. It is measured in percentages relative to the compute of an entire GPU card. For example, a setting of 30% means that 30% of the GPU card's compute is dedicated to processing GPU virtualization workloads.
|
gpu_pod_core_percentage_used |
Gauge |
% |
GPU processes |
Used GPU compute, that is, the GPU compute used by the GPU workloads. It is measured in percentages relative to the compute of an entire GPU card. For example, a setting of 30% means that the GPU workloads are actively using 30% of the GPU card's compute.
|
gpu_pod_memory_total |
Gauge |
Byte |
GPU processes |
GPU memory allocated by a GPU card to the GPU workloads. It is measured in bytes.
|
gpu_pod_memory_used |
Gauge |
Byte |
GPU processes |
Used GPU memory, that is, the GPU memory used by the GPU workloads. It is measured in bytes.
|
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.