Updated on 2025-09-05 GMT+08:00

GPU Metrics

The CCE AI Suite (NVIDIA GPU) add-on provides GPU monitoring metrics. This add-on offers additional GPU observability options. This section describes the metrics provided by CCE AI Suite (NVIDIA GPU).

GPU Metrics Provided by CCE

Table 1 Basic GPU monitoring metrics

Category

Metric

Type

Unit

Monitoring Level

Description

Utilization

cce_gpu_utilization

Gauge

%

GPU cards

GPU compute usage

cce_gpu_memory_utilization

Gauge

%

GPU cards

GPU memory usage

cce_gpu_encoder_utilization

Gauge

%

GPU cards

GPU encoding usage

cce_gpu_decoder_utilization

Gauge

%

GPU cards

GPU decoding usage

cce_gpu_utilization_process

Gauge

%

GPU processes

GPU compute usage of each process

cce_gpu_memory_utilization_process

Gauge

%

GPU processes

GPU memory usage of each process

cce_gpu_encoder_utilization_process

Gauge

%

GPU processes

GPU encoding usage of each process

cce_gpu_decoder_utilization_process

Gauge

%

GPU processes

GPU decoding usage of each process

Memory

cce_gpu_memory_used

Gauge

Byte

GPU cards

Used GPU memory

NOTE:
If the NVIDIA driver version is 510 or later, the cce_gpu_memory_used value may be inaccurate in full GPU mode. The details are as follows:
  • In CCE AI Suite (NVIDIA GPU) of a version earlier than 2.7.60 or 2.1.44, the cce_gpu_memory_used value might be approximately 250 MB higher than the actual usage. This discrepancy reflects the memory reserved by the system for the GPU driver or firmware.
  • In CCE AI Suite (NVIDIA GPU) of version 2.7.60, 2.1.44, or later, the cce_gpu_memory_used value may be about 100 KB higher than the actual value.

cce_gpu_memory_total

Gauge

Byte

GPU cards

Total GPU memory

cce_gpu_memory_free

Gauge

Byte

GPU cards

Idle GPU memory

cce_gpu_bar1_memory_used

Gauge

Byte

GPU cards

Used GPU BAR1 memory

cce_gpu_bar1_memory_total

Gauge

Byte

GPU cards

Total GPU BAR1 memory

Frequency

cce_gpu_clock

Gauge

MHz

GPU cards

GPU clock frequency

cce_gpu_memory_clock

Gauge

MHz

GPU cards

The speed at which the GPU memory operates

cce_gpu_graphics_clock

Gauge

MHz

GPU cards

GPU frequency

cce_gpu_video_clock

Gauge

MHz

GPU cards

GPU video processor frequency

Physical status

cce_gpu_temperature

Gauge

°C

GPU cards

GPU temperature

cce_gpu_power_usage

Gauge

Milliwatt

GPU cards

GPU power

cce_gpu_total_energy_consumption

Gauge

Millijoule

GPU cards

Total GPU energy consumption

Bandwidth

cce_gpu_pcie_link_bandwidth

Gauge

bit

GPU cards

GPU PCIe bandwidth

cce_gpu_nvlink_bandwidth

Gauge

Gbit/s

GPU cards

GPU NVLink bandwidth

cce_gpu_pcie_throughput_rx

Gauge

KB/s

GPU cards

GPU PCIe RX bandwidth

cce_gpu_pcie_throughput_tx

Gauge

KB/s

GPU cards

GPU PCIe TX bandwidth

cce_gpu_nvlink_utilization_counter_rx

Gauge

KB/s

GPU cards

GPU NVLink RX bandwidth

cce_gpu_nvlink_utilization_counter_tx

Gauge

KB/s

GPU cards

GPU NVLink TX bandwidth

Memory isolation page

cce_gpu_retired_pages_sbe

Gauge

N/A

GPU cards

Number of isolated GPU memory pages with single-bit errors

cce_gpu_retired_pages_dbe

Gauge

N/A

GPU cards

Number of isolated GPU memory pages with dual-bit errors

Table 2 xGPU monitoring metrics

Metric

Type

Unit

Monitoring Level

Description

xgpu_memory_total

Gauge

Byte

GPU processes

Total xGPU memory

xgpu_memory_used

Gauge

Byte

GPU processes

Used xGPU memory

xgpu_core_percentage_total

Gauge

%

GPU processes

Total xGPU cores

xgpu_core_percentage_used

Gauge

%

GPU processes

Used xGPU cores

gpu_schedule_policy

Gauge

N/A

GPU cards

xGPU scheduling policy. Options:

  • 0: xGPU memory is isolated and cores are shared.
  • 1: Both xGPU memory and cores are isolated.
  • 2: default mode, indicating that the current card is not used by any xGPU device for allocation.

xgpu_device_health

Gauge

N/A

GPU cards

xGPU device health. Options:

  • 0: The xGPU device is healthy.
  • 1: The xGPU device is unhealthy.
  • To use the metrics listed in Table 3, ensure that the version of the CCE AI Suite (NVIDIA GPU) add-on is 2.1.30, 2.7.46, or later. If you require these metrics, promptly upgrade the add-on.
  • Cloud Native Cluster Monitoring does not automatically collect GPU pod monitoring metrics. To view relevant data in the monitoring center, configure Cloud Native Cluster Monitoring to collect necessary metrics by referring to "Monitoring" > "Collecting GPU Pod Monitoring Metrics and Setting Up a Grafana Dashboard" in Best Practices.
  • If the NVIDIA driver version is 510 or later, the gpu_pod_memory_used value may be inaccurate in full GPU mode. The details are as follows:
    • In CCE AI Suite (NVIDIA GPU) of a version earlier than 2.7.60 or 2.1.44, the gpu_pod_memory_used value might be approximately 250 MB higher than the actual usage. This discrepancy reflects the memory reserved by the system for the GPU driver or firmware.
    • In CCE AI Suite (NVIDIA GPU) of version 2.7.60, 2.1.44, or later, the gpu_pod_memory_used value may be about 100 KB higher than the actual value.
Table 3 GPU pod monitoring metrics

Metric

Type

Unit

Monitoring Process

Description

gpu_pod_core_percentage_total

Gauge

%

GPU processes

GPU compute allocated by a GPU card to GPU workloads. It is measured in percentages relative to the compute of an entire GPU card. For example, a setting of 30% means that 30% of the GPU card's compute is dedicated to processing GPU virtualization workloads.

  • When GPU virtualization is disabled, the compute of the entire GPU card is used, resulting in a metric value of 100%.
  • When GPU virtualization is enabled, this metric aligns with the xgpu_core_percentage_total value.

gpu_pod_core_percentage_used

Gauge

%

GPU processes

Used GPU compute, that is, the GPU compute used by the GPU workloads. It is measured in percentages relative to the compute of an entire GPU card. For example, a setting of 30% means that the GPU workloads are actively using 30% of the GPU card's compute.

  • When GPU virtualization is disabled, this metric aligns with the cce_gpu_utilization value.
  • When GPU virtualization is enabled, this metric aligns with the xgpu_core_percentage_used value.

gpu_pod_memory_total

Gauge

Byte

GPU processes

GPU memory allocated by a GPU card to the GPU workloads. It is measured in bytes.

  • When GPU virtualization is disabled, this metric aligns with the cce_gpu_memory_total value.
  • When GPU virtualization is enabled, this metric aligns with the xgpu_memory_total value × 1024 × 1024.

gpu_pod_memory_used

Gauge

Byte

GPU processes

Used GPU memory, that is, the GPU memory used by the GPU workloads. It is measured in bytes.

  • When GPU virtualization is disabled, this metric aligns with the cce_gpu_memory_used value.
  • When GPU virtualization is enabled, this metric aligns with the xgpu_memory_used value × 1024 × 1024.