Updated on 2025-07-17 GMT+08:00

GPU Metrics

The CCE AI Suite (NVIDIA GPU) add-on provides GPU monitoring metrics and integrates DCGM-Exporter. To use DCGM-Exporter, make sure you have version 2.7.32 or later of the add-on installed. This add-on offers additional GPU observability options. This section describes the metrics provided by CCE AI Suite (NVIDIA GPU).

Billing

GPU metrics are custom ones. If you plan to have them reported to AOM, you will be billed on a pay-per-use basis. To avoid any extra fees, review Pricing Details carefully before enabling this function.

GPU Metrics Provided by CCE

Table 1 Basic GPU monitoring metrics

Category

Metric

Type

Unit

Monitoring Level

Description

Utilization

cce_gpu_utilization

Gauge

%

GPU cards

GPU compute usage

cce_gpu_memory_utilization

Gauge

%

GPU cards

GPU memory usage

cce_gpu_encoder_utilization

Gauge

%

GPU cards

GPU encoding usage

cce_gpu_decoder_utilization

Gauge

%

GPU cards

GPU decoding usage

cce_gpu_utilization_process

Gauge

%

GPU processes

GPU compute usage of each process

cce_gpu_memory_utilization_process

Gauge

%

GPU processes

GPU memory usage of each process

cce_gpu_encoder_utilization_process

Gauge

%

GPU processes

GPU encoding usage of each process

cce_gpu_decoder_utilization_process

Gauge

%

GPU processes

GPU decoding usage of each process

Memory

cce_gpu_memory_used

Gauge

Byte

GPU cards

Used GPU memory

NOTE:
If the NVIDIA driver version is 510 or later, the cce_gpu_memory_used value may be inaccurate in full GPU mode. The details are as follows:
  • In CCE AI Suite (NVIDIA GPU) of a version earlier than 2.7.60 or 2.1.44, the cce_gpu_memory_used value might be approximately 250 MB higher than the actual usage. This discrepancy reflects the memory reserved by the system for the GPU driver or firmware.
  • In CCE AI Suite (NVIDIA GPU) of version 2.7.60, 2.1.44, or later, the cce_gpu_memory_used value may be about 100 KB higher than the actual value.

cce_gpu_memory_total

Gauge

Byte

GPU cards

Total GPU memory

cce_gpu_memory_free

Gauge

Byte

GPU cards

Idle GPU memory

cce_gpu_bar1_memory_used

Gauge

Byte

GPU cards

Used GPU BAR1 memory

cce_gpu_bar1_memory_total

Gauge

Byte

GPU cards

Total GPU BAR1 memory

Frequency

cce_gpu_clock

Gauge

MHz

GPU cards

GPU clock frequency

cce_gpu_memory_clock

Gauge

MHz

GPU cards

The speed at which the GPU memory operates

cce_gpu_graphics_clock

Gauge

MHz

GPU cards

GPU frequency

cce_gpu_video_clock

Gauge

MHz

GPU cards

GPU video processor frequency

Physical status

cce_gpu_temperature

Gauge

°C

GPU cards

GPU temperature

cce_gpu_power_usage

Gauge

Milliwatt

GPU cards

GPU power

cce_gpu_total_energy_consumption

Gauge

Millijoule

GPU cards

Total GPU energy consumption

Bandwidth

cce_gpu_pcie_link_bandwidth

Gauge

bit

GPU cards

GPU PCIe bandwidth

cce_gpu_nvlink_bandwidth

Gauge

Gbit/s

GPU cards

GPU NVLink bandwidth

cce_gpu_pcie_throughput_rx

Gauge

KB/s

GPU cards

GPU PCIe RX bandwidth

cce_gpu_pcie_throughput_tx

Gauge

KB/s

GPU cards

GPU PCIe TX bandwidth

cce_gpu_nvlink_utilization_counter_rx

Gauge

KB/s

GPU cards

GPU NVLink RX bandwidth

cce_gpu_nvlink_utilization_counter_tx

Gauge

KB/s

GPU cards

GPU NVLink TX bandwidth

Memory isolation page

cce_gpu_retired_pages_sbe

Gauge

N/A

GPU cards

Number of isolated GPU memory pages with single-bit errors

cce_gpu_retired_pages_dbe

Gauge

N/A

GPU cards

Number of isolated GPU memory pages with dual-bit errors

Table 2 xGPU monitoring metrics

Metric

Type

Unit

Monitoring Level

Description

xgpu_memory_total

Gauge

Byte

GPU processes

Total xGPU memory

xgpu_memory_used

Gauge

Byte

GPU processes

Used xGPU memory

xgpu_core_percentage_total

Gauge

%

GPU processes

Total xGPU cores

xgpu_core_percentage_used

Gauge

%

GPU processes

Used xGPU cores

gpu_schedule_policy

Gauge

N/A

GPU cards

xGPU scheduling policy. Options:

  • 0: xGPU memory is isolated and cores are shared.
  • 1: Both xGPU memory and cores are isolated.
  • 2: default mode, indicating that the current card is not used by any xGPU device for allocation.

xgpu_device_health

Gauge

N/A

GPU cards

xGPU device health. Options:

  • 0: The xGPU device is healthy.
  • 1: The xGPU device is unhealthy.
  • To use the metrics listed in Table 3, ensure that the version of the CCE AI Suite (NVIDIA GPU) add-on is 2.1.30, 2.7.46, or later. If you require these metrics, promptly upgrade the add-on.
  • Cloud Native Cluster Monitoring does not automatically collect GPU pod monitoring metrics. To view relevant data in the monitoring center, configure Cloud Native Cluster Monitoring to collect necessary metrics by referring to "Monitoring" > "Collecting GPU Pod Monitoring Metrics and Setting Up a Grafana Dashboard" in Best Practices.
  • If the NVIDIA driver version is 510 or later, the gpu_pod_memory_used value may be inaccurate in full GPU mode. The details are as follows:
    • In CCE AI Suite (NVIDIA GPU) of a version earlier than 2.7.60 or 2.1.44, the gpu_pod_memory_used value might be approximately 250 MB higher than the actual usage. This discrepancy reflects the memory reserved by the system for the GPU driver or firmware.
    • In CCE AI Suite (NVIDIA GPU) of version 2.7.60, 2.1.44, or later, the gpu_pod_memory_used value may be about 100 KB higher than the actual value.
Table 3 GPU pod monitoring metrics

Metric

Type

Unit

Monitoring Process

Description

gpu_pod_core_percentage_total

Gauge

%

GPU processes

GPU compute allocated by a GPU card to GPU workloads. It is measured in percentages relative to the compute of an entire GPU card. For example, a setting of 30% means that 30% of the GPU card's compute is dedicated to processing GPU virtualization workloads.

  • When GPU virtualization is disabled, the compute of the entire GPU card is used, resulting in a metric value of 100%.
  • When GPU virtualization is enabled, this metric aligns with the xgpu_core_percentage_total value.

gpu_pod_core_percentage_used

Gauge

%

GPU processes

Used GPU compute, that is, the GPU compute used by the GPU workloads. It is measured in percentages relative to the compute of an entire GPU card. For example, a setting of 30% means that the GPU workloads are actively using 30% of the GPU card's compute.

  • When GPU virtualization is disabled, this metric aligns with the cce_gpu_utilization value.
  • When GPU virtualization is enabled, this metric aligns with the xgpu_core_percentage_used value.

gpu_pod_memory_total

Gauge

Byte

GPU processes

GPU memory allocated by a GPU card to the GPU workloads. It is measured in bytes.

  • When GPU virtualization is disabled, this metric aligns with the cce_gpu_memory_total value.
  • When GPU virtualization is enabled, this metric aligns with the xgpu_memory_total value × 1024 × 1024.

gpu_pod_memory_used

Gauge

Byte

GPU processes

Used GPU memory, that is, the GPU memory used by the GPU workloads. It is measured in bytes.

  • When GPU virtualization is disabled, this metric aligns with the cce_gpu_memory_used value.
  • When GPU virtualization is enabled, this metric aligns with the xgpu_memory_used value × 1024 × 1024.

GPU Metrics Provided by DCGM

Table 4 Utilization

Metric

Type

Unit

Description

DCGM_FI_DEV_GPU_UTIL

Gauge

%

GPU utilization. It specifies the time during which one or more kernel functions are active in a period (1s or 1/6s, which varies with the GPU models).

This metric displays only the GPUs used by kernel functions, but does not display the specific usage.

DCGM_FI_DEV_MEM_COPY_UTIL

Gauge

%

GPU memory bandwidth utilization of a measured object

For example, the maximum memory bandwidth of NVIDIA GPU V100 is 900 GB/s. If the current memory bandwidth is 450 GB/s, the memory bandwidth usage is 50%.

DCGM_FI_DEV_ENC_UTIL

Gauge

%

GPU encoder utilization of a measured object

DCGM_FI_DEV_DEC_UTIL

Gauge

%

GPU decoder utilization of a measured object

Table 5 Memory

Metric

Type

Unit

Description

DCGM_FI_DEV_FB_FREE

Gauge

MB

Amount of remaining GPU memory

DCGM_FI_DEV_FB_USED

Gauge

MB

Amount of used GPU memory

The value is the same as the value of Memory-Usage in the nvidia-smi command.

Table 6 Profiling

Metric

Type

Unit

Description

DCGM_FI_PROF_GR_ENGINE_ACTIVE

Gauge

%

Percentage of the time when the graphic or compute engine is in the active state within a period.

This is an average value of all graphic or compute engines.

An active graphic or compute engine indicates that the graphic or compute context is associated with a thread and the graphic or compute context is busy.

DCGM_FI_PROF_SM_ACTIVE

Gauge

%

Fraction of the time during which at least one thread bundle is active on an SM within a period.

This is an average value of all SMs and is insensitive to the number of threads in each block.

A thread bundle is active after being scheduled and allocated with resources. The thread bundle may be in the computing state or a non-computing state (for example, waiting for a memory request).

If the value is less than 0.5, GPUs are not efficiently used. The value should be greater than 0.8.

For example, a GPU has N SMs:

  • A kernel function uses N thread blocks to run on all SMs in a period. In this case, the value is 1 (100%).
  • A kernel function runs N/5 thread blocks in a period. In this case, the value is 0.2.
  • A kernel function uses N thread blocks and runs only 1/5 of cycles in a period. In this case, the value is 0.2.

DCGM_FI_PROF_SM_OCCUPANCY

Gauge

%

Ratio of the number of thread bundles that reside on the SM to the maximum number of thread bundles that can reside on the SM within a period.

This is an average value of all SMs within a period.

A high value does not mean a high GPU usage. Only when the GPU memory bandwidth is limited, a high value of workloads (DCGM_FI_PROF_DRAM_ACTIVE) indicates more efficient GPU usage.

DCGM_FI_PROF_PIPE_TENSOR_ACTIVE

Gauge

%

Fraction of cycles during which the tensor (HMMA/IMMA) pipe is active.

This is an average value within a period, not an instantaneous value.

A higher value indicates a higher utilization of tensor cores.

Value 1 (100%) indicates that a tensor instruction is sent every instruction cycle in the entire period (one instruction is completed in two cycles).

If the value is 0.2 (20%), the possible causes are as follows:

  • During the entire period, 20% of the SM tensor cores run at 100% utilization.
  • During the entire period, all SM tensor cores run at 20% utilization.
  • During 1/5 of the entire period, all SM tensor cores run at 100% utilization.
  • Other combinations

DCGM_FI_PROF_PIPE_FP64_ACTIVE

Gauge

%

Fraction of cycles during which the FP64 (double precision) pipe is active.

This is an average value within a period, not an instantaneous value.

A larger value indicates a higher usage of FP64 cores.

Value 1 (100%) indicates that the FP64 instruction is executed every four cycles (for example, Volta cards) in a period.

If the value is 0.2 (20%), the possible causes are as follows:

  • During the entire period, 20% of the SM FP64 cores run at 100% utilization.
  • During the entire period, all SM FP64 cores run at 20% utilization.
  • During 1/5 of the entire period, all SM FP64 cores run at 100% utilization.
  • Other combinations

DCGM_FI_PROF_PIPE_FP32_ACTIVE

Gauge

%

Fraction of cycles during which the fused multiply-add (FMA) pipe is active. Multiply-add applies to FP32 (single precision) and integers.

This is an average value within a period, not an instantaneous value.

A larger value indicates a higher usage of FP32 cores.

Value 1 (100%) indicates that the FP32 instruction is executed every two cycles (for example, Volta cards) in a period.

If the value is 0.2 (20%), the possible causes are as follows:

  • During the entire period, 20% of the SM FP32 cores run at 100% utilization.
  • During the entire period, all SM FP32 cores run at 20% utilization.
  • During 1/5 of the entire period, all SM FP32 cores run at 100% utilization.
  • Other combinations

DCGM_FI_PROF_PIPE_FP16_ACTIVE

Gauge

%

Fraction of cycles during which the FP16 (half-precision) pipe is active.

This is an average value within a period, not an instantaneous value.

A larger value indicates a higher usage of FP16 cores.

Value 1 (100%) indicates that the FP16 instruction is executed every two cycles (for example, Volta cards) in a period.

If the value is 0.2 (20%), the possible causes are as follows:

  • During the entire period, 20% of the SM FP16 cores run at 100% utilization.
  • During the entire period, all SM FP16 cores run at 20% utilization.
  • During 1/5 of the entire period, all SM FP16 cores run at 100% utilization.
  • Other combinations

DCGM_FI_PROF_DRAM_ACTIVE

Gauge

%

Fraction of cycles during which Memory BW Utilization sends data to or receives from device memory.

This is an average value within a period, not an instantaneous value.

A higher value indicates a higher utilization of device memory.

Value 1 (100%) indicates that a DRAM instruction is executed in every cycle throughout the entire time period (although a peak value of around 0.8 (80%) is the maximum achievable).

If the value is set to 0.2 (20%), it means that 20% of the cycles involve reading from or writing to the device memory within the given time period.

DCGM_FI_PROF_PCIE_TX_BYTES

DCGM_FI_PROF_PCIE_RX_BYTES

Counter

Byte/s

Rate of data transmitted or received over the PCIe bus, including the protocol header and data payload.

This is an average value within a period, not an instantaneous value.

The rate is averaged over the period. For example, if 1 GB of data is transmitted within 1 second, the transmission rate is 1 GB/s regardless of whether the data is transmitted at a constant rate or burst. Theoretically, the maximum PCIe Gen3 bandwidth is 985 MB/s per channel.

DCGM_FI_PROF_NVLINK_RX_BYTES

DCGM_FI_PROF_NVLINK_TX_BYTES

Counter

Byte/s

Rate at which data is transmitted or received through NVLink, excluding the protocol header.

This is an average value within a period, not an instantaneous value.

The rate is averaged over the period. For example, if 1 GB of data is transmitted within 1 second, the transmission rate is 1 GB/s regardless of whether the data is transmitted at a constant rate or burst. Theoretically, the maximum NVLink Gen2 bandwidth is 25 GB/s per link in each direction.

Table 7 Frequency (clock)

Metric

Type

Unit

Description

DCGM_FI_DEV_SM_CLOCK

Gauge

MHz

SM clock for the device

DCGM_FI_DEV_MEM_CLOCK

Gauge

MHz

Memory clock for the device

DCGM_FI_DEV_APP_SM_CLOCK

Gauge

MHz

SM application clocks

DCGM_FI_DEV_APP_MEM_CLOCK

Gauge

MHz

Memory application clocks

DCGM_FI_DEV_CLOCK_THROTTLE_REASONS

Gauge

MHz

The reason why the clock is throttled

Table 8 XID errors and violations

Metric

Type

Unit

Description

DCGM_FI_DEV_XID_ERRORS

Gauge

N/A

The last XID error that occurs in a period of time

DCGM_FI_DEV_POWER_VIOLATION

Counter

μs

A violation caused by the power limit. The value is the time when the violation occurs.

DCGM_FI_DEV_THERMAL_VIOLATION

Counter

μs

A violation caused by the thermal limit. The value is the time when the violation occurs.

DCGM_FI_DEV_SYNC_BOOST_VIOLATION

Counter

μs

A violation caused by the synchronous boost limit. The value is the time when the violation occurs.

DCGM_FI_DEV_BOARD_LIMIT_VIOLATION

Counter

μs

A violation caused by the board limit. The value is the time when the violation occurs.

DCGM_FI_DEV_LOW_UTIL_VIOLATION

Counter

μs

A violation caused by the low utilization limit. The value is the time when the violation occurs.

DCGM_FI_DEV_RELIABILITY_VIOLATION

Counter

μs

A violation caused by the reliability limit. The value is the time when the violation occurs.

Table 9 BAR1

Metric

Type

Unit

Description

DCGM_FI_DEV_BAR1_USED

Gauge

MB

The used BAR1

DCGM_FI_DEV_BAR1_FREE

Gauge

MB

The remaining BAR1

Table 10 Temperature and power

Metric

Type

Unit

Description

DCGM_FI_DEV_MEMORY_TEMP

Gauge

°C

Memory temperature

DCGM_FI_DEV_GPU_TEMP

Gauge

°C

GPU temperature

DCGM_FI_DEV_POWER_USAGE

Gauge

Watt

GPU power

DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION

Counter

Millijoule

Energy consumed since a driver was loaded

Table 11 Retired pages

Metric

Type

Unit

Description

DCGM_FI_DEV_RETIRED_SBE

Gauge

N/A

Number of retired pages due to single bit errors

DCGM_FI_DEV_RETIRED_DBE

Gauge

N/A

Number of retired pages due to double bit errors

For details about more DCGM metrics, see Field Identifiers.