GPU Metrics

The CCE AI Suite (NVIDIA GPU) add-on provides GPU monitoring metrics and integrates DCGM-Exporter. To use DCGM-Exporter, make sure you have version 2.7.32 or later of the add-on installed. This add-on offers additional GPU observability options. This section describes the metrics provided by CCE AI Suite (NVIDIA GPU).

Billing

GPU metrics are custom ones. If you plan to have them reported to AOM, you will be billed on a pay-per-use basis. To avoid any extra fees, review Pricing Details carefully before enabling this function.

GPU Metrics Provided by CCE

**Table 1** Basic GPU monitoring metrics
Category	Metric	Type	Unit	Monitoring Level	Description
Utilization	cce_gpu_utilization	Gauge	%	GPU cards	GPU compute usage
	cce_gpu_memory_utilization	Gauge	%	GPU cards	GPU memory usage
	cce_gpu_encoder_utilization	Gauge	%	GPU cards	GPU encoding usage
	cce_gpu_decoder_utilization	Gauge	%	GPU cards	GPU decoding usage
	cce_gpu_utilization_process	Gauge	%	GPU processes	GPU compute usage of each process
	cce_gpu_memory_utilization_process	Gauge	%	GPU processes	GPU memory usage of each process
	cce_gpu_encoder_utilization_process	Gauge	%	GPU processes	GPU encoding usage of each process
	cce_gpu_decoder_utilization_process	Gauge	%	GPU processes	GPU decoding usage of each process
Memory	cce_gpu_memory_used	Gauge	Byte	GPU cards	Used GPU memory NOTE: If the NVIDIA driver version is 510 or later, the cce_gpu_memory_used value may be inaccurate in full GPU mode. The details are as follows: In CCE AI Suite (NVIDIA GPU) of a version earlier than 2.7.60 or 2.1.44, the cce_gpu_memory_used value might be approximately 250 MB higher than the actual usage. This discrepancy reflects the memory reserved by the system for the GPU driver or firmware. In CCE AI Suite (NVIDIA GPU) of version 2.7.60, 2.1.44, or later, the cce_gpu_memory_used value may be about 100 KB higher than the actual value.
	cce_gpu_memory_total	Gauge	Byte	GPU cards	Total GPU memory
	cce_gpu_memory_free	Gauge	Byte	GPU cards	Idle GPU memory
	cce_gpu_bar1_memory_used	Gauge	Byte	GPU cards	Used GPU BAR1 memory
	cce_gpu_bar1_memory_total	Gauge	Byte	GPU cards	Total GPU BAR1 memory
Frequency	cce_gpu_clock	Gauge	MHz	GPU cards	GPU clock frequency
	cce_gpu_memory_clock	Gauge	MHz	GPU cards	The speed at which the GPU memory operates
	cce_gpu_graphics_clock	Gauge	MHz	GPU cards	GPU frequency
	cce_gpu_video_clock	Gauge	MHz	GPU cards	GPU video processor frequency
Physical status	cce_gpu_temperature	Gauge	°C	GPU cards	GPU temperature
	cce_gpu_power_usage	Gauge	Milliwatt	GPU cards	GPU power
	cce_gpu_total_energy_consumption	Gauge	Millijoule	GPU cards	Total GPU energy consumption
Bandwidth	cce_gpu_pcie_link_bandwidth	Gauge	bit	GPU cards	GPU PCIe bandwidth
	cce_gpu_nvlink_bandwidth	Gauge	Gbit/s	GPU cards	GPU NVLink bandwidth
	cce_gpu_pcie_throughput_rx	Gauge	KB/s	GPU cards	GPU PCIe RX bandwidth
	cce_gpu_pcie_throughput_tx	Gauge	KB/s	GPU cards	GPU PCIe TX bandwidth
	cce_gpu_nvlink_utilization_counter_rx	Gauge	KB/s	GPU cards	GPU NVLink RX bandwidth
	cce_gpu_nvlink_utilization_counter_tx	Gauge	KB/s	GPU cards	GPU NVLink TX bandwidth
Memory isolation page	cce_gpu_retired_pages_sbe	Gauge	N/A	GPU cards	Number of isolated GPU memory pages with single-bit errors
Memory isolation page	cce_gpu_retired_pages_dbe	Gauge	N/A	GPU cards	Number of isolated GPU memory pages with dual-bit errors

**Table 2** xGPU monitoring metrics
Metric	Type	Unit	Monitoring Level	Description
xgpu_memory_total	Gauge	Byte	GPU processes	Total xGPU memory
xgpu_memory_used	Gauge	Byte	GPU processes	Used xGPU memory
xgpu_core_percentage_total	Gauge	%	GPU processes	Total xGPU cores
xgpu_core_percentage_used	Gauge	%	GPU processes	Used xGPU cores
gpu_schedule_policy	Gauge	N/A	GPU cards	xGPU scheduling policy. Options: 0: xGPU memory is isolated and cores are shared. 1: Both xGPU memory and cores are isolated. 2: default mode, indicating that the current card is not used by any xGPU device for allocation.
xgpu_device_health	Gauge	N/A	GPU cards	xGPU device health. Options: 0: The xGPU device is healthy. 1: The xGPU device is unhealthy.

To use the metrics listed in Table 3, ensure that the version of the CCE AI Suite (NVIDIA GPU) add-on is 2.1.30, 2.7.46, or later. If you require these metrics, promptly upgrade the add-on.
Cloud Native Cluster Monitoring does not automatically collect GPU pod monitoring metrics. To view relevant data in the monitoring center, configure Cloud Native Cluster Monitoring to collect necessary metrics by referring to "Monitoring" > "Collecting GPU Pod Monitoring Metrics and Setting Up a Grafana Dashboard" in Best Practices.
If the NVIDIA driver version is 510 or later, the gpu_pod_memory_used value may be inaccurate in full GPU mode. The details are as follows:
- In CCE AI Suite (NVIDIA GPU) of a version earlier than 2.7.60 or 2.1.44, the gpu_pod_memory_used value might be approximately 250 MB higher than the actual usage. This discrepancy reflects the memory reserved by the system for the GPU driver or firmware.
- In CCE AI Suite (NVIDIA GPU) of version 2.7.60, 2.1.44, or later, the gpu_pod_memory_used value may be about 100 KB higher than the actual value.

**Table 3** GPU pod monitoring metrics
Metric	Type	Unit	Monitoring Process	Description
gpu_pod_core_percentage_total	Gauge	%	GPU processes	GPU compute allocated by a GPU card to GPU workloads. It is measured in percentages relative to the compute of an entire GPU card. For example, a setting of 30% means that 30% of the GPU card's compute is dedicated to processing GPU virtualization workloads. When GPU virtualization is disabled, the compute of the entire GPU card is used, resulting in a metric value of 100%. When GPU virtualization is enabled, this metric aligns with the xgpu_core_percentage_total value.
gpu_pod_core_percentage_used	Gauge	%	GPU processes	Used GPU compute, that is, the GPU compute used by the GPU workloads. It is measured in percentages relative to the compute of an entire GPU card. For example, a setting of 30% means that the GPU workloads are actively using 30% of the GPU card's compute. When GPU virtualization is disabled, this metric aligns with the cce_gpu_utilization value. When GPU virtualization is enabled, this metric aligns with the xgpu_core_percentage_used value.
gpu_pod_memory_total	Gauge	Byte	GPU processes	GPU memory allocated by a GPU card to the GPU workloads. It is measured in bytes. When GPU virtualization is disabled, this metric aligns with the cce_gpu_memory_total value. When GPU virtualization is enabled, this metric aligns with the xgpu_memory_total value × 1024 × 1024.
gpu_pod_memory_used	Gauge	Byte	GPU processes	Used GPU memory, that is, the GPU memory used by the GPU workloads. It is measured in bytes. When GPU virtualization is disabled, this metric aligns with the cce_gpu_memory_used value. When GPU virtualization is enabled, this metric aligns with the xgpu_memory_used value × 1024 × 1024.

GPU Metrics Provided by DCGM

**Table 4** Utilization
Metric	Type	Unit	Description
DCGM_FI_DEV_GPU_UTIL	Gauge	%	GPU utilization. It specifies the time during which one or more kernel functions are active in a period (1s or 1/6s, which varies with the GPU models). This metric displays only the GPUs used by kernel functions, but does not display the specific usage.
DCGM_FI_DEV_MEM_COPY_UTIL	Gauge	%	GPU memory bandwidth utilization of a measured object For example, the maximum memory bandwidth of NVIDIA GPU V100 is 900 GB/s. If the current memory bandwidth is 450 GB/s, the memory bandwidth usage is 50%.
DCGM_FI_DEV_ENC_UTIL	Gauge	%	GPU encoder utilization of a measured object
DCGM_FI_DEV_DEC_UTIL	Gauge	%	GPU decoder utilization of a measured object

**Table 5** Memory
Metric	Type	Unit	Description
DCGM_FI_DEV_FB_FREE	Gauge	MB	Amount of remaining GPU memory
DCGM_FI_DEV_FB_USED	Gauge	MB	Amount of used GPU memory The value is the same as the value of Memory-Usage in the nvidia-smi command.

**Table 6** Profiling
Metric	Type	Unit	Description
DCGM_FI_PROF_GR_ENGINE_ACTIVE	Gauge	%	Percentage of the time when the graphic or compute engine is in the active state within a period. This is an average value of all graphic or compute engines. An active graphic or compute engine indicates that the graphic or compute context is associated with a thread and the graphic or compute context is busy.
DCGM_FI_PROF_SM_ACTIVE	Gauge	%	Fraction of the time during which at least one thread bundle is active on an SM within a period. This is an average value of all SMs and is insensitive to the number of threads in each block. A thread bundle is active after being scheduled and allocated with resources. The thread bundle may be in the computing state or a non-computing state (for example, waiting for a memory request). If the value is less than 0.5, GPUs are not efficiently used. The value should be greater than 0.8. For example, a GPU has N SMs: A kernel function uses N thread blocks to run on all SMs in a period. In this case, the value is 1 (100%). A kernel function runs N/5 thread blocks in a period. In this case, the value is 0.2. A kernel function uses N thread blocks and runs only 1/5 of cycles in a period. In this case, the value is 0.2.
DCGM_FI_PROF_SM_OCCUPANCY	Gauge	%	Ratio of the number of thread bundles that reside on the SM to the maximum number of thread bundles that can reside on the SM within a period. This is an average value of all SMs within a period. A high value does not mean a high GPU usage. Only when the GPU memory bandwidth is limited, a high value of workloads (DCGM_FI_PROF_DRAM_ACTIVE) indicates more efficient GPU usage.
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE	Gauge	%	Fraction of cycles during which the tensor (HMMA/IMMA) pipe is active. This is an average value within a period, not an instantaneous value. A higher value indicates a higher utilization of tensor cores. Value 1 (100%) indicates that a tensor instruction is sent every instruction cycle in the entire period (one instruction is completed in two cycles). If the value is 0.2 (20%), the possible causes are as follows: During the entire period, 20% of the SM tensor cores run at 100% utilization. During the entire period, all SM tensor cores run at 20% utilization. During 1/5 of the entire period, all SM tensor cores run at 100% utilization. Other combinations
DCGM_FI_PROF_PIPE_FP64_ACTIVE	Gauge	%	Fraction of cycles during which the FP64 (double precision) pipe is active. This is an average value within a period, not an instantaneous value. A larger value indicates a higher usage of FP64 cores. Value 1 (100%) indicates that the FP64 instruction is executed every four cycles (for example, Volta cards) in a period. If the value is 0.2 (20%), the possible causes are as follows: During the entire period, 20% of the SM FP64 cores run at 100% utilization. During the entire period, all SM FP64 cores run at 20% utilization. During 1/5 of the entire period, all SM FP64 cores run at 100% utilization. Other combinations
DCGM_FI_PROF_PIPE_FP32_ACTIVE	Gauge	%	Fraction of cycles during which the fused multiply-add (FMA) pipe is active. Multiply-add applies to FP32 (single precision) and integers. This is an average value within a period, not an instantaneous value. A larger value indicates a higher usage of FP32 cores. Value 1 (100%) indicates that the FP32 instruction is executed every two cycles (for example, Volta cards) in a period. If the value is 0.2 (20%), the possible causes are as follows: During the entire period, 20% of the SM FP32 cores run at 100% utilization. During the entire period, all SM FP32 cores run at 20% utilization. During 1/5 of the entire period, all SM FP32 cores run at 100% utilization. Other combinations
DCGM_FI_PROF_PIPE_FP16_ACTIVE	Gauge	%	Fraction of cycles during which the FP16 (half-precision) pipe is active. This is an average value within a period, not an instantaneous value. A larger value indicates a higher usage of FP16 cores. Value 1 (100%) indicates that the FP16 instruction is executed every two cycles (for example, Volta cards) in a period. If the value is 0.2 (20%), the possible causes are as follows: During the entire period, 20% of the SM FP16 cores run at 100% utilization. During the entire period, all SM FP16 cores run at 20% utilization. During 1/5 of the entire period, all SM FP16 cores run at 100% utilization. Other combinations
DCGM_FI_PROF_DRAM_ACTIVE	Gauge	%	Fraction of cycles during which Memory BW Utilization sends data to or receives from device memory. This is an average value within a period, not an instantaneous value. A higher value indicates a higher utilization of device memory. Value 1 (100%) indicates that a DRAM instruction is executed in every cycle throughout the entire time period (although a peak value of around 0.8 (80%) is the maximum achievable). If the value is set to 0.2 (20%), it means that 20% of the cycles involve reading from or writing to the device memory within the given time period.
DCGM_FI_PROF_PCIE_TX_BYTES DCGM_FI_PROF_PCIE_RX_BYTES	Counter	Byte/s	Rate of data transmitted or received over the PCIe bus, including the protocol header and data payload. This is an average value within a period, not an instantaneous value. The rate is averaged over the period. For example, if 1 GB of data is transmitted within 1 second, the transmission rate is 1 GB/s regardless of whether the data is transmitted at a constant rate or burst. Theoretically, the maximum PCIe Gen3 bandwidth is 985 MB/s per channel.
DCGM_FI_PROF_NVLINK_RX_BYTES DCGM_FI_PROF_NVLINK_TX_BYTES	Counter	Byte/s	Rate at which data is transmitted or received through NVLink, excluding the protocol header. This is an average value within a period, not an instantaneous value. The rate is averaged over the period. For example, if 1 GB of data is transmitted within 1 second, the transmission rate is 1 GB/s regardless of whether the data is transmitted at a constant rate or burst. Theoretically, the maximum NVLink Gen2 bandwidth is 25 GB/s per link in each direction.

**Table 7** Frequency (clock)
Metric	Type	Unit	Description
DCGM_FI_DEV_SM_CLOCK	Gauge	MHz	SM clock for the device
DCGM_FI_DEV_MEM_CLOCK	Gauge	MHz	Memory clock for the device
DCGM_FI_DEV_APP_SM_CLOCK	Gauge	MHz	SM application clocks
DCGM_FI_DEV_APP_MEM_CLOCK	Gauge	MHz	Memory application clocks
DCGM_FI_DEV_CLOCK_THROTTLE_REASONS	Gauge	MHz	The reason why the clock is throttled

**Table 8** XID errors and violations
Metric	Type	Unit	Description
DCGM_FI_DEV_XID_ERRORS	Gauge	N/A	The last XID error that occurs in a period of time
DCGM_FI_DEV_POWER_VIOLATION	Counter	μs	A violation caused by the power limit. The value is the time when the violation occurs.
DCGM_FI_DEV_THERMAL_VIOLATION	Counter	μs	A violation caused by the thermal limit. The value is the time when the violation occurs.
DCGM_FI_DEV_SYNC_BOOST_VIOLATION	Counter	μs	A violation caused by the synchronous boost limit. The value is the time when the violation occurs.
DCGM_FI_DEV_BOARD_LIMIT_VIOLATION	Counter	μs	A violation caused by the board limit. The value is the time when the violation occurs.
DCGM_FI_DEV_LOW_UTIL_VIOLATION	Counter	μs	A violation caused by the low utilization limit. The value is the time when the violation occurs.
DCGM_FI_DEV_RELIABILITY_VIOLATION	Counter	μs	A violation caused by the reliability limit. The value is the time when the violation occurs.

**Table 9** BAR1
Metric	Type	Unit	Description
DCGM_FI_DEV_BAR1_USED	Gauge	MB	The used BAR1
DCGM_FI_DEV_BAR1_FREE	Gauge	MB	The remaining BAR1

**Table 10** Temperature and power
Metric	Type	Unit	Description
DCGM_FI_DEV_MEMORY_TEMP	Gauge	°C	Memory temperature
DCGM_FI_DEV_GPU_TEMP	Gauge	°C	GPU temperature
DCGM_FI_DEV_POWER_USAGE	Gauge	Watt	GPU power
DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION	Counter	Millijoule	Energy consumed since a driver was loaded

**Table 11** Retired pages
Metric	Type	Unit	Description
DCGM_FI_DEV_RETIRED_SBE	Gauge	N/A	Number of retired pages due to single bit errors
DCGM_FI_DEV_RETIRED_DBE	Gauge	N/A	Number of retired pages due to double bit errors