GPU Metrics

The CCE AI Suite (NVIDIA GPU) add-on provides GPU monitoring metrics. This add-on offers additional GPU observability options. This section describes the metrics provided by CCE AI Suite (NVIDIA GPU).

GPU Metrics Provided by CCE

**Table 1** Basic GPU monitoring metrics
Category	Metric	Type	Unit	Monitoring Level	Description
Utilization	cce_gpu_utilization	Gauge	%	GPU cards	GPU compute usage
	cce_gpu_memory_utilization	Gauge	%	GPU cards	GPU memory usage
	cce_gpu_encoder_utilization	Gauge	%	GPU cards	GPU encoding usage
	cce_gpu_decoder_utilization	Gauge	%	GPU cards	GPU decoding usage
	cce_gpu_utilization_process	Gauge	%	GPU processes	GPU compute usage of each process
	cce_gpu_memory_utilization_process	Gauge	%	GPU processes	GPU memory usage of each process
	cce_gpu_encoder_utilization_process	Gauge	%	GPU processes	GPU encoding usage of each process
	cce_gpu_decoder_utilization_process	Gauge	%	GPU processes	GPU decoding usage of each process
Memory	cce_gpu_memory_used	Gauge	Byte	GPU cards	Used GPU memory NOTE: If the NVIDIA driver version is 510 or later, the cce_gpu_memory_used value may be inaccurate in full GPU mode. The details are as follows: In CCE AI Suite (NVIDIA GPU) of a version earlier than 2.7.60 or 2.1.44, the cce_gpu_memory_used value might be approximately 250 MB higher than the actual usage. This discrepancy reflects the memory reserved by the system for the GPU driver or firmware. In CCE AI Suite (NVIDIA GPU) of version 2.7.60, 2.1.44, or later, the cce_gpu_memory_used value may be about 100 KB higher than the actual value.
	cce_gpu_memory_total	Gauge	Byte	GPU cards	Total GPU memory
	cce_gpu_memory_free	Gauge	Byte	GPU cards	Idle GPU memory
	cce_gpu_bar1_memory_used	Gauge	Byte	GPU cards	Used GPU BAR1 memory
	cce_gpu_bar1_memory_total	Gauge	Byte	GPU cards	Total GPU BAR1 memory
Frequency	cce_gpu_clock	Gauge	MHz	GPU cards	GPU clock frequency
	cce_gpu_memory_clock	Gauge	MHz	GPU cards	The speed at which the GPU memory operates
	cce_gpu_graphics_clock	Gauge	MHz	GPU cards	GPU frequency
	cce_gpu_video_clock	Gauge	MHz	GPU cards	GPU video processor frequency
Physical status	cce_gpu_temperature	Gauge	°C	GPU cards	GPU temperature
	cce_gpu_power_usage	Gauge	Milliwatt	GPU cards	GPU power
	cce_gpu_total_energy_consumption	Gauge	Millijoule	GPU cards	Total GPU energy consumption
Bandwidth	cce_gpu_pcie_link_bandwidth	Gauge	bit	GPU cards	GPU PCIe bandwidth
	cce_gpu_nvlink_bandwidth	Gauge	Gbit/s	GPU cards	GPU NVLink bandwidth
	cce_gpu_pcie_throughput_rx	Gauge	KB/s	GPU cards	GPU PCIe RX bandwidth
	cce_gpu_pcie_throughput_tx	Gauge	KB/s	GPU cards	GPU PCIe TX bandwidth
	cce_gpu_nvlink_utilization_counter_rx	Gauge	KB/s	GPU cards	GPU NVLink RX bandwidth
	cce_gpu_nvlink_utilization_counter_tx	Gauge	KB/s	GPU cards	GPU NVLink TX bandwidth
Memory isolation page	cce_gpu_retired_pages_sbe	Gauge	N/A	GPU cards	Number of isolated GPU memory pages with single-bit errors
Memory isolation page	cce_gpu_retired_pages_dbe	Gauge	N/A	GPU cards	Number of isolated GPU memory pages with dual-bit errors

**Table 2** xGPU monitoring metrics
Metric	Type	Unit	Monitoring Level	Description
xgpu_memory_total	Gauge	Byte	GPU processes	Total xGPU memory
xgpu_memory_used	Gauge	Byte	GPU processes	Used xGPU memory
xgpu_core_percentage_total	Gauge	%	GPU processes	Total xGPU cores
xgpu_core_percentage_used	Gauge	%	GPU processes	Used xGPU cores
gpu_schedule_policy	Gauge	N/A	GPU cards	xGPU scheduling policy. Options: 0: xGPU memory is isolated and cores are shared. 1: Both xGPU memory and cores are isolated. 2: default mode, indicating that the current card is not used by any xGPU device for allocation.
xgpu_device_health	Gauge	N/A	GPU cards	xGPU device health. Options: 0: The xGPU device is healthy. 1: The xGPU device is unhealthy.

To use the metrics listed in Table 3, ensure that the version of the CCE AI Suite (NVIDIA GPU) add-on is 2.1.30, 2.7.46, or later. If you require these metrics, promptly upgrade the add-on.
Cloud Native Cluster Monitoring does not automatically collect GPU pod monitoring metrics. To view relevant data in the monitoring center, configure Cloud Native Cluster Monitoring to collect necessary metrics by referring to "Monitoring" > "Collecting GPU Pod Monitoring Metrics and Setting Up a Grafana Dashboard" in Best Practices.
If the NVIDIA driver version is 510 or later, the gpu_pod_memory_used value may be inaccurate in full GPU mode. The details are as follows:
- In CCE AI Suite (NVIDIA GPU) of a version earlier than 2.7.60 or 2.1.44, the gpu_pod_memory_used value might be approximately 250 MB higher than the actual usage. This discrepancy reflects the memory reserved by the system for the GPU driver or firmware.
- In CCE AI Suite (NVIDIA GPU) of version 2.7.60, 2.1.44, or later, the gpu_pod_memory_used value may be about 100 KB higher than the actual value.

**Table 3** GPU pod monitoring metrics
Metric	Type	Unit	Monitoring Process	Description
gpu_pod_core_percentage_total	Gauge	%	GPU processes	GPU compute allocated by a GPU card to GPU workloads. It is measured in percentages relative to the compute of an entire GPU card. For example, a setting of 30% means that 30% of the GPU card's compute is dedicated to processing GPU virtualization workloads. When GPU virtualization is disabled, the compute of the entire GPU card is used, resulting in a metric value of 100%. When GPU virtualization is enabled, this metric aligns with the xgpu_core_percentage_total value.
gpu_pod_core_percentage_used	Gauge	%	GPU processes	Used GPU compute, that is, the GPU compute used by the GPU workloads. It is measured in percentages relative to the compute of an entire GPU card. For example, a setting of 30% means that the GPU workloads are actively using 30% of the GPU card's compute. When GPU virtualization is disabled, this metric aligns with the cce_gpu_utilization value. When GPU virtualization is enabled, this metric aligns with the xgpu_core_percentage_used value.
gpu_pod_memory_total	Gauge	Byte	GPU processes	GPU memory allocated by a GPU card to the GPU workloads. It is measured in bytes. When GPU virtualization is disabled, this metric aligns with the cce_gpu_memory_total value. When GPU virtualization is enabled, this metric aligns with the xgpu_memory_total value × 1024 × 1024.
gpu_pod_memory_used	Gauge	Byte	GPU processes	Used GPU memory, that is, the GPU memory used by the GPU workloads. It is measured in bytes. When GPU virtualization is disabled, this metric aligns with the cce_gpu_memory_used value. When GPU virtualization is enabled, this metric aligns with the xgpu_memory_used value × 1024 × 1024.