GPU Metrics
The CCE AI Suite (NVIDIA GPU) add-on provides GPU monitoring metrics and integrates DCGM-Exporter. To use DCGM-Exporter, make sure you have version 2.7.32 or later of the add-on installed. This add-on offers additional GPU observability options. This section describes the metrics provided by CCE AI Suite (NVIDIA GPU).
Billing
GPU metrics are custom ones. If you plan to have them reported to AOM, you will be billed on a pay-per-use basis. To avoid any extra fees, review Pricing Details carefully before enabling this function.
GPU Metrics Provided by CCE
Category |
Metric |
Type |
Unit |
Monitoring Level |
Description |
---|---|---|---|---|---|
Utilization |
cce_gpu_utilization |
Gauge |
% |
GPU cards |
GPU compute usage |
cce_gpu_memory_utilization |
Gauge |
% |
GPU cards |
GPU memory usage |
|
cce_gpu_encoder_utilization |
Gauge |
% |
GPU cards |
GPU encoding usage |
|
cce_gpu_decoder_utilization |
Gauge |
% |
GPU cards |
GPU decoding usage |
|
cce_gpu_utilization_process |
Gauge |
% |
GPU processes |
GPU compute usage of each process |
|
cce_gpu_memory_utilization_process |
Gauge |
% |
GPU processes |
GPU memory usage of each process |
|
cce_gpu_encoder_utilization_process |
Gauge |
% |
GPU processes |
GPU encoding usage of each process |
|
cce_gpu_decoder_utilization_process |
Gauge |
% |
GPU processes |
GPU decoding usage of each process |
|
Memory |
cce_gpu_memory_used |
Gauge |
Byte |
GPU cards |
Used GPU memory
NOTE:
If the NVIDIA driver version is 510 or later, the cce_gpu_memory_used value may be inaccurate in full GPU mode. The details are as follows:
|
cce_gpu_memory_total |
Gauge |
Byte |
GPU cards |
Total GPU memory |
|
cce_gpu_memory_free |
Gauge |
Byte |
GPU cards |
Idle GPU memory |
|
cce_gpu_bar1_memory_used |
Gauge |
Byte |
GPU cards |
Used GPU BAR1 memory |
|
cce_gpu_bar1_memory_total |
Gauge |
Byte |
GPU cards |
Total GPU BAR1 memory |
|
Frequency |
cce_gpu_clock |
Gauge |
MHz |
GPU cards |
GPU clock frequency |
cce_gpu_memory_clock |
Gauge |
MHz |
GPU cards |
The speed at which the GPU memory operates |
|
cce_gpu_graphics_clock |
Gauge |
MHz |
GPU cards |
GPU frequency |
|
cce_gpu_video_clock |
Gauge |
MHz |
GPU cards |
GPU video processor frequency |
|
Physical status |
cce_gpu_temperature |
Gauge |
°C |
GPU cards |
GPU temperature |
cce_gpu_power_usage |
Gauge |
Milliwatt |
GPU cards |
GPU power |
|
cce_gpu_total_energy_consumption |
Gauge |
Millijoule |
GPU cards |
Total GPU energy consumption |
|
Bandwidth |
cce_gpu_pcie_link_bandwidth |
Gauge |
bit |
GPU cards |
GPU PCIe bandwidth |
cce_gpu_nvlink_bandwidth |
Gauge |
Gbit/s |
GPU cards |
GPU NVLink bandwidth |
|
cce_gpu_pcie_throughput_rx |
Gauge |
KB/s |
GPU cards |
GPU PCIe RX bandwidth |
|
cce_gpu_pcie_throughput_tx |
Gauge |
KB/s |
GPU cards |
GPU PCIe TX bandwidth |
|
cce_gpu_nvlink_utilization_counter_rx |
Gauge |
KB/s |
GPU cards |
GPU NVLink RX bandwidth |
|
cce_gpu_nvlink_utilization_counter_tx |
Gauge |
KB/s |
GPU cards |
GPU NVLink TX bandwidth |
|
Memory isolation page |
cce_gpu_retired_pages_sbe |
Gauge |
N/A |
GPU cards |
Number of isolated GPU memory pages with single-bit errors |
cce_gpu_retired_pages_dbe |
Gauge |
N/A |
GPU cards |
Number of isolated GPU memory pages with dual-bit errors |
Metric |
Type |
Unit |
Monitoring Level |
Description |
---|---|---|---|---|
xgpu_memory_total |
Gauge |
Byte |
GPU processes |
Total xGPU memory |
xgpu_memory_used |
Gauge |
Byte |
GPU processes |
Used xGPU memory |
xgpu_core_percentage_total |
Gauge |
% |
GPU processes |
Total xGPU cores |
xgpu_core_percentage_used |
Gauge |
% |
GPU processes |
Used xGPU cores |
gpu_schedule_policy |
Gauge |
N/A |
GPU cards |
xGPU scheduling policy. Options:
|
xgpu_device_health |
Gauge |
N/A |
GPU cards |
xGPU device health. Options:
|

- To use the metrics listed in Table 3, ensure that the version of the CCE AI Suite (NVIDIA GPU) add-on is 2.1.30, 2.7.46, or later. If you require these metrics, promptly upgrade the add-on.
- Cloud Native Cluster Monitoring does not automatically collect GPU pod monitoring metrics. To view relevant data in the monitoring center, configure Cloud Native Cluster Monitoring to collect necessary metrics by referring to "Monitoring" > "Collecting GPU Pod Monitoring Metrics and Setting Up a Grafana Dashboard" in Best Practices.
- If the NVIDIA driver version is 510 or later, the gpu_pod_memory_used value may be inaccurate in full GPU mode. The details are as follows:
- In CCE AI Suite (NVIDIA GPU) of a version earlier than 2.7.60 or 2.1.44, the gpu_pod_memory_used value might be approximately 250 MB higher than the actual usage. This discrepancy reflects the memory reserved by the system for the GPU driver or firmware.
- In CCE AI Suite (NVIDIA GPU) of version 2.7.60, 2.1.44, or later, the gpu_pod_memory_used value may be about 100 KB higher than the actual value.
Metric |
Type |
Unit |
Monitoring Process |
Description |
---|---|---|---|---|
gpu_pod_core_percentage_total |
Gauge |
% |
GPU processes |
GPU compute allocated by a GPU card to GPU workloads. It is measured in percentages relative to the compute of an entire GPU card. For example, a setting of 30% means that 30% of the GPU card's compute is dedicated to processing GPU virtualization workloads.
|
gpu_pod_core_percentage_used |
Gauge |
% |
GPU processes |
Used GPU compute, that is, the GPU compute used by the GPU workloads. It is measured in percentages relative to the compute of an entire GPU card. For example, a setting of 30% means that the GPU workloads are actively using 30% of the GPU card's compute.
|
gpu_pod_memory_total |
Gauge |
Byte |
GPU processes |
GPU memory allocated by a GPU card to the GPU workloads. It is measured in bytes.
|
gpu_pod_memory_used |
Gauge |
Byte |
GPU processes |
Used GPU memory, that is, the GPU memory used by the GPU workloads. It is measured in bytes.
|
GPU Metrics Provided by DCGM
Metric |
Type |
Unit |
Description |
---|---|---|---|
DCGM_FI_DEV_GPU_UTIL |
Gauge |
% |
GPU utilization. It specifies the time during which one or more kernel functions are active in a period (1s or 1/6s, which varies with the GPU models). This metric displays only the GPUs used by kernel functions, but does not display the specific usage. |
DCGM_FI_DEV_MEM_COPY_UTIL |
Gauge |
% |
GPU memory bandwidth utilization of a measured object For example, the maximum memory bandwidth of NVIDIA GPU V100 is 900 GB/s. If the current memory bandwidth is 450 GB/s, the memory bandwidth usage is 50%. |
DCGM_FI_DEV_ENC_UTIL |
Gauge |
% |
GPU encoder utilization of a measured object |
DCGM_FI_DEV_DEC_UTIL |
Gauge |
% |
GPU decoder utilization of a measured object |
Metric |
Type |
Unit |
Description |
---|---|---|---|
DCGM_FI_DEV_FB_FREE |
Gauge |
MB |
Amount of remaining GPU memory |
DCGM_FI_DEV_FB_USED |
Gauge |
MB |
Amount of used GPU memory The value is the same as the value of Memory-Usage in the nvidia-smi command. |
Metric |
Type |
Unit |
Description |
---|---|---|---|
DCGM_FI_PROF_GR_ENGINE_ACTIVE |
Gauge |
% |
Percentage of the time when the graphic or compute engine is in the active state within a period. This is an average value of all graphic or compute engines. An active graphic or compute engine indicates that the graphic or compute context is associated with a thread and the graphic or compute context is busy. |
DCGM_FI_PROF_SM_ACTIVE |
Gauge |
% |
Fraction of the time during which at least one thread bundle is active on an SM within a period. This is an average value of all SMs and is insensitive to the number of threads in each block. A thread bundle is active after being scheduled and allocated with resources. The thread bundle may be in the computing state or a non-computing state (for example, waiting for a memory request). If the value is less than 0.5, GPUs are not efficiently used. The value should be greater than 0.8. For example, a GPU has N SMs:
|
DCGM_FI_PROF_SM_OCCUPANCY |
Gauge |
% |
Ratio of the number of thread bundles that reside on the SM to the maximum number of thread bundles that can reside on the SM within a period. This is an average value of all SMs within a period. A high value does not mean a high GPU usage. Only when the GPU memory bandwidth is limited, a high value of workloads (DCGM_FI_PROF_DRAM_ACTIVE) indicates more efficient GPU usage. |
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE |
Gauge |
% |
Fraction of cycles during which the tensor (HMMA/IMMA) pipe is active. This is an average value within a period, not an instantaneous value. A higher value indicates a higher utilization of tensor cores. Value 1 (100%) indicates that a tensor instruction is sent every instruction cycle in the entire period (one instruction is completed in two cycles). If the value is 0.2 (20%), the possible causes are as follows:
|
DCGM_FI_PROF_PIPE_FP64_ACTIVE |
Gauge |
% |
Fraction of cycles during which the FP64 (double precision) pipe is active. This is an average value within a period, not an instantaneous value. A larger value indicates a higher usage of FP64 cores. Value 1 (100%) indicates that the FP64 instruction is executed every four cycles (for example, Volta cards) in a period. If the value is 0.2 (20%), the possible causes are as follows:
|
DCGM_FI_PROF_PIPE_FP32_ACTIVE |
Gauge |
% |
Fraction of cycles during which the fused multiply-add (FMA) pipe is active. Multiply-add applies to FP32 (single precision) and integers. This is an average value within a period, not an instantaneous value. A larger value indicates a higher usage of FP32 cores. Value 1 (100%) indicates that the FP32 instruction is executed every two cycles (for example, Volta cards) in a period. If the value is 0.2 (20%), the possible causes are as follows:
|
DCGM_FI_PROF_PIPE_FP16_ACTIVE |
Gauge |
% |
Fraction of cycles during which the FP16 (half-precision) pipe is active. This is an average value within a period, not an instantaneous value. A larger value indicates a higher usage of FP16 cores. Value 1 (100%) indicates that the FP16 instruction is executed every two cycles (for example, Volta cards) in a period. If the value is 0.2 (20%), the possible causes are as follows:
|
DCGM_FI_PROF_DRAM_ACTIVE |
Gauge |
% |
Fraction of cycles during which Memory BW Utilization sends data to or receives from device memory. This is an average value within a period, not an instantaneous value. A higher value indicates a higher utilization of device memory. Value 1 (100%) indicates that a DRAM instruction is executed in every cycle throughout the entire time period (although a peak value of around 0.8 (80%) is the maximum achievable). If the value is set to 0.2 (20%), it means that 20% of the cycles involve reading from or writing to the device memory within the given time period. |
DCGM_FI_PROF_PCIE_TX_BYTES DCGM_FI_PROF_PCIE_RX_BYTES |
Counter |
Byte/s |
Rate of data transmitted or received over the PCIe bus, including the protocol header and data payload. This is an average value within a period, not an instantaneous value. The rate is averaged over the period. For example, if 1 GB of data is transmitted within 1 second, the transmission rate is 1 GB/s regardless of whether the data is transmitted at a constant rate or burst. Theoretically, the maximum PCIe Gen3 bandwidth is 985 MB/s per channel. |
DCGM_FI_PROF_NVLINK_RX_BYTES DCGM_FI_PROF_NVLINK_TX_BYTES |
Counter |
Byte/s |
Rate at which data is transmitted or received through NVLink, excluding the protocol header. This is an average value within a period, not an instantaneous value. The rate is averaged over the period. For example, if 1 GB of data is transmitted within 1 second, the transmission rate is 1 GB/s regardless of whether the data is transmitted at a constant rate or burst. Theoretically, the maximum NVLink Gen2 bandwidth is 25 GB/s per link in each direction. |
Metric |
Type |
Unit |
Description |
---|---|---|---|
DCGM_FI_DEV_SM_CLOCK |
Gauge |
MHz |
SM clock for the device |
DCGM_FI_DEV_MEM_CLOCK |
Gauge |
MHz |
Memory clock for the device |
DCGM_FI_DEV_APP_SM_CLOCK |
Gauge |
MHz |
SM application clocks |
DCGM_FI_DEV_APP_MEM_CLOCK |
Gauge |
MHz |
Memory application clocks |
DCGM_FI_DEV_CLOCK_THROTTLE_REASONS |
Gauge |
MHz |
The reason why the clock is throttled |
Metric |
Type |
Unit |
Description |
---|---|---|---|
DCGM_FI_DEV_XID_ERRORS |
Gauge |
N/A |
The last XID error that occurs in a period of time |
DCGM_FI_DEV_POWER_VIOLATION |
Counter |
μs |
A violation caused by the power limit. The value is the time when the violation occurs. |
DCGM_FI_DEV_THERMAL_VIOLATION |
Counter |
μs |
A violation caused by the thermal limit. The value is the time when the violation occurs. |
DCGM_FI_DEV_SYNC_BOOST_VIOLATION |
Counter |
μs |
A violation caused by the synchronous boost limit. The value is the time when the violation occurs. |
DCGM_FI_DEV_BOARD_LIMIT_VIOLATION |
Counter |
μs |
A violation caused by the board limit. The value is the time when the violation occurs. |
DCGM_FI_DEV_LOW_UTIL_VIOLATION |
Counter |
μs |
A violation caused by the low utilization limit. The value is the time when the violation occurs. |
DCGM_FI_DEV_RELIABILITY_VIOLATION |
Counter |
μs |
A violation caused by the reliability limit. The value is the time when the violation occurs. |
Metric |
Type |
Unit |
Description |
---|---|---|---|
DCGM_FI_DEV_BAR1_USED |
Gauge |
MB |
The used BAR1 |
DCGM_FI_DEV_BAR1_FREE |
Gauge |
MB |
The remaining BAR1 |
Metric |
Type |
Unit |
Description |
---|---|---|---|
DCGM_FI_DEV_MEMORY_TEMP |
Gauge |
°C |
Memory temperature |
DCGM_FI_DEV_GPU_TEMP |
Gauge |
°C |
GPU temperature |
DCGM_FI_DEV_POWER_USAGE |
Gauge |
Watt |
GPU power |
DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION |
Counter |
Millijoule |
Energy consumed since a driver was loaded |
Metric |
Type |
Unit |
Description |
---|---|---|---|
DCGM_FI_DEV_RETIRED_SBE |
Gauge |
N/A |
Number of retired pages due to single bit errors |
DCGM_FI_DEV_RETIRED_DBE |
Gauge |
N/A |
Number of retired pages due to double bit errors |
For details about more DCGM metrics, see Field Identifiers.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot