Compute
Elastic Cloud Server
Huawei Cloud Flexus
Bare Metal Server
Auto Scaling
Image Management Service
Dedicated Host
FunctionGraph
Cloud Phone Host
Huawei Cloud EulerOS
Networking
Virtual Private Cloud
Elastic IP
Elastic Load Balance
NAT Gateway
Direct Connect
Virtual Private Network
VPC Endpoint
Cloud Connect
Enterprise Router
Enterprise Switch
Global Accelerator
Management & Governance
Cloud Eye
Identity and Access Management
Cloud Trace Service
Resource Formation Service
Tag Management Service
Log Tank Service
Config
OneAccess
Resource Access Manager
Simple Message Notification
Application Performance Management
Application Operations Management
Organizations
Optimization Advisor
IAM Identity Center
Cloud Operations Center
Resource Governance Center
Migration
Server Migration Service
Object Storage Migration Service
Cloud Data Migration
Migration Center
Cloud Ecosystem
KooGallery
Partner Center
User Support
My Account
Billing Center
Cost Center
Resource Center
Enterprise Management
Service Tickets
HUAWEI CLOUD (International) FAQs
ICP Filing
Support Plans
My Credentials
Customer Operation Capabilities
Partner Support Plans
Professional Services
Analytics
MapReduce Service
Data Lake Insight
CloudTable Service
Cloud Search Service
Data Lake Visualization
Data Ingestion Service
GaussDB(DWS)
DataArts Studio
Data Lake Factory
DataArts Lake Formation
IoT
IoT Device Access
Others
Product Pricing Details
System Permissions
Console Quick Start
Common FAQs
Instructions for Associating with a HUAWEI CLOUD Partner
Message Center
Security & Compliance
Security Technologies and Applications
Web Application Firewall
Host Security Service
Cloud Firewall
SecMaster
Anti-DDoS Service
Data Encryption Workshop
Database Security Service
Cloud Bastion Host
Data Security Center
Cloud Certificate Manager
Edge Security
Managed Threat Detection
Blockchain
Blockchain Service
Web3 Node Engine Service
Media Services
Media Processing Center
Video On Demand
Live
SparkRTC
MetaStudio
Storage
Object Storage Service
Elastic Volume Service
Cloud Backup and Recovery
Storage Disaster Recovery Service
Scalable File Service Turbo
Scalable File Service
Volume Backup Service
Cloud Server Backup Service
Data Express Service
Dedicated Distributed Storage Service
Containers
Cloud Container Engine
SoftWare Repository for Container
Application Service Mesh
Ubiquitous Cloud Native Service
Cloud Container Instance
Databases
Relational Database Service
Document Database Service
Data Admin Service
Data Replication Service
GeminiDB
GaussDB
Distributed Database Middleware
Database and Application Migration UGO
TaurusDB
Middleware
Distributed Cache Service
API Gateway
Distributed Message Service for Kafka
Distributed Message Service for RabbitMQ
Distributed Message Service for RocketMQ
Cloud Service Engine
Multi-Site High Availability Service
EventGrid
Dedicated Cloud
Dedicated Computing Cluster
Business Applications
Workspace
ROMA Connect
Message & SMS
Domain Name Service
Edge Data Center Management
Meeting
AI
Face Recognition Service
Graph Engine Service
Content Moderation
Image Recognition
Optical Character Recognition
ModelArts
ImageSearch
Conversational Bot Service
Speech Interaction Service
Huawei HiLens
Video Intelligent Analysis Service
Developer Tools
SDK Developer Guide
API Request Signing Guide
Terraform
Koo Command Line Interface
Content Delivery & Edge Computing
Content Delivery Network
Intelligent EdgeFabric
CloudPond
Intelligent EdgeCloud
Solutions
SAP Cloud
High Performance Computing
Developer Services
ServiceStage
CodeArts
CodeArts PerfTest
CodeArts Req
CodeArts Pipeline
CodeArts Build
CodeArts Deploy
CodeArts Artifact
CodeArts TestPlan
CodeArts Check
CodeArts Repo
Cloud Application Engine
MacroVerse aPaaS
KooMessage
KooPhone
KooDrive

GPU Metrics

Updated on 2025-02-18 GMT+08:00

The CCE AI Suite (NVIDIA GPU) add-on provides GPU monitoring metrics and integrates DCGM-Exporter. To use DCGM-Exporter, make sure you have version 2.7.32 or later of the add-on installed. This add-on offers additional GPU observability options. This section describes the metrics provided by CCE AI Suite (NVIDIA GPU).

Billing

GPU metrics are custom ones. If you plan to have them reported to AOM, you will be billed on a pay-per-use basis. To avoid any extra fees, review Pricing Details carefully before enabling this function.

GPU Metrics Provided by CCE

NOTE:

The CCE AI Suite (NVIDIA GPU) add-on version 2.1.24, 2.7.40, or later includes the ability to read the used xGPU compute, used xGPU memory, and total xGPU memory in addition to the basic GPU metrics.

  • The cce_gpu_memory_total metric supports the collection of xgpu_memory_total data.
  • The cce_gpu_memory_used metric supports the collection of xgpu_memory_used data.
  • The cce_gpu_utilization metric supports the collection of xgpu_core_percentage_used data.

When CCE shows the GPU metric data, it also shows the xGPU metric data. The xGPU metric data is identified by the gpu_index label in the format of {gpu_index="M|N"}, where M represents the GPU serial number (gpu_index) and N represents the xGPU serial number (xgpu_index) of the GPU. You can use the gpu_index label to get xGPU metrics. For example:

cce_gpu_memory_used{gpu_index="0|1"} 16000

It indicates that the used memory of the xGPU whose xgpu_index is 1 on GPU0 is 16,000 bytes.

If you do not need to see xGPU metrics, you can filter them out using regular expressions. For example:

cce_gpu_memory_used{gpu_index=~"[^|]"}
Table 1 Basic GPU monitoring metrics

Type

Metric

Type

Unit

Monitoring Level

Description

Utilization

cce_gpu_utilization

Gauge

%

GPU cards

GPU compute usage

NOTE:

If the add-on version is 2.1.24, 2.7.40, or later, this metric can also be used to obtain the corresponding xGPU metric (xgpu_core_percentage_used).

You can use the gpu_index label of the metric to obtain the xGPU metric. For example, gpu_index=0|0 indicates GPU 0 and xGPU 0.

cce_gpu_memory_utilization

Gauge

%

GPU cards

GPU memory usage

cce_gpu_encoder_utilization

Gauge

%

GPU cards

GPU encoding usage

cce_gpu_decoder_utilization

Gauge

%

GPU cards

GPU decoding usage

cce_gpu_utilization_process

Gauge

%

GPU processes

GPU compute usage of each process

cce_gpu_memory_utilization_process

Gauge

%

GPU processes

GPU memory usage of each process

cce_gpu_encoder_utilization_process

Gauge

%

GPU processes

GPU encoding usage of each process

cce_gpu_decoder_utilization_process

Gauge

%

GPU processes

GPU decoding usage of each process

Memory

cce_gpu_memory_used

Gauge

Byte

GPU cards

Used GPU memory

NOTE:

If the add-on version is 2.1.24, 2.7.40, or later, this metric can also be used to obtain the corresponding xGPU metric (xgpu_memory_used).

You can use the gpu_index label of the metric to obtain the xGPU metric. For example, gpu_index=0|0 indicates GPU 0 and xGPU 0.

cce_gpu_memory_total

Gauge

Byte

GPU cards

Total GPU memory

NOTE:

If the add-on version is 2.1.24, 2.7.40, or later, this metric can also be used to obtain the corresponding xGPU metric (xgpu_memory_total).

You can use the gpu_index label of the metric to obtain the xGPU metric. For example, gpu_index=0|0 indicates GPU 0 and xGPU 0.

cce_gpu_memory_free

Gauge

Byte

GPU cards

Idle GPU memory

cce_gpu_bar1_memory_used

Gauge

Byte

GPU cards

Used GPU BAR1 memory

cce_gpu_bar1_memory_total

Gauge

Byte

GPU cards

Total GPU BAR1 memory

Frequency

cce_gpu_clock

Gauge

MHz

GPU cards

GPU clock frequency

cce_gpu_memory_clock

Gauge

MHz

GPU cards

The speed at which the GPU memory operates

cce_gpu_graphics_clock

Gauge

MHz

GPU cards

GPU frequency

cce_gpu_video_clock

Gauge

MHz

GPU cards

GPU video processor frequency

Physical status

cce_gpu_temperature

Gauge

°C

GPU cards

GPU temperature

cce_gpu_power_usage

Gauge

Milliwatt

GPU cards

GPU power

cce_gpu_total_energy_consumption

Gauge

Millijoule

GPU cards

Total GPU energy consumption

Bandwidth

cce_gpu_pcie_link_bandwidth

Gauge

bit

GPU cards

GPU PCIe bandwidth

cce_gpu_nvlink_bandwidth

Gauge

Gbit/s

GPU cards

GPU NVLink bandwidth

cce_gpu_pcie_throughput_rx

Gauge

KB/s

GPU cards

GPU PCIe RX bandwidth

cce_gpu_pcie_throughput_tx

Gauge

KB/s

GPU cards

GPU PCIe TX bandwidth

cce_gpu_nvlink_utilization_counter_rx

Gauge

KB/s

GPU cards

GPU NVLink RX bandwidth

cce_gpu_nvlink_utilization_counter_tx

Gauge

KB/s

GPU cards

GPU NVLink TX bandwidth

Memory isolation page

cce_gpu_retired_pages_sbe

Gauge

N/A

GPU cards

Number of isolated GPU memory pages with single-bit errors

cce_gpu_retired_pages_dbe

Gauge

N/A

GPU cards

Number of isolated GPU memory pages with dual-bit errors

Table 2 xGPU monitoring metrics

Metric

Type

Unit

Monitoring Level

Description

xgpu_memory_total

Gauge

Byte

GPU processes

Total xGPU memory

xgpu_memory_used

Gauge

Byte

GPU processes

Used xGPU memory

xgpu_core_percentage_total

Gauge

%

GPU processes

Total xGPU cores

xgpu_core_percentage_used

Gauge

%

GPU processes

Used xGPU cores

gpu_schedule_policy

Gauge

N/A

GPU cards

xGPU scheduling policy. Options:

  • 0: xGPU memory is isolated and cores are shared.
  • 1: Both xGPU memory and cores are isolated.
  • 2: default mode, indicating that the current card is not used by any xGPU device for allocation.

xgpu_device_health

Gauge

N/A

GPU cards

xGPU device health. Options:

  • 0: The xGPU device is healthy.
  • 1: The xGPU device is unhealthy.

GPU Metrics Provided by DCGM

Table 3 Utilization

Metric

Type

Unit

Description

DCGM_FI_DEV_GPU_UTIL

Gauge

%

GPU utilization. It specifies the time during which one or more kernel functions are active in a period (1s or 1/6s, which varies with the GPU models).

This metric displays only the GPUs used by kernel functions, but does not display the specific usage.

DCGM_FI_DEV_MEM_COPY_UTIL

Gauge

%

GPU memory bandwidth utilization of a measured object

For example, the maximum memory bandwidth of NVIDIA GPU V100 is 900 GB/s. If the current memory bandwidth is 450 GB/s, the memory bandwidth usage is 50%.

DCGM_FI_DEV_ENC_UTIL

Gauge

%

GPU encoder utilization of a measured object

DCGM_FI_DEV_DEC_UTIL

Gauge

%

GPU decoder utilization of a measured object

Table 4 Memory

Metric

Type

Unit

Description

DCGM_FI_DEV_FB_FREE

Gauge

MB

Number of remaining GPU memory

DCGM_FI_DEV_FB_USED

Gauge

MB

Number of used GPU memory

The value is the same as the value of Memory-Usage in the nvidia-smi command.

Table 5 Profiling

Metric

Type

Unit

Description

DCGM_FI_PROF_GR_ENGINE_ACTIVE

Gauge

%

Percentage of the time when the graphic or compute engine is in the active state within a period.

This is an average value of all graphic or compute engines.

An active graphic or compute engine indicates that the graphic or compute context is associated with a thread and the graphic or compute context is busy.

DCGM_FI_PROF_SM_ACTIVE

Gauge

%

Fraction of the time during which at least one thread bundle is active on an SM within a period.

This is an average value of all SMs and is insensitive to the number of threads in each block.

A thread bundle is active after being scheduled and allocated with resources. The thread bundle may be in the computing state or a non-computing state (for example, waiting for a memory request).

If the value is less than 0.5, GPUs are not efficiently used. The value should be greater than 0.8.

For example, a GPU has N SMs:

  • A kernel function uses N thread blocks to run on all SMs in a period. In this case, the value is 1 (100%).
  • A kernel function runs N/5 thread blocks in a period. In this case, the value is 0.2.
  • A kernel function uses N thread blocks and runs only 1/5 of cycles in a period. In this case, the value is 0.2.

DCGM_FI_PROF_SM_OCCUPANCY

Gauge

%

Ratio of the number of thread bundles that reside on the SM to the maximum number of thread bundles that can reside on the SM within a period.

This is an average value of all SMs within a period.

A high value does not mean a high GPU usage. Only when the GPU memory bandwidth is limited, a high value of workloads (DCGM_FI_PROF_DRAM_ACTIVE) indicates more efficient GPU usage.

DCGM_FI_PROF_PIPE_TENSOR_ACTIVE

Gauge

%

Fraction of cycles during which the tensor (HMMA/IMMA) pipe is active.

This is an average value within a period, not an instantaneous value.

A higher value indicates a higher utilization of tensor cores.

Value 1 (100%) indicates that a tensor instruction is sent every instruction cycle in the entire period (one instruction is completed in two cycles).

If the value is 0.2 (20%), the possible causes are as follows:

  • During the entire period, 20% of the SM tensor cores run at 100% utilization.
  • During the entire period, all SM tensor cores run at 20% utilization.
  • During 1/5 of the entire period, all SM tensor cores run at 100% utilization.
  • Other combinations

DCGM_FI_PROF_PIPE_FP64_ACTIVE

Gauge

%

Fraction of cycles during which the FP64 (double precision) pipe is active.

This is an average value within a period, not an instantaneous value.

A larger value indicates a higher usage of FP64 cores.

Value 1 (100%) indicates that the FP64 instruction is executed every four cycles (for example, Volta cards) in a period.

If the value is 0.2 (20%), the possible causes are as follows:

  • During the entire period, 20% of the SM FP64 cores run at 100% utilization.
  • During the entire period, all SM FP64 cores run at 20% utilization.
  • During 1/5 of the entire period, all SM FP64 cores run at 100% utilization.
  • Other combinations

DCGM_FI_PROF_PIPE_FP32_ACTIVE

Gauge

%

Fraction of cycles during which the fused multiply-add (FMA) pipe is active. Multiply-add applies to FP32 (single precision) and integers.

This is an average value within a period, not an instantaneous value.

A larger value indicates a higher usage of FP32 cores.

Value 1 (100%) indicates that the FP32 instruction is executed every two cycles (for example, Volta cards) in a period.

If the value is 0.2 (20%), the possible causes are as follows:

  • During the entire period, 20% of the SM FP32 cores run at 100% utilization.
  • During the entire period, all SM FP32 cores run at 20% utilization.
  • During 1/5 of the entire period, all SM FP32 cores run at 100% utilization.
  • Other combinations

DCGM_FI_PROF_PIPE_FP16_ACTIVE

Gauge

%

Fraction of cycles during which the FP16 (half-precision) pipe is active.

This is an average value within a period, not an instantaneous value.

A larger value indicates a higher usage of FP16 cores.

Value 1 (100%) indicates that the FP16 instruction is executed every two cycles (for example, Volta cards) in a period.

If the value is 0.2 (20%), the possible causes are as follows:

  • During the entire period, 20% of the SM FP16 cores run at 100% utilization.
  • During the entire period, all SM FP16 cores run at 20% utilization.
  • During 1/5 of the entire period, all SM FP16 cores run at 100% utilization.
  • Other combinations

DCGM_FI_PROF_DRAM_ACTIVE

Gauge

%

Fraction of cycles during which Memory BW Utilization sends data to or receives from device memory.

This is an average value within a period, not an instantaneous value.

A higher value indicates a higher utilization of device memory.

Value 1 (100%) indicates that a DRAM instruction is executed in every cycle throughout the entire time period (although a peak value of around 0.8 (80%) is the maximum achievable).

If the value is set to 0.2 (20%), it means that 20% of the cycles involve reading from or writing to the device memory within the given time period.

DCGM_FI_PROF_PCIE_TX_BYTES

DCGM_FI_PROF_PCIE_RX_BYTES

Counter

Byte/s

Rate of data transmitted or received over the PCIe bus, including the protocol header and data payload.

This is an average value within a period, not an instantaneous value.

The rate is averaged over the period. For example, if 1 GB of data is transmitted within 1 second, the transmission rate is 1 GB/s regardless of whether the data is transmitted at a constant rate or burst. Theoretically, the maximum PCIe Gen3 bandwidth is 985 MB/s per channel.

DCGM_FI_PROF_NVLINK_RX_BYTES

DCGM_FI_PROF_NVLINK_TX_BYTES

Counter

Byte/s

Rate at which data is transmitted or received through NVLink, excluding the protocol header.

This is an average value within a period, not an instantaneous value.

The rate is averaged over the period. For example, if 1 GB of data is transmitted within 1 second, the transmission rate is 1 GB/s regardless of whether the data is transmitted at a constant rate or burst. Theoretically, the maximum NVLink Gen2 bandwidth is 25 GB/s per link in each direction.

Table 6 Frequency (clock)

Metric

Type

Unit

Description

DCGM_FI_DEV_SM_CLOCK

Gauge

MHz

SM clock for the device

DCGM_FI_DEV_MEM_CLOCK

Gauge

MHz

Memory clock for the device

DCGM_FI_DEV_APP_SM_CLOCK

Gauge

MHz

SM application clocks

DCGM_FI_DEV_APP_MEM_CLOCK

Gauge

MHz

Memory application clocks

DCGM_FI_DEV_CLOCK_THROTTLE_REASONS

Gauge

MHz

The reason why the clock is throttled

Table 7 XID errors and violations

Metric

Type

Unit

Description

DCGM_FI_DEV_XID_ERRORS

Gauge

N/A

The last XID error that occurs in a period of time

DCGM_FI_DEV_POWER_VIOLATION

Counter

μs

A violation caused by the power limit. The value is the time when the violation occurs.

DCGM_FI_DEV_THERMAL_VIOLATION

Counter

μs

A violation caused by the thermal limit. The value is the time when the violation occurs.

DCGM_FI_DEV_SYNC_BOOST_VIOLATION

Counter

μs

A violation caused by the synchronous boost limit. The value is the time when the violation occurs.

DCGM_FI_DEV_BOARD_LIMIT_VIOLATION

Counter

μs

A violation caused by the board limit. The value is the time when the violation occurs.

DCGM_FI_DEV_LOW_UTIL_VIOLATION

Counter

μs

A violation caused by the low utilisation limit. The value is the time when the violation occurs.

DCGM_FI_DEV_RELIABILITY_VIOLATION

Counter

μs

A violation caused by the reliability limit. The value is the time when the violation occurs.

Table 8 BAR1

Metric

Type

Unit

Description

DCGM_FI_DEV_BAR1_USED

Gauge

MB

The used BAR1

DCGM_FI_DEV_BAR1_FREE

Gauge

MB

The remaining BAR1

Table 9 Temperature and power

Metric

Type

Unit

Description

DCGM_FI_DEV_MEMORY_TEMP

Gauge

°C

Memory temperature

DCGM_FI_DEV_GPU_TEMP

Gauge

°C

GPU temperature

DCGM_FI_DEV_POWER_USAGE

Gauge

Watt

GPU power

DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION

Counter

Millijoule

Energy consumed since a driver was loaded

Table 10 Retired pages

Metric

Type

Unit

Description

DCGM_FI_DEV_RETIRED_SBE

Gauge

N/A

Number of retired pages due to single bit errors

DCGM_FI_DEV_RETIRED_DBE

Gauge

N/A

Number of retired pages due to double bit errors

For details about more DCGM metrics, see Field Identifiers.

We use cookies to improve our site and your experience. By continuing to browse our site you accept our cookie policy. Find out more

Feedback

Feedback

Feedback

0/500

Selected Content

Submit selected content with the feedback