Updated on 2024-06-26 GMT+08:00

GPU View

GPU resource metrics are used to measure GPU performance and usage, including the GPU usage, temperature, and GPU memory, so you can better monitor the GPU.

Metric Description

Figure 1 GPU metrics
Table 1 GPU metrics

Metric

Unit

Description

Cluster - GPU Memory Usage

%

GPU memory usage of the cluster.

Formula: Used GPU memory of the cluster/Total GPU memory of the cluster

Cluster - GPU Compute Usage

%

GPU compute usage of the cluster.

Formula: Used GPU compute of the cluster/Total GPU compute of the cluster

Node - Used GPU Memory

byte

GPU memory used by the node.

Node - GPU Compute Usage

%

GPU compute usage of each node.

Formula: Total GPU compute used by containers on the node/Total GPU compute of the node

Node - GPU Memory Usage

%

GPU memory usage of each node.

Formula: Total GPU memory used by containers on the node/Total GPU memory of the node

GPU - Used GPU Memory

byte

GPU memory usage of each GPU.

Formula: Total used GPU memory of containers on the GPU/Total GPU memory of the GPU

GPU - GPU Compute Usage

%

GPU compute usage of each graphics card.

Formula: Total GPU compute used by containers on the graphics card/Total GPU compute of the graphics card

GPU - Temperature

°C

Temperature of each GPU.

GPU - Memory Clock

Hz

Memory clock of each GPU.

GPU - PCIe Bandwidth

byte/s

PCle bandwidth of each GPU.

Metric List

The following is the metric list of the GPU view.
Table 2 Metric description

Metric

Type

Description

cce_gpu_gpu_utilization

Gauge

GPU compute usage.

cce_gpu_memory_utilization

Gauge

GPU memory usage.

cce_gpu_memory_used

Gauge

Used GPU memory.

cce_gpu_memory_total

Gauge

Total GPU memory.

cce_gpu_memory_free

Gauge

Free GPU memory.

cce_gpu_memory_clock

Gauge

The speed at which the GPU memory operates.

cce_gpu_gpu_temperature

Gauge

GPU temperature.

cce_gpu_pcie_link_bandwidth

Gauge

GPU PCIe bandwidth.

cce_gpu_pcie_throughput_rx

Gauge

GPU PCIe RX bandwidth.