Updated on 2024-11-14 GMT+08:00

Basic Metrics: ModelArts Metrics

This section describes the ModelArts metrics reported to AOM through the Agent.

Table 1 Metrics reported by ModelArts to AOM through the Agent

Category

Metric

Metric Name

Description

Value Range

Unit

CPU

ma_container_cpu_util

CPU Usage

CPU usage of a measured object

0–100

%

ma_container_cpu_used_core

Used CPU Cores

Number of CPU cores used by a measured object

≥ 0

Cores

ma_container_cpu_limit_core

Total CPU Cores

Total number of CPU cores that have been applied for a measured object

≥ 1

Cores

Memory

ma_container_memory_capacity_megabytes

Memory

Total physical memory that has been applied for a measured object

≥ 0

MB

ma_container_memory_util

Physical Memory Usage

Percentage of the used physical memory to the total physical memory applied for a measured object

0–100

%

ma_container_memory_used_megabytes

Used Physical Memory

Physical memory that has been used by a measured object (container_memory_working_set_bytes in the current working set). (Memory usage in a working set = Active anonymous and cache, and file-baked page ≤ container_memory_usage_bytes)

≥ 0

MB

Storage I/O

ma_container_disk_read_kilobytes

Disk Read Rate

Volume of data read from a disk per second

≥ 0

KB/s

ma_container_disk_write_kilobytes

Disk Write Rate

Volume of data written into a disk per second

≥ 0

KB/s

GPU memory

ma_container_gpu_mem_total_megabytes

GPU Memory Capacity

Total GPU memory of a training job

> 0

MB

ma_container_gpu_mem_util

GPU Memory Usage

Percentage of the used GPU memory to the total GPU memory

0–100

%

ma_container_gpu_mem_used_megabytes

Used GPU Memory

GPU memory used by a measured object

≥ 0

MB

GPU

ma_container_gpu_util

GPU Usage

GPU usage of a measured object

0–100

%

ma_container_gpu_mem_copy_util

GPU Memory Bandwidth Usage

GPU memory bandwidth usage of a measured object. For example, the maximum memory bandwidth of NVIDIA GPU V100 is 900 GB/s. If the current memory bandwidth is 450 GB/s, the memory bandwidth usage is 50%.

0–100

%

ma_container_gpu_enc_util

GPU Encoder Usage

GPU encoder usage of a measured object

0–100

%

ma_container_gpu_dec_util

GPU Decoder Usage

GPU decoder usage of a measured object

0–100

%

DCGM_FI_DEV_GPU_TEMP

GPU Temperature

GPU temperature

> 0

°C

DCGM_FI_DEV_POWER_USAGE

GPU Power

GPU power

> 0

W

DCGM_FI_DEV_MEMORY_TEMP

Memory Temperature

Memory temperature

> 0

°C

DCGM_FI_PROF_GR_ENGINE_ACTIVE

Graphics Engine Activity

Percentage of the time when the graphic or compute engine is in the active state within a period. This is an average value of all graphic or compute engines. An active graphic or compute engine indicates that the graphic or compute context is associated with a thread and the graphic or compute context is busy.

0–1.0

Percentage (fraction)

DCGM_FI_PROF_SM_OCCUPANCY

SM Occupancy

Ratio of the number of thread bundles that reside on the SM to the maximum number of thread bundles that can reside on the SM within a period.

This is an average value of all SMs within a period.

A high value does not mean a high GPU usage. Only when the GPU memory bandwidth is limited, a high value of workloads (DCGM_FI_PROF_DRAM_ACTIVE) indicates more efficient GPU usage.

0–1.0

Percentage (fraction)

DCGM_FI_PROF_PIPE_TENSOR_ACTIVE

Tensor Activity

Fraction of the period during which the tensor (HMMA/IMMA) pipe is active.

This is an average value within a period, not an instantaneous value.

A higher value indicates a higher utilization of tensor cores.

Value 1 (100%) indicates that a tensor instruction is sent every instruction cycle in the entire period (one instruction is completed in two cycles).

If the value is 0.2 (20%), the possible causes are as follows:

During the entire period, 20% of the SM tensor cores run at 100% utilization.

During the entire period, all SM tensor cores run at 20% utilization.

During 1/5 of the entire period, all SM tensor cores run at 100% utilization.

Other combinations

0–1.0

Percentage (fraction)

DCGM_FI_PROF_DRAM_ACTIVE

Memory BW Utilization

Percentage of the time for sending data to or receiving data from the device memory within a period.

This is an average value within a period, not an instantaneous value.

A higher value indicates a higher utilization of device memory.

Value 1 (100%) indicates that a DRAM instruction is executed once per cycle throughout a period (the maximum value can be reached at a peak of about 0.8).

If the value is 0.2 (20%), indicating that data is read from or written into the device memory during 20% of the cycle within a period.

0–1.0

Percentage (fraction)

DCGM_FI_PROF_PIPE_FP16_ACTIVE

FP16 Engine Activity

Fraction of the period during which the FP16 (half-precision) pipe is active.

This is an average value within a period, not an instantaneous value.

A larger value indicates a higher usage of FP16 cores.

Value 1 (100%) indicates that the FP16 instruction is executed every two cycles (for example, Volta cards) in a period.

If the value is 0.2 (20%), the possible causes are as follows:

During the entire period, 20% of the SM FP16 cores run at 100% utilization.

During the entire period, all SM FP16 cores run at 20% utilization.

During 1/5 of the entire period, all SM FP16 cores run at 100% utilization.

Other combinations

0–1.0

Percentage (fraction)

DCGM_FI_PROF_PIPE_FP32_ACTIVE

FP32 Engine Activity

Fraction of the period during which the fused multiply-add (FMA) pipe is active. Multiply-add applies to FP32 (single precision) and integers.

This is an average value within a period, not an instantaneous value.

A larger value indicates a higher usage of FP32 cores.

Value 1 (100%) indicates that the FP32 instruction is executed every two cycles (for example, Volta cards) in a period.

If the value is 0.2 (20%), the possible causes are as follows:

During the entire period, 20% of the SM FP32 cores run at 100% utilization.

During the entire period, all SM FP32 cores run at 20% utilization.

During 1/5 of the entire period, all SM FP32 cores run at 100% utilization.

Other combinations

0–1.0

Percentage (fraction)

DCGM_FI_PROF_PIPE_FP64_ACTIVE

FP64 Engine Activity

Fraction of the period during which the FP64 (double precision) pipe is active.

This is an average value within a period, not an instantaneous value.

A larger value indicates a higher usage of FP64 cores.

Value 1 (100%) indicates that the FP64 instruction is executed every four cycles (for example, Volta cards) in a period.

If the value is 0.2 (20%), the possible causes are as follows:

During the entire period, 20% of the SM FP64 cores run at 100% utilization.

During the entire period, all SM FP64 cores run at 20% utilization.

During 1/5 of the entire period, all SM FP64 cores run at 100% utilization.

Other combinations

0–1.0

Percentage (fraction)

DCGM_FI_PROF_SM_ACTIVE

SM Activity

Fraction of the time during which at least one thread bundle is active on an SM within a period.

This is an average value of all SMs and is insensitive to the number of threads in each block.

A thread bundle is active after being scheduled and allocated with resources. The thread bundle may be in the computing state or a non-computing state (for example, waiting for a memory request).

If the value is less than 0.5, GPUs are not efficiently used. The value should be greater than 0.8.

For example, a GPU has N SMs:

A kernel function uses N thread blocks to run on all SMs in a period. In this case, the value is 1 (100%).

A kernel function runs N/5 thread blocks in a period. In this case, the value is 0.2.

A kernel function uses N thread blocks and runs only 1/5 of cycles in a period. In this case, the value is 0.2.

0–1.0

Percentage (fraction)

DCGM_FI_PROF_PCIE_TX_BYTES

DCGM_FI_PROF_PCIE_RX_BYTES

PCIe Bandwidth

Rate of data transmitted or received over the PCIe bus, including the protocol header and data payload.

This is an average value within a period, not an instantaneous value.

The rate is averaged over the period. For example, if 1 GB of data is transmitted within 1 second, the transmission rate is 1 GB/s regardless of whether the data is transmitted at a constant rate or burst. Theoretically, the maximum PCIe Gen3 bandwidth is 985 MB/s per channel.

≥ 0

Bytes/s

DCGM_FI_PROF_NVLINK_RX_BYTES

DCGM_FI_PROF_NVLINK_TX_BYTES

NVLink Bandwidth

Rate at which data is transmitted or received through NVLink, excluding the protocol header.

This is an average value within a period, not an instantaneous value.

The rate is averaged over the period. For example, if 1 GB of data is transmitted within 1 second, the transmission rate is 1 GB/s regardless of whether the data is transmitted at a constant rate or burst. Theoretically, the maximum NVLink Gen2 bandwidth is 25 GB/s per link in each direction.

≥ 0

Bytes/s

Network I/O

ma_container_network_receive_bytes

Downlink Rate (BPS)

Inbound traffic rate of a measured object

≥ 0

Bytes/s

ma_container_network_receive_packets

Downlink Rate (PPS)

Number of data packets received by a NIC per second

≥ 0

Packets/s

ma_container_network_receive_error_packets

Downlink Error Rate

Number of error packets received by a NIC per second

≥ 0

Count/s

ma_container_network_transmit_bytes

Uplink Rate (BPS)

Outbound traffic rate of a measured object

≥ 0

Bytes/s

ma_container_network_transmit_error_packets

Uplink Error Rate

Number of error packets sent by a NIC per second

≥ 0

Count/s

ma_container_network_transmit_packets

Uplink Rate (PPS)

Number of data packets sent by a NIC per second

≥ 0

Packets/s

NPU

ma_container_npu_util

NPU Usage

NPU usage of a measured object

0–100

%

ma_container_npu_memory_util

NPU Memory Usage

Percentage of the used NPU memory to the total NPU memory

0–100

%

ma_container_npu_memory_used_megabytes

Used NPU Memory

NPU memory used by a measured object

≥ 0

MB

ma_container_npu_memory_total_megabytes

Total NPU Memory

Total NPU memory of a measured object

≥ 0

MB