Basic Metrics: ModelArts Metrics
This section describes the ModelArts metrics reported to AOM through the Agent.
Category |
Metric |
Metric Name |
Description |
Value Range |
Unit |
---|---|---|---|---|---|
CPU |
ma_container_cpu_util |
CPU Usage |
CPU usage of a measured object |
0–100 |
% |
ma_container_cpu_used_core |
Used CPU Cores |
Number of CPU cores used by a measured object |
≥ 0 |
Cores |
|
ma_container_cpu_limit_core |
Total CPU Cores |
Total number of CPU cores that have been applied for a measured object |
≥ 1 |
Cores |
|
Memory |
ma_container_memory_capacity_megabytes |
Memory |
Total physical memory that has been applied for a measured object |
≥ 0 |
MB |
ma_container_memory_util |
Physical Memory Usage |
Percentage of the used physical memory to the total physical memory applied for a measured object |
0–100 |
% |
|
ma_container_memory_used_megabytes |
Used Physical Memory |
Physical memory that has been used by a measured object (container_memory_working_set_bytes in the current working set). (Memory usage in a working set = Active anonymous AND cache, and file-baked page ≤ container_memory_usage_bytes) |
≥ 0 |
MB |
|
Storage I/O |
ma_container_disk_read_kilobytes |
Disk Read Rate |
Volume of data read from a disk per second |
≥ 0 |
KB/s |
ma_container_disk_write_kilobytes |
Disk Write Rate |
Volume of data written into a disk per second |
≥ 0 |
KB/s |
|
GPU memory |
ma_container_gpu_mem_total_megabytes |
GPU Memory Capacity |
Total GPU memory of a training job |
> 0 |
MB |
ma_container_gpu_mem_util |
GPU Memory Usage |
Percentage of the used GPU memory to the total GPU memory |
0–100 |
% |
|
ma_container_gpu_mem_used_megabytes |
Used GPU Memory |
GPU memory used by a measured object |
≥ 0 |
MB |
|
GPU |
ma_container_gpu_util |
GPU Usage |
GPU usage of a measured object |
0–100 |
% |
ma_container_gpu_mem_copy_util |
GPU Memory Bandwidth Usage |
GPU memory bandwidth usage of a measured object. For example, the maximum memory bandwidth of NVIDIA GPU V100 is 900 GB/s. If the current memory bandwidth is 450 GB/s, the memory bandwidth usage is 50%. |
0–100 |
% |
|
ma_container_gpu_enc_util |
GPU Encoder Usage |
GPU encoder usage of a measured object |
0–100 |
% |
|
ma_container_gpu_dec_util |
GPU Decoder Usage |
GPU decoder usage of a measured object |
0–100 |
% |
|
DCGM_FI_DEV_GPU_TEMP |
GPU Temperature |
GPU temperature |
> 0 |
°C |
|
DCGM_FI_DEV_POWER_USAGE |
GPU Power |
GPU power |
> 0 |
W |
|
DCGM_FI_DEV_MEMORY_TEMP |
Memory Temperature |
Memory temperature |
> 0 |
°C |
|
DCGM_FI_PROF_GR_ENGINE_ACTIVE |
Graphics Engine Activity |
Percentage of the time when the graphic or compute engine is in the active state within a period. This is an average value of all graphic or compute engines. An active graphic or compute engine indicates that the graphic or compute context is associated with a thread and the graphic or compute context is busy. |
0–1.0 |
Percentage (fraction) |
|
DCGM_FI_PROF_SM_OCCUPANCY |
SM Occupancy |
Ratio of the number of thread bundles that reside on the SM to the maximum number of thread bundles that can reside on the SM within a period. This is an average value of all SMs within a period. A high value does not mean a high GPU usage. Only when the GPU memory bandwidth is limited, a high value of workloads (DCGM_FI_PROF_DRAM_ACTIVE) indicates more efficient GPU usage. |
0–1.0 |
Percentage (fraction) |
|
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE |
Tensor Activity |
Fraction of the period during which the tensor (HMMA/IMMA) pipe is active. This is an average value within a period, not an instantaneous value. A higher value indicates a higher utilization of tensor cores. Value 1 (100%) indicates that a tensor instruction is sent every instruction cycle in the entire period (one instruction is completed in two cycles). If the value is 0.2 (20%), the possible causes are as follows: During the entire period, 20% of the SM tensor cores run at 100% utilization. During the entire period, all SM tensor cores run at 20% utilization. During 1/5 of the entire period, all SM tensor cores run at 100% utilization. Other combinations |
0–1.0 |
Percentage (fraction) |
|
DCGM_FI_PROF_DRAM_ACTIVE |
Memory BW Utilization |
Percentage of the time for sending data to or receiving data from the device memory within a period. This is an average value within a period, not an instantaneous value. A higher value indicates a higher utilization of device memory. Value 1 (100%) indicates that a DRAM instruction is executed once per cycle throughout a period (the maximum value can be reached at a peak of about 0.8). If the value is 0.2 (20%), indicating that data is read from or written into the device memory during 20% of the cycle within a period. |
0–1.0 |
Percentage (fraction) |
|
DCGM_FI_PROF_PIPE_FP16_ACTIVE |
FP16 Engine Activity |
Fraction of the period during which the FP16 (half-precision) pipe is active. This is an average value within a period, not an instantaneous value. A larger value indicates a higher usage of FP16 cores. Value 1 (100%) indicates that the FP16 instruction is executed every two cycles (for example, Volta cards) in a period. If the value is 0.2 (20%), the possible causes are as follows: During the entire period, 20% of the SM FP16 cores run at 100% utilization. During the entire period, all SM FP16 cores run at 20% utilization. During 1/5 of the entire period, all SM FP16 cores run at 100% utilization. Other combinations |
0–1.0 |
Percentage (fraction) |
|
DCGM_FI_PROF_PIPE_FP32_ACTIVE |
FP32 Engine Activity |
Fraction of the period during which the fused multiply-add (FMA) pipe is active. Multiply-add applies to FP32 (single precision) and integers. This is an average value within a period, not an instantaneous value. A larger value indicates a higher usage of FP32 cores. Value 1 (100%) indicates that the FP32 instruction is executed every two cycles (for example, Volta cards) in a period. If the value is 0.2 (20%), the possible causes are as follows: During the entire period, 20% of the SM FP32 cores run at 100% utilization. During the entire period, all SM FP32 cores run at 20% utilization. During 1/5 of the entire period, all SM FP32 cores run at 100% utilization. Other combinations |
0–1.0 |
Percentage (fraction) |
|
DCGM_FI_PROF_PIPE_FP64_ACTIVE |
FP64 Engine Activity |
Fraction of the period during which the FP64 (double precision) pipe is active. This is an average value within a period, not an instantaneous value. A larger value indicates a higher usage of FP64 cores. Value 1 (100%) indicates that the FP64 instruction is executed every four cycles (for example, Volta cards) in a period. If the value is 0.2 (20%), the possible causes are as follows: During the entire period, 20% of the SM FP64 cores run at 100% utilization. During the entire period, all SM FP64 cores run at 20% utilization. During 1/5 of the entire period, all SM FP64 cores run at 100% utilization. Other combinations |
0–1.0 |
Percentage (fraction) |
|
DCGM_FI_PROF_SM_ACTIVE |
SM Activity |
Fraction of the time during which at least one thread bundle is active on an SM within a period. This is an average value of all SMs and is insensitive to the number of threads in each block. A thread bundle is active after being scheduled and allocated with resources. The thread bundle may be in the computing state or a non-computing state (for example, waiting for a memory request). If the value is less than 0.5, GPUs are not efficiently used. The value should be greater than 0.8. For example, a GPU has N SMs: A kernel function uses N thread blocks to run on all SMs in a period. In this case, the value is 1 (100%). A kernel function runs N/5 thread blocks in a period. In this case, the value is 0.2. A kernel function uses N thread blocks and runs only 1/5 of cycles in a period. In this case, the value is 0.2. |
0–1.0 |
Percentage (fraction) |
|
DCGM_FI_PROF_PCIE_TX_BYTES DCGM_FI_PROF_PCIE_RX_BYTES |
PCIe Bandwidth |
Rate of data transmitted or received over the PCIe bus, including the protocol header and data payload. This is an average value within a period, not an instantaneous value. The rate is averaged over the period. For example, if 1 GB of data is transmitted within 1 second, the transmission rate is 1 GB/s regardless of whether the data is transmitted at a constant rate or burst. Theoretically, the maximum PCIe Gen3 bandwidth is 985 MB/s per channel. |
≥ 0 |
Bytes/s |
|
DCGM_FI_PROF_NVLINK_RX_BYTES DCGM_FI_PROF_NVLINK_TX_BYTES |
NVLink Bandwidth |
Rate at which data is transmitted or received through NVLink, excluding the protocol header. This is an average value within a period, not an instantaneous value. The rate is averaged over the period. For example, if 1 GB of data is transmitted within 1 second, the transmission rate is 1 GB/s regardless of whether the data is transmitted at a constant rate or burst. Theoretically, the maximum NVLink Gen2 bandwidth is 25 GB/s per link in each direction. |
≥ 0 |
Bytes/s |
|
Network I/O |
ma_container_network_receive_bytes |
Downlink Rate (BPS) |
Inbound traffic rate of a measured object |
≥ 0 |
Bytes/s |
ma_container_network_receive_packets |
Downlink Rate (PPS) |
Number of data packets received by a NIC per second |
≥ 0 |
Packets/s |
|
ma_container_network_receive_error_packets |
Downlink Error Rate |
Number of error packets received by a NIC per second |
≥ 0 |
Count/s |
|
ma_container_network_transmit_bytes |
Uplink Rate (BPS) |
Outbound traffic rate of a measured object |
≥ 0 |
Bytes/s |
|
ma_container_network_transmit_error_packets |
Uplink Error Rate |
Number of error packets sent by a NIC per second |
≥ 0 |
Count/s |
|
ma_container_network_transmit_packets |
Uplink Rate (PPS) |
Number of data packets sent by a NIC per second |
≥ 0 |
Packets/s |
|
NPU |
ma_container_npu_util |
NPU Usage |
NPU usage of a measured object |
0–100 |
% |
ma_container_npu_memory_util |
NPU Memory Usage |
Percentage of the used NPU memory to the total NPU memory |
0–100 |
% |
|
ma_container_npu_memory_used_megabytes |
Used NPU Memory |
NPU memory used by a measured object |
≥ 0 |
MB |
|
ma_container_npu_memory_total_megabytes |
Total NPU Memory |
Total NPU memory of a measured object |
≥ 0 |
MB |
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.