Basic Metrics: ModelArts Metrics
This section describes the ModelArts metrics reported to AOM through the Agent.
Category |
Metric |
Metric Name |
Description |
Value Range |
Unit |
---|---|---|---|---|---|
CPU |
ma_container_cpu_util |
CPU Usage |
CPU usage of a measured object |
0–100 |
% |
ma_container_cpu_used_core |
Used CPU Cores |
Number of CPU cores used by a measured object |
≥ 0 |
Cores |
|
ma_container_cpu_limit_core |
Total CPU Cores |
Total number of CPU cores that have been applied for a measured object |
≥ 1 |
Cores |
|
Memory |
ma_container_memory_capacity_megabytes |
Memory |
Total physical memory that has been applied for a measured object |
≥ 0 |
MB |
ma_container_memory_util |
Physical Memory Usage |
Percentage of the used physical memory to the total physical memory applied for a measured object |
0–100 |
% |
|
ma_container_memory_used_megabytes |
Used Physical Memory |
Physical memory that has been used by a measured object (container_memory_working_set_bytes in the current working set). (Memory usage in a working set = Active anonymous and cache, and file-baked page ≤ container_memory_usage_bytes) |
≥ 0 |
MB |
|
Storage I/O |
ma_container_disk_read_kilobytes |
Disk Read Rate |
Volume of data read from a disk per second |
≥ 0 |
KB/s |
ma_container_disk_write_kilobytes |
Disk Write Rate |
Volume of data written into a disk per second |
≥ 0 |
KB/s |
|
GPU memory |
ma_container_gpu_mem_total_megabytes |
GPU Memory Capacity |
Total GPU memory of a training job |
> 0 |
MB |
ma_container_gpu_mem_util |
GPU Memory Usage |
Percentage of the used GPU memory to the total GPU memory |
0–100 |
% |
|
ma_container_gpu_mem_used_megabytes |
Used GPU Memory |
GPU memory used by a measured object |
≥ 0 |
MB |
|
GPU |
ma_container_gpu_util |
GPU Usage |
GPU usage of a measured object |
0–100 |
% |
ma_container_gpu_mem_copy_util |
GPU Memory Bandwidth Usage |
GPU memory bandwidth usage of a measured object. For example, the maximum memory bandwidth of NVIDIA GPU V100 is 900 GB/s. If the current memory bandwidth is 450 GB/s, the memory bandwidth usage is 50%. |
0–100 |
% |
|
ma_container_gpu_enc_util |
GPU Encoder Usage |
GPU encoder usage of a measured object |
0–100 |
% |
|
ma_container_gpu_dec_util |
GPU Decoder Usage |
GPU decoder usage of a measured object |
0–100 |
% |
|
DCGM_FI_DEV_GPU_TEMP |
GPU Temperature |
GPU temperature |
> 0 |
°C |
|
DCGM_FI_DEV_POWER_USAGE |
GPU Power |
GPU power |
> 0 |
W |
|
DCGM_FI_DEV_MEMORY_TEMP |
Memory Temperature |
Memory temperature |
> 0 |
°C |
|
DCGM_FI_PROF_GR_ENGINE_ACTIVE |
Graphics Engine Activity |
Percentage of the time when the graphic or compute engine is in the active state within a period. This is an average value of all graphic or compute engines. An active graphic or compute engine indicates that the graphic or compute context is associated with a thread and the graphic or compute context is busy. |
0–1.0 |
Percentage (fraction) |
|
DCGM_FI_PROF_SM_OCCUPANCY |
SM Occupancy |
Ratio of the number of thread bundles that reside on the SM to the maximum number of thread bundles that can reside on the SM within a period. This is an average value of all SMs within a period. A high value does not mean a high GPU usage. Only when the GPU memory bandwidth is limited, a high value of workloads (DCGM_FI_PROF_DRAM_ACTIVE) indicates more efficient GPU usage. |
0–1.0 |
Percentage (fraction) |
|
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE |
Tensor Activity |
Fraction of the period during which the tensor (HMMA/IMMA) pipe is active. This is an average value within a period, not an instantaneous value. A higher value indicates a higher utilization of tensor cores. Value 1 (100%) indicates that a tensor instruction is sent every instruction cycle in the entire period (one instruction is completed in two cycles). If the value is 0.2 (20%), the possible causes are as follows: During the entire period, 20% of the SM tensor cores run at 100% utilization. During the entire period, all SM tensor cores run at 20% utilization. During 1/5 of the entire period, all SM tensor cores run at 100% utilization. Other combinations |
0–1.0 |
Percentage (fraction) |
|
DCGM_FI_PROF_DRAM_ACTIVE |
Memory BW Utilization |
Percentage of the time for sending data to or receiving data from the device memory within a period. This is an average value within a period, not an instantaneous value. A higher value indicates a higher utilization of device memory. Value 1 (100%) indicates that a DRAM instruction is executed once per cycle throughout a period (the maximum value can be reached at a peak of about 0.8). If the value is 0.2 (20%), indicating that data is read from or written into the device memory during 20% of the cycle within a period. |
0–1.0 |
Percentage (fraction) |
|
DCGM_FI_PROF_PIPE_FP16_ACTIVE |
FP16 Engine Activity |
Fraction of the period during which the FP16 (half-precision) pipe is active. This is an average value within a period, not an instantaneous value. A larger value indicates a higher usage of FP16 cores. Value 1 (100%) indicates that the FP16 instruction is executed every two cycles (for example, Volta cards) in a period. If the value is 0.2 (20%), the possible causes are as follows: During the entire period, 20% of the SM FP16 cores run at 100% utilization. During the entire period, all SM FP16 cores run at 20% utilization. During 1/5 of the entire period, all SM FP16 cores run at 100% utilization. Other combinations |
0–1.0 |
Percentage (fraction) |
|
DCGM_FI_PROF_PIPE_FP32_ACTIVE |
FP32 Engine Activity |
Fraction of the period during which the fused multiply-add (FMA) pipe is active. Multiply-add applies to FP32 (single precision) and integers. This is an average value within a period, not an instantaneous value. A larger value indicates a higher usage of FP32 cores. Value 1 (100%) indicates that the FP32 instruction is executed every two cycles (for example, Volta cards) in a period. If the value is 0.2 (20%), the possible causes are as follows: During the entire period, 20% of the SM FP32 cores run at 100% utilization. During the entire period, all SM FP32 cores run at 20% utilization. During 1/5 of the entire period, all SM FP32 cores run at 100% utilization. Other combinations |
0–1.0 |
Percentage (fraction) |
|
DCGM_FI_PROF_PIPE_FP64_ACTIVE |
FP64 Engine Activity |
Fraction of the period during which the FP64 (double precision) pipe is active. This is an average value within a period, not an instantaneous value. A larger value indicates a higher usage of FP64 cores. Value 1 (100%) indicates that the FP64 instruction is executed every four cycles (for example, Volta cards) in a period. If the value is 0.2 (20%), the possible causes are as follows: During the entire period, 20% of the SM FP64 cores run at 100% utilization. During the entire period, all SM FP64 cores run at 20% utilization. During 1/5 of the entire period, all SM FP64 cores run at 100% utilization. Other combinations |
0–1.0 |
Percentage (fraction) |
|
DCGM_FI_PROF_SM_ACTIVE |
SM Activity |
Fraction of the time during which at least one thread bundle is active on an SM within a period. This is an average value of all SMs and is insensitive to the number of threads in each block. A thread bundle is active after being scheduled and allocated with resources. The thread bundle may be in the computing state or a non-computing state (for example, waiting for a memory request). If the value is less than 0.5, GPUs are not efficiently used. The value should be greater than 0.8. For example, a GPU has N SMs: A kernel function uses N thread blocks to run on all SMs in a period. In this case, the value is 1 (100%). A kernel function runs N/5 thread blocks in a period. In this case, the value is 0.2. A kernel function uses N thread blocks and runs only 1/5 of cycles in a period. In this case, the value is 0.2. |
0–1.0 |
Percentage (fraction) |
|
DCGM_FI_PROF_PCIE_TX_BYTES DCGM_FI_PROF_PCIE_RX_BYTES |
PCIe Bandwidth |
Rate of data transmitted or received over the PCIe bus, including the protocol header and data payload. This is an average value within a period, not an instantaneous value. The rate is averaged over the period. For example, if 1 GB of data is transmitted within 1 second, the transmission rate is 1 GB/s regardless of whether the data is transmitted at a constant rate or burst. Theoretically, the maximum PCIe Gen3 bandwidth is 985 MB/s per channel. |
≥ 0 |
Bytes/s |
|
DCGM_FI_PROF_NVLINK_RX_BYTES DCGM_FI_PROF_NVLINK_TX_BYTES |
NVLink Bandwidth |
Rate at which data is transmitted or received through NVLink, excluding the protocol header. This is an average value within a period, not an instantaneous value. The rate is averaged over the period. For example, if 1 GB of data is transmitted within 1 second, the transmission rate is 1 GB/s regardless of whether the data is transmitted at a constant rate or burst. Theoretically, the maximum NVLink Gen2 bandwidth is 25 GB/s per link in each direction. |
≥ 0 |
Bytes/s |
|
Network I/O |
ma_container_network_receive_bytes |
Downlink Rate (BPS) |
Inbound traffic rate of a measured object |
≥ 0 |
Bytes/s |
ma_container_network_receive_packets |
Downlink Rate (PPS) |
Number of data packets received by a NIC per second |
≥ 0 |
Packets/s |
|
ma_container_network_receive_error_packets |
Downlink Error Rate |
Number of error packets received by a NIC per second |
≥ 0 |
Count/s |
|
ma_container_network_transmit_bytes |
Uplink Rate (BPS) |
Outbound traffic rate of a measured object |
≥ 0 |
Bytes/s |
|
ma_container_network_transmit_error_packets |
Uplink Error Rate |
Number of error packets sent by a NIC per second |
≥ 0 |
Count/s |
|
ma_container_network_transmit_packets |
Uplink Rate (PPS) |
Number of data packets sent by a NIC per second |
≥ 0 |
Packets/s |
|
NPU |
ma_container_npu_util |
NPU Usage |
NPU usage of a measured object |
0–100 |
% |
ma_container_npu_memory_util |
NPU Memory Usage |
Percentage of the used NPU memory to the total NPU memory |
0–100 |
% |
|
ma_container_npu_memory_used_megabytes |
Used NPU Memory |
NPU memory used by a measured object |
≥ 0 |
MB |
|
ma_container_npu_memory_total_megabytes |
Total NPU Memory |
Total NPU memory of a measured object |
≥ 0 |
MB |
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot