Viewing the Resource Usage of a Training Job
Description
In the Monitoring tab of the training job details page, you can view the CPU, GPU, and NPU usage for the job or a single node.
Notes and Constraints
You can view the monitoring data of the entire training period. For details, see Table 1.
Usage data of training job resources is stored for 30 days before being automatically deleted.
Operations
- On the ModelArts console, choose Model Training > Training Jobs from the navigation pane.
- In the training job list, click the name of the target job to go to the training job details page.
- Click the Monitoring tab to view the resource usage of the training job. You can monitor resources in the following dimensions. Table 1 describes the metrics.
- Job monitoring: Monitors the overall resource usage of the current training job. You can select the last 15 minutes, last 30 minutes, last 1 hour, last 6 hours, last 12 hours, or last 24 hours, or specify a period.
- Task monitoring: Monitors the resource usage of the training job's nodes. You can select the last 15 minutes, last 30 minutes, last 1 hour, last 6 hours, last 12 hours, or last 24 hours, or specify a period.
Table 1 Training job monitoring metrics Category
Metric
Parameter
Description
Unit
Value Range
Presentation Form
CPU
CPU Usage
ma_container_cpu_util
CPU usage of a measured object
%
0%–100%
Line chart
Used cores
ma_container_cpu_used_core
Number of CPU cores used by a measured object
Core
≥ 0
Bar chart
Memory
Physical Memory Usage
ma_container_memory_util
Percentage of the used physical memory to the total physical memory
%
0%–100%
Line chart
Used Physical Memory
ma_container_memory_used_meg
Physical memory that has been used by a measured object (container_memory_working_set_bytes in the current working set)
(Memory usage in a working set = Active anonymous page and cache, and file-baked page ≤ container_memory_usage_bytes)
MB
≥ 0
Bar chart
GPU
GPU Usage
ma_container_gpu_util
GPU usage of a measured object
%
0%–100%
Line chart
GPU Memory Usage
ma_container_gpu_mem_util
Percentage of the used GPU memory to the total GPU memory
%
0%–100%
Line chart
Used GPU Memory
ma_container_gpu_mem_used_megabytes
GPU memory used by a measured object
MB
≥ 0
Bar chart
NPU
NPU Usage
ma_container_npu_ai_core_util
AI core usage of Ascend AI processors
%
0%–100%
Line chart
NPU Memory Usage
ma_container_npu_memory_util
Percentage of the used NPU memory to the total NPU memory (To be replaced by ma_container_npu_ddr_memory_util for Snt3 series, and ma_container_npu_hbm_util for Snt9 series)
%
0%–100%
Line chart
Used NPU Memory
ma_container_npu_memory_used_megabytes
NPU memory used by a measured object (To be replaced by ma_container_npu_ddr_memory_usage_bytes for Snt3 series, and ma_container_npu_hbm_usage_bytes for Snt9 series)
MB
≥ 0
Bar chart
Network
Network Downlink Rate
ma_container_network_receive_bytes
Inbound traffic rate of a measured object
Bytes/s
≥ 0
Line chart
Network Uplink Rate
ma_container_network_transmit_bytes
Outbound traffic rate of a measured object
Bytes/s
≥ 0
Line chart
Disk
Disk Read Rate
ma_container_disk_read_kilobytes
Volume of data read from a disk per second
KB/s
≥ 0
Line chart
Disk Write Rate
ma_container_disk_write_kilobytes
Volume of data written into a disk per second
KB/s
≥ 0
Line chart
For more metrics, see Viewing Monitoring Metrics of a Training Job.
Alarms of Job Resource Usage
You can view the job resource usage on the training job list page. If the average GPU/NPU usage of the job's worker-0 instance is lower than 50%, an alarm is displayed in the training job list.

The job resource usage here involves only GPU and NPU resources. The method of calculating the average GPU/NPU usage of a job's worker-0 instance is: Summarize the usage of each GPU/NPU accelerator card at each time point of the job's worker-0 instance and calculate the average value.
Improving Job Resource Utilization
- Increasing the value of batch_size increases GPU and NPU usage. You must decide the batch size that will not cause a memory overflow.
- If the time for reading data in a batch is longer than the time for GPUs or NPUs to calculate data in a batch, GPU or NPU usage may fluctuate. In this case, optimize the performance of data reading and data augmentation. For example, read data in parallel or use tools such as NVIDIA Data Loading Library (DALI) to improve the data augmentation speed.
- If a model is large and frequently saved, GPU or NPU usage is affected. In this case, do not save models frequently. Similarly, make sure that other non-GPU/NPU operations, such as log printing and training metric saving, do not affect the training process for too much time.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot