Viewing the Resource Usage of a Training Job

Description

In the Monitoring tab of the training job details page, you can view the CPU, GPU, and NPU usage for the job or a single node.

Notes and Constraints

You can view the monitoring data of the entire training period. For details, see Table 1.

Usage data of training job resources is stored for 30 days before being automatically deleted.

Operations

On the ModelArts console, choose Model Training > Training Jobs from the navigation pane.
In the training job list, click the name of the target job to go to the training job details page.

Click the Monitoring tab to view the resource usage of the training job. You can monitor resources in the following dimensions. Table 1 describes the metrics.

Job monitoring: Monitors the overall resource usage of the current training job. You can select the last 15 minutes, last 30 minutes, last 1 hour, last 6 hours, last 12 hours, or last 24 hours, or specify a period.
Task monitoring: Monitors the resource usage of the training job's nodes. You can select the last 15 minutes, last 30 minutes, last 1 hour, last 6 hours, last 12 hours, or last 24 hours, or specify a period.
The system shows job monitoring data for the past 15 minutes if the jobs are in Creating, Pending, or Running states. For jobs in Abnormal, Terminating, Terminated, or Completed states, it displays data from their start to end times.

**Table 1** Training job monitoring metrics
Category	Metric	Parameter	Description	Unit	Value Range	Presentation Form
CPU	CPU Usage	ma_container_cpu_util	CPU usage of a measured object	%	0%–100%	Line chart
CPU	Used cores	ma_container_cpu_used_core	Number of CPU cores used by a measured object	Core	≥ 0	Bar chart
Memory	Physical Memory Usage	ma_container_memory_util	Percentage of the used physical memory to the total physical memory	%	0%–100%	Line chart
Memory	Used Physical Memory	ma_container_memory_used_meg	Physical memory that has been used by a measured object (container_memory_working_set_bytes in the current working set) (Memory usage in a working set = Active anonymous page and cache, and file-baked page ≤ container_memory_usage_bytes)	MB	≥ 0	Bar chart
GPU	GPU Usage	ma_container_gpu_util	GPU usage of a measured object	%	0%–100%	Line chart
	GPU Memory Usage	ma_container_gpu_mem_util	Percentage of the used GPU memory to the total GPU memory	%	0%–100%	Line chart
	Used GPU Memory	ma_container_gpu_mem_used_megabytes	GPU memory used by a measured object	MB	≥ 0	Bar chart
NPU	NPU Usage	ma_container_npu_ai_core_util	AI core usage of Ascend AI processors	%	0%–100%	Line chart
	NPU Memory Usage	ma_container_npu_memory_util	Percentage of the used NPU memory to the total NPU memory (To be replaced by ma_container_npu_ddr_memory_util for Snt3 series, and ma_container_npu_hbm_util for Snt9 series)	%	0%–100%	Line chart
	Used NPU Memory	ma_container_npu_memory_used_megabytes	NPU memory used by a measured object (To be replaced by ma_container_npu_ddr_memory_usage_bytes for Snt3 series, and ma_container_npu_hbm_usage_bytes for Snt9 series)	MB	≥ 0	Bar chart
Network	Network Downlink Rate	ma_container_network_receive_bytes	Inbound traffic rate of a measured object	Bytes/s	≥ 0	Line chart
Network	Network Uplink Rate	ma_container_network_transmit_bytes	Outbound traffic rate of a measured object	Bytes/s	≥ 0	Line chart
Disk	Disk Read Rate	ma_container_disk_read_kilobytes	Volume of data read from a disk per second	KB/s	≥ 0	Line chart
Disk	Disk Write Rate	ma_container_disk_write_kilobytes	Volume of data written into a disk per second	KB/s	≥ 0	Line chart

For more metrics, see Viewing Monitoring Metrics of a Training Job.

Alarms of Job Resource Usage

You can view the job resource usage on the training job list page. If the average GPU/NPU usage of the job's worker-0 instance is lower than 50%, an alarm is displayed in the training job list.

Figure 1 Job resource usage in the job list

The job resource usage here involves only GPU and NPU resources. The method of calculating the average GPU/NPU usage of a job's worker-0 instance is: Summarize the usage of each GPU/NPU accelerator card at each time point of the job's worker-0 instance and calculate the average value.

Improving Job Resource Utilization

Increasing the value of batch_size increases GPU and NPU usage. You must decide the batch size that will not cause a memory overflow.
If the time for reading data in a batch is longer than the time for GPUs or NPUs to calculate data in a batch, GPU or NPU usage may fluctuate. In this case, optimize the performance of data reading and data augmentation. For example, read data in parallel or use tools such as NVIDIA Data Loading Library (DALI) to improve the data augmentation speed.
If a model is large and frequently saved, GPU or NPU usage is affected. In this case, do not save models frequently. Similarly, make sure that other non-GPU/NPU operations, such as log printing and training metric saving, do not affect the training process for too much time.

Parent topic: Managing Model Training Jobs

Previous topic: Visualizing the Training Job Process

Next topic: Viewing the Model Evaluation Result