Viewing the Resource Usage of a Training Job

Operations

On the ModelArts console, choose Training Management > Training Jobs from the navigation pane.
In the training job list, click the name of the target job to go to the training job details page.

On the training job details page, click the Resource Usages tab to view the resource usage of the compute nodes. The data of at most the last three days can be displayed. When the resource usage window is opened, the data is loading and refreshed periodically.

Operation 1: If a training job uses multiple compute nodes, choose a node from the drop-down list box to view its metrics.

Operation 2: Click cpuUsage, gpuMemUsage, gpuUtil, memUsage, npuMemUsage, or npuUtil to show or hide the usage chart of the parameter.

Operation 3: Hover the cursor on the graph to view the usage at the specific time.

Figure 1 Resource Usages
Click to enlarge

**Table 1** Parameters
Parameter	Description
cpuUsage	CPU usage
gpuMemUsage	GPU memory usage
gpuUtil	GPU usage
memUsage	Memory usage
npuMemUsage	NPU memory usage
npuUtil	NPU usage

Alarms of Job Resource Usage

You can view the job resource usage on the training job list page. If the average GPU/NPU usage of the job's worker-0 instance is lower than 50%, an alarm is displayed in the training job list.

Figure 2 Job resource usage in the job list
Click to enlarge

The job resource usage here involves only GPU and NPU resources. The method of calculating the average GPU/NPU usage of a job's worker-0 instance is: Summarize the usage of each GPU/NPU accelerator card at each time point of the job's worker-0 instance and calculate the average value.

Improving Job Resource Utilization

Increasing the value of batch_size increases GPU and NPU usage. You must decide the batch size that will not cause a memory overflow.
If the time for reading data in a batch is longer than the time for GPUs or NPUs to calculate data in a batch, GPU or NPU usage may fluctuate. In this case, optimize the performance of data reading and data augmentation. For example, read data in parallel or use tools such as NVIDIA Data Loading Library (DALI) to improve the data augmentation speed.
If a model is large and frequently saved, GPU or NPU usage is affected. In this case, do not save models frequently. Similarly, make sure that other non-GPU/NPU operations, such as log printing and training metric saving, do not affect the training process for too much time.

Parent topic: Performing a Training

Previous topic: Preventing Cloud Shell Session from Disconnection

Next topic: Evaluation Results