Viewing the Resource Usage of a Training Job

Operations

You can view the resource usage of a compute node in the Resource Usages window. The data of at most the last three days can be displayed. When the resource usage window is opened, the data is loading and refreshed periodically.

Operation 1: If a training job uses multiple compute nodes, choose a node from the drop-down list box to view its metrics.

Operation 2: Click cpuUsage, gpuMemUsage, gpuUtil, memUsage, npuMemUsage, or npuUtil to show or hide the usage chart of the parameter.

Operation 3: Hover the cursor on the graph to view the usage at the specific time.

Figure 1 Resource Usages

**Table 1** Parameters
Parameter	Description
cpuUsage	CPU usage
gpuMemUsage	GPU memory usage
gpuUtil	GPU usage
memUsage	Memory usage
npuMemUsage	NPU memory usage
npuUtil	NPU usage

Alarms of Job Resource Usage

You can view the job resource usage on the training job list page. If the average GPU/NPU usage of a job is lower than 50%, an alarm is displayed in the training job list.

Figure 2 Job resource usage in the job list

The job resource usage here involves only GPU and NPU resources. The method of calculating the average GPU/NPU usage of a job is: Summarize the usage of each GPU/NPU accelerator card at each time point of the job and calculate the average value. If a job uses multiple compute nodes, summarize the usage of all compute nodes and then obtain the average usage of a single job.

Improving Job Resource Utilization

Increasing the value of batch_size increases GPU and NPU usage. You must decide the batch size that will not cause a memory overflow.
If the time for reading data in a batch is longer than the time for GPUs or NPUs to calculate data in a batch, GPU or NPU usage may fluctuate. In this case, optimize the performance of data reading and data augmentation. For example, read data in parallel or use tools such as NVIDIA Data Loading Library (DALI) to improve the data augmentation speed.
If a model is large and frequently saved, GPU or NPU usage is affected. In this case, do not save models frequently. Similarly, make sure that other non-GPU/NPU operations, such as log printing and training metric saving, do not affect the training process for too much time.

Parent topic: Performing a Training

Previous topic: Viewing Training Job Events

Next topic: Evaluation Results