Viewing the Resource Usage of a Training Job
Operations
You can view the resource usage of a compute node in the Resource Usages window. The data of at most the last three days can be displayed. When the resource usage window is opened, the data is loading and refreshed periodically.
Operation 1: If a training job uses multiple compute nodes, choose a node from the drop-down list box to view its metrics.
Operation 2: Click cpuUsage, gpuMemUsage, gpuUtil, memUsage, npuMemUsage, or npuUtil to show or hide the usage chart of the parameter.
Operation 3: Hover the cursor on the graph to view the usage at the specific time.
Parameter |
Description |
---|---|
cpuUsage |
CPU usage |
gpuMemUsage |
GPU memory usage |
gpuUtil |
GPU usage |
memUsage |
Memory usage |
npuMemUsage |
NPU memory usage |
npuUtil |
NPU usage |
Alarms of Job Resource Usage
You can view the job resource usage on the training job list page. If the average GPU/NPU usage of a job is lower than 50%, an alarm is displayed in the training job list.
The job resource usage here involves only GPU and NPU resources. The method of calculating the average GPU/NPU usage of a job is: Summarize the usage of each GPU/NPU accelerator card at each time point of the job and calculate the average value. If a job uses multiple compute nodes, summarize the usage of all compute nodes and then obtain the average usage of a single job.
Improving Job Resource Utilization
- Increasing the value of batch_size increases GPU and NPU usage. You must decide the batch size that will not cause a memory overflow.
- If the time for reading data in a batch is longer than the time for GPUs or NPUs to calculate data in a batch, GPU or NPU usage may fluctuate. In this case, optimize the performance of data reading and data augmentation. For example, read data in parallel or use tools such as NVIDIA Data Loading Library (DALI) to improve the data augmentation speed.
- If a model is large and frequently saved, GPU or NPU usage is affected. In this case, do not save models frequently. Similarly, make sure that other non-GPU/NPU operations, such as log printing and training metric saving, do not affect the training process for too much time.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.