Viewing the Resource Usage of a Training Job
Operations
- On the ModelArts console, choose Training Management > Training Jobs from the navigation pane.
- In the training job list, click the name of the target job to go to the training job details page.
- On the training job details page, click the Resource Usages tab to view the resource usage of the compute nodes. The data of at most the last three days can be displayed. When the resource usage window is opened, the data is loading and refreshed periodically.
Operation 1: If a training job uses multiple compute nodes, choose a node from the drop-down list box to view its metrics.
Operation 2: Click cpuUsage, gpuMemUsage, gpuUtil, memUsage, npuMemUsage, or npuUtil to show or hide the usage chart of the parameter.
Operation 3: Hover the cursor on the graph to view the usage at the specific time.
Figure 1 Resource Usages
Table 1 Parameters Parameter
Description
cpuUsage
CPU usage
gpuMemUsage
GPU memory usage
gpuUtil
GPU usage
memUsage
Memory usage
npuMemUsage
NPU memory usage
npuUtil
NPU usage
Alarms of Job Resource Usage
You can view the job resource usage on the training job list page. If the average GPU/NPU usage of the job's worker-0 instance is lower than 50%, an alarm is displayed in the training job list.
The job resource usage here involves only GPU and NPU resources. The method of calculating the average GPU/NPU usage of a job's worker-0 instance is: Summarize the usage of each GPU/NPU accelerator card at each time point of the job's worker-0 instance and calculate the average value.
Improving Job Resource Utilization
- Increasing the value of batch_size increases GPU and NPU usage. You must decide the batch size that will not cause a memory overflow.
- If the time for reading data in a batch is longer than the time for GPUs or NPUs to calculate data in a batch, GPU or NPU usage may fluctuate. In this case, optimize the performance of data reading and data augmentation. For example, read data in parallel or use tools such as NVIDIA Data Loading Library (DALI) to improve the data augmentation speed.
- If a model is large and frequently saved, GPU or NPU usage is affected. In this case, do not save models frequently. Similarly, make sure that other non-GPU/NPU operations, such as log printing and training metric saving, do not affect the training process for too much time.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot