Help Center> ModelArts> Model Development> Performing a Training> Viewing the Resource Usage of a Training Job
Updated on 2024-05-07 GMT+08:00

Viewing the Resource Usage of a Training Job

Operations

  1. On the ModelArts console, choose Training Management > Training Jobs from the navigation pane.
  2. In the training job list, click the name of the target job to go to the training job details page.
  3. On the training job details page, click the Resource Usages tab to view the resource usage of the compute nodes. The data of at most the last three days can be displayed. When the resource usage window is opened, the data is loading and refreshed periodically.

    Operation 1: If a training job uses multiple compute nodes, choose a node from the drop-down list box to view its metrics.

    Operation 2: Click cpuUsage, gpuMemUsage, gpuUtil, memUsage, npuMemUsage, or npuUtil to show or hide the usage chart of the parameter.

    Operation 3: Hover the cursor on the graph to view the usage at the specific time.

    Figure 1 Resource Usages
    Table 1 Parameters

    Parameter

    Description

    cpuUsage

    CPU usage

    gpuMemUsage

    GPU memory usage

    gpuUtil

    GPU usage

    memUsage

    Memory usage

    npuMemUsage

    NPU memory usage

    npuUtil

    NPU usage

Alarms of Job Resource Usage

You can view the job resource usage on the training job list page. If the average GPU/NPU usage of the job's worker-0 instance is lower than 50%, an alarm is displayed in the training job list.

Figure 2 Job resource usage in the job list

The job resource usage here involves only GPU and NPU resources. The method of calculating the average GPU/NPU usage of a job's worker-0 instance is: Summarize the usage of each GPU/NPU accelerator card at each time point of the job's worker-0 instance and calculate the average value.

Improving Job Resource Utilization

  • Increasing the value of batch_size increases GPU and NPU usage. You must decide the batch size that will not cause a memory overflow.
  • If the time for reading data in a batch is longer than the time for GPUs or NPUs to calculate data in a batch, GPU or NPU usage may fluctuate. In this case, optimize the performance of data reading and data augmentation. For example, read data in parallel or use tools such as NVIDIA Data Loading Library (DALI) to improve the data augmentation speed.
  • If a model is large and frequently saved, GPU or NPU usage is affected. In this case, do not save models frequently. Similarly, make sure that other non-GPU/NPU operations, such as log printing and training metric saving, do not affect the training process for too much time.