Updated on 2025-08-18 GMT+08:00

Viewing the Resource Usage of a Training Job

Description

In the Monitoring tab of the training job details page, you can view the CPU, GPU, and NPU usage for the job or a single node.

Notes and Constraints

You can view the monitoring data of the entire training period. For details, see Table 1.

Usage data of training job resources is stored for 30 days before being automatically deleted.

Operations

  1. On the ModelArts console, choose Model Training > Training Jobs from the navigation pane.
  2. In the training job list, click the name of the target job to go to the training job details page.
  3. Click the Monitoring tab to view the resource usage of the training job. You can monitor resources in the following dimensions. Table 1 describes the metrics.
    • Job monitoring: Monitors the overall resource usage of the current training job. You can select the last 15 minutes, last 30 minutes, last 1 hour, last 6 hours, last 12 hours, or last 24 hours, or specify a period.
    • Task monitoring: Monitors the resource usage of the training job's nodes. You can select the last 15 minutes, last 30 minutes, last 1 hour, last 6 hours, last 12 hours, or last 24 hours, or specify a period.
    Table 1 Training job monitoring metrics

    Category

    Metric

    Parameter

    Description

    Unit

    Value Range

    Presentation Form

    CPU

    CPU Usage

    ma_container_cpu_util

    CPU usage of a measured object

    %

    0%–100%

    Line chart

    Used cores

    ma_container_cpu_used_core

    Number of CPU cores used by a measured object

    Core

    ≥ 0

    Bar chart

    Memory

    Physical Memory Usage

    ma_container_memory_util

    Percentage of the used physical memory to the total physical memory

    %

    0%–100%

    Line chart

    Used Physical Memory

    ma_container_memory_used_meg

    Physical memory that has been used by a measured object (container_memory_working_set_bytes in the current working set)

    (Memory usage in a working set = Active anonymous page and cache, and file-baked page ≤ container_memory_usage_bytes)

    MB

    ≥ 0

    Bar chart

    GPU

    GPU Usage

    ma_container_gpu_util

    GPU usage of a measured object

    %

    0%–100%

    Line chart

    GPU Memory Usage

    ma_container_gpu_mem_util

    Percentage of the used GPU memory to the total GPU memory

    %

    0%–100%

    Line chart

    Used GPU Memory

    ma_container_gpu_mem_used_megabytes

    GPU memory used by a measured object

    MB

    ≥ 0

    Bar chart

    NPU

    NPU Usage

    ma_container_npu_ai_core_util

    AI core usage of Ascend AI processors

    %

    0%–100%

    Line chart

    NPU Memory Usage

    ma_container_npu_memory_util

    Percentage of the used NPU memory to the total NPU memory (To be replaced by ma_container_npu_ddr_memory_util for Snt3 series, and ma_container_npu_hbm_util for Snt9 series)

    %

    0%–100%

    Line chart

    Used NPU Memory

    ma_container_npu_memory_used_megabytes

    NPU memory used by a measured object (To be replaced by ma_container_npu_ddr_memory_usage_bytes for Snt3 series, and ma_container_npu_hbm_usage_bytes for Snt9 series)

    MB

    ≥ 0

    Bar chart

    Network

    Network Downlink Rate

    ma_container_network_receive_bytes

    Inbound traffic rate of a measured object

    Bytes/s

    ≥ 0

    Line chart

    Network Uplink Rate

    ma_container_network_transmit_bytes

    Outbound traffic rate of a measured object

    Bytes/s

    ≥ 0

    Line chart

    Disk

    Disk Read Rate

    ma_container_disk_read_kilobytes

    Volume of data read from a disk per second

    KB/s

    ≥ 0

    Line chart

    Disk Write Rate

    ma_container_disk_write_kilobytes

    Volume of data written into a disk per second

    KB/s

    ≥ 0

    Line chart

    For more metrics, see Viewing Monitoring Metrics of a Training Job.

Alarms of Job Resource Usage

You can view the job resource usage on the training job list page. If the average GPU/NPU usage of the job's worker-0 instance is lower than 50%, an alarm is displayed in the training job list.

Figure 1 Job resource usage in the job list

The job resource usage here involves only GPU and NPU resources. The method of calculating the average GPU/NPU usage of a job's worker-0 instance is: Summarize the usage of each GPU/NPU accelerator card at each time point of the job's worker-0 instance and calculate the average value.

Improving Job Resource Utilization

  • Increasing the value of batch_size increases GPU and NPU usage. You must decide the batch size that will not cause a memory overflow.
  • If the time for reading data in a batch is longer than the time for GPUs or NPUs to calculate data in a batch, GPU or NPU usage may fluctuate. In this case, optimize the performance of data reading and data augmentation. For example, read data in parallel or use tools such as NVIDIA Data Loading Library (DALI) to improve the data augmentation speed.
  • If a model is large and frequently saved, GPU or NPU usage is affected. In this case, do not save models frequently. Similarly, make sure that other non-GPU/NPU operations, such as log printing and training metric saving, do not affect the training process for too much time.