Updated on 2025-11-18 GMT+08:00

Training Dashboard Monitoring

Description

Use ModelArts to train models while monitoring their performance in real time for smooth operation. For deeper insights, the ModelArts console provides advanced monitoring and data export options. Go to Model Training > Training Jobs > Monitoring Dashboard to see an overview, health checks, and live updates on your training job. Click Export to save specific monitoring data for local analysis. These tools help you monitor progress effectively and improve training efficiency through better data management.

Constraints

You can view the monitoring data of the last year on the training dashboard.

Training Job Overview

The training job overview shows the total number of jobs, current resource requests, and job states, giving you a quick understanding of the training progress and resource usage.

Metric

Description

Total Jobs

The total number of all training jobs under the current workspace for the account, showing the overall scale of jobs.

Resource Usage (PUs)

The total number of accelerator cards requested by all currently running training jobs, indicating real-time resource demand.

Jobs in each state

The number of jobs in different statuses (such as Queuing, Running, Completed, Abnormal/Failed), used to monitor job health and distribution.

Health Monitoring

The health check module focuses on the stability and reliability management of training jobs. By quantitatively evaluating the job execution results and the system's fault tolerance capabilities, it provides critical evidence for O&M decision-making.

Metric

Description

Success Rate

The percentage of jobs successfully completed during the statistical period.

Fault Recovery Trigger Rate

The percentage of jobs with fault tolerance and recovery enabled during the statistical period.

Job Recovery Success Rate

The percentage of abnormal or interrupted jobs that were successfully restarted and used accelerator cards during the statistical period

Job Monitoring

Job monitoring includes tracking faults and resource usage. You can check this data for the past week, 30 days, or a custom time range.

Job Fault Monitoring

Table 1 Job fault monitoring

Metric

Description

Job Failure Rate

The percentage of jobs that failed during the statistical period.

Job Recovery Success Rate

The percentage of abnormal or interrupted jobs that were successfully restarted and used accelerator cards during the statistical period

Job Recovery Duration

The average time taken to restart a failed or interrupted job with accelerator cards used during the statistical period

Job Resource Consumption

Table 2 Job resource consumption

Metric

Description

Resource Consumption Trend

The compute resources requested by training jobs over time. The data can be filtered by NPU or GPU.

Top Jobs by Resource Consumption

A list of training jobs that requested the most compute resources during the statistical period. The data can be filtered by NPU, GPU, or CPU.