Training Dashboard Monitoring
Description
Use ModelArts to train models while monitoring their performance in real time for smooth operation. For deeper insights, the ModelArts console provides advanced monitoring and data export options. Go to Model Training > Training Jobs > Monitoring Dashboard to see an overview, health checks, and live updates on your training job. Click Export to save specific monitoring data for local analysis. These tools help you monitor progress effectively and improve training efficiency through better data management.
Constraints
You can view the monitoring data of the last year on the training dashboard.
Training Job Overview
The training job overview shows the total number of jobs, current resource requests, and job states, giving you a quick understanding of the training progress and resource usage.
|
Metric |
Description |
|---|---|
|
Total Jobs |
The total number of all training jobs under the current workspace for the account, showing the overall scale of jobs. |
|
Resource Usage (PUs) |
The total number of accelerator cards requested by all currently running training jobs, indicating real-time resource demand. |
|
Jobs in each state |
The number of jobs in different statuses (such as Queuing, Running, Completed, Abnormal/Failed), used to monitor job health and distribution. |
Health Monitoring
The health check module focuses on the stability and reliability management of training jobs. By quantitatively evaluating the job execution results and the system's fault tolerance capabilities, it provides critical evidence for O&M decision-making.
|
Metric |
Description |
|---|---|
|
Success Rate |
The percentage of jobs successfully completed during the statistical period. |
|
Fault Recovery Trigger Rate |
The percentage of jobs with fault tolerance and recovery enabled during the statistical period. |
|
Job Recovery Success Rate |
The percentage of abnormal or interrupted jobs that were successfully restarted and used accelerator cards during the statistical period |
Job Monitoring
Job monitoring includes tracking faults and resource usage. You can check this data for the past week, 30 days, or a custom time range.
Job Fault Monitoring
|
Metric |
Description |
|---|---|
|
Job Failure Rate |
The percentage of jobs that failed during the statistical period. |
|
Job Recovery Success Rate |
The percentage of abnormal or interrupted jobs that were successfully restarted and used accelerator cards during the statistical period |
|
Job Recovery Duration |
The average time taken to restart a failed or interrupted job with accelerator cards used during the statistical period |
Job Resource Consumption
|
Metric |
Description |
|---|---|
|
Resource Consumption Trend |
The compute resources requested by training jobs over time. The data can be filtered by NPU or GPU. |
|
Top Jobs by Resource Consumption |
A list of training jobs that requested the most compute resources during the statistical period. The data can be filtered by NPU, GPU, or CPU. |
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot