Updated on 2026-04-28 GMT+08:00

Intelligent O&M

Description

ModelArts monitors training jobs in real time for smooth operations. The training job details page includes intelligent O&M tools for easy monitoring and maintenance.

If a job ends with a Failed or Terminated status, choose suitable diagnosis tools under Intelligent O&M. Select the right tool based on your needs since their scope and duration differ.

Prerequisites

Fault monitoring requires that you enable Auto Restart when creating a training job.

Code error detection requires that you enable Auto Restart when creating a training job.

Performance monitoring requires that you enable Performance Monitoring and Diagnosis when creating a training job.

Real-Time Monitoring

Real-time monitoring in intelligent O&M is categorized into fault monitoring, code error detection, and performance monitoring.

When you enable the corresponding prerequisite features for a training job, the intelligent O&M system monitors the job in real-time. The interface displays status indicators such as Unprotected, No risk, Low risk, Medium risk, High risk, and Monitoring. If an anomaly is detected, the system provides a risk level and a detailed monitoring report, enabling you to handle abnormal jobs promptly.

Table 1 Real-time monitoring comparison

Real-Time Monitoring Type

Description

Anomaly Risk Level

Fault monitoring

When a fault occurs, the system automatically triggers restart and recovery to ensure the high availability of the training job.

Medium

High

Code error detection

Performs real-time monitoring of the current training job code. When an anomaly appears, a diagnostic report is automatically generated to assist in troubleshooting.

Low

Medium

High

Performance monitoring

Monitors the performance metrics of the training job in real-time. When metrics become abnormal, a monitoring report is automatically generated to assist in troubleshooting.

Medium

Diagnosis Tools

Currently, two types of diagnostic tools are supported: performance analysis and standard diagnosis.

Table 2

Diagnosis Tool

Description

Performance analysis

Designed for performance degradation issues such as unexpected step durations or imbalanced resource utilization. It provides real-time observation of step-time curves and supports manual collection of profiling data to generate visualized analysis results.

Standard diagnosis

Primarily detects training job environment information, job events, job logs, and device logs to identify runtime environment issues, code anomalies, and hardware failures. The diagnosis duration is positively correlated with the job cluster scale and log file size.

Click Diagnose on the right side of the corresponding tool to perform a diagnosis on the training job.

Once completed, you can view the diagnosis report.

Viewing Monitoring Reports

When an anomaly is detected during real-time monitoring, an anomaly detection report will be generated. Click View Report to see detailed diagnosis results.

  • Code error detection report

    The results include the faulty device involved in the event, key logs, fault category, faulty component, faulty module, and recommended solutions.

Viewing Diagnosis Reports

After a diagnosis is complete, you can click View Report under the Intelligent O&M tab on the training job details page. Alternatively, you can find the corresponding job in the list under OM Management > Log Diagnosis to view details.

The diagnosis report provides a detailed look at the basic information and results of the diagnostic task.

  • Basic information

    Includes job ID, diagnosis duration, creation time, update time, creator, description, resource type, and training job ID.

  • Diagnosis results

    Includes a description of the symptoms, a description of the fault, and recommended solutions for fault handling.