Intelligent O&M
Description
ModelArts monitors training jobs in real time for smooth operations. The training job details page includes intelligent O&M tools for easy monitoring and maintenance.
If a job ends with a Failed or Terminated status, choose suitable diagnosis tools under Intelligent O&M. Select the right tool based on your needs since their scope and duration differ.
Prerequisites
Fault monitoring requires that you enable Auto Restart when creating a training job.
Code error detection requires that you enable Auto Restart when creating a training job.
Performance monitoring requires that you enable Performance Monitoring and Diagnosis when creating a training job.
Real-Time Monitoring
Real-time monitoring in intelligent O&M is categorized into fault monitoring, code error detection, and performance monitoring.
When you enable the corresponding prerequisite features for a training job, the intelligent O&M system monitors the job in real-time. The interface displays status indicators such as Unprotected, No risk, Low risk, Medium risk, High risk, and Monitoring. If an anomaly is detected, the system provides a risk level and a detailed monitoring report, enabling you to handle abnormal jobs promptly.
| Real-Time Monitoring Type | Description | Anomaly Risk Level |
|---|---|---|
| Fault monitoring | When a fault occurs, the system automatically triggers restart and recovery to ensure the high availability of the training job. | Medium High |
| Code error detection | Performs real-time monitoring of the current training job code. When an anomaly appears, a diagnostic report is automatically generated to assist in troubleshooting. | Low Medium High |
| Performance monitoring | Monitors the performance metrics of the training job in real-time. When metrics become abnormal, a monitoring report is automatically generated to assist in troubleshooting. | Medium |
Diagnosis Tools
Currently, two types of diagnostic tools are supported: performance analysis and standard diagnosis.
| Diagnosis Tool | Description |
|---|---|
| Performance analysis | Designed for performance degradation issues such as unexpected step durations or imbalanced resource utilization. It provides real-time observation of step-time curves and supports manual collection of profiling data to generate visualized analysis results. |
| Standard diagnosis | Primarily detects training job environment information, job events, job logs, and device logs to identify runtime environment issues, code anomalies, and hardware failures. The diagnosis duration is positively correlated with the job cluster scale and log file size. |
Click Diagnose on the right side of the corresponding tool to perform a diagnosis on the training job.
Once completed, you can view the diagnosis report.
Viewing Monitoring Reports
When an anomaly is detected during real-time monitoring, an anomaly detection report will be generated. Click View Report to see detailed diagnosis results.
Viewing Diagnosis Reports
After a diagnosis is complete, you can click View Report under the Intelligent O&M tab on the training job details page. Alternatively, you can find the corresponding job in the list under OM Management > Log Diagnosis to view details.
The diagnosis report provides a detailed look at the basic information and results of the diagnostic task.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot