Locating Faults by Analyzing Training Logs
If you encounter an issue during the execution of a ModelArts training job, view logs first. In most scenarios, you can locate the issue based on the error information reported in logs.
If a training job fails, ModelArts automatically identifies the failure cause and displays a message on the log page. The message consists of possible causes, recommended solutions, and error logs (marked in red).
ModelArts provides possible causes (for reference only) and solutions for some common training faults. Not all faults can be identified. For a distributed job, only the analysis result of the current node is displayed. To obtain the failure cause of a training job, check the analysis results of all nodes used by the training job.
To rectify common training faults, perform the following steps:
- Rectify the fault based on the analysis and suggestions provided on the log page.
- Solution 1: A troubleshooting document is provided for you to follow.
- Solution 2: Rebuild the training job and run it again.
- If the fault persists, analyze the error information in the logs to locate and rectify the fault.
- If the provided solutions cannot rectify your fault, you can submit a service ticket for technical support.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.