Locating Faults by Analyzing Training Logs
If you encounter an issue during the execution of a ModelArts training job, view logs first. In most scenarios, you can locate the issue based on the error information reported in logs.
If a training job fails, ModelArts automatically identifies the failure cause and displays a message on the log page. The message consists of possible causes, recommended solutions, and error logs (marked in red).
![](https://support.huaweicloud.com/eu/develop-modelarts/figure/en-us_image_0000001906698432.png)
ModelArts provides possible causes (for reference only) and solutions for some common training faults. Not all faults can be identified. For a distributed job, only the analysis result of the current node is displayed. To obtain the failure cause of a training job, check the analysis results of all nodes used by the training job.
To rectify common training faults, perform the following steps:
- Rectify the fault based on the analysis and suggestions provided on the log page.
- Solution 1: A troubleshooting document is provided for you to follow.
- Solution 2: Rebuild the training job and run it again.
- If the fault persists, analyze the error information in the logs to locate and rectify the fault.
- If the provided solutions cannot rectify your fault, you can submit a service ticket for technical support.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.