Help Center> ModelArts> Model Development> Performing a Training> Training Job Logs> Locating Faults by Analyzing Training Logs
Updated on 2024-05-07 GMT+08:00

Locating Faults by Analyzing Training Logs

If you encounter an issue during the execution of a ModelArts training job, view logs first. In most scenarios, you can locate the issue based on the error information reported in logs.

If a training job fails, ModelArts automatically identifies the failure cause and displays a message on the log page. The message consists of possible causes, recommended solutions, and error logs (marked in red).

Figure 1 Identifying training faults

ModelArts provides possible causes (for reference only) and solutions for some common training faults. Not all faults can be identified. For a distributed job, only the analysis result of the current node is displayed. To obtain the failure cause of a training job, check the analysis results of all nodes used by the training job.

To rectify common training faults, perform the following steps:

  1. Rectify the fault based on the analysis and suggestions provided on the log page.
    • Solution 1: A troubleshooting document is provided for you to follow.
    • Solution 2: Rebuild the training job and run it again.
  2. If the fault persists, analyze the error information in the logs to locate and rectify the fault.
  3. If the provided solutions cannot rectify your fault, you can submit a service ticket for technical support.