Stopped Training Job Process
Symptom
The training job process is stopped and the logs are interrupted.
Possible Causes
- CPU soft lock
The decompression of a large number of files may cause CPU soft lock and node restart. You can suspend the decompression for the specified amount of time by invoking sleep method when decompressing a large number of files. For example, every time 10,000 files are decompressed, the decompression stops for 1 second.
- Storage limitation
Use data disks based on specifications. For details about a data disk size, see What Are Sizes of the /cache Directories for Different Resource Specifications in the Training Environment?
- CPU overload
Troubleshooting
According to the error information, the error is caused by the user code.
You can use either of the following methods to locate the fault:
- Debug the code online (only available for the non-distributed code).
- Apply for a development environment instance with the same specifications in the development environment (notebook).
- Debug the user code in the notebook and find the improper code snippet.
- Find a solution by searching the key code snippet and exit code in a search engine.
- Locate the fault based on the training logs.
- Identify the improper code snippet based on the logs.
- Print the improper code snippet to obtain more detailed log information.
- Run the training job again to locate the improper code snippet.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot