Help Center/
ModelArts/
Troubleshooting/
Training Jobs/
Running a Training Job Failed/
Running a Job Failed Due to Persistently Rising Memory Usage
Updated on 2024-01-26 GMT+08:00
Running a Job Failed Due to Persistently Rising Memory Usage
Symptom
A training job is in the Failed state.
Possible Causes
The memory usage continues to rise, leading to the training job failure.
Solution
- View the logs and monitoring data of the training job to check whether there are any OOM errors.
- Check whether there is any code in the training script that keeps using resources and prevents them from being allocated efficiently.
- If yes, optimize the code and wait until the job runs properly.
- If no, either upgrade the resource specifications allocated to the training job or contact technical support.
- Restart the training job. Use CloudShell to log in to the training container to check the memory metrics and see if the memory usage spikes.
- If yes, check the training job logs generated when the memory usage spikes and improve the relevant code logic to lower the memory consumption.
- If no, either upgrade the resource specifications allocated to the training job or contact technical support.
Parent topic: Running a Training Job Failed
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
The system is busy. Please try again later.
For any further questions, feel free to contact us through the chatbot.
Chatbot