Help Center/
ModelArts/
Troubleshooting/
Training Jobs/
Hard Faults Due to Space Limit/
Training Job Failed Due to OOM
Updated on 2024-04-30 GMT+08:00
Training Job Failed Due to OOM
Symptom
If a training job failed due to out of memory (OOM), possible symptoms as as follows:
- Error code 137 is returned.
- The log file contains error information with keyword killed.
Figure 1 Error log
- Error message "RuntimeError: CUDA out of memory." is displayed in logs.
Figure 2 Error log
- Error message "Dst tensor is not initialized" is displayed in TensorFlow logs.
Possible Causes
The possible causes are as follows:
- GPU memory is insufficient.
- OOM occurred on certain nodes. This issue is typically caused by the node fault.
Solution
- Modify hyperparameter settings to release unnecessary tensors.
- Modify network parameters, such as batch_size, hide_layer, and cell_nums.
- Release unnecessary tensors.
del tmp_tensor torch.cuda.empty_cache()
- Use the local PyCharm to remotely access notebook for debugging.
- If the fault persists, submit a service ticket to locate the fault or even isolate the affected node.
Summary and Suggestions
Before creating a training job, use the ModelArts development environment to debug the training code to maximally eliminate errors in code migration.
- Use the online notebook environment for debugging. For details, see Using JupyterLab to Develop a Model.
- Use the local IDE (PyCharm or VS Code) to access the cloud environment for debugging. For details, see Using the Local IDE to Develop a Model.
Parent topic: Hard Faults Due to Space Limit
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
The system is busy. Please try again later.
For any further questions, feel free to contact us through the chatbot.
Chatbot