Help Center/ ModelArts/ Troubleshooting/ Training Jobs/ Memory Limit Issues/ Training Job Failed Due to OOM

Updated on 2025-08-18 GMT+08:00

View PDF

Training Job Failed Due to OOM

Symptom

If a training job failed due to out of memory (OOM), possible symptoms as as follows:

Error code 137 is returned.
The log file contains error information with keyword killed.
Figure 1 Error log
Error message "RuntimeError: CUDA out of memory." is displayed in logs.
Figure 2 Error log
Error message "Dst tensor is not initialized" is displayed in TensorFlow logs.

Possible Causes

The possible causes are as follows:

GPU memory is insufficient.
OOM occurred on certain nodes. This issue is typically caused by the node fault.

Solution

Modify hyperparameter settings to release unnecessary tensors.
1. Modify network parameters, such as batch_size, hide_layer, and cell_nums.
2. Release unnecessary tensors.
```
del tmp_tensor 
torch.cuda.empty_cache()
```
Use the local PyCharm to remotely access notebook for debugging.
If the fault persists, submit a service ticket to locate the fault or even isolate the affected node.

Summary and Suggestions

Before creating a training job, use the ModelArts development environment to debug the training code to maximally eliminate errors in code migration.

Parent topic: Memory Limit Issues

Previous topic: Error Message "No space left on device" Displayed in Logs

Next topic: Common Issues Related to Insufficient Disk Space and Solutions

Feedback

Was this page helpful?

Helpful Not helpful

Provide feedback

Thank you very much for your feedback. We will continue working to improve the documentation.

The system is busy. Please try again later.

Which of the following issues have you encountered?

Content is inconsistent with the product UI

Unclear descriptions

Lack of examples or code

Incorrect steps

Can't find what I need

Lack of best practices

Feedback (optional)

0/500

Select at least one type of issue, and enter your comments or suggestions.

Enter a maximum of 500 characters.

Submit Cancel