Training Job Failed Due to OOM

If a training job fails due to out of memory (OOM), possible symptoms are as follows:

Error code 137 is returned.

Modelarts Service Log Trainina end with return code: 137
Modelarts Service Log]handle outputs of training job

The log contains an error message with keyword killed.

RuntimeError: DataLoader worker (pid 38077) is killed by signal: Killed.

Error message "RuntimeError: CUDA out of memory." is displayed in logs.
Figure 1 Error log
Error message "Dst tensor is not initialized" is displayed in TensorFlow logs.

The possible causes are as follows:

Video RAM is insufficient.
OOM occurred on certain nodes. This issue is typically caused by the node fault.

Modify hyperparameter settings to release unnecessary tensors.
1. Modify network parameters, such as batch_size, hide_layer, and cell_nums.
2. Release unnecessary tensors.
```
del tmp_tensor 
torch.cuda.empty_cache()
```
Use the local PyCharm to remotely access notebook for debugging.
If the fault persists, submit a service ticket to locate the fault or even isolate the affected node.

Before creating a training job, use the ModelArts development environment to debug your training code and minimize migration errors.

Use the notebook environment for online debugging. For details, see Using JupyterLab to Develop Models.
Use a local IDE (PyCharm or VS Code) to access the cloud environment for debugging. For details, see Using a Local IDE to Develop Models.

Parent topic: Hard Faults Due to Space Limit