Updated on 2024-06-11 GMT+08:00

Training Job Failed Due to OOM

Symptom

If a training job failed due to out of memory (OOM), possible symptoms as as follows:
  1. Error code 137 is returned.
  2. The log file contains error information with keyword killed.
    Figure 1 Error log
  3. Error message "RuntimeError: CUDA out of memory." is displayed in logs.
    Figure 2 Error log
  4. Error message "Dst tensor is not initialized" is displayed in TensorFlow logs.

Possible Causes

The possible causes are as follows:

  • GPU memory is insufficient.
  • OOM occurred on certain nodes. This issue is typically caused by the node fault.

Solution

  1. Modify hyperparameter settings to release unnecessary tensors.
    1. Modify network parameters, such as batch_size, hide_layer, and cell_nums.
    2. Release unnecessary tensors.
      del tmp_tensor 
      torch.cuda.empty_cache()
  2. Use the local PyCharm to remotely access notebook for debugging.
  3. If the fault persists, submit a service ticket to locate the fault or even isolate the affected node.

Summary and Suggestions

Before creating a training job, use the ModelArts development environment to debug the training code to maximally eliminate errors in code migration.