Help Center/ ModelArts/ Troubleshooting/ Training Jobs/ GPU Issues/ Error Message "cuda runtime error (10) : invalid device ordinal at xxx" Displayed in Logs
Updated on 2024-12-30 GMT+08:00

Error Message "cuda runtime error (10) : invalid device ordinal at xxx" Displayed in Logs

Symptom

When a training job fails, you encounter the following error in the logs.

Figure 1 Error log

Possible Causes

The issue may arise due to the following reasons:

  • The CUDA_VISIBLE_DEVICES setting does not align with the job specifications. For instance, if you select a job with four GPUs (IDs 0, 1, 2, and 3), but perform CUDA operations specifying tensor.to(device="cuda:7"), it targets GPU 7, which exceeds the available GPU IDs.
  • Damaged GPUs on resource nodes may result in fewer detected GPUs than the selected specifications.

Solution

  1. Perform CUDA operations on GPUs with IDs specified by CUDA_VISIBLE_DEVICES.
  2. If a GPU on a resource node is damaged, contact technical support.

Summary and Suggestions

Debug your training code in the ModelArts development environment before creating a job.