Help Center/ ModelArts/ Troubleshooting/ Training Jobs/ GP Issues/ Error Message "cuda runtime error (10) : invalid device ordinal at xxx" Is Displayed in Logs
Updated on 2025-08-22 GMT+08:00

Error Message "cuda runtime error (10) : invalid device ordinal at xxx" Is Displayed in Logs

Symptom

A training job fails, and the following error is printed in logs:

RuntimeError: cuda runtime error (10) : invalid device ordinal at xxx
Figure 1 Error log

Possible Causes

The issue may arise due to the following reasons:

  • The CUDA_VISIBLE_DEVICES setting does not align with the job specifications. For instance, if you select a job with four GPs (IDs 0, 1, 2, and 3), but perform CUDA operations specifying tensor.to(device="cuda:7"), it targets GP 7, which exceeds the available GP IDs.
  • Damaged GPs on resource nodes may result in fewer detected GPs than the selected specifications.

Solution

  1. Perform CUDA operations on GPUs with IDs specified by CUDA_VISIBLE_DEVICES.
  2. If a GP on a resource node is damaged, contact technical support.

Summary and Suggestions

Before creating a training job, use the ModelArts development environment to debug your training code and minimize migration errors.