Help Center/ ModelArts/ Troubleshooting/ Training Jobs/ GPU Issues/ Error Message "cuda runtime error (10) : invalid device ordinal at xxx" Displayed in Logs

Updated on 2024-06-11 GMT+08:00

View PDF

Error Message "cuda runtime error (10) : invalid device ordinal at xxx" Displayed in Logs

Symptom

A training job failed, and the following error is displayed in logs.

Figure 1 Error log

Possible Causes

The possible causes are as follows:

The CUDA_VISIBLE_DEVICES setting does not comply with job specifications. For example, you select a job with four GPUs, and the IDs of available GPUs are 0, 1, 2, and 3. However, when you perform CUDA operations, for example tensor.to(device="cuda:7"), tensors are specified to run on GPU 7, which is beyond the available GPU IDs.
GPUs are damaged on resource nodes if CUDA operations are performed on a GPU with a specified ID. As a result, the number of GPUs that can be detected is less than the selected specifications.

Solution

Perform CUDA operations on the GPUs with IDs specified by CUDA_VISIBLE_DEVICES.
If a GPU on a resource node is damaged, contact technical support.

Summary and Suggestions

Before creating a training job, use the ModelArts development environment to debug the training code to maximally eliminate errors in code migration.

Parent topic: GPU Issues

Previous topic: Error Message "RuntimeError: connect() timed out" Displayed in Logs

Next topic: Error Message "RuntimeError: Cannot re-initialize CUDA in forked subprocess" Displayed in Logs

Feedback

Was this page helpful?

Helpful Not helpful

Provide feedback

Thank you very much for your feedback. We will continue working to improve the documentation.

The system is busy. Please try again later.

Which of the following issues have you encountered?

Content is inconsistent with the product UI

Unclear descriptions

Lack of examples or code

Incorrect steps

Can't find what I need

Lack of best practices

Feedback (optional)

0/500

Select at least one type of issue, and enter your comments or suggestions.

Enter a maximum of 500 characters.

Submit Cancel