Help Center/
ModelArts/
Troubleshooting/
Training Jobs/
GPU Issues/
Error Message "cuda runtime error (10) : invalid device ordinal at xxx" Displayed in Logs
Updated on 2024-12-30 GMT+08:00
Error Message "cuda runtime error (10) : invalid device ordinal at xxx" Displayed in Logs
Symptom
When a training job fails, you encounter the following error in the logs.
Figure 1 Error log
Possible Causes
The issue may arise due to the following reasons:
- The CUDA_VISIBLE_DEVICES setting does not align with the job specifications. For instance, if you select a job with four GPUs (IDs 0, 1, 2, and 3), but perform CUDA operations specifying tensor.to(device="cuda:7"), it targets GPU 7, which exceeds the available GPU IDs.
- Damaged GPUs on resource nodes may result in fewer detected GPUs than the selected specifications.
Solution
- Perform CUDA operations on GPUs with IDs specified by CUDA_VISIBLE_DEVICES.
- If a GPU on a resource node is damaged, contact technical support.
Summary and Suggestions
Debug your training code in the ModelArts development environment before creating a job.
- Use the online notebook environment. For details, see JupyterLab Overview and Common Operations.
- Use a local IDE (PyCharm or VS Code) to access the cloud environment. For details, see Operation Process in a Local IDE.
Parent topic: GPU Issues
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
The system is busy. Please try again later.
For any further questions, feel free to contact us through the chatbot.
Chatbot