Help Center/
ModelArts/
Troubleshooting/
Training Jobs/
GPU Issues/
Error Message "cuda runtime error (10) : invalid device ordinal at xxx" Displayed in Logs
Updated on 2024-04-30 GMT+08:00
Error Message "cuda runtime error (10) : invalid device ordinal at xxx" Displayed in Logs
Symptom
A training job failed, and the following error is displayed in logs.
Figure 1 Error log
Possible Causes
The possible causes are as follows:
- The CUDA_VISIBLE_DEVICES setting does not comply with job specifications. For example, you select a job with four GPUs, and the IDs of available GPUs are 0, 1, 2, and 3. However, when you perform CUDA operations, for example tensor.to(device="cuda:7"), tensors are specified to run on GPU 7, which is beyond the available GPU IDs.
- GPUs are damaged on resource nodes if CUDA operations are performed on a GPU with a specified ID. As a result, the number of GPUs that can be detected is less than the selected specifications.
Solution
- Perform CUDA operations on the GPUs with IDs specified by CUDA_VISIBLE_DEVICES.
- If a GPU on a resource node is damaged, contact technical support.
Summary and Suggestions
Before creating a training job, use the ModelArts development environment to debug the training code to maximally eliminate errors in code migration.
- Use the online notebook environment for debugging. For details, see JupyterLab Overview and Common Operations.
- Use a local IDE (PyCharm or VS Code) to access the cloud environment for debugging. For details, see Operation Process in a Local IDE.
Parent topic: GPU Issues
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
The system is busy. Please try again later.
For any further questions, feel free to contact us through the chatbot.
Chatbot