Help Center/
ModelArts/
Troubleshooting/
Training Jobs/
In-Cloud Migration Adaptation Issues/
Reinstalled CUDA Version Does Not Match the One in the Target Image
Updated on 2024-04-11 GMT+08:00
Reinstalled CUDA Version Does Not Match the One in the Target Image
Symptom
An error occurs after the engine version is reinstalled or a new CUDA package is compiled based on the existing image.
1. "RuntimeError: cuda runtime error (11) : invalid argument at /pytorch/aten/src/THC/THCCachingHostAllocator.cpp:278" 2. "libcudart.so.9.0 cannot open shared object file no such file or directory" 3. "Make sure the device specification refers to a valid device. The requested device appears to be a GPU,but CUDA is not enabled"
Possible Causes
The possible cause is as follows:
The CUDA version of the newly installed package does not match the CUDA version in the image.
Solution
Use the local PyCharm to remotely access notebook for debugging and installation.
- Remotely log in to the selected image and run nvcc -V to obtain the CUDA version of the image.
- Reinstall Torch. Ensure that the version matches the one obtained in the previous step.
Summary and Suggestions
Before creating a training job, use the ModelArts development environment to debug the training code to maximally eliminate errors in code migration.
- Use the online notebook environment for debugging. For details, see JupyterLab Overview and Common Operations.
- Use a local IDE (PyCharm or VS Code) to access the cloud environment for debugging. For details, see Operation Process in a Local IDE.
Parent topic: In-Cloud Migration Adaptation Issues
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
The system is busy. Please try again later.
For any further questions, feel free to contact us through the chatbot.
Chatbot