Reinstalled CUDA Version Does Not Match the One in the Target Image

Symptom

An error occurs after the engine version is reinstalled or a new CUDA package is compiled based on the existing image.

1. "RuntimeError: cuda runtime error (11) : invalid argument at /pytorch/aten/src/THC/THCCachingHostAllocator.cpp:278"
2. "libcudart.so.9.0 cannot open shared object file no such file or directory"
3. "Make sure the device specification refers to a valid device. The requested device appears to be a GPU,but CUDA is not enabled"

Possible Causes

The possible cause is as follows:

The CUDA version of the newly installed package does not match the CUDA version in the image.

Solution

Use the local PyCharm to remotely access notebook for debugging and installation.

Remotely log in to the selected image and run nvcc -V to obtain the CUDA version of the image.
Reinstall Torch. Ensure that the version matches the one obtained in the previous step.

Summary and Suggestions

Before creating a training job, use the ModelArts development environment to debug your training code and minimize migration errors.

Use the notebook environment for online debugging. For details, see Using JupyterLab to Develop Models.
Use a local IDE (PyCharm or VS Code) to access the cloud environment for debugging. For details, see Using a Local IDE to Develop Models.

Parent topic: In-Cloud Migration Adaptation Issues

Previous topic: Error Message "Please upgrade numpy to >= xxx to use this pandas version" Is Displayed in Logs

Next topic: Error ModelArts.2763 Occurred During Training Job Creation