Failed to Find the .so File During Training
Symptom
During the execution of a ModelArts training job, the following error message is displayed in the log and the training failed:
libcudart.so.9.0 cannot open shared object file no such file or directory
Possible Cause
The CUDA version of the .so file generated during compilation is different from that of the training job.
Solution
If the CUDA version in the compilation environment is different from that in the training environment, an error will occur when a training job runs. For example, this error occurs if the .so file generated in the TensorFlow 1.13 development environment of CUDA version 10 is used in the TensorFlow 1.12 training environment of CUDA version 9.0.
To resolve this issue, perform the following operations:
- Add the following command before executing a training job to check whether the .so file is available. If the .so file is available, go to 2. Otherwise, go to 3.
import os; os.system(find /usr -name *libcudart.so*);
- Configure the environment variable LD_LIBRARY_PATH and issue the training job again.
For example, if the path for storing the .so file is /use/local/cuda/lib64, configure LD_LIBRARY_PATH as follows:
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib64
- Run the following command to check whether the CUDA version of the training environment supports the .so file:
os.system("cat /usr/local/cuda/version.txt")
- If so, import an external .so file (download it from the browser) and configure LD_LIBRARY_PATH in 2.
- If not, replace the engine and issue the training job again. Alternatively, use a custom image to create a job. For details, see Using a Custom Image to Train Models.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot