用户使用torch报错Unexpected error from cudaGetDeviceCount
问题现象
在Notebook执行兼容gpu的脚本时报错不兼容,但是通过nvcc --version排查显示是兼容。
import torch import sys print('A', sys.version) print('B', torch.__version__) print('C', torch.cuda.is_available()) print('D', torch.backends.cudnn.enabled) device = torch.device('cuda') print('E', torch.cuda.get_device_properties(device)) print('F', torch.tensor([1.0, 2.0]).cuda())
报错如下
Traceback (most recent call last): File "test.py", line 8, in <module> print('E', torch.cuda.get_device_properties(device)) File "/opt/conda/lib/python3.7/site-packages/torch/cuda/__init__.py", line 356, in get_device_properties _lazy_init() # will define _get_device_properties File "/opt/conda/lib/python3.7/site-packages/torch/cuda/__init__.py", line 214, in _lazy_init torch._C._cuda_init() RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 803: system has unsupported display driver / cuda driver combination</module>
解决方式
- 先排查cuda和torch版本是否兼容。
# cuda版本 nvcc --version # nvidia-smi版本 nvidia-smi # torch版本(要确定用户用的哪个conda下的python) python -c "import torch;print(torch.__version__)"
通过pytorch官网可查兼容版本:https://pytorch.org/get-started/previous-versions/
- 如果环境中装了多版本的cuda,可以排查LD_LIBRARY_PATH中的cuda优先级,需要手动调整下。
举例:如果cuda只兼容cuda-9.1,查询到LD_LIBRARY_PATH=/usr/local/cuda-11.8/lib64:/usr/local/cuda-9.1/lib64
需要手动调整优先级,执行命令export LD_LIBRARY_PATH=/usr/local/cuda-9.1/lib64:$LD_LIBRARY_PATH