文档首页/ AI开发平台ModelArts/ 故障排除/ 开发环境/ 自定义镜像故障/ 用户使用torch报错Unexpected error from cudaGetDeviceCount
更新时间:2024-12-30 GMT+08:00

用户使用torch报错Unexpected error from cudaGetDeviceCount

问题现象

在Notebook执行兼容gpu的脚本时报错不兼容,但是通过nvcc --version排查显示是兼容。

import torch
import sys
print('A', sys.version)
print('B', torch.__version__)
print('C', torch.cuda.is_available())
print('D', torch.backends.cudnn.enabled)
device = torch.device('cuda')
print('E', torch.cuda.get_device_properties(device))
print('F', torch.tensor([1.0, 2.0]).cuda())

报错如下

Traceback (most recent call last):
File "test.py", line 8, in <module>
print('E', torch.cuda.get_device_properties(device))
File "/opt/conda/lib/python3.7/site-packages/torch/cuda/__init__.py", line 356, in get_device_properties
_lazy_init() # will define _get_device_properties
File "/opt/conda/lib/python3.7/site-packages/torch/cuda/__init__.py", line 214, in _lazy_init
torch._C._cuda_init()
RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 803: system has unsupported display driver / cuda driver combination</module>

解决方式

  1. 先排查cuda和torch版本是否兼容。
    # cuda版本
    nvcc --version
    # nvidia-smi版本
    nvidia-smi
    
    # torch版本(要确定用户用的哪个conda下的python)
    python -c "import torch;print(torch.__version__)"

    通过pytorch官网可查兼容版本:https://pytorch.org/get-started/previous-versions/

  2. 如果环境中装了多版本的cuda,可以排查LD_LIBRARY_PATH中的cuda优先级,需要手动调整下。

    举例:如果cuda只兼容cuda-9.1,查询到LD_LIBRARY_PATH=/usr/local/cuda-11.8/lib64:/usr/local/cuda-9.1/lib64

    需要手动调整优先级,执行命令export LD_LIBRARY_PATH=/usr/local/cuda-9.1/lib64:$LD_LIBRARY_PATH