用户使用torch报错Unexpected error from cudaGetDeviceCount

问题现象

在Notebook执行兼容gpu的脚本时报错不兼容，但是通过nvcc --version排查显示是兼容。

import torch
import sys
print('A', sys.version)
print('B', torch.__version__)
print('C', torch.cuda.is_available())
print('D', torch.backends.cudnn.enabled)
device = torch.device('cuda')
print('E', torch.cuda.get_device_properties(device))
print('F', torch.tensor([1.0, 2.0]).cuda())

报错如下

Traceback (most recent call last):
File "test.py", line 8, in <module>
print('E', torch.cuda.get_device_properties(device))
File "/opt/conda/lib/python3.7/site-packages/torch/cuda/__init__.py", line 356, in get_device_properties
_lazy_init() # will define _get_device_properties
File "/opt/conda/lib/python3.7/site-packages/torch/cuda/__init__.py", line 214, in _lazy_init
torch._C._cuda_init()
RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 803: system has unsupported display driver / cuda driver combination</module>

解决方式

先排查cuda和torch版本是否兼容。

# cuda版本
nvcc --version
# nvidia-smi版本
nvidia-smi

# torch版本（要确定用户用的哪个conda下的python）
python -c "import torch;print(torch.__version__)"

通过pytorch官网可查兼容版本：https://pytorch.org/get-started/previous-versions/

如果环境中装了多版本的cuda，可以排查LD_LIBRARY_PATH中的cuda优先级，需要手动调整下。
举例：如果cuda只兼容cuda-9.1，查询到LD_LIBRARY_PATH=/usr/local/cuda-11.8/lib64:/usr/local/cuda-9.1/lib64

需要手动调整优先级，执行命令export LD_LIBRARY_PATH=/usr/local/cuda-9.1/lib64:$LD_LIBRARY_PATH

父主题： 自定义镜像故障

上一篇：用户使用ma-cli制作自定义镜像失败，报错文件不存在（not found）

下一篇：其他故障

意见反馈

文档内容是否对您有帮助？

有帮助没帮助

提供反馈

提交成功！非常感谢您的反馈，我们会继续努力做到更好！您可在我的云声建议查看反馈及问题处理状态。

系统繁忙，请稍后重试

在使用文档中是否遇到以下问题

内容与产品页面不一致

内容不易理解

缺失示例代码

步骤不可操作

搜不到想要的内容

缺少最佳实践

意见反馈（选填）

0/500

请至少选择一项反馈信息并填写问题反馈

字符长度不能超过500

直接提交取消

如您有其它疑问，您也可以通过华为云社区问答频道来与我们联系探讨

智能客服提问云社区提问

用户使用torch报错Unexpected error from cudaGetDeviceCount

问题现象

解决方式

相关文档

意见反馈

文档内容是否对您有帮助？

7*24

备案

专业服务

退订

建议反馈

售前咨询热线