Why Is the GPU Driver Unavailable?
Symptom
Run the nvidia-smi command to check the GPU usage. The following information is displayed:
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
Possible Causes
The system kernel is upgraded, resulting in GPU driver unavailability.
Troubleshooting
Run the corresponding command on the server to check the version of the kernel where the driver is installed:
- CentOS: find /usr/lib/modules -name nvidia.ko
- Ubuntu: find /lib/modules -name nvidia.ko
For example, run the preceding command in CentOS. If the command output shown in Figure 2 is displayed, the GPU driver is installed on the 3.10.0-957.5.1.el7.x86_64 kernel.
Run the uname –r command. The command output shown in Figure 3 indicates that the current kernel version is 3.10.0-1160.24.1.el7.x86_64.
The version of the kernel where the driver is installed is different from the current kernel version.
Solution
- Method 1: Restart the ECS and select the kernel version used when the GPU driver was installed.
- In the ECS list, locate the row that contains the target ECS and click Remote Login in the Operation column. In the displayed dialog box, click Log In in the Other Login Modes area.
- Click Ctrl+Alt+Del in the upper part of the remote login panel to restart the ECS.
- Refresh the page quickly and press the up and down arrow keys to stop the ECS from restarting. Then, select the kernel version used when the GPU driver was installed and press Enter to enter the system. The GPU driver becomes available in the current kernel version.
- Method 2: Reinstall the driver based on the new kernel version.
- Uninstall the driver.
- a: Run the nvidia-uninstall command to uninstall the driver.
If the system displays a message indicating that the command does not exist, go to b.
- b. Run the whereis nvidia command to query the version of the driver installed on the ECS.
Figure 4 Installed driver version
Download the driver package of the same version as the obtained one from the NVIDIA official website. (This driver package is required when you uninstall and reinstall the driver.)
For example, if the driver version is nvidia-396.44, run the sh NVIDIA-Linux-x86_64-396.44.run --uninstall command to uninstall the driver.
- a: Run the nvidia-uninstall command to uninstall the driver.
- Reinstall the driver.
For details, see Installing a Driver and Toolkit.
- Uninstall the driver.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot