Help Center/ Elastic Cloud Server/ Troubleshooting/ GPU Driver Issues/ Why Is the GPU Driver Unavailable?
Updated on 2022-07-15 GMT+08:00

Why Is the GPU Driver Unavailable?

Symptom

Run the nvidia-smi command to check the GPU usage. The following information is displayed:

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
Figure 1 GPU driver unavailable

Possible Causes

The system kernel is upgraded, resulting in GPU driver unavailability.

Troubleshooting

Run the corresponding command on the server to check the version of the kernel where the driver is installed:

  • CentOS: find /usr/lib/modules -name nvidia.ko
  • Ubuntu: find /lib/modules -name nvidia.ko

For example, run the preceding command in CentOS. If the command output shown in Figure 2 is displayed, the GPU driver is installed on the 3.10.0-957.5.1.el7.x86_64 kernel.

Figure 2 Version of the kernel where the driver is installed

Run the uname –r command. The command output shown in Figure 3 indicates that the current kernel version is 3.10.0-1160.24.1.el7.x86_64.

Figure 3 Current kernel version

The version of the kernel where the driver is installed is different from the current kernel version.

Solution

  • Method 1: Restart the ECS and select the kernel version used when the GPU driver was installed.
    1. In the ECS list, locate the row that contains the target ECS and click Remote Login in the Operation column. In the displayed dialog box, click Log In in the Other Login Modes area.
    2. Click Ctrl+Alt+Del in the upper part of the remote login panel to restart the ECS.
    3. Refresh the page quickly and press the up and down arrow keys to stop the ECS from restarting. Then, select the kernel version used when the GPU driver was installed and press Enter to enter the system. The GPU driver becomes available in the current kernel version.
  • Method 2: Reinstall the driver based on the new kernel version.
    1. Uninstall the driver.
      • a: Run the nvidia-uninstall command to uninstall the driver.

        If the system displays a message indicating that the command does not exist, go to b.

      • b. Run the whereis nvidia command to query the version of the driver installed on the ECS.
        Figure 4 Installed driver version

        Download the driver package of the same version as the obtained one from the NVIDIA official website. (This driver package is required when you uninstall and reinstall the driver.)

        For example, if the driver version is nvidia-396.44, run the sh NVIDIA-Linux-x86_64-396.44.run --uninstall command to uninstall the driver.

    2. Reinstall the driver.

      For details, see Installing a Driver and Toolkit.