Help Center/ Elastic Cloud Server/ Troubleshooting/ GPU Driver Issues/ How Do I Troubleshoot GPU Start Failures Caused by NULL Pointer Dereference on NVIDIA?
Updated on 2022-07-15 GMT+08:00

How Do I Troubleshoot GPU Start Failures Caused by NULL Pointer Dereference on NVIDIA?

Symptom

A GPU instance fails to be started. The system log shows "Unable to handle kernel NULL pointer dereference at 0000000000000008", as shown in Figure 1.

Figure 1 NVIDIA driver NULL pointer access

Possible Causes

The GPU driver is abnormal.

Solution

  1. Uninstall the driver.
    • Method 1: Run the nvidia-uninstall command to uninstall the driver.

      If the system displays a message indicating that the command does not exist, use method 2.

    • Method 2: Run the whereis nvidia command to query the version of the driver installed on the ECS.
      Figure 2 Installed driver version

      Download the driver package of the same version as the obtained one from the NVIDIA official website. (This driver package is required when you uninstall and reinstall the driver.)

      For example, if the driver version is nvidia-396.44, run the sh NVIDIA-Linux-x86_64-396.44.run --uninstall command to uninstall the driver.

  2. Reinstall the driver.

    For details, see Installing a Driver and Toolkit.