How Do I Troubleshoot GPU Start Failures Caused by NULL Pointer Dereference on NVIDIA?
Symptom
A GPU instance fails to be started. The system log shows "Unable to handle kernel NULL pointer dereference at 0000000000000008", as shown in Figure 1.
Possible Causes
The GPU driver is abnormal.
Solution
- Uninstall the driver.
- Method 1: Run the nvidia-uninstall command to uninstall the driver.
If the system displays a message indicating that the command does not exist, use method 2.
- Method 2: Run the whereis nvidia command to query the version of the driver installed on the ECS.
Figure 2 Installed driver version
Download the driver package of the same version as the obtained one from the NVIDIA official website. (This driver package is required when you uninstall and reinstall the driver.)
For example, if the driver version is nvidia-396.44, run the sh NVIDIA-Linux-x86_64-396.44.run --uninstall command to uninstall the driver.
- Method 1: Run the nvidia-uninstall command to uninstall the driver.
- Reinstall the driver.
For details, see Installing a Driver and Toolkit.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot