Help Center/ Elastic Cloud Server/ Troubleshooting/ GPU Driver Issues/ Why Is the GPU Display Abnormal?
Updated on 2024-08-15 GMT+08:00

Why Is the GPU Display Abnormal?

Symptom

The following issues occur when nvidia-smi command is used to check the GPU usage.

  • On the device with one GPU, the following information is displayed:
    No devices were found
  • On the device with multiple GPUs, a message indicating that the number of GPUs is incomplete is displayed.

    The lspci | grep -i nvidia command output shows that the number of GPUs is normal.

Solution

  1. Check whether the ECS, for example, of the PI2 or G6 flavor, is using NVIDIA Tesla T4 GPUs.
  2. Check the system log /var/log/message for any reported driver-related errors.
    • If the error message "Failed to copy vbios to system memory" is displayed, the possible causes may be frequent driver loading/unloading. You are advised to enable the driver's persistence mode to keep the driver in the loading state.
      Figure 1 System logs
      1. Run the following command to enable the driver's persistence mode:

        nvidia-smi -pm 1

      2. Run the following command to open and edit the /etc/rc.local file:

        vim /etc/rc.local

      3. Configure automatic startup and write the nvidia-smi -pm 1 command to the /etc/rc.local file.
      4. Press Esc, enter :wq, and press Enter to save the settings and exit.
      5. Run the following command to add startup permissions:

        chmod +x /etc/rc.d/rc.local

    • If "Failed to copy vbios to system memory" is not displayed, go to the next step.
  3. Check whether the ECS uses Tesla 510.xx.xx.
    • If yes, the driver version may be incompatible with the image. You are advised to change the driver version. For details, see GPU Driver.
    • If no, go to the next step.
  4. Restart the ECS and run the nvidia-smi command to check whether the usage of GPUs is normal.

    If the fault persists, contact customer service.