Help Center/ Elastic Cloud Server/ Troubleshooting/ Self-diagnosis of Faulty GPU-accelerated ECSs/ Fault Diagnosis and Handling of Graphics Cards/ What Do I Do If ECC Error "double bit ecc error" Occurs and There Are No Retired Pages Shown in the nvidia-smi -q Command Output?
Updated on 2025-07-30 GMT+08:00

What Do I Do If ECC Error "double bit ecc error" Occurs and There Are No Retired Pages Shown in the nvidia-smi -q Command Output?

Possible Causes

Errors may occur in the GPU memory.

Impact

GPU-related applications may be affected.

Solution

Run the nvidia-smi command to view the graphics card information.

  • In the command output, if the number of ECC errors in the Volatile Uncorr. ECC column is greater than 0, run the nvidia-smi -q -i &.{gpu_id} command to view the graphics card details.
  • In the command output, if the number of ECC errors in the Volatile Uncorr. ECC is 0, run the nvidia-smi -q command to view all the graphics cards details.
  • If Pending Page Blacklist is No and the double bit ecc error frequently occurs, check whether the graphics card can be replaced.
    1. Run the nvidia-smi –r command to reset the GPU.
    2. Run the nvidia-smi --query-retired-pages=gpu_name,gpu_bus_id,gpu_serial,retired_pages.cause,retired_pages.timestamp --format=csv command. If double bit ecc occurs for five consecutive times, contact technical support to replace the graphics card. Alternatively, reset the GPU and check whether the services are recovered. If yes, the graphics card can still be used.