Help Center/
Elastic Cloud Server/
Troubleshooting/
Self-diagnosis of Faulty GPU-accelerated ECSs/
Fault Diagnosis and Handling of Graphics Cards/
What Do I Do If ECC Error "double bit ecc error" Occurs and There Are No Retired Pages Shown in the nvidia-smi -q Command Output?
Updated on 2025-07-30 GMT+08:00
What Do I Do If ECC Error "double bit ecc error" Occurs and There Are No Retired Pages Shown in the nvidia-smi -q Command Output?
Possible Causes
Errors may occur in the GPU memory.
Impact
GPU-related applications may be affected.
Solution
Run the nvidia-smi command to view the graphics card information.
- In the command output, if the number of ECC errors in the Volatile Uncorr. ECC column is greater than 0, run the nvidia-smi -q -i &.{gpu_id} command to view the graphics card details.
- In the command output, if the number of ECC errors in the Volatile Uncorr. ECC is 0, run the nvidia-smi -q command to view all the graphics cards details.
- If Pending Page Blacklist is No and the double bit ecc error frequently occurs, check whether the graphics card can be replaced.
- Run the nvidia-smi –r command to reset the GPU.
- Run the nvidia-smi --query-retired-pages=gpu_name,gpu_bus_id,gpu_serial,retired_pages.cause,retired_pages.timestamp --format=csv command. If double bit ecc occurs for five consecutive times, contact technical support to replace the graphics card. Alternatively, reset the GPU and check whether the services are recovered. If yes, the graphics card can still be used.
Parent topic: Fault Diagnosis and Handling of Graphics Cards
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
The system is busy. Please try again later.
For any further questions, feel free to contact us through the chatbot.
Chatbot