Help Center/
Elastic Cloud Server/
Troubleshooting/
Self-diagnosis of Faulty GPU-accelerated ECSs/
Self-recovery from Non-hardware Faults/
What Do I Do If There Are Retired Pages?
Updated on 2025-07-30 GMT+08:00
What Do I Do If There Are Retired Pages?
Symptom
- The service becomes abnormal after being scheduled to a GPU node, and becomes normal when it is scheduled to another node.
- The GPU memory usage of an ECS suddenly decreases.
Checking Whether There Are Retired Pages
- Run the following command to check whether there are ECC errors in the graphics card:
- In the command output in 1, check the number of ECC errors in the Volatile Uncorr. ECC column. If the number is greater than 0, run the following command to check whether there are retired pages.
nvidia-smi -q -i &.{gpu_id} -d PAGE_RETIREMEN
If Pending Page Blacklist is No in the command output, there are no retired pages.
- If the number of ECC errors in the Volatile Uncorr. ECC which is displayed in the command output in 1 is 0, run the following command to check whether there are retired pages for all GPUs.
nvidia-smi -q -d PAGE_RETIREMENT
- If Pending Page Blacklist is Yes in the command output in 3, there are retired pages. Then, reload the driver to retire the pages.
Solution
- Method 1:
- Method 2:
Parent topic: Self-recovery from Non-hardware Faults
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
The system is busy. Please try again later.
For any further questions, feel free to contact us through the chatbot.
Chatbot