What Do I Do If There Are Retired Pages?
Symptom
- The service becomes abnormal after being scheduled to a GPU node, and becomes normal when it is scheduled to another node.
- The GPU memory usage of an ECS suddenly decreases.
Checking Whether There Are Retired Pages
- Run the following command to check whether there are ECC errors in the graphics card:

- In the command output in 1, check the number of ECC errors in the Volatile Uncorr. ECC column. If the number is greater than 0, run the following command to check whether there are retired pages.
nvidia-smi -q -i &.{gpu_id} -d PAGE_RETIREMENT

If Pending Page Blacklist is No in the command output, there are no retired pages.
- If the number of ECC errors in the Volatile Uncorr. ECC which is displayed in the command output in 1 is 0, run the following command to check whether there are retired pages for all GPUs.
nvidia-smi -q -d PAGE_RETIREMENT
- If Pending Page Blacklist is Yes in the command output in 3, there are retired pages. Then, reload the driver to retire the pages.

Solution
- Method 1:
- Method 2:
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.