Updated on 2025-07-30 GMT+08:00

What Do I Do If There Are Retired Pages?

Symptom

  • The service becomes abnormal after being scheduled to a GPU node, and becomes normal when it is scheduled to another node.
  • The GPU memory usage of an ECS suddenly decreases.

Checking Whether There Are Retired Pages

  1. Run the following command to check whether there are ECC errors in the graphics card:

    nvidia-smi

  2. In the command output in 1, check the number of ECC errors in the Volatile Uncorr. ECC column. If the number is greater than 0, run the following command to check whether there are retired pages.

    nvidia-smi -q -i &.{gpu_id} -d PAGE_RETIREMEN

    If Pending Page Blacklist is No in the command output, there are no retired pages.

  3. If the number of ECC errors in the Volatile Uncorr. ECC which is displayed in the command output in 1 is 0, run the following command to check whether there are retired pages for all GPUs.

    nvidia-smi -q -d PAGE_RETIREMENT

  4. If Pending Page Blacklist is Yes in the command output in 3, there are retired pages. Then, reload the driver to retire the pages.

Solution

  • Method 1:
    1. Run the following command to check the GPU usage and kill all processes using the GPU:

      nvidia-smi

    2. Run the following command to reset the GPU:

      nvidia-smi -r

    3. Run the following command to check whether there are retired pages:

      nvidia-smi -q -d PAGE_RETIREMENT

      If Pending Page Blacklist is No, there are no retired pages.

  • Method 2:
    1. Run the following command to restart the ECS:

      reboot

    2. Run the following command to check whether there are retired pages:

      nvidia-smi -q -d PAGE_RETIREMENT

      If Pending Page Blacklist is No, there are no retired pages.