Updated on 2025-07-30 GMT+08:00

What Do I Do If the nvidia-smi Command Output Shows Overheated GPUs?

Possible Causes

The heat dissipation of the graphics card is abnormal, or the fan is damaged.

Impact

Excessively high temperature of the graphics card affects the services.

Solution

Run the nvidia-smi command to check whether the fan is normal.

  • If the fan speed is 0, the fan may be damaged. In this case, stop and migrate the service. After the service is migrated, collect fault information by referring to Fault Information Collection and contact technical support to check whether the hardware is faulty.

  • If ERR! is displayed in the command output, the graphics card may be overheated. Stop the service, wait until the graphics card cools down, and run the nvidia-smi command to check whether the ERR! disappears.
    • If the command output is normal, adjust the service to limit the maximum power of the graphics card.
    • If the fault persists, collect fault information by referring to Fault Information Collection and contact technical support.