Updated on 2025-07-30 GMT+08:00

What Do I Do If nvidia-smi Command Output Shows the SRAM ECC Error (V100 GPUs)?

Possible Causes

Errors may occur in the GPU memory.

Impact

GPU-related applications may be affected.

Solution

Run the nvidia-smi command to view the graphics card information.

  • In the command output, if there are ECC errors in the Volatile Uncorr. ECC, run the nvidia-smi -q -i &.{gpu_id} command to view the graphics card details.
  • In the command output, if there are no ECC errors in the Volatile Uncorr. ECC, run the nvidia-smi -q command to view all the graphics cards details.
  • As shown in the figure, if only the value of Device Memory increases in the Single Bit under Volatile or the Single Bit under Aggregate, no operation is required.
  • In Single Bit and Double bit under Volatile or Single Bit and Double bit under Aggregate, if the sum of Register File, L1 Cache, L2 Cache, Texture Memory, Texture Shared, and CBU is greater than 0, contact fault information by referring to Fault Information Collection and contact technical support.