GPU Fault Handling

Prerequisites

Cloud Native Logging has been installed in the cluster so that GPU events can be synchronously reported to AOM.

GPU Isolation Event

When a GPU malfunctions, the system automatically isolates the faulty GPU. For details, see Table 1.

**Table 1** GPU isolation event
Event Cause	Error Details	Description	Result
GPUMemoryError	Device=%s, UUID=%s, SN=%s has failed remapped rows; The device will go unhealthy.	Failed to obtain the number of remapped rows in NVML.	The faulty GPU device is isolated.
GPUMemoryError	Device=%s, UUID=%s, SN=%s has more than 60 retired pages caused by both multiple single bit ecc error and double bit ecc error, DBE error number: %d, SBE error number: %d; The device will go unhealthy.	The total number of DBE errors and SBE errors of the GPU device is greater than 60.	The faulty GPU device is isolated.
GPUMemoryError	Device=%s, UUID=%s, SN=%s has more than 4 SRAM uncorrectable ecc errors count; The device will go unhealthy.	The number of uncorrectable ECC errors of the GPU device is greater than 4.	The faulty GPU device is isolated.
GPUXidError	Failed to determine gpu device uuid for Xid=%d; Marking all devices as unhealthy.	Failed to obtain the UUID using NVML.	The GPU device of the faulty GPU node is isolated.
GPUXidError	Xid=%d on Device=%s, UUID=%s, SN=%s, the device will go unhealthy.	GPU Xid error occurred, and the affected Xids are 74 and 79.	The faulty GPU device is isolated.
GPUHealthWarning	Device=%s, UUID=%s, SN=%s failed to get fan state.	The fan on the GPU device is not running properly.	The affected GPU device is not isolated.
GPUHealthWarning	Device=%s, UUID=%s, SN=%s failed to get power state.	Failed to obtain the power of the GPU device.	The affected GPU device is not isolated.

Fault Locating

Failed to obtain the number of remapped rows in NVML.
The GPU driver or GPU device malfunctions. Contact customer service based on the type of the node (ECS) where the GPU device resides.
The total number of DBE errors and SBE errors of the GPU device is high.
The GPU driver or GPU device malfunctions. Contact customer service based on the type of the node (ECS) where the GPU device resides.
There are uncorrectable ECC errors of the GPU device.
1. Log in to the node where the GPU isolation event occurred.
2. Go to the /usr/local/nvidia/bin directory and run the nvidia-smi -q command.
  If the nvidia-smi command is unavailable or fails to be executed, the failure may be caused by the lack of GPU driver. Reinstall the GPU driver and try again.
3. Check the ECC ERROR in the command output.
  - Correctable Error: Such an error will not interrupt services or trigger GPU isolation.
  - Uncorrectable Error: Such an error will interrupt services and trigger GPU isolation.
4. If there are uncorrectable errors, perform the following operations to rectify the fault:
  1. Configure taints on the target node to evict the existing service load from the node.
  2. Restart the target node.
  3. If the fault persists, collect the output of the nvidia-smi -q command and contact customer service based on the type of the node (ECS) where the GPU device resides.
Failed to obtain the UUID using NVML.
1. Log in to the node where the GPU isolation event occurred.
2. Access /usr/local/nvidia/bin.
3. Run the nvidia-smi command and check the device ID in the command output, for example, 00:0D.0.
  If the nvidia-smi command is unavailable or fails to be executed, the failure may be caused by the lack of GPU driver. Reinstall the GPU driver and try again.
4. Run the lspci | grep NVIDIA command and check the device ID in the command output.
5. Compare the two results. If they do not match, contact customer service based on the type of the node (ECS) where the GPU device is located.
The Xid of the GPU device is incorrect.
1. Log in to the node where the GPU isolation event occurred.
2. Run the dmesg -T | grep -i NVRM command and check the command output.
3. If information in the "Xid(PCI:0000:00:0x): xx" format is displayed, collect the error code and identify the cause based on NVIDIA Xid Errors. Collect the error information and detailed cause and contact customer service based on the type of the node (ECS) where the GPU device resides.
The available memory of xGPU devices is far less than the physical GPU memory.
1. Log in to the xGPU node.
2. Run the /usr/local/nvidia/bin/nvidia-smi command to obtain the physical GPU memory of the target GPU and record its serial number.
3. Run the cat /proc/xgpu/{GPU serial number}/meminfo command to obtain the available xGPU memory. Replace {GPU serial number} with the one obtained in preceding step.
4. Compare the obtained GPU memory.
  
  The driver of the GPU vendor occupies a certain amount of physical GPU memory, which is about 300 MiB. This is normal. For example, if Tesla T4 GPUs run with NVIDIA driver 510.47.03, the driver occupies the GPU memory of 280 MiB by default. The value varies depending on the driver version. For example, the 535 series driver occupies more memory than the 470 series driver.
  
  If the available xGPU memory is far less than the physical GPU memory, some containers that are not provisioned using GPU virtualization occupy the GPU memory.
5. In this case, clear the GPU load on the target node through the CCE console or by using kubectl.
6. Run the rmmod xgpu_km command to delete GPU virtualization.
7. Delete the nvidia-gpu-device-plugin pods on the target node through the CCE console or by using kubectl.
8. After the nvidia-gpu-device-plugin pods are rebuilt, perform steps 2 and 3 again to verify the result.