GPU Fault Handling

In a Kubernetes environment, managing GPU resources is complex, and diagnosing and recovering from faults can be challenging and costly. When a GPU becomes faulty, the CCE cluster can quickly report an event and isolate the faulty GPU. This isolation ensures that other functional GPUs continue to provide services with minimal disruption. This section describes common GPU events, isolation outcomes, and solutions to help you respond to faults quickly, minimize downtime, and maintain service continuity and high performance.

Prerequisites

The cluster version must be v1.27 or later.
Cloud Native Log Collection has been installed in the cluster so that GPU events can be synchronously reported to AOM.

CCE AI Suite (NVIDIA GPU) Exception Event Reporting and Isolation

When a GPU malfunctions, CCE AI Suite (NVIDIA GPU) reports an exception event and CCE isolates the affected GPU device based on the event. Kubernetes then cannot allocate the isolated GPU, making the GPU resource temporarily unavailable. After the fault is rectified, the GPU resumes normal operation. Table 1 lists CCE AI Suite (NVIDIA GPU) exception events and isolation results.

**Table 1** CCE AI Suite (NVIDIA GPU) exception events
Event Cause	Error Details	Description	Result	Solution
GPUMemoryError	GPUServerId=xxx; MemoryErrorType=FailedRemappedRows; Device=xxx; UUID=xxx; SN=xxx; Message=The device has failed remapped rows, it will go unhealthy.	Failed to obtain the number of remapped rows in NVML: Row remapping is used to handle memory faults. If row remapping fails, GPU memory access will be affected.	The faulty GPU device is isolated.	Failed to obtain the number of remapped rows in NVML.
	GPUServerId=xxx; MemoryErrorType=OverThresholdRetiredPages; Device=xxx; UUID=xxx; SN=xxx; Message=The device has more than 60 retired pages caused by both multiple single bit ecc error and double bit ecc error, DBE error number: xxx, SBE error number: xxx, it will go unhealthy.	High DBEs and SBEs in a GPU: When the total number of Double Bit Errors (DBEs) and Single Bit Errors (SBEs) in a GPU exceeds the preset threshold (60), CCE determines that a memory fault has occurred on the device.	The faulty GPU device is isolated.	The total number of DBEs and SBEs of a GPU is high.
	GPUServerId=xxx; MemoryErrorType=OverLimitSramEccErrors; Device=xxx; UUID=xxx; SN=xxx; Message=The device has more than xxx SRAM uncorrectable ecc errors count, it will go unhealthy.	Uncorrectable ECC errors in a GPU: When uncorrectable ECC errors occur in SRAM and the number of errors exceeds four, the GPU is considered to be abnormal.	The faulty GPU device is isolated.	Uncorrectable ECC errors have been detected in a GPU.
	GPUServerId=xxx; MemoryErrorType=PendingRetiredPages; Device=xxx; UUID=xxx; SN=xxx; Message=The device has pending retired pages.	Pages to be isolated in a GPU: CCE automatically checks for pages that need to be isolated in a GPU. If such pages are found, memory inconsistencies or errors may occur.	The affected GPU device is not isolated.	There are pages to be isolated in a GPU.
	GPUServerId=xxx; MemoryErrorType=InfoROMCorrupted; Device=xxx; UUID=xxx; SN=xxx; Message=InfoROM is corrupted.	Damaged GPU InfoROM: The InfoROM is a storage area in the GPU memory that holds critical configuration and status information. If the InfoROM is damaged, the GPU may not function correctly.	The affected GPU device is not isolated.	GPU InfoROM is damaged.
GPUXidError	GPUServerId=xxx; XidErrorType=GetUuidError; Xid=xxx; Message=Failed to determine gpu device uuid for Xid xxx, all devices will go unhealthy.	Failed to obtain a GPU UUID from NVML: When parsing an Xid event reported by the driver, the NVIDIA Management Library (NVML) cannot retrieve the UUID of the affected GPU. This issue is not caused by an Xid error but may be due to driver, hardware, or permission problems.	The GPU device of the faulty GPU node is isolated.	Failed to obtain the UUID using NVML.
	GPUServerId=xxx; XidErrorType=FatalXidError; Xid=xxx; Device=xxx; UUID=xxx; SN=xxx; Message=The device will go unhealthy.	Critical Xid error in a GPU: A critical Xid error in a GPU, such as Xid 74 or 79, indicates a severe hardware issue that requires fault isolation.	The faulty GPU device is isolated.	An Xid error has occurred in a GPU.
	GPUServerId=xxx; XidErrorType=ApplicationXidError; Xid=xxx; Message=Event could be caused by an application error, not a device error.	Xid error caused by applications on a GPU: Such Xid errors are not caused by the GPU itself. Therefore, the GPU does not need to be isolated. These Xids include 13, 31, 43, 45, 68, and 137.	The affected GPU device is not isolated.	An Xid error has occurred in a GPU.
	GPUServerId=xxx; XidErrorType=OtherXidError; Xid=xxx; Device=xxx; UUID=xxx; SN=xxx; Message=The device may be unhealthy.	Other common Xid errors in a GPU: Such Xid errors may be caused by issues with the GPU device or driver, or they may indicate error correction information.	The affected GPU device is not isolated.	An Xid error has occurred in a GPU.
GPUHealthWarning	GPUServerId=xxx; HealthWarningType=GetFanStateError; Device=xxx; UUID=xxx; SN=xxx; Message=The device failed to get fan state.	Malfunctional fan on a GPU: GPU's fan usage cannot be obtained. The fan may be faulty.	The affected GPU device is not isolated.	None
GPUHealthWarning	GPUServerId=xxx; HealthWarningType=GetPowerStateError; Device=xxx; UUID=xxx; SN=xxx; Message=The device failed to get power state.	Malfunctional GPU power supply: GPU's power usage cannot be obtained. The power supply may be faulty.	The affected GPU device is not isolated.	None
GPUNvmlError	GPUServerId=xxx; NvmlErrorType=GetDeviceHandleError; Device=xxx; UUID=xxx; SN=xxx; Message=The device cannot be reached through NVML, it may be unhealthy.	GPU failed to communicate with NVML: CCE cannot access a specified GPU through NVML. This issue typically indicates that the GPU device is unhealthy.	The affected GPU device is not isolated.	GPU failed to communicate with NVML.

Fault Locating

Failed to obtain the number of remapped rows in NVML.
The GPU driver or GPU device malfunctions. Contact customer service based on the type of the node (ECS or BMS) where the GPU device resides.
The total number of DBEs and SBEs of a GPU is high.
The GPU driver or GPU device malfunctions. Contact customer service based on the type of the node (ECS or BMS) where the GPU device resides.
Uncorrectable ECC errors have been detected in a GPU.
1. Log in to the node where the GPU isolation event occurred.
2. Go to the /usr/local/nvidia/bin directory and run the nvidia-smi -q command.
  If the nvidia-smi command is unavailable or fails to be executed, the failure may be caused by the lack of GPU driver. Reinstall the GPU driver and try again.
3. Check the ECC ERROR in the command output.
  - Correctable Error: Such an error will not interrupt services or trigger GPU isolation.
  - Uncorrectable Error: Such an error will interrupt services and trigger GPU isolation.
4. If there are uncorrectable errors, perform the following operations to rectify the fault:
  1. Configure taints on the target node to evict the existing service load from the node.
  2. Restart the target node.
  3. If the fault persists, collect the output of the nvidia-smi -q command and contact customer service based on the type of the node (ECS or BMS) where the GPU device resides.
There are pages to be isolated in a GPU.
The GPU driver or GPU device malfunctions. Contact customer service based on the type of the node (ECS or BMS) where the GPU device resides.
1. Log in to the node where the GPU isolation event occurred.
2. Go to the /usr/local/nvidia/bin directory and run the nvidia-smi -i <target gpu> -q -d PAGE_RETIREMENT command.
  If the nvidia-smi command is unavailable or fails to be executed, the failure may be caused by the lack of GPU driver. Reinstall the GPU driver and try again.
3. Check the Pending Page Blacklist in the command output. If there is data in it, there are pages to be isolated. Try the following methods to resolve this issue:
  1. Configure taints on the target node to evict the existing service load from the node.
  2. Reconnect to the GPU and start a new program on the GPU.
    If the reconnected GPU does not work, reset it and restart the node. If the fault persists, contact customer service based on the type of the node (ECS or BMS) where the GPU device resides.
GPU InfoROM is damaged.
The GPU device malfunctions. Contact customer service based on the type of the node (ECS or BMS) where the GPU device resides.
Failed to obtain the UUID using NVML.
1. Log in to the node where the GPU isolation event occurred.
2. Access /usr/local/nvidia/bin.
3. Run the nvidia-smi command and check the device ID in the command output, for example, 00:0D.0.
  If the nvidia-smi command is unavailable or fails to be executed, the failure may be caused by the lack of GPU driver. Reinstall the GPU driver and try again.
4. Run the lspci | grep NVIDIA command and check the device ID in the command output.
5. Compare the two results. If they do not match, contact customer service based on the type of the node (ECS or BMS) where the GPU device is located.
An Xid error has occurred in a GPU.
Select a troubleshooting method based on Xid errors.
- Critical Xid errors: The GPU device malfunctions. Contact customer service based on the type of the node (ECS or BMS) where the GPU device resides.
- Application-caused Xid errors: Identify possible fault causes based on Table 2 and verify the functionality of related applications. If the fault persists, restart the application to restore the GPU.
- Other common Xid errors: Identify possible fault causes based on Table 3 and try to recover the GPU accordingly. If the fault persists, contact customer service based on the type of the node (ECS or BMS) where the GPU device resides.
GPU failed to communicate with NVML.
If a GPU fails to communicate with NVML, possible causes are as follows:
- The GPU UUID does not match the valid GPU in the system.
- The GPU device is not properly powered, which prevents it from functioning correctly.
- The GPU device goes offline unexpectedly. As a result, it is inaccessible.
- The running GPU device is interrupted, which may be caused by a hardware fault or driver issue.
- The driver detects an unknown hardware or software error that cannot be classified.
Contact customer service based on the type of the node (ECS or BMS) where the GPU device resides.
Other common error: The available memory of virtual GPU devices is far less than the physical GPU memory.
Compare the physical GPU memory with the available GPU memory to check whether non-virtualized GPU containers are using the GPU memory. To do so, perform the following operations:
1. Log in to the virtual GPU node.
2. Run the /usr/local/nvidia/bin/nvidia-smi command to obtain the physical GPU memory of the target GPU and record its serial number.
3. Run the cat /proc/xgpu/{GPU serial number}/meminfo command to obtain the available virtualized GPU memory. Replace {GPU serial number} with the one obtained in 2.
4. Compare the obtained GPU memory.
  
  The driver of the GPU vendor occupies a certain amount of physical GPU memory, which is about 300 MiB. This is normal. For example, if Tesla T4 GPUs run with NVIDIA driver 510.47.03, the driver occupies 280 MiB of GPU memory by default. The value varies depending on the driver version. For example, the 535 series driver occupies more memory than the 470 series driver.
  
  If the available virtualized GPU memory is far less than the physical GPU memory, some containers that are not provisioned using GPU virtualization occupy the GPU memory.
5. Evict GPU workloads from the target node using the CCE console or kubectl command.
6. Run the rmmod xgpu_km command to delete GPU virtualization.
7. Delete nvidia-gpu-device-plugin pods from the target node using the CCE console or kubectl command.
8. After the nvidia-gpu-device-plugin pods are rebuilt, perform steps 2 and 3 again to verify the result.

Common GPU Xid Errors

Xid errors are reported by the NVIDIA driver. By capturing and analyzing these error codes, you can accurately identify and resolve GPU hardware or driver issues. The following describes common Xid errors. For details, see NVIDIA Xid Errors.

Xid errors caused by user programs: These errors may be caused by program faults or resource management issues. For details, see Table 2.

**Table 2** Xid errors (user programs)
Xid	Error Description
13	The GR: SW Notify Error message indicates a GPU engine error, typically caused by an out-of-bounds access (for example, exceeding array limits) but could also stem from illegal instructions, registers, or rare hardware faults.
31	The FIFO: MMU Error message indicates that the GPU encountered an error while processing memory access or command queues. This is typically due to unauthorized application access to memory but may also result from rare driver or hardware faults.
43	The Reset Channel Verif Error message may be caused by a user application error.
45	The OS: Preemptive Channel Removal message indicates that a task running on the GPU has been forcibly terminated. This typically occurs when the user program is terminated, the GPU is reset, or a system signal (such as Control-C or SIGKILL) is received.
68	The NVDEC0 Exception message indicates that an error occurred during video decoding, which may be caused by hardware, driver, or user program issues.
137	The NVLink FLA privilege error message indicates that a fault has been reported by the remote MMU, typically due to an application-level bug. However, it can also be caused by driver or hardware issues.

Other Xid errors: These errors are typically caused by GPU hardware, driver, or system configuration faults. For details, see Table 3.

**Table 3** Other Xid errors
Xid	Error Description
32	The PBDMA Error message primarily indicates issues related to the quality of the PCI-E bus, which can affect the proper functioning of the GPU device.
48	The DBE (Double Bit Error) ECC Error message indicates that an uncorrectable ECC error has been detected on the GPU. To handle this error, reset the GPU.
63	The ECC Page Retirement or Row Remapping message indicates that NVIDIA's self-correction mechanism has detected a GPU memory hardware error and has either retired the faulty memory page or remapped it to prevent further issues. Additionally, NVIDIA records the retirement and remapped information in the InfoROM. If the same problem occurs again, no error will be reported. Typically, this error does not affect normal functioning of the GPU.
64	The triggering scenario of this Xid is similar to that of Xid 63. Xid 63 indicates that the retirement and remapped information has been recorded in the InfoROM. Therefore, no similar error will be reported for the same memory area. Xid 64 indicates a failure in recording ECC page retirement or row remapping information to the InfoROM. If the same issue occurs again later, the error will be reported again because the information was not recorded properly. Typically, this error does not affect normal functioning of the GPU.
74	The NVLink Error message indicates a connectivity issue between GPUs or between GPUs and an NVSwitch over NVLink. This error can significantly impact the normal functioning of the GPU.
79	The GPU has fallen off the bus message indicates that the GPU is no longer accessible over its PCI Express connection. This is typically due to a hardware fault.
93	The Non-fatal violation of provisioned InfoROM wear limit message indicates that the number of write operations to the InfoROM (a read-only memory used for storing device configuration and status information) on the GPU is approaching or has exceeded a predefined wear limit. Typically, this error does not affect normal functioning of the GPU.
94	The Contained ECC error occurred message indicates that the GPU has suppressed uncorrectable ECC errors to prevent them from spreading to the entire system. Typically, this error does not affect normal functioning of the GPU.
95	The Uncontained ECC error occurred message indicates the GPU failed to suppress uncorrectable ECC errors. As a result, all applications running on the GPU are affected.
110	The Security fault error message indicates a hardware fault.
119	The GSP RPC Timeout message indicates that the GPU system processor timed out. If the fault persists, restart the GPU or node.
120	The GSP Error message indicates that an error has occurred in the GPU system processor. If the fault persists, restart the GPU or node.
121	The C2C Link corrected error message indicates that an error occurred in the C2C NVLink connection, but the error has been corrected. Restart the GPU at your earliest convenience to ensure a long-term stable connection.
140	The ECC unrecovered error message indicates that the GPU driver has detected an uncorrectable error in the GPU memory. This type of error typically requires a GPU reset.