Pod Eviction Upon a GPU Fault

There will be a GPU fault when a GPU is unable to deliver computing services reliably due to issues such as hardware damage, driver anomalies, communication failures, or environmental factors. These issues can lead to pod startup failures, interrupted tasks, and resource waste, particularly in AI tasks and high-performance computing applications. To mitigate these risks, CCE offers a pod eviction strategy upon GPU faults. With this strategy, when CCE AI Suite (NVIDIA GPU) detects a qualifying GPU fault, it will evict pods that carry a specific label. You can then decide whether to recreate or reschedule these evicted pods. This ensures service continuity and improves system stability.

Prerequisites

A CCE standard or Turbo cluster of v1.27 or later is available.
A NVIDIA GPU node is running properly in the cluster.
The CCE AI Suite (NVIDIA GPU) add-on has been installed in the cluster. The add-on version must meet the following requirements:
- If the cluster version is v1.27, the add-on version must be 2.2.1 or later.
- If the cluster version is v1.28 or later, the add-on version must be 2.8.1 or later.

Notes and Constraints

After pod eviction upon a GPU fault is enabled, eviction is triggered only when the CCE AI Suite (NVIDIA GPU) add-on detects a specific type of GPU fault. CCE response to such faults depends on the isolation policy associated with each fault type, which directly influences how pods are rescheduled. For details about the GPU fault types that support automatic pod eviction and the isolation policies for each fault type, see Table 1.
- If the detected fault leads to GPU isolation, the GPU, whether it is isolated due to an individual fault or the faulty GPU node, is marked as unavailable. In this case, the evicted pods will not be rescheduled to the same GPU until it recovers, thereby ensuring service continuity.
- If the detected fault does not lead to GPU isolation, the evicted pods may be reassigned to the same GPU in subsequent scheduling cycles.
Pod eviction upon a GPU fault applies only to pods that use an entire GPU card or a shared GPU. It does not apply to pods that use GPU virtualization.

Process

Procedure	Description
Step 1: Enable Pod Eviction Upon a GPU Fault	Enable pod eviction upon a GPU fault in the device-plugin component of the CCE AI Suite (NVIDIA GPU) add-on.
Step 2: Deploy a Workload with the Specified Label	After pod eviction upon a GPU fault is enabled, CCE automatically evicts the affected pods with the designated label upon the detection of specific GPU faults. The required label must be added to the pod specification during the workload deployment. This label allows CCE to identify eligible pods and trigger eviction when there is a fault.
Step 3: Check Whether Pod Can Be Evicted	If there is a GPU fault, check the events of the node hosting the faulty GPU to confirm whether the affected pod was evicted.
Other Operations	(Optional) Perform these operations as required. Configure which Xid errors are treated as critical and which are application-induced based on your service requirements. The values specified for these two types of Xid errors must be different. If an error is identified as critical, CCE will isolate the affected GPU. If the error is determined to be application-induced, the GPU will not be isolated. By default, critical Xid errors are 74 and 79, and application-induced Xid errors are 13, 31, 43, 45, 68, and 137.

Step 1: Enable Pod Eviction Upon a GPU Fault

Enable pod eviction upon a GPU fault in the device-plugin component of the CCE AI Suite (NVIDIA GPU) add-on.

Log in to the CCE console and click the cluster name to access the cluster console. The Overview page is displayed.
In the navigation pane, choose Add-ons. In the right pane, find the CCE AI Suite (NVIDIA GPU) add-on and click Edit.

In the window that slides out from the right, click Edit YAML. Set enable_pod_eviction_on_gpu_error to true to enable pod eviction upon a GPU fault.

...
custom:
      annotations: {}
      compatible_with_legacy_api: false
      component_schedulername: kube-scheduler
      disable_mount_path_v1: false
      disable_nvidia_gsp: true
      driver_mount_paths: bin,lib64
      enable_fault_isolation: true
      enable_health_monitoring: true
      enable_metrics_monitoring: true
      enable_simple_lib64_mount: true
      enable_xgpu: false
      enable_xgpu_burst: false
      gpu_driver_config: {}
      health_check_xids_v2: 74,79    # The values must be different from those designated as application-induced Xid errors.
      install_nvidia_peermem: false
      is_driver_from_nvidia: true
      enable_pod_eviction_on_gpu_error: true 
...

enable_pod_eviction_on_gpu_error controls whether to enable pod eviction upon a GPU fault. The options include:
- true: Pod eviction upon a GPU fault is enabled.
- false: (Default) Pod eviction upon a GPU fault is disabled.

After the setting, click OK in the lower right corner of the page. CCE AI Suite (NVIDIA GPU) is then automatically upgraded. After the add-on status changes to Running, the pod eviction upon a GPU fault function becomes effective.

Step 2: Deploy a Workload with the Specified Label

After pod eviction upon a GPU fault is enabled, CCE automatically evicts the affected pods with the designated label upon the detection of specific GPU faults. The required label must be added to the pod specification during the workload deployment. This label allows CCE to identify eligible pods and trigger eviction when there is a fault.

Log in to the CCE console and click the cluster name to access the cluster console. In the navigation pane, choose Workloads. In the upper right corner of the displayed page, click Create Workload.
In the Container Settings area, click Basic Info, set GPU Quota to GPU card or Shared, and configure the related parameters.
- GPU card: An entire physical GPU will be exclusively allocated to the target pod.
- Shared: Multiple pods can share a physical GPU, including its compute and memory resources.
Figure 1 Using a GPU card
In Advanced Settings area, click Labels and Annotations and add the evict_on_gpu_error=true label. When the add-on detects a GPU fault, it evicts the pods with this label from the faulty GPU.

Figure 2 Adding the required label
Configure other parameters by referring to Creating a Workload. Then, click Create Workload in the lower right corner. When the workload changes to the Running state, it is created.

A job is used as an example. For other workloads, add the same label to the same field.

Create a YAML file for creating a job with the specified label.

vim k8-job.yaml

Add the evict_on_gpu_error: "true" label to the workload. The file content is as follows:

apiVersion: batch/v1
kind: Job
metadata:
  name: k8-job
spec:
  completions: 2
  parallelism: 2
  template:
    metadata:                         # Manually add the specific label to the workload.
      labels:                         
        evict_on_gpu_error: "true"   # When the add-on detects a GPU fault, it evicts the pods with this label from the faulty GPU.
    spec:
      restartPolicy: Never
      schedulerName: volcano
      containers:
        - name: k8-job
          image: pytorch:latest  # Replace it with the needed image.
          imagePullPolicy: IfNotPresent
          command: ["/bin/bash", "-c", "python /etc/scripts/run.py"]  # Replace it with the needed application command.
          resources:
            limits:
              nvidia.com/gpu: 1
            requests:
              nvidia.com/gpu: 1

Create the workload.
```
kubectl apply -f k8-job.yaml
```
If information similar to the following is displayed, the workload has been created:
```
job.batch/k8-job created
```

Check whether the pods have been created.

kubectl get pod -n default

If information similar to the following is displayed and the pod statuses are all Running, the pods have been created.

NAME                      READY   STATUS        RESTARTS   AGE
k8-job1-jdjbb             1/1     Running       0          28s
k8-job1-pr5m4             1/1     Running       0          28s

Step 3: Check Whether Pod Can Be Evicted

If there is a GPU fault, check the events of the node hosting the faulty GPU to confirm whether the affected pod was evicted.

Log in to the CCE console and click the cluster name to access the cluster console. The Overview page is displayed.
In the navigation pane, choose Nodes. In the right pane, click the Nodes tab.
Locate the row containing the node and click View Events in the Operation column. If event similar to that shown in the figure below is displayed, the pod has been evicted.

Figure 3 Pod eviction event

Other Operations

By default, critical Xid errors are 74 and 79, and application-induced Xid errors are 13, 31, 43, 45, 68, and 137. You can configure which Xid errors are treated as critical and which are application-induced based on your service requirements. The values specified for these two types of Xid errors must be different. If an error is identified as critical, CCE will isolate the affected GPU. If the error is determined to be application-induced, the GPU will not be isolated.

These operations are optional.

Configuring Critical Xid Errors

Log in to the CCE console and click the cluster name to access the cluster console. The Overview page is displayed.
In the navigation pane, choose Add-ons. In the right pane, find the CCE AI Suite (NVIDIA GPU) add-on and click Edit.

In the window that slides out from the right, click Edit YAML.

...
custom:
      annotations: {}
      compatible_with_legacy_api: false
      component_schedulername: kube-scheduler
      disable_mount_path_v1: false
      disable_nvidia_gsp: true
      driver_mount_paths: bin,lib64
      enable_fault_isolation: true
      enable_health_monitoring: true
      enable_metrics_monitoring: true
      enable_simple_lib64_mount: true
      enable_xgpu: false
      enable_xgpu_burst: false
      gpu_driver_config: {}
      health_check_xids_v2: 74,79    # The values must be different from those designated as application-induced Xid errors.
      install_nvidia_peermem: false
      is_driver_from_nvidia: true
      enable_pod_eviction_on_gpu_error: true 
...

health_check_xids_v2 defines which Xid errors are critical. The values are customizable and must be different from those designated as application-induced Xid errors. If this parameter is not set, the default value 74,79 is used.
Common GPU Xid Errors lists common Xid errors. For more details, see Xid Errors.

After the setting, click OK in the lower right corner of the page. CCE AI Suite (NVIDIA GPU) is then automatically upgraded. After the add-on status changes to Running, the pod eviction upon a GPU fault function becomes effective.

Configuring Application-Induced Xid Errors

Log in to the client and access the cluster using kubectl (Accessing a Cluster Using kubectl).
Edit the YAML file of the add-on ConfigMap to customize application-induced Xid errors. If no modification is made, the default value of application-error-xids is 13,31,43,45,68,137.
```
kubectl edit cm -n kube-system gpu-device-plugin-configuration
```
Enter i to change the value of application-error-xids. Common GPU Xid Errors lists common Xid errors. For more details, see Xid Errors.
```
apiVersion: v1
data:
  config.yaml: |-
    fault-detect-config:
      xid:
        application-error-xids: "13,31,43,45,68,137"  # The values must be different from those designated as critical Xid errors.
        no-check-xids: ""
...
```
After completing the modification, press Esc to exit and enter :wq to save the modification.

When the CCE AI Suite (NVIDIA GPU) add-on is upgraded or modified, its associated ConfigMap is reset to the default configuration. To retain custom application-induced Xid error settings, manually reconfigure them.