Help Center/ Cloud Container Engine/ FAQs/ Workload/ Scheduling Policies/ Why Do a Large Number of Pods Fail to Be Executed After a Workload That Uses Even Scheduling on Virtual GPUs Is Created?
Updated on 2025-06-18 GMT+08:00

Why Do a Large Number of Pods Fail to Be Executed After a Workload That Uses Even Scheduling on Virtual GPUs Is Created?

Symptom

After a workload that uses even scheduling on virtual GPUs is created, a large number of GPU pods fail to be executed. The following is an example:

kubectl get pods

Information similar to the following is displayed:

NAME                         READY   STATUS                     RESTARTS   AGE
test-586cf9464c-54g4s        0/1     UnexpectedAdmissionError    0         57s
test-586cf9464c-58n6d        0/1     UnexpectedAdmissionError    0         10s
test1-689cf9462f-5bzcv       0/1     UnexpectedAdmissionError    0         58s
...

Possible Cause

For even scheduling on virtual GPUs, the cluster version must be compatible with the CCE AI Suite (NVIDIA GPU) add-on version. Compatibility details are as follows:
  • If the add-on version is 2.1.41 or later, the cluster version must be v1.27.16-r20, v1.28.15-r10, v1.29.10-r10, v1.30.6-r10, v1.31.4-r0, or later.
  • If the add-on version is 2.7.57 or later, the cluster version must be v1.28.15-r10, v1.29.10-r10, v1.30.6-r10, v1.31.4-r0, or later.

When the cluster version is incompatible but the GPU virtualization resources on the nodes are still available, pod scheduling proceeds normally. However, once resources are exhausted, kubelet fails to process even scheduling on virtual GPUs properly. This causes pods that failed to be scheduled to stack, leading to memory leakage and batch pod execution failures.

Solution

  • To continue using even scheduling on virtual GPUs, delete the workload that failed to be executed and upgrade the cluster to the required version. For details, see Cluster Upgrade Overview. After the cluster is upgraded, create the workload again. Then, the workload can be scheduled properly.
  • If you do not need even scheduling on virtual GPUs, restart the kubelet component on the affected nodes to restore performance. The process is as follows:
    1. Check the nodes that have faulty pods and record the node IP addresses. You will need these IP addresses to restore the kubelet component on the nodes.
      kubectl get pod -l volcano.sh/gpu-num  -owide

      Information similar to the following is displayed:

      NAME                        READY   STATUS                      RESTARTS   AGE      IP              NODE
      test-586cf9464c-54g4s       0/1     UnexpectedAdmissionError    0          5m57s    <none>          11.84.252.4
      test-586cf9464c-58n6d       0/1     UnexpectedAdmissionError    0          5m10s    172.19.0.24     11.84.252.4
      test-586cf9464c-5bzcv       0/1     UnexpectedAdmissionError    0          5m58s    <none>          11.84.252.4
      test-586cf9464c-6bb5d       0/1     UnexpectedAdmissionError    0          6m15s    <none>          11.84.252.4
      test-586cf9464c-6r2bq       0/1     UnexpectedAdmissionError    0          5m11s    <none>          11.84.252.4
      test-586cf9464c-6rcpl       0/1     UnexpectedAdmissionError    0          6m11s    172.19.0.21     11.84.252.4
      ...
    2. Delete the workload that uses even scheduling on virtual GPUs. You need to replace deployment with the corresponding workload type and replace test with the corresponding workload name in the following command.
      kubectl delete deployment test  # Delete the test Deployment.

      Information similar to the following is displayed:

      deployment.apps/test deleted
    3. Log in to the nodes involved in 1 one by one and restart kubelet.
      systemctl restart kubelet

      Enter the node password in the command output. If no error message is displayed, the node has been restarted. After the restart, pods that do not use even scheduling on virtual GPUs and previously failed to be scheduled should now be scheduled successfully.