Why Do a Large Number of Pods Fail to Be Executed After a Workload That Uses Even Scheduling on Virtual GPUs Is Created?
Symptom
After a workload that uses even scheduling on virtual GPUs is created, a large number of GPU pods fail to be executed. The following is an example:
kubectl get pods
Information similar to the following is displayed:
NAME READY STATUS RESTARTS AGE test-586cf9464c-54g4s 0/1 UnexpectedAdmissionError 0 57s test-586cf9464c-58n6d 0/1 UnexpectedAdmissionError 0 10s test1-689cf9462f-5bzcv 0/1 UnexpectedAdmissionError 0 58s ...
Possible Cause
- If the add-on version is 2.1.41 or later, the cluster version must be v1.27.16-r20, v1.28.15-r10, v1.29.10-r10, v1.30.6-r10, v1.31.4-r0, or later.
- If the add-on version is 2.7.57 or later, the cluster version must be v1.28.15-r10, v1.29.10-r10, v1.30.6-r10, v1.31.4-r0, or later.
When the cluster version is incompatible but the GPU virtualization resources on the nodes are still available, pod scheduling proceeds normally. However, once resources are exhausted, kubelet fails to process even scheduling on virtual GPUs properly. This causes pods that failed to be scheduled to stack, leading to memory leakage and batch pod execution failures.
Solution
- To continue using even scheduling on virtual GPUs, delete the workload that failed to be executed and upgrade the cluster to the required version. For details, see Cluster Upgrade Overview. After the cluster is upgraded, create the workload again. Then, the workload can be scheduled properly.
- If you do not need even scheduling on virtual GPUs, restart the kubelet component on the affected nodes to restore performance. The process is as follows:
- Check the nodes that have faulty pods and record the node IP addresses. You will need these IP addresses to restore the kubelet component on the nodes.
kubectl get pod -l volcano.sh/gpu-num -owide
Information similar to the following is displayed:
NAME READY STATUS RESTARTS AGE IP NODE test-586cf9464c-54g4s 0/1 UnexpectedAdmissionError 0 5m57s <none> 11.84.252.4 test-586cf9464c-58n6d 0/1 UnexpectedAdmissionError 0 5m10s 172.19.0.24 11.84.252.4 test-586cf9464c-5bzcv 0/1 UnexpectedAdmissionError 0 5m58s <none> 11.84.252.4 test-586cf9464c-6bb5d 0/1 UnexpectedAdmissionError 0 6m15s <none> 11.84.252.4 test-586cf9464c-6r2bq 0/1 UnexpectedAdmissionError 0 5m11s <none> 11.84.252.4 test-586cf9464c-6rcpl 0/1 UnexpectedAdmissionError 0 6m11s 172.19.0.21 11.84.252.4 ...
- Delete the workload that uses even scheduling on virtual GPUs. You need to replace deployment with the corresponding workload type and replace test with the corresponding workload name in the following command.
kubectl delete deployment test # Delete the test Deployment.
Information similar to the following is displayed:
deployment.apps/test deleted
- Log in to the nodes involved in 1 one by one and restart kubelet.
systemctl restart kubelet
Enter the node password in the command output. If no error message is displayed, the node has been restarted. After the restart, pods that do not use even scheduling on virtual GPUs and previously failed to be scheduled should now be scheduled successfully.
- Check the nodes that have faulty pods and record the node IP addresses. You will need these IP addresses to restore the kubelet component on the nodes.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot