How Do I Rectify Failures When the NVIDIA Driver Is Used to Start Containers on GPU Nodes?
Did a Resource Scheduling Failure Event Occur on a Cluster Node?
Symptom
A node is running properly and has GPU resources. However, the following error information is displayed:
0/9 nodes are available: 9 insufficient nvidia.com/gpu
Fault Locating
- Check whether the node is attached with NVIDIA label.

- Check whether the NVIDIA driver is running properly.
Log in to the node where the add-on is running and view the driver installation log in the /opt/cloud/cce/nvidia/nvidia_installer.log or /usr/local/nvidia/nvidia-installer.log directory.
View standard output logs of the NVIDIA container.
Filter the container ID by running the following command:
crictl ps -a | grep nvidia
View logs by running the following command:
crictl logs Container ID
What Should I Do If the NVIDIA Version Reported by a Service and the CUDA Version Do Not Match?
- Check the CUDA version in the service container (preferentially using the official CUDA query method):
cat /usr/local/cuda/version.txt
- Check whether the CUDA version inside the container is included in the range of CUDA versions supported by the NVIDIA driver on the host node. Run nvidia-smi to check the maximum CUDA version supported by the driver on the node.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot
