device-plugin组件是CCE集群中用于报告硬件资源状态的组件。在GPU场景中,节点可用的GPU资源由kube-system命名空间中的nvidia-gpu-device-plugin上报。若节点资源上报不准确或设备挂载存在问题,建议优先排查device-plugin。
执行以下命令,
检查device-plugin组件状态。
kubectl get po -A -owide|grep nvidia
- 若异常节点对应的device-plugin Pod实例为非Running状态,请提交工单联系技术支持人员。
- 若异常节点对应的device-plugin Pod实例为Running状态,请执行以下命令,进一步检查device-plugin组件日志,确认是否存在报错信息。
kubectl logs -n kube-system nvidia-gpu-device-plugin-9xmhr
若回显结果显示“gpu driver wasn't ready. will re-check”,请进入2,检查驱动安装目录下的/usr/local/nvidia/bin/nvidia-smi或/opt/cloud/cce/nvidia/bin/nvidia-smi文件是否存在。
...
I0527 11:29:06.420714 3336959 nvidia_gpu.go:76] device-plugin started
I0527 11:29:06.521884 3336959 nodeinformer.go:124] "nodeInformer started"
I0527 11:29:06.521964 3336959 nvidia_gpu.go:262] "gpu driver wasn't ready. will re-check in %s" 5s="(MISSING)"
I0527 11:29:11.524882 3336959 nvidia_gpu.go:262] "gpu driver wasn't ready. will re-check in %s" 5s="(MISSING)"
...