What Can I Do If Certain Alarms Are Displayed in the GPU Node Events After the CCE AI Suite (NVIDIA GPU) Add-on Is Upgraded?
Symptom
After the CCE AI Suite (NVIDIA GPU) add-on is upgraded, the following alarms are displayed when you view the GPU node events:
- Alarm 1
Event name: XGPUKmodNeedUpgrade
Kubernetes event: "GPU serverid: xxx, info: XGPU kmod on node xx.xx.xx.xx needs upgrade"
- Alarm 2
Event name: XGPUKmodAbnormal
Kubernetes event: "XGPU kmod on node %s is abnormal"
Possible Cause
- Alarm 1: Before the CCE AI Suite (NVIDIA GPU) add-on is upgraded, the GPU virtualization workloads on the GPU nodes were not drained beforehand. As a result, the xGPU kmod upgrade was skipped. This caused a version mismatch between the xGPU kmod and the upgraded add-on.
- Alarm 2: The xGPU kmod upgrade failed during the CCE AI Suite (NVIDIA GPU) add-on upgrade.
These alarms do not impact existing services, but they may prevent new features or bug fixes introduced in the upgraded add-on from taking effect. It is advised to address these alarms promptly to ensure full add-on functionality.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot