What Can I Do If Certain Alarms Are Displayed in the GPU Node Events After the CCE AI Suite (NVIDIA GPU) Add-on Is Upgraded?

Symptom

After the CCE AI Suite (NVIDIA GPU) add-on is upgraded, the following alarms are displayed when you view the GPU node events:

Alarm 1
Event name: XGPUKmodNeedUpgrade

Kubernetes event: "GPU serverid: xxx, info: XGPU kmod on node xx.xx.xx.xx needs upgrade"
Alarm 2
Event name: XGPUKmodAbnormal

Kubernetes event: "XGPU kmod on node %s is abnormal"

Possible Cause

Alarm 1: Before the CCE AI Suite (NVIDIA GPU) add-on is upgraded, the GPU virtualization workloads on the GPU nodes had not been drained beforehand. As a result, the xGPU kmod upgrade was skipped. This caused a version mismatch between the xGPU kmod and the upgraded add-on.
Alarm 2: The xGPU kmod upgrade failed during the CCE AI Suite (NVIDIA GPU) add-on upgrade.

These alarms do not impact existing services, but they may prevent new features or bug fixes introduced in the upgraded add-on from taking effect. It is advised to address these alarms promptly to ensure full add-on functionality.

Solution

Drain GPU virtualization workloads on the GPU nodes one by one and restart nvidia-gpu-device-plugin on each related node. For details, see How Can I Drain a GPU Node After Upgrading or Rolling Back the CCE AI Suite (NVIDIA GPU) Add-on?

Parent Topic: Node Running

Feedback

Was this page helpful?

Helpful Not helpful

Provide feedback

Thank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.

The system is busy. Please try again later.

For any further questions, feel free to contact us through the chatbot.

Chatbot