Help Center/ Cloud Container Engine/ FAQs/ Node/ Node Running/ What Can I Do If Certain Alarms Are Displayed in the GPU Node Events After the CCE AI Suite (NVIDIA GPU) Add-on Is Upgraded?
Updated on 2025-07-17 GMT+08:00

What Can I Do If Certain Alarms Are Displayed in the GPU Node Events After the CCE AI Suite (NVIDIA GPU) Add-on Is Upgraded?

Symptom

After the CCE AI Suite (NVIDIA GPU) add-on is upgraded, the following alarms are displayed when you view the GPU node events:

  • Alarm 1

    Event name: XGPUKmodNeedUpgrade

    Kubernetes event: "GPU serverid: xxx, info: XGPU kmod on node xx.xx.xx.xx needs upgrade"

  • Alarm 2

    Event name: XGPUKmodAbnormal

    Kubernetes event: "XGPU kmod on node %s is abnormal"

Possible Cause

  • Alarm 1: Before the CCE AI Suite (NVIDIA GPU) add-on is upgraded, the GPU virtualization workloads on the GPU nodes were not drained beforehand. As a result, the xGPU kmod upgrade was skipped. This caused a version mismatch between the xGPU kmod and the upgraded add-on.
  • Alarm 2: The xGPU kmod upgrade failed during the CCE AI Suite (NVIDIA GPU) add-on upgrade.

These alarms do not impact existing services, but they may prevent new features or bug fixes introduced in the upgraded add-on from taking effect. It is advised to address these alarms promptly to ensure full add-on functionality.