Updated on 2025-08-18 GMT+08:00

AI Suite (NV GPU)

Description

The AI suite, NV GPU, is a device management plug-in that supports GPUs in containers. To use GPU nodes in a cluster, this plug-in must be installed.

Constraints

  • When you create a dedicated resource pool, this plug-in is automatically installed only when the instance specification type is set to GPU.
  • Do not upgrade the GPU driver before the plug-in upgrade is complete. Otherwise, the driver upgrade may be suspended or fail.

Verifying the Add-on

After the add-on is installed, run the nvidia-smi command on the GPU node and the container that schedules GPU resources to verify the availability of the GPU device and driver.

  • GPU node:
    • If the add-on version is earlier than 2.0.0, run the following command:
      cd /opt/cloud/cce/nvidia/bin && ./nvidia-smi
    • If the add-on version is 2.0.0 or later, run the following command:
      cd /usr/local/nvidia/bin && ./nvidia-smi
  • Container:
    • If the cluster version is v1.27 or earlier, run the following command:
      cd /usr/local/nvidia/bin && ./nvidia-smi
    • If the cluster version is v1.28 or later, run the following command:
      cd /usr/bin && ./nvidia-smi

If GPU information is returned, the device is available and the add-on has been installed.

Components

Table 1 Plug-in components

Component

Description

Resource Type

nvidia-driver-installer

A workload for installing the NV GPU driver on a node, which only uses resources during the installation process. Once the installation is finished, no resources are used.

DaemonSet

hce20-nvidia-driver-installer

A workload for installing the NV GPU driver on a node, which only uses resources during the installation process. Once the installation is finished, no resources are used (used to adapt to OS HCE 2.0).

DaemonSet

ubuntu22-nvidia-driver-installer

A workload for installing the NV GPU driver on a node, which only uses resources during the installation process. Once the installation is finished, no resources are used (used to adapt to OS Ubuntu22).

DaemonSet

nvidia-gpu-device-plugin

A Kubernetes device plugin that provides NV GPU heterogeneous compute for containers

DaemonSet

nvidia-operator

A component that provides NV GPU node management capabilities for clusters

Deployment

dcgm-exporter

A component that is installed when DCGM-Exporter is enabled to observe DCGM metrics. It is used to collect GPU metrics.

DaemonSet

Change History

Table 2 Release history

Plug-in Version

New Feature

2.7.63

Fixed security vulnerabilities.

2.7.42

Added the NV 535.216.03 driver to support xGPUs.

2.6.4

Updated the isolation logic of GPUs.

2.0.72

Updated the isolation logic of GPUs.

2.0.48

Fixed the issue occurred during driver installation.

2.0.44

  • Supported NV driver 535.
  • Non-root users can use xGPUs.
  • Optimized startup logic.

2.0.14

  • Supported xGPU device monitoring.
  • Supported the compatibility between nvidia.com/gpu and volcano.sh/gpu-* api.

1.2.29

  • Adapted to Ubuntu 22.04.
  • Optimized the automatic mounting of the GPU driver directory.

1.2.24

  • Supported GPU driver version configuration for node pools.
  • Supported GPU metric collection.

1.2.20

Set the plug-in alias to gpu.

1.2.15

Adapted to CCE v1.23 clusters.