Help Center/ Cloud Container Engine/ Best Practices/ Cloud Native AI/ Installing the NVIDIA DRA Driver

Updated on 2026-06-17 GMT+08:00

Installing the NVIDIA DRA Driver

To break through the limitations of traditional device plug-ins in complex parameter passing and dynamic GPU resource allocation, Kubernetes 1.26 introduces dynamic resource allocation (DRA). DRA supports fine-grained, on-demand GPU resource scheduling through native APIs. The NVIDIA DRA Driver works with the DRA to dynamically share and virtualize GPU resources between pods, significantly improving utilization and reducing hardware costs in LLM training and multi-tenant inference scenarios.

Background

In the early Kubernetes developments, GPU resource scheduling completely depends on device plug-ins. With the popularization of foundation model training, multi-tenant inference, and GPU virtualization technologies, the traditional coarse-grained allocation can no longer meet complex service requirements. It faces the following bottlenecks:

Limited parameter expression: Device plug-ins can process only simple integer resource requests, for example, nvidia.com/gpu: 2. The traditional YAML cannot directly and accurately express complex requirements, such as "request two GPUs interconnected through NVLink" or "request a MIG slice with a specific GPU memory size."
Inflexible static configuration: Take MIG as an example. The GPU segmentation specifications must be statically configured during node initialization and cannot be dynamically adjusted at runtime. When a service needs to change the resource specifications, the node needs to be restarted. This makes true on-demand resource allocation impossible.
Decoupled scheduling and allocation: Device plug-ins use a pre-allocation model. The scheduler can only obtain the coarse-grained information about the number of GPUs, but cannot obtain key details such as the topology structure. As a result, the scheduling policies are decoupled from the actual device capabilities, limiting resource utilization efficiency and task performance optimization.
Unreliable resource cleanup: Due to the lack of strict resource lifecycle management, when a pod exits abnormally, GPU resources may not be completely released. This can affect the cluster stability and resource utilization.

Solution

To break through these bottlenecks, Kubernetes 1.26 and later versions introduce DRA. It does not directly replace device plug-ins, but serves as a strong supplement to provide more refined, flexible solutions for complex scenarios. NVIDIA DRA Driver, developed based on this framework, elevates GPUs from simple countable resources to descriptive, configurable objects. The main features are as follows:

Flexible device filtering: Common Expression Language (CEL) can be used to filter device attributes in a fine-grained manner.
Resource sharing: Multiple containers or pods can securely share the same device resources by referencing the corresponding resource declaration.
Centralized device management: Device drivers and cluster administrators can manage devices by device class in a unified manner. These device classes can be optimized based on hardware performance and application requirements. For example, you can define a cost-optimized device class for a general purpose workload and define a high-performance device class for a high-demand task to meet the specific requirements of multiple application scenarios.
Simplified pod resource request management: With DRA, application O&M personnel do not need to explicitly specify detailed device specifications when creating pods. Instead, they only need to reference a pre-configured resource request for pods. The system automatically allocates the corresponding devices to the pods based on the request. This greatly improves development and management efficiency.

Prerequisites

A cluster of v1.34 or later is available.
The needed images are ready.
The needed images are shown below. Download them on a PC that can access the Internet and has Docker installed in advance.
1. Download the images.
```
docker pull nvcr.io/nvidia/k8s-dra-driver-gpu:v25.8.0
docker pull ubuntu:22.04
```
2. Push the downloaded images to the SWR image repository to ensure that all nodes in the Kubernetes cluster can pull them.
  For details about how to push an image, see Pushing an Image.

Notes and Constraints

The GPU node must use containerd.

Procedure

Log in to the CCE console and click the cluster name to access the cluster console.
Install the CCE AI Suite (NVIDIA GPU) add-on.
1. In the navigation pane, choose Add-ons. In the right pane, find the CCE AI Suite (NVIDIA GPU) add-on and click Install.
2. In the window that slides out from the right, configure the parameters referring to the settings below.
  - Add-on Versions: Select 2.12.0 or later. The 2.12.0 version is used as an example.
  - Default Cluster Driver: Select 570.86.15 or later. The 570.86.15 version is used as an example. If you select NVIDIA DRA Driver for GPUs v25.12.0 or later, install the GPU driver of version 580 or later.
  For more driver versions, see CCE AI Suite (NVIDIA GPU).
3. Click Installation Using YAML, change the value of dra_mode in spec.values.custom to true, and click Submit.
Check whether the driver has been installed.
1. On the Add-ons page, view details about the installed CCE AI Suite (NVIDIA GPU) add-on.
  Check whether the status of nvidia-driver-installer on the node is Running.
2. Check whether the GPU driver has been properly loaded.
```
nvidia-smi
```
  If the GPU information, such as the model, is displayed, the driver has been installed.
(Optional) Prepare the CDI.

View the patch version of the cluster on the Overview page. If the patch version is v1.34.2-r0 or later, skip this step.
1. Edit the containerd configuration file.
```
vi /etc/containerd/config.toml
```
2. Add the following parameters to [plugins."io.containerd.grpc.v1.cri"]:
```
enable_cdi = true
cdi_spec_dirs = ["/etc/cdi", "/var/run/cdi"]
```
3. Restart containerd to apply the changes.
```
systemctl restart containerd
```
4. Check the containerd logs and ensure that CDI has been enabled.
```
cat /var/log/cce/containerd/containerd.log | grep -i cdi
```
  If information similar to the following is displayed, CDI has been enabled:
```
EnableCDI:true CDISpecDirs:[/etc/cdi /var/run/cdi]
```

Install NVIDIA DRA Driver.

Do as follows on a node:

Install Helm. Helm 3.19.3 is used as an example.

curl -O https://get.helm.sh/helm-v3.19.3-linux-amd64.tar.gz
tar xvf helm-v3.19.3-linux-amd64.tar.gz 
cp ./linux-amd64/helm /usr/local/bin/ 
helm version

If information similar to the following is displayed, the tool has been installed:

version.BuildInfo{Version:"v3.19.3", GitCommit:"0707f566a3f4ced24009ef14d67fe0ce69db****", GitTreeState:"clean", GoVersion:"go1.24.10"}

Obtain the Helm template package.

helm fetch https://helm.ngc.nvidia.com/nvidia/charts/nvidia-dra-driver-gpu-25.8.0.tgz
tar xvf nvidia-dra-driver-gpu-25.8.0.tgz
cd nvidia-dra-driver-gpu

Create a namespace.
```
kubectl create namespace nvidia-dra
```

Install nvidia-dra-driver.

helm install nvidia-dra . --namespace nvidia-dra \
  --set resources.computeDomains.enabled=false \
  --set gpuResourcesEnabledOverride=true \
  --set image.repository="<SWR-image-address>" \
  --set image.tag="v25.8.0" \
  --set "kubeletPlugin.affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution.nodeSelectorTerms[0].matchExpressions[0].key=accelerator" \
  --set "kubeletPlugin.affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution.nodeSelectorTerms[0].matchExpressions[0].operator=Exists" \
  --set "kubeletPlugin.affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution.nodeSelectorTerms[0].matchExpressions[1].key=kubernetes.io/role" \
  --set "kubeletPlugin.affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution.nodeSelectorTerms[0].matchExpressions[1].operator=NotIn" \
  --set "kubeletPlugin.affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution.nodeSelectorTerms[0].matchExpressions[1].values[0]=virtual-kubelet" \
  --set "kubeletPlugin.affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution.nodeSelectorTerms[0].matchExpressions[1].values[1]=edge" \
  --set nvidiaDriverRoot="/usr/local/nvidia"

Modify the following parameters as needed:

image.repository: Replace it with the address of k8s-dra-driver-gpu pushed to SWR.
image.tag: Change the value based on the actual image tag, for example, v25.8.0.
nvidiaDriverRoot: If an NVIDIA driver is not installed using the CCE AI Suite (NVIDIA GPU), you need to specify the actual driver installation path.

The expected output is as follows:

NAME: nvidia-dra
LAST DEPLOYED: xxx xxx xxx xx:xx:xx xxxx
NAMESPACE: nvidia-dra
STATUS: deployed
REVISION: 1
TEST SUITE: None

Check whether device classes have been generated.

kubectl get deviceclass

The expected output is as follows:

NAME             AGE
gpu.nvidia.com   74s
mig.nvidia.com   74s

Check whether resource slices have been generated.

kubectl get resourceslices

The expected output is as follows:

NAME                                 NODE            DRIVER           POOL            AGE
192.168.**.**-gpu.nvidia.com-5rrfv   192.168.**.**   gpu.nvidia.com   192.168.**.**   71s

Check whether the NVIDIA driver has been installed.

Create a file named gpu-pod.yaml. You can name the file as needed.

vi gpu-pod.yaml

An example of the file content is as follows:

apiVersion: resource.k8s.io/v1
kind: ResourceClaimTemplate
metadata:
  name: single-gpu
spec:
  spec:
    devices:
      requests:
      - exactly:
          allocationMode: ExactCount
          deviceClassName: gpu.nvidia.com
          count: 1
        name: gpu
---
apiVersion: v1
kind: Pod
metadata:
  name: pod1
  labels:
    app: pod
spec:
  containers:
  - name: ctr
    image: {{Ubuntu-image-downloaded-in-the-prerequisites}}:22.04
    command: ["bash", "-c"]
    args: ["nvidia-smi -L; trap 'exit 0' TERM; sleep 9999 & wait"]
    resources:
      claims:
      - name: gpu
  resourceClaims:
  - name: gpu
    resourceClaimTemplateName: single-gpu

Create the Deployment.
```
kubectl create -f gpu-pod.yaml
```
Check the Deployment status and check whether the NVIDIA driver has been installed.
```
kubectl get pod -owide
```
You can see that pod1 is in the Running state.
View the pod logs and check whether the GPU has been mounted.
```
kubectl logs pod1
```
View the GPU information.
```
kubectl exec -it pod1 -- /bin/bash
nvidia-smi
```
If the GPU information is returned, the device is available and the add-on has been installed.

Parent Topic: Cloud Native AI

Previous topic: Cloud Native AI

Next topic: Using InferencePool and Envoy Gateway to Build an AI Infrastructure Layer

Feedback

Was this page helpful?

Helpful Not helpful

Provide feedback

Thank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.

The system is busy. Please try again later.

For any further questions, feel free to contact us through the chatbot.

Chatbot