Installing the NVIDIA DRA Driver
To break through the limitations of traditional device plug-ins in complex parameter passing and dynamic GPU resource allocation, Kubernetes 1.26 introduces dynamic resource allocation (DRA). DRA supports fine-grained, on-demand GPU resource scheduling through native APIs. The NVIDIA DRA Driver works with the DRA to dynamically share and virtualize GPU resources between pods, significantly improving utilization and reducing hardware costs in LLM training and multi-tenant inference scenarios.
Background
In the early Kubernetes developments, GPU resource scheduling completely depends on device plug-ins. With the popularization of foundation model training, multi-tenant inference, and GPU virtualization technologies, the traditional coarse-grained allocation can no longer meet complex service requirements. It faces the following bottlenecks:
- Limited parameter expression: Device plug-ins can process only simple integer resource requests, for example, nvidia.com/gpu: 2. The traditional YAML cannot directly and accurately express complex requirements, such as "request two H100 GPUs interconnected through NVLink" or "request a MIG slice with a specific GPU memory size."
- Inflexible static configuration: Take MIG as an example. The GPU segmentation specifications must be statically configured during node initialization and cannot be dynamically adjusted at runtime. When a service needs to change the resource specifications, the node needs to be restarted. This makes true on-demand resource allocation impossible.
- Decoupled scheduling and allocation: Device plug-ins use a pre-allocation model. The scheduler can only obtain the coarse-grained information about the number of GPUs, but cannot obtain key details such as the topology structure. As a result, the scheduling policies are decoupled from the actual device capabilities, limiting resource utilization efficiency and task performance optimization.
- Unreliable resource cleanup: Due to the lack of strict resource lifecycle management, when a pod exits abnormally, GPU resources may not be completely released. This can affect the cluster stability and resource utilization.
Solution
To break through these bottlenecks, Kubernetes 1.26 and later versions introduce DRA. It does not directly replace device plug-ins, but serves as a strong supplement to provide more refined, flexible solutions for complex scenarios. NVIDIA DRA Driver, developed based on this framework, elevates GPUs from simple countable resources to descriptive, configurable objects. The main features are as follows:
- Flexible device filtering: Common Expression Language (CEL) can be used to filter device attributes in a fine-grained manner.
- Resource sharing: Multiple containers or pods can securely share the same device resources by referencing the corresponding resource declaration.
- Centralized device management: Device drivers and cluster administrators can manage devices by device class in a unified manner. These device classes can be optimized based on hardware performance and application requirements. For example, you can define a cost-optimized device class for a general purpose workload and define a high-performance device class for a high-demand task to meet the specific requirements of multiple application scenarios.
- Simplified pod resource request management: With DRA, application O&M personnel do not need to explicitly specify detailed device specifications when creating pods. Instead, they only need to reference a pre-configured resource request for pods. The system automatically allocates the corresponding devices to the pods based on the request. This greatly improves development and management efficiency.
Prerequisites
- A cluster of v1.34 or later is available.
- The needed images are ready.
The needed images are shown below. Download them on a PC that can access the Internet and has Docker installed in advance.
- Download the images.
docker pull nvcr.io/nvidia/k8s-dra-driver-gpu:v25.8.0 docker pull ubuntu:22.04
- Push the downloaded images to the SWR image repository to ensure that all nodes in the Kubernetes cluster can pull them.
For details about how to push an image, see Pushing an Image.
- Download the images.
Notes and Constraints
The GPU node must use containerd.
Procedure
- Log in to the CCE console and click the cluster name to access the cluster console.
- Install the CCE AI Suite (NVIDIA GPU) add-on.
- In the navigation pane, choose Add-ons. In the right pane, find the CCE AI Suite (NVIDIA GPU) add-on and click Install.
- In the window that slides out from the right, configure the parameters referring to the settings below.
- Add-on Versions: Select 2.12.0 or later. The 2.12.0 version is used as an example.
- Default Cluster Driver: Select 570.86.15 or later. The 570.86.15 version is used as an example.
For more driver versions, see CCE AI Suite (NVIDIA GPU).
- Click Installation Using YAML, change the value of dra_mode in spec.values.custom to true, and click Submit.

- Check whether the driver has been installed.
- On the Add-ons page, view details about the installed CCE AI Suite (NVIDIA GPU) add-on.
Check whether the status of nvidia-driver-installer on the node is Running.

- Check whether the GPU driver has been properly loaded.
nvidia-smi
If the GPU information, such as the model, is displayed, the driver has been installed.

- On the Add-ons page, view details about the installed CCE AI Suite (NVIDIA GPU) add-on.
- (Optional) Prepare the CDI.
View the patch version of the cluster on the Overview page. If the patch version is v1.34.2-r0 or later, skip this step.
- Edit the containerd configuration file.
vi /etc/containerd/config.toml
- Add the following parameters to [plugins."io.containerd.grpc.v1.cri"]:
enable_cdi = true cdi_spec_dirs = ["/etc/cdi", "/var/run/cdi"]
- Restart containerd to apply the changes.
systemctl restart containerd
- Check the containerd logs and ensure that CDI has been enabled.
cat /var/log/cce/containerd/containerd.log | grep -i cdi
If information similar to the following is displayed, CDI has been enabled:
EnableCDI:true CDISpecDirs:[/etc/cdi /var/run/cdi]
- Edit the containerd configuration file.
- Install NVIDIA DRA Driver.
Do as follows on a node:
- Install Helm. Helm 3.19.3 is used as an example.
curl -O https://get.helm.sh/helm-v3.19.3-linux-amd64.tar.gz tar xvf helm-v3.19.3-linux-amd64.tar.gz cp ./linux-amd64/helm /usr/local/bin/ helm version
If information similar to the following is displayed, the tool has been installed:
version.BuildInfo{Version:"v3.19.3", GitCommit:"0707f566a3f4ced24009ef14d67fe0ce69db****", GitTreeState:"clean", GoVersion:"go1.24.10"} - Obtain the Helm template package.
helm fetch https://helm.ngc.nvidia.com/nvidia/charts/nvidia-dra-driver-gpu-25.8.0.tgz tar xvf nvidia-dra-driver-gpu-25.8.0.tgz cd nvidia-dra-driver-gpu
- Create a namespace.
kubectl create namespace nvidia-dra
- Install nvidia-dra-driver.
helm install nvidia-dra . --namespace nvidia-dra \ --set resources.computeDomains.enabled=false \ --set gpuResourcesEnabledOverride=true \ --set image.repository="<SWR-image-address>" \ --set image.tag="v25.8.0" \ --set "kubeletPlugin.affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution.nodeSelectorTerms[0].matchExpressions[0].key=accelerator" \ --set "kubeletPlugin.affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution.nodeSelectorTerms[0].matchExpressions[0].operator=Exists" \ --set "kubeletPlugin.affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution.nodeSelectorTerms[0].matchExpressions[1].key=kubernetes.io/role" \ --set "kubeletPlugin.affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution.nodeSelectorTerms[0].matchExpressions[1].operator=NotIn" \ --set "kubeletPlugin.affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution.nodeSelectorTerms[0].matchExpressions[1].values[0]=virtual-kubelet" \ --set "kubeletPlugin.affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution.nodeSelectorTerms[0].matchExpressions[1].values[1]=edge" \ --set nvidiaDriverRoot="/usr/local/nvidia"
Modify the following parameters as needed:
- image.repository: Replace it with the address of k8s-dra-driver-gpu pushed to SWR.
- image.tag: Change the value based on the actual image tag, for example, v25.8.0.
- nvidiaDriverRoot: If an NVIDIA driver is not installed using the CCE AI Suite (NVIDIA GPU), you need to specify the actual driver installation path.
The expected output is as follows:
NAME: nvidia-dra LAST DEPLOYED: xxx xxx xxx xx:xx:xx xxxx NAMESPACE: nvidia-dra STATUS: deployed REVISION: 1 TEST SUITE: None
- Check whether device classes have been generated.
kubectl get deviceclass
The expected output is as follows:
NAME AGE gpu.nvidia.com 74s mig.nvidia.com 74s
- Check whether resource slices have been generated.
kubectl get resourceslices
The expected output is as follows:
NAME NODE DRIVER POOL AGE 192.168.**.**-gpu.nvidia.com-5rrfv 192.168.**.** gpu.nvidia.com 192.168.**.** 71s
- Install Helm. Helm 3.19.3 is used as an example.
- Check whether the NVIDIA driver has been installed.
- Create a file named gpu-pod.yaml. You can name the file as needed.
vi gpu-pod.yaml
An example of the file content is as follows:
apiVersion: resource.k8s.io/v1 kind: ResourceClaimTemplate metadata: name: single-gpu spec: spec: devices: requests: - exactly: allocationMode: ExactCount deviceClassName: gpu.nvidia.com count: 1 name: gpu --- apiVersion: v1 kind: Pod metadata: name: pod1 labels: app: pod spec: containers: - name: ctr image: {{Ubuntu-image-downloaded-in-the-prerequisites}}:22.04 command: ["bash", "-c"] args: ["nvidia-smi -L; trap 'exit 0' TERM; sleep 9999 & wait"] resources: claims: - name: gpu resourceClaims: - name: gpu resourceClaimTemplateName: single-gpu - Create the Deployment.
kubectl create -f gpu-pod.yaml - Check the Deployment status and check whether the NVIDIA driver has been installed.
kubectl get pod -owide
You can see that pod1 is in the Running state.
- View the pod logs and check whether the GPU has been mounted.
kubectl logs pod1
- View the GPU information.
kubectl exec -it pod1 -- /bin/bash nvidia-smi
If the GPU information is returned, the device is available and the add-on has been installed.

- Create a file named gpu-pod.yaml. You can name the file as needed.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot
