Help Center/ Cloud Container Engine/ User Guide/ Scheduling/ NPU Scheduling/ NPU Virtualization/ Manual NPU Virtualization

Updated on 2025-11-06 GMT+08:00

View PDF

Manual NPU Virtualization

In CCE, manual NPU virtualization enables node-level segmentation of NPUs, allowing manual control over resource allocation per NPU. This approach offers greater flexibility. This approach requires complex configurations and is best suited for scenarios that demand precise NPU resources, such as services requiring dedicated compute or strict isolation guarantees.

Prerequisites

There are NPU chips that support virtualization in the cluster. For details about the product types, see Supported NPU Chip Types.
The CCE AI Suite (Ascend NPU) add-on of v2.1.15 or later has been installed in the cluster. For details, see CCE AI Suite (Ascend NPU).
An NPU driver has been installed on the NPU nodes, and the driver version is 23.0.1 or later. To upgrade a driver, perform the following operations:
- To upgrade a driver, ensure that the NPU firmware is available on the node. Reinstalling the driver will restart the node. You are advised to drain the node before installing the driver. For details, see Draining a Node. VMs do not support firmware upgrade.
- To install a driver for all users in the OS during a driver upgrade, use the --install-for-all parameter together. You can use, for example, ./Ascend-hdk-310p-npu-driver_x.x.x_linux-{arch}.run --full --install-for-all.
- If a driver upgrade fails, see "What Can I Do If an NPU Driver Fails to Be Upgraded?" in FAQs > "Chart and Add-on".
1. Uninstall the original NPU driver. For details, see Uninstalling the NPU Driver.
2. Go to Firmware and Drivers, select the corresponding product model, and download the driver installation package (in .run format) of 23.0.1 or later.
3. Check the NPU installation prerequisites and restrictions in Before You Start, and install the NPU by referring to Installing the Driver (.run).

Notes and Constraints

CCE has only verified the virtualization of Ascend Snt3P3. .
Ascend AI products are delivered with virtualization templates. NPUs can only be virtualized based on these predefined template specifications. For details, see Virtualization Templates. Ascend AI products support flexible combinations of virtual instances. An NPU chip can be virtualized in to multiple vNPUs using different virtualization templates. The total resources used by each vNPU cannot exceed the physical limit of the NPU. The recommended specifications are provided. You can flexibly combine them as required. For details, see Virtual Instance Specifications.

Step 1: Create a vNPU

CCE standard and Turbo clusters allow you to create vNPUs as needed.

Obtain the basic information of the node. The information obtained by running the following command includes the NPU driver version, chip model, and resource usage, which serve as a basis for vNPU specification planning:

npu-smi info

The command output shows that the driver version is 24.1.rc2.3, there are two NPUs, each with one NPU chip, and the device IDs of the NPUs are 104 and 112.

+--------------------------------------------------------------------------------------------------------+ 
| npu-smi 24.1.rc2.3                               Version: 24.1.rc2.3                                   |
+-------------------------------+-----------------+------------------------------------------------------+ 
| NPU     Name                  | Health          | Power(W)     Temp(C)           Hugepages-Usage(page) |
| Chip    Device                | Bus-Id          | AICore(%)    Memory-Usage(MB)                        |
+===============================+=================+======================================================+ 
| 104     xxx                 | OK              | NA           58                0     / 0             |
| 0       0                     | 0000:00:0D.0    | 0            1782 / 21527                            |
+===============================+=================+======================================================+ 
| 112     xxx                 | OK              | NA           53                0     / 0             |
| 0       1                     | 0000:00:0E.0    | 0            1786 / 21527                            |
+===============================+=================+======================================================+
+-------------------------------+-----------------+------------------------------------------------------+ 
| NPU     Chip                  | Process id      | Process name             | Process memory(MB)        |
+===============================+=================+======================================================+
| No running processes found in NPU 104                                                                  |
+===============================+=================+======================================================+ 
| No running processes found in NPU 112                                                                  |
+===============================+=================+======================================================+

Check the virtualization templates supported by the current node and their respective resource specifications. You can split vNPUs using one or more templates.

npu-smi info -t template-info

The following information is displayed. vir01, vir02, vir02_1c, and other similar names are virtualization templates. The available templates vary by product. The following example is for reference only.

+------------------------------------------------------------------------------------------+ 
|NPU instance template info is:                                                            |
|Name                AICORE    Memory    AICPU     VPC            VENC           JPEGD     |
|                               GB                 PNGD           VDEC           JPEGE     |
|==========================================================================================|
|vir01               1         3         1         1              0              2         |
|                                                  0              1              1         |
+------------------------------------------------------------------------------------------+ 
|vir02               2         6         2         3              1              4         |
|                                                  0              3              2         |
+------------------------------------------------------------------------------------------+ 
|vir02_1c            2         6         1         3              0              4         |
|                                                  0              3              2         |
+------------------------------------------------------------------------------------------+ 
|vir04               4         12        4         6              2              8         |
|                                                  0              6              4         |
+------------------------------------------------------------------------------------------+ 
|vir04_3c            4         12        3         6              1              8         |
|                                                  0              6              4         |
+------------------------------------------------------------------------------------------+ 
|vir04_3c_ndvpp      4         12        3         0              0              0         |
|                                                  0              0              0         |
+------------------------------------------------------------------------------------------+ 
|vir04_4c_dvpp       4         12        4         12             3              16        |
|                                                  0              12             8         |
+------------------------------------------------------------------------------------------+

Run the npu-smiset-tcreate-vnpu-i<id>-c<chip_id>-f<vnpu_config>[-vnpu_id][-gvgroup_id] command to create a vNPU.

npu-smi set -t create-vnpu -i 104 -c 0 -f vir02

**Table 1** Parameters in this command
Parameter	Example Value	Description
id	104	Device ID (which is the NPU ID). How to obtain: Run the npu-smi info -l command to obtain the NPU ID.
chip_id	0	ID of the NPU chip. How to obtain: Run the npu-smi info -m command to obtain the chip ID.
vnpu_config	vir02	Name of the virtualization template. For details, see 3.
vnpu_id	-	(Optional) ID of the vNPU to be created.
vgroup_id	-	(Optional) ID of the virtual resource group (vGroup). The value ranges from 0 to 3. This parameter is only available for Atlas inference products. For details about vGroup, see Virtualization Modes.

If information similar to the following is displayed, the vNPU has been created:

Status                         : OK        
Message                        : Create vnpu success

Configure the vNPU configuration recovery. After the configuration is complete, the system can save the vNPU configuration when the node is restarted, so that the vNPU is still valid after a restart.
1. Run the following command to enable the vNPU configuration recovery: If 1 in the command changes to 0, the vNPU configuration recovery is disabled.
```
npu-smi set -t vnpu-cfg-recover -d 1
```
  Information similar to the following is displayed:
```
Status : OK
Message : The VNPU config recover mode Enable is set successfully.
```
2. Run the following command to check whether the vNPU configuration recovery is enabled:
```
npu-smi info -t vnpu-cfg-recover
```
  If information similar to the following is displayed, the function has been enabled:
```
VNPU config recover mode : Enable
```

Run the following command to view the created vNPU and the remaining resources in the NPU chip: In this command, 104 indicates the NPU ID, and 0 indicates the chip ID. Replace them as required.

npu-smi info -t info-vnpu -i 104 -c 0

The command output shows that one vNPU is created from template vir02 and its ID is 100. The remaining resources (such as AI Cores and memory) in the NPU chip and the resources in the vir02 template are equal to the total physical resources of the NPU chip. During NPU virtualization, the sum of a certain type of resources used by all vNPUs on an NPU chip cannot exceed the physical resources on the NPU chip.

+-------------------------------------------------------------------------------+ 
| NPU resource static info as follow:                                           |
| Format:Free/Total                   NA: Currently, query is not supported.    |
| AICORE    Memory    AICPU    VPC    VENC    VDEC    JPEGD    JPEGE    PNGD    |
|            GB                                                                 |
|===============================================================================|
| 6/8       15/21     5/7      9/12   2/3     9/12    12/16    6/8      NA/NA   |
+-------------------------------------------------------------------------------+ 
| Total number of vnpu: 1                                                       |
+-------------------------------------------------------------------------------+
|  Vnpu ID  |  Vgroup ID     |  Container ID  |  Status  |  Template Name       |
+-------------------------------------------------------------------------------+ 
|  100      |  0             |  000000000000  |  0       |  vir02               |
+-------------------------------------------------------------------------------+

Step 2: Restart the Component and Check Resource Reporting

After a vNPU is created, restart the huawei-npu-device-plugin component on the node to report NPU resources to Kubernetes.

Run the following command to query all the pods for running the huawei-npu-device-plugin component:

kubectl get pods -A -o wide | grep huawei-npu-device-plugin

Below is the command output. The IP addresses in bold indicate that the component is running on these nodes. Delete the pods based on the node IP address to restart the huawei-npu-device-plugin component on the node. In this example, the pod is deleted from the node whose IP address is 192.168.2.27.

kube-system   huawei-npu-device-plugin-8lq64            1/1     Running   2 (4d7h ago)   4d8h   192.168.0.9     192.168.0.9     <none>           <none>
kube-system   huawei-npu-device-plugin-khkvr            1/1     Running   0              4d8h   192.168.0.131   192.168.0.131   <none>           <none>
kube-system   huawei-npu-device-plugin-rltx4            1/1     Running   0              4d8h   192.168.7.56    192.168.7.56    <none>           <none>
kube-system   huawei-npu-device-plugin-t9vxx            1/1     Running   1 (4d8h ago)   4d8h   192.168.0.72    192.168.0.72    <none>           <none>
kube-system   huawei-npu-device-plugin-c6x7            1/1     Running   0              3d2h   192.168.2.27    192.168.2.27    <none>           <none>

Run the following command to delete the pod:
```
kubectl delete pod -n kube-system huawei-npu-device-plugin-c6x7
```
If information similar to the following is displayed, the pod has been deleted:
```
pod "huawei-npu-device-plugin-c6x7" deleted
```

Run the following command to query the reported vNPU resources: After NPU virtualization, only the created vNPUs are reported as available, and the remaining resources cannot be reported to Kubernetes for use.

kubectl describe node 192.168.2.27

The command output shows that both the number of the cards and the number of vNPUs are 1. This indicates that one NPU chip has been virtualized and the other is not.

... ... 
Capacity: 
  cpu:                       32
  ephemeral-storage:         102683576Ki
  huawei.com/ascend-310:     1   # The number of NPUs
  huawei.com/ascend-310-2c:  1   # The number of vNPUs
  hugepages-1Gi:             0
  hugepages-2Mi:             0
  localssd:                  0
  localvolume:               0
  memory:                    131480656Ki
  pods:                      110
Allocatable: 
  cpu:                       31850m 
  ephemeral-storage:         94633183485 
  huawei.com/ascend-310:     1   # The number of NPUs
  huawei.com/ascend-310-2c:  1   # The number of vNPUs
  hugepages-1Gi:             0
  hugepages-2Mi:             0
  localssd:                  0
  localvolume:               0
  memory:                    126616656Ki 
  pods:                      110
... ...

Step 3: Use the Created vNPU

After vNPUs are created, you can use YAML to specify the vNPU resources for workloads to efficiently manage and flexibly configure resources. If you need to use the Volcano Scheduler, its version must be 1.12.1 or later.

Create a workload and request vNPU resources using the vir02 template.

Create a YAML file named vnpu-worker.

vim vnpu-worker.yaml

Containers can request NPU or vNPU resources. The two types of resources cannot be used at the same time.

Before using a vNPU, ensure that it has been created. If a vNPU is not created, an error is reported, for example, "0/2 nodes are available: 2 Insufficient huawei.com/ascend-310-2c".

kind: Deployment
apiVersion: apps/v1
metadata:
  name: vnpu-test
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      app: vnpu-test
  template:
    metadata:
      labels:
        app: vnpu-test
    spec:
      schedulerName: kube-scheduler    # If the workload requires Volcano Scheduler, install the add-on and ensure that the add-on version is v1.12.1 or later.
      containers:
        - name: container-0
          image: nginx:latest
          resources:
            limits:
              cpu: 250m
              huawei.com/ascend-310-2c: '1'   # The number of vNPUs to be requested. The value is fixed at 1.
              memory: 512Mi
            requests:
              cpu: 250m
              huawei.com/ascend-310-2c: '1'   # The value is fixed at 1.
              memory: 512Mi

The container only requests one vNPU, meaning that the number of vNPUs in both requests and limits is fixed at 1.
The vNPU must be created on the node in advance, and there must be sufficient resources. If the vNPU resources are insufficient, an error message similar to "0/2 nodes are available: 2 Insufficient huawei.com/ascend-310-2c." is displayed.

huawei.com/ascend-310-2c indicates the name of the requested vNPU. The vNPU name varies depending on the product and template. You can refer to the following table to obtain the vNPU name.

**Table 2** vNPU names in different products
Product Type	Virtualization Template	vNPU Name
Atlas inference series (eight AI Cores)	vir01	huawei.com/ascend-310-1c
	vir02	huawei.com/ascend-310-2c
	vir02_1c	huawei.com/ascend-310-2c.1cpu
	vir04	huawei.com/ascend-310-4c
	vir04_3c	huawei.com/ascend-310-4c.3cpu
	vir04_3c_ndvpp	huawei.com/ascend-310-4c.3cpu.ndvpp
	vir04_4c_dvpp	huawei.com/ascend-310-4c.4cpu.dvpp
Ascend training series (30 or 32 AI Cores)	vir16	huawei.com/ascend-1980-16c
	vir08	huawei.com/ascend-1980-8c
	vir04	huawei.com/ascend-1980-4c
	vir02	huawei.com/ascend-1980-2c

Run the following command to create the workload:
```
kubectl apply -f vnpu-worker.yaml
```
Information similar to the following is displayed:
```
deployment/vnpu-test created
```
Run the following command to check whether the pod is running:
```
kubectl get pod | grep vnpu-test
```
If the following information is displayed, the pod for the workload is running normally:
```
vnpu-test-6658cd795b-rx76t      1/1     Running     0       59m
```

Access the container.

kubectl -n default exec -it vnpu-test-6658cd795b-rx76t -c container-0 -- /bin/bash

Check whether the vNPU is mounted to the container.

Run the following command to use an environment variable to specify the search path of the dynamic link library (DLL), which ensures that the system can correctly load the required DLL file when running NPU-related applications:
```
export LD_LIBRARY_PATH=/usr/local/HiAI/driver/lib64:/usr/local/Ascend/driver/lib64/common:/usr/local/Ascend/driver/lib64/driver:/usr/local/Ascend/driver/lib64 
```

Run the following command to view the vNPU mounted to the container:

npu-smi info

The command output indicates that vNPU whose ID is 104 has been mounted to the container. The virtualization template is vir02.

+--------------------------------------------------------------------------------------------------------+
| npu-smi 24.1.rc2.3                               Version: 24.1.rc2.3                                   |
+-------------------------------+-----------------+------------------------------------------------------+
| NPU     Name                  | Health          | Power(W)     Temp(C)           Hugepages-Usage(page) |
| Chip    Device                | Bus-Id          | AICore(%)    Memory-Usage(MB)                        |
+===============================+=================+======================================================+
| 104     xxx             | OK              | NA           54                0     / 0             |
| 0       0                     | 0000:00:0D.0    | 0            445  / 5381                             |
+===============================+=================+======================================================+
+-------------------------------+-----------------+------------------------------------------------------+
| NPU     Chip                  | Process id      | Process name             | Process memory(MB)        |
+===============================+=================+======================================================+
| No running processes found in NPU 104                                                                  |
+===============================+=================+======================================================+

Step 4: Destroy the Created vNPU

Destroy a vNPU when it is no longer used to release related resources. Before destroying a vNPU, ensure that no job is using the vNPU. If the vNPU is being used, the vNPU fails to be destroyed.

Run the npu-smi set -t destroy-vnpu -i <id>-c <chip_id> -v <vnpu_id> command to destroy the vNPU.
```
npu-smi set -t destroy-vnpu -i 104 -c 0 -v 100
```
- If information similar to the following is displayed, the command is executed successfully:
```
Status                         : OK 
Message                        : Destroy vnpu 100 success
```
- If the following information is displayed, there are jobs using the vNPU to be destroyed. Ensure that no job is using the vNPU to be destroyed and run the command again.
```
destroy vnpu 100 failed.
Usage: npu-smi set -t destroy-vnpu [Options...] 
Options: 
       -i %d              Card ID     
       -c %d              Chip ID  
       -v %d              Vnpu ID
```

Restart the huawei-npu-device-plugin component on the corresponding node to report the information to Kubernetes. To do so, take the following steps:

Run the following command to query all the pods for running the huawei-npu-device-plugin component:

kubectl get pods -A -o wide | grep huawei-npu-device-plugin

kube-system   huawei-npu-device-plugin-8lq64            1/1     Running   2 (4d7h ago)   4d8h   192.168.0.9     192.168.0.9     <none>           <none>
kube-system   huawei-npu-device-plugin-khkvr            1/1     Running   0              4d8h   192.168.0.131   192.168.0.131   <none>           <none>
kube-system   huawei-npu-device-plugin-rltx4            1/1     Running   0              4d8h   192.168.7.56    192.168.7.56    <none>           <none>
kube-system   huawei-npu-device-plugin-t9vxx            1/1     Running   1 (4d8h ago)   4d8h   192.168.0.72    192.168.0.72    <none>           <none>
kube-system   huawei-npu-device-plugin-tcmck            1/1     Running   0              3d2h   192.168.2.27    192.168.2.27    <none>           <none>

Run the following command to delete the pod:
```
kubectl delete pod -n kube-system huawei-npu-device-plugin-tcmck
```
If information similar to the following is displayed, the pod has been deleted:
```
pod "huawei-npu-device-plugin-tcmck" deleted
```

Run the following command to check whether the vNPU is destroyed: If the number of cards is restored, the vNPU has been destroyed:

kubectl describe node 192.168.2.27

The command output shows that the number of cards is restored to 2 and the number of vNPUs is 0, indicating that the vNPU has been destroyed.

... ... 
Capacity: 
  cpu:                       32
  ephemeral-storage:         102683576Ki
  huawei.com/ascend-310:     2
  huawei.com/ascend-310-2c:  0
  hugepages-1Gi:             0
  hugepages-2Mi:             0
  localssd:                  0
  localvolume:               0
  memory:                    131480656Ki
  pods:                      110
Allocatable: 
  cpu:                       31850m 
  ephemeral-storage:         94633183485 
  huawei.com/ascend-310:     2
  huawei.com/ascend-310-2c:  0
  hugepages-1Gi:             0
  hugepages-2Mi:             0
  localssd:                  0
  localvolume:               0
  memory:                    126616656Ki 
  pods:                      110
... ...

Parent Topic: NPU Virtualization

Previous topic: Automatic NPU Virtualization (Computing Segmentation)

Next topic: NPU Monitoring