Updated on 2025-08-19 GMT+08:00

CCE AI Suite (Ascend NPU)

Introduction

The CCE AI Suite (Ascend NPU) add-on supports and manages Huawei NPUs in containers. It provides functions such as automatic driver installation, device registration and scheduling, performance monitoring, and virtualized resource management. With this add-on, you can implement automatic deployment, refined scheduling, visualized monitoring, and virtualization of NPUs for diverse heterogeneous computing requirements. It is typically used in:

  • AI model training: It supports multi-NPU parallelism and refined resource scheduling to improve the efficiency and stability of large-scale model training.
  • AI inference and real-time services: It supports low-latency, on-demand computing to ensure real-time performance and service availability of inference tasks.
  • Resource isolation environments: It supports vNPU management to divide resources by granularity for compute isolation and quota control in multi-user scenarios.
  • Training task monitoring and resource optimization: It supports metric collection and visual analysis to monitor training task performance and optimize resource usage.

After this add-on is installed, you can create Ascend-accelerated nodes to quickly and efficiently process inference and image recognition.

Notes and Constraints

  • To use Ascend-accelerated nodes in a cluster, the CCE AI Suite (Ascend NPU) add-on must be installed.
  • After an AI-accelerated node is migrated, the node will be reset. If the automatic driver installation function is enabled (supported only by the add-on of version 1.2.5 or later) and the driver corresponding to the NPU node model is selected for the CCE AI Suite (Ascend NPU) add-on of the destination cluster, the NPU driver will be automatically installed after the node migration.
    • If the add-on version is earlier than 1.2.8, you need to manually restart the node after the driver installation for the diver to work.
    • If the add-on version is 1.2.8 or later, the node will be automatically restarted after the driver is installed. After the restart, the driver works.
  • If the NPU driver installation command is included in the post-installation script of the node pool, with automatic driver installation enabled and the correct NPU driver model selected, driver installation commands from the frontend and the npu-driver-installer pod will be executed concurrently to install drivers during node pool scale-out. This may result in discrepancies in the installed driver or installation failures. Therefore, when the driver selection function is enabled for huawei-npu, you are not advised to expand the capacity of a node pool for which Post-installation Command has been compiled, or compile Post-installation Command when creating a node pool to install the NPU driver.

Installing the Add-on

  1. Log in to the CCE console and click the cluster name to access the cluster console.
  2. In the navigation pane, choose Add-ons. In the right pane, find the CCE AI Suite (Ascend NPU) add-on and click Install.
  3. In the Metric-based Observation area, enable Use NPU-Exporter to Observe NPU Metrics. After this function is enabled, NPU-Exporter will be deployed on the NPU nodes as a DaemonSet. When using this component, pay attention to the following points:

  4. Determine whether to enable Auto Driver Installation (supported only when the add-on version is 1.2.5 or later).

    • Enabled: You can specify the driver version based on the NPU model for easier driver maintenance.
      After the driver is enabled, the add-on automatically installs the driver based on the specified driver version. By default, the recommended driver is used. You can also select Path to a custom driver from the drop-down list and enter a driver address.
      • The add-on installs the driver based on the driver version selected for the specified model. Such installation is only for nodes with no NPU driver installed. Nodes with an NPU driver installed remain unchanged. If you change the driver version when upgrading the add-on or updating add-on parameters, such change takes effect only on the nodes with no NPU driver installed.
      • After the driver is successfully installed, the node automatically restarts. To prevent service loss during the restart, you are advised to drain the node in advance and then install or upgrade the driver. For details about how to drain a node, see Draining a Node. For details about how to verify the driver installation, see How to Check Whether the NPU Driver Has Been Installed on a Node.
      • Uninstalling the add-on does not automatically delete the installed NPU driver. For details about how to uninstall the NPU driver, see Uninstalling the NPU Driver.
    • Disabled: Driver versions are decided by the system, and the drivers cannot be maintained using the add-on. When you add an NPU node on the console, the system adds the command to install an NPU driver (version and type decided by the system) and automatically restarts the node after the driver installation is complete. Adding an NPU node in another way, such as using an API, requires you to add the driver installation command to Post-installation Command.
    • The following table lists what NPUs and OS specifications are supported.
      Table 1 Specification adaptation

      NPU Type

      Supported OS

      Snt3 (ascend-snt3)

      EulerOS 2.5 x86, CentOS 7.6 x86, EulerOS 2.9 x86, and EulerOS 2.8 Arm

      NOTE:

      The Snt3 Arm model supports up to EulerOS 2.8 Arm, which has now reached EOS. For details, see EOS Plan.

      CCE standard and Turbo clusters of v1.28 and later versions do not support EulerOS 2.8 Arm. To use NPUs in such clusters, select compatible NPUs by referring to Mappings Between Cluster Versions and OS Versions and Software Versions Required by Different Models. For details about the purchase process, see Lite Cluster Usage Process.

  5. Click Install.

Components

Table 2 Add-on components

Component

Description

Resource Type

npu-driver-installer

Used for installing an NPU driver on NPU nodes.

DaemonSet

huawei-npu-device-plugin

Allows containers to use Huawei NPUs.

DaemonSet

npu-exporter

Used for monitoring and collecting NPU metric data. You need to manually enable it. After this component is enabled, it runs as a DaemonSet on each NPU node.

DaemonSet

ascend-vnpu-manager

Enables node pool–level NPU virtualization and allows CCE to create vNPUs. By doing so, it facilitates more efficient utilization of NPU resources. After installing the add-on, go to Settings, click the Heterogeneous Resources tab, and enable NPU virtualization. The add-on will then deploy the component on the required nodes. For details, see NPU Virtualization.

DaemonSet

NPU Metrics

In CCE standard and Turbo clusters of v1.25 or later, the NPU metrics listed in the table below are exposed through device-plugin. They can be collected, reported, and displayed on AOM.

Metric

Monitoring Level

Remarks

cce_npu_memory_total

NPU cards

Total NPU memory

cce_npu_memory_used

NPU cards

NPU memory usage

cce_npu_utilization

NPU cards

NPU compute usage

How to Check Whether the NPU Driver Has Been Installed on a Node

After ensuring that the driver is successfully installed, restart the node for the driver to take effect. Otherwise, the driver cannot take effect and NPU resources are unavailable. To check whether the driver is installed, perform the following operations:

  1. On the Add-ons page, click CCE AI Suite (Ascend NPU).

  2. Verify that the node where npu-driver-installer is deployed is in the Running state.

    If the node is restarted before the NPU driver is installed, the driver installation may fail, and a message is displayed on the Nodes page indicating that the driver is not ready. In this case, uninstall the NPU driver from the node and restart the npu-driver-installer pod to reinstall the NPU driver. After confirming that the driver is installed, restart the node. For details about how to uninstall the driver, see Uninstalling the NPU Driver.

Uninstalling the NPU Driver

Log in to the node, obtain the driver operation records in the /var/log/ascend_seclog/operation.log file, and find the driver run package used in last installation. If the log file does not exist, the driver is installed using the npu_x86_latest.run or npu_arm_latest.run driver combined package. After finding the driver installation package, run the bash {run package name} --uninstall command to uninstall the driver and restart the node as prompted.

  1. Log in to the node where the NPU driver needs to be uninstalled and find the /var/log/ascend_seclog/operation.log file.
  2. If the /var/log/ascend_seclog/operation.log file can be found, view the driver installation log to find the driver installation record.

    If the /var/log/ascend_seclog/operation.log file cannot be found, the driver may be installed using the npu_x86_latest.run or npu_arm_latest.run driver combined package. You can confirm this by checking whether the /usr/local/HiAI/driver/ directory exists.

    The combined package of the NPU driver is stored in the /root/d310_driver directory, and other driver installation packages are stored in the /root/npu-drivers directory.

  3. After finding the driver installation package, run the bash {run package path} --uninstall command to uninstall the driver. The following uses Ascend310-hdk-npu-driver_6.0.rc1_linux-x86-64.run as an example:

    bash /root/npu-drivers/Ascend310-hdk-npu-driver_6.0.rc1_linux-x86-64.run --uninstall

  4. Restart the node as prompted. (The installation and uninstallation of the current NPU driver take effect only after the node is restarted.)

Release History

Table 3 CCE AI Suite (Ascend NPU) add-on

Add-on Version

Supported Cluster Version

New Feature

2.1.63

v1.25

v1.27

v1.28

v1.29

v1.30

v1.31

v1.32

CCE clusters v1.32 are supported.

2.1.53

v1.25

v1.27

v1.28

v1.29

v1.30

v1.31

Fixed the security vulnerabilities.

2.1.46

v1.21

v1.23

v1.25

v1.27

v1.28

v1.29

v1.30

v1.31

CCE clusters v1.31 are supported.

2.1.23

v1.21

v1.23

v1.25

v1.27

v1.28

v1.29

v1.30

Fixed some issues.

2.1.22

v1.21

v1.23

v1.25

v1.27

v1.28

v1.29

v1.30

  • Fixed display issues on some pages.
  • Hypernodes can be obtained.
  • NPU topology can be reported.
  • Resolved log printing issues.

2.1.14

v1.21

v1.23

v1.25

v1.27

v1.28

v1.29

v1.30

Fixed some issues.

2.1.7

v1.21

v1.23

v1.25

v1.27

v1.28

v1.29

Resolved the issue that npu-smi fails to be automatically mounted to a service container.

2.1.5

v1.21

v1.23

v1.25

v1.27

v1.28

v1.29

  • CCE clusters v1.29 are supported.
  • Added silent fault codes.

2.0.9

v1.21

v1.23

v1.25

v1.27

v1.28

Fixed the issue that process-level fault recovery and annotation adding to workloads occasionally fail.

2.0.5

v1.21

v1.23

v1.25

v1.27

v1.28

  • CCE clusters v1.28 are supported.
  • Supported liveness probes.
  • Ascend drivers can be automatically mounted to service containers.

1.2.14

v1.19

v1.21

v1.23

v1.25

v1.27

Supported NPU monitoring.

1.2.6

v1.19

v1.21

v1.23

v1.25

Supported automatic installation of NPU drivers.

1.2.5

v1.19

v1.21

v1.23

v1.25

Supported automatic installation of NPU drivers.

1.2.4

v1.19

v1.21

v1.23

v1.25

CCE clusters v1.25 are supported.

1.2.2

v1.19

v1.21

v1.23

CCE clusters v1.23 are supported.

1.2.1

v1.19

v1.21

v1.23

CCE clusters v1.23 are supported.

1.1.8

v1.15

v1.17

v1.19

v1.21

CCE clusters v1.21 are supported.

1.1.2

v1.15

v1.17

v1.19

Added the default seccomp profile.

1.1.1

v1.15

v1.17

v1.19

CCE clusters v1.15 are supported.

1.1.0

v1.17

v1.19

CCE clusters v1.19 are supported.

1.0.8

v1.13

v1.15

v1.17

Adapted to the Snt3 C75 drivers.

1.0.6

v1.13

v1.15

v1.17

Supported the C75 drivers.

1.0.5

v1.13

v1.15

v1.17

Allowed containers to use Huawei NPUs.

1.0.3

v1.13

v1.15

v1.17

Allowed containers to use Huawei NPUs.