CCE AI Suite (Ascend NPU)
Introduction
The CCE AI Suite (Ascend NPU) add-on supports and manages Huawei NPUs in containers. It provides functions such as automatic driver installation, device registration and scheduling, performance monitoring, and virtualized resource management. With this add-on, you can implement automatic deployment, refined scheduling, visualized monitoring, and virtualization of NPUs for diverse heterogeneous computing requirements. It is typically used in:
- AI model training: It supports multi-NPU parallelism and refined resource scheduling to improve the efficiency and stability of large-scale model training.
- AI inference and real-time services: It supports low-latency, on-demand computing to ensure real-time performance and service availability of inference tasks.
- Resource isolation environments: It supports vNPU management to divide resources by granularity for compute isolation and quota control in multi-user scenarios.
-
Training task monitoring and resource optimization: It supports metric collection and visual analysis to monitor training task performance and optimize resource usage.
After this add-on is installed, you can create Ascend-accelerated nodes to quickly and efficiently process inference and image recognition.
Notes and Constraints
- To use Ascend-accelerated nodes in a cluster, the CCE AI Suite (Ascend NPU) add-on must be installed.
- After an AI-accelerated node is migrated, the node will be reset. If the automatic driver installation function is enabled (supported only by the add-on of version 1.2.5 or later) and the driver corresponding to the NPU node model is selected for the CCE AI Suite (Ascend NPU) add-on of the destination cluster, the NPU driver will be automatically installed after the node migration.
- If the add-on version is earlier than 1.2.8, you need to manually restart the node after the driver installation for the diver to work.
- If the add-on version is 1.2.8 or later, the node will be automatically restarted after the driver is installed. After the restart, the driver works.
- If the NPU driver installation command is included in the post-installation script of the node pool, with automatic driver installation enabled and the correct NPU driver model selected, driver installation commands from the frontend and the npu-driver-installer pod will be executed concurrently to install drivers during node pool scale-out. This may result in discrepancies in the installed driver or installation failures. Therefore, when the driver selection function is enabled for huawei-npu, you are not advised to expand the capacity of a node pool for which Post-installation Command has been compiled, or compile Post-installation Command when creating a node pool to install the NPU driver.
Installing the Add-on
- Log in to the CCE console and click the cluster name to access the cluster console.
- In the navigation pane, choose Add-ons. In the right pane, find the CCE AI Suite (Ascend NPU) add-on and click Install.
- In the Metric-based Observation area, enable Use NPU-Exporter to Observe NPU Metrics. After this function is enabled, NPU-Exporter will be deployed on the NPU nodes as a DaemonSet. When using this component, pay attention to the following points:
- NPU-Exporter is supported when the add-on version is 2.1.55 or later. Additionally, this component requires the NPU driver of version 24.x or later. For details, see the steps for installing NPU drivers.
- After NPU-Exporter is enabled, if you need to report the collected NPU monitoring data to AOM, see Comprehensive Monitoring of NPU Metrics.
- Determine whether to enable Auto Driver Installation (supported only when the add-on version is 1.2.5 or later).
- Enabled: You can specify the driver version based on the NPU model for easier driver maintenance.
After the driver is enabled, the add-on automatically installs the driver based on the specified driver version. By default, the recommended driver is used. You can also select Path to a custom driver from the drop-down list and enter a driver address.
- The add-on installs the driver based on the driver version selected for the specified model. Such installation is only for nodes with no NPU driver installed. Nodes with an NPU driver installed remain unchanged. If you change the driver version when upgrading the add-on or updating add-on parameters, such change takes effect only on the nodes with no NPU driver installed.
- After the driver is successfully installed, the node automatically restarts. To prevent service loss during the restart, you are advised to drain the node in advance and then install or upgrade the driver. For details about how to drain a node, see Draining a Node. For details about how to verify the driver installation, see How to Check Whether the NPU Driver Has Been Installed on a Node.
- Uninstalling the add-on does not automatically delete the installed NPU driver. For details about how to uninstall the NPU driver, see Uninstalling the NPU Driver.
- Disabled: Driver versions are decided by the system, and the drivers cannot be maintained using the add-on. When you add an NPU node on the console, the system adds the command to install an NPU driver (version and type decided by the system) and automatically restarts the node after the driver installation is complete. Adding an NPU node in another way, such as using an API, requires you to add the driver installation command to Post-installation Command.
- The following table lists what NPUs and OS specifications are supported.
Table 1 Specification adaptation NPU Type
Supported OS
Snt3 (ascend-snt3)
EulerOS 2.5 x86, CentOS 7.6 x86, EulerOS 2.9 x86, and EulerOS 2.8 Arm
NOTE:The Snt3 Arm model supports up to EulerOS 2.8 Arm, which has now reached EOS. For details, see EOS Plan.
CCE standard and Turbo clusters of v1.28 and later versions do not support EulerOS 2.8 Arm. To use NPUs in such clusters, select compatible NPUs by referring to Mappings Between Cluster Versions and OS Versions and Software Versions Required by Different Models. For details about the purchase process, see Lite Cluster Usage Process.
- Enabled: You can specify the driver version based on the NPU model for easier driver maintenance.
- Click Install.
Components
Component |
Description |
Resource Type |
---|---|---|
npu-driver-installer |
Used for installing an NPU driver on NPU nodes. |
DaemonSet |
huawei-npu-device-plugin |
Allows containers to use Huawei NPUs. |
DaemonSet |
npu-exporter |
Used for monitoring and collecting NPU metric data. You need to manually enable it. After this component is enabled, it runs as a DaemonSet on each NPU node. |
DaemonSet |
ascend-vnpu-manager |
Enables node pool–level NPU virtualization and allows CCE to create vNPUs. By doing so, it facilitates more efficient utilization of NPU resources. After installing the add-on, go to Settings, click the Heterogeneous Resources tab, and enable NPU virtualization. The add-on will then deploy the component on the required nodes. For details, see NPU Virtualization. |
DaemonSet |
NPU Metrics
In CCE standard and Turbo clusters of v1.25 or later, the NPU metrics listed in the table below are exposed through device-plugin. They can be collected, reported, and displayed on AOM.
Metric |
Monitoring Level |
Remarks |
---|---|---|
cce_npu_memory_total |
NPU cards |
Total NPU memory |
cce_npu_memory_used |
NPU cards |
NPU memory usage |
cce_npu_utilization |
NPU cards |
NPU compute usage |
How to Check Whether the NPU Driver Has Been Installed on a Node
After ensuring that the driver is successfully installed, restart the node for the driver to take effect. Otherwise, the driver cannot take effect and NPU resources are unavailable. To check whether the driver is installed, perform the following operations:
- On the Add-ons page, click CCE AI Suite (Ascend NPU).
- Verify that the node where npu-driver-installer is deployed is in the Running state.
If the node is restarted before the NPU driver is installed, the driver installation may fail, and a message is displayed on the Nodes page indicating that the driver is not ready. In this case, uninstall the NPU driver from the node and restart the npu-driver-installer pod to reinstall the NPU driver. After confirming that the driver is installed, restart the node. For details about how to uninstall the driver, see Uninstalling the NPU Driver.
Uninstalling the NPU Driver
Log in to the node, obtain the driver operation records in the /var/log/ascend_seclog/operation.log file, and find the driver run package used in last installation. If the log file does not exist, the driver is installed using the npu_x86_latest.run or npu_arm_latest.run driver combined package. After finding the driver installation package, run the bash {run package name} --uninstall command to uninstall the driver and restart the node as prompted.
- Log in to the node where the NPU driver needs to be uninstalled and find the /var/log/ascend_seclog/operation.log file.
- If the /var/log/ascend_seclog/operation.log file can be found, view the driver installation log to find the driver installation record.
If the /var/log/ascend_seclog/operation.log file cannot be found, the driver may be installed using the npu_x86_latest.run or npu_arm_latest.run driver combined package. You can confirm this by checking whether the /usr/local/HiAI/driver/ directory exists.
The combined package of the NPU driver is stored in the /root/d310_driver directory, and other driver installation packages are stored in the /root/npu-drivers directory.
- After finding the driver installation package, run the bash {run package path} --uninstall command to uninstall the driver. The following uses Ascend310-hdk-npu-driver_6.0.rc1_linux-x86-64.run as an example:
bash /root/npu-drivers/Ascend310-hdk-npu-driver_6.0.rc1_linux-x86-64.run --uninstall
- Restart the node as prompted. (The installation and uninstallation of the current NPU driver take effect only after the node is restarted.)
Helpful Links
Release History
Add-on Version |
Supported Cluster Version |
New Feature |
---|---|---|
2.1.63 |
v1.25 v1.27 v1.28 v1.29 v1.30 v1.31 v1.32 |
CCE clusters v1.32 are supported. |
2.1.53 |
v1.25 v1.27 v1.28 v1.29 v1.30 v1.31 |
Fixed the security vulnerabilities. |
2.1.46 |
v1.21 v1.23 v1.25 v1.27 v1.28 v1.29 v1.30 v1.31 |
CCE clusters v1.31 are supported. |
2.1.23 |
v1.21 v1.23 v1.25 v1.27 v1.28 v1.29 v1.30 |
Fixed some issues. |
2.1.22 |
v1.21 v1.23 v1.25 v1.27 v1.28 v1.29 v1.30 |
|
2.1.14 |
v1.21 v1.23 v1.25 v1.27 v1.28 v1.29 v1.30 |
Fixed some issues. |
2.1.7 |
v1.21 v1.23 v1.25 v1.27 v1.28 v1.29 |
Resolved the issue that npu-smi fails to be automatically mounted to a service container. |
2.1.5 |
v1.21 v1.23 v1.25 v1.27 v1.28 v1.29 |
|
2.0.9 |
v1.21 v1.23 v1.25 v1.27 v1.28 |
Fixed the issue that process-level fault recovery and annotation adding to workloads occasionally fail. |
2.0.5 |
v1.21 v1.23 v1.25 v1.27 v1.28 |
|
1.2.14 |
v1.19 v1.21 v1.23 v1.25 v1.27 |
Supported NPU monitoring. |
1.2.6 |
v1.19 v1.21 v1.23 v1.25 |
Supported automatic installation of NPU drivers. |
1.2.5 |
v1.19 v1.21 v1.23 v1.25 |
Supported automatic installation of NPU drivers. |
1.2.4 |
v1.19 v1.21 v1.23 v1.25 |
CCE clusters v1.25 are supported. |
1.2.2 |
v1.19 v1.21 v1.23 |
CCE clusters v1.23 are supported. |
1.2.1 |
v1.19 v1.21 v1.23 |
CCE clusters v1.23 are supported. |
1.1.8 |
v1.15 v1.17 v1.19 v1.21 |
CCE clusters v1.21 are supported. |
1.1.2 |
v1.15 v1.17 v1.19 |
Added the default seccomp profile. |
1.1.1 |
v1.15 v1.17 v1.19 |
CCE clusters v1.15 are supported. |
1.1.0 |
v1.17 v1.19 |
CCE clusters v1.19 are supported. |
1.0.8 |
v1.13 v1.15 v1.17 |
Adapted to the Snt3 C75 drivers. |
1.0.6 |
v1.13 v1.15 v1.17 |
Supported the C75 drivers. |
1.0.5 |
v1.13 v1.15 v1.17 |
Allowed containers to use Huawei NPUs. |
1.0.3 |
v1.13 v1.15 v1.17 |
Allowed containers to use Huawei NPUs. |
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot