CCE AI Suite (Ascend NPU)
Introduction
CCE AI Suite (Ascend NPU) supports and manages NPUs in containers.
After this add-on is installed, you can create Ascend-accelerated nodes to quickly and efficiently process inference and image recognition.
Notes and Constraints
- To use Ascend-accelerated nodes in a cluster, the Ascend NPU add-on must be installed.
- After an AI-accelerated node is migrated, the node will be reset. Manually reinstall the NPU driver.
Installing the Add-on
- Log in to the CCE console and click the cluster name to access the cluster console. In the navigation pane, choose Add-ons, locate CCE AI Suite (Ascend NPU) on the right, and click Install.
- On the Install Add-on page, configure the specifications as needed. You can adjust the number of add-on instances and resource quotas as required.
- Determine whether to automatically install the driver (supported only when the add-on version is 1.2.5 or later).
- Enabled: You can specify the driver version based on the NPU model for easier driver maintenance.
After the driver is enabled, the add-on automatically installs the driver based on the specified driver version. By default, the recommended driver is used. You can also select Custom driver from the drop-down list and enter a driver address.
- The add-on installs the driver based on the driver version selected for the specified model. Such installation is only for nodes with no NPU driver installed. Nodes with an NPU driver installed remain unchanged. If you change the driver version when upgrading the add-on or updating add-on parameters, such change takes effect only on the nodes with no NPU driver installed.
- After the driver is successfully installed, restart the node for the driver to take effect. For details about how to check whether the driver is successfully installed, see How to Check Whether the NPU Driver Has Been Installed on a Node.
- Uninstalling the add-on does not automatically delete the installed NPU driver. For details about how to uninstall the NPU driver, see Uninstalling the NPU Driver.
- Disabled: Driver versions are decided by the system, and the drivers cannot be maintained using the add-on. When you add an NPU node on the console, the console adds the command to install the NPU driver (version and type decided by the system) and automatically restarts the node after the installation is complete. Adding an NPU node in another way, such as using an API, requires you to add the driver installation command to Post-installation Command.
- The supported NPU types and OSs are as follows:
NPU Type
Supported OS
D310
EulerOS 2.5 x86, CentOS 7.6 x86, EulerOS 2.9 x86, and EulerOS 2.8 arm
- Enabled: You can specify the driver version based on the NPU model for easier driver maintenance.
- Click Install.
Components
Component |
Description |
Resource Type |
---|---|---|
npu-driver-installer |
Used for installing an NPU driver on NPU nodes. |
DaemonSet |
huawei-npu-device-plugin |
The CCE AI Suite (Ascend NPU) add-on can be used in containers. |
DaemonSet |
NPU Metrics
Metric |
Monitoring Level |
Remarks |
---|---|---|
cce_npu_memory_total |
NPU cards |
Total NPU memory |
cce_npu_memory_used |
NPU cards |
NPU memory usage |
cce_npu_utilization |
NPU cards |
NPU compute usage |
How to Check Whether the NPU Driver Has Been Installed on a Node
After ensuring that the driver is successfully installed, restart the node for the driver to take effect. Otherwise, the driver cannot take effect and NPU resources are unavailable. To check whether the driver is installed, perform the following operations:
- On the Add-ons page, click CCE AI Suite (Ascend NPU).
- Verify that the node where npu-driver-installer is deployed is in the Running state.
If the node is restarted before the NPU driver is installed, the driver installation may fail and a message is displayed on the Nodes page of the cluster indicating that the Ascend driver is not ready. In this case, uninstall the NPU driver from the node and restart the npu-driver-installer pod to reinstall the NPU driver. After confirming that the driver is installed, restart the node. For details about how to uninstall the driver, see Uninstalling the NPU Driver.
Uninstalling the NPU Driver
Log in to the node, obtain the driver operation records in the /var/log/ascend_seclog/operation.log file, and find the driver run package used in last installation. If the lof file does not exist, the driver is installed using the npu_x86_latest.run or npu_arm_latest.run driver combined package. After finding the driver installation package, run the bash {run package name} --uninstall command to uninstall the driver and restart the node as prompted.
- Log in to the node where the NPU driver needs to be uninstalled and find the /var/log/ascend_seclog/operation.log file.
- If the /var/log/ascend_seclog/operation.log file can be found, view the driver installation log to find the driver installation record.
If the /var/log/ascend_seclog/operation.log file cannot be found, the driver may be installed using the npu_x86_latest.run or npu_arm_latest.run driver combined package. You can confirm this by checking whether the /usr/local/HiAI/driver/ directory exists.
The combined package of the NPU driver is stored in the /root/d310_driver directory, and other driver installation packages are stored in the /root/npu-drivers directory.
- After finding the driver installation package, run the bash {run package path} --uninstall command to uninstall the driver. The following uses Ascend310-hdk-npu-driver_6.0.rc1_linux-x86-64.run as an example:
bash /root/npu-drivers/Ascend310-hdk-npu-driver_6.0.rc1_linux-x86-64.run --uninstall
- Restart the node as prompted. (The installation and uninstallation of the current NPU driver take effect only after the node is restarted.)
Change History
Add-on Version |
Supported Cluster Version |
New Feature |
---|---|---|
2.1.23 |
v1.21 v1.23 v1.25 v1.27 v1.28 v1.29 v1.30 |
Fixed some issues. |
2.1.22 |
v1.21 v1.23 v1.25 v1.27 v1.28 v1.29 v1.30 |
|
2.1.14 |
v1.21 v1.23 v1.25 v1.27 v1.28 v1.29 v1.30 |
Fixed some issues. |
2.1.7 |
v1.21 v1.23 v1.25 v1.27 v1.28 v1.29 |
Fixed some issues. |
2.1.5 |
v1.21 v1.23 v1.25 v1.27 v1.28 v1.29 |
|
2.0.9 |
v1.21 v1.23 v1.25 v1.27 v1.28 |
Fixed the issue that process-level fault recovery and annotation adding to workloads occasionally fail. |
2.0.5 |
v1.21 v1.23 v1.25 v1.27 v1.28 |
|
1.2.14 |
v1.19 v1.21 v1.23 v1.25 v1.27 |
Supported NPU monitoring. |
1.2.6 |
v1.19 v1.21 v1.23 v1.25 |
Supports automatical installation of NPU drivers. |
1.2.5 |
v1.19 v1.21 v1.23 v1.25 |
Supports automatical installation of NPU drivers. |
1.2.4 |
v1.19 v1.21 v1.23 v1.25 |
CCE clusters 1.25 are supported. |
1.2.2 |
v1.19 v1.21 v1.23 |
CCE clusters 1.23 are supported. |
1.2.1 |
v1.19 v1.21 v1.23 |
CCE clusters 1.23 are supported. |
1.1.8 |
v1.15 v1.17 v1.19 v1.21 |
CCE clusters 1.21 are supported. |
1.1.2 |
v1.15 v1.17 v1.19 |
Adds the default seccomp profile. |
1.1.1 |
v1.15 v1.17 v1.19 |
CCE clusters 1.15 are supported. |
1.1.0 |
v1.17 v1.19 |
CCE clusters 1.19 are supported. |
1.0.8 |
v1.13 v1.15 v1.17 |
Adapts to the D310 C75 driver. |
1.0.6 |
v1.13 v1.15 v1.17 |
Supports the Ascend C75 driver. |
1.0.5 |
v1.13 v1.15 v1.17 |
Allows containers to use Huawei NPU add-ons. |
1.0.3 |
v1.13 v1.15 v1.17 |
Allows containers to use Huawei NPU add-ons. |
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot