Updated on 2024-05-09 GMT+08:00

CCE AI Suite (Ascend NPU)

Introduction

Ascend NPU is a device management add-on that supports Huawei NPUs in containers.

After this add-on is installed, you can create Ascend-accelerated nodes to quickly and efficiently process inference and image recognition.

Constraints

  • To use Ascend-accelerated nodes in a cluster, the Ascend NPU add-on must be installed.
  • After an AI-accelerated node is migrated, the node will be reset. Manually reinstall the NPU driver.

Installing the Add-on

  1. Log in to the CCE console and click the cluster name to access the cluster console. Choose Add-ons in the navigation pane, locate CCE AI Suite (Ascend NPU) on the right, and click Install.
  2. Set NPU parameters. The add-on uses the following parameters by default. The default NPU settings provided by the add-on can satisfy most scenarios and require no changes.

    {
    	"check_frequency_failed_threshold": 100,
    	"check_frequency_fall_times": 3,
    	"check_frequency_gate": false,
    	"check_frequency_recover_threshold": 100,
    	"check_frequency_rise_times": 2,
    	"container_path": "/usr/local/HiAI_unused",
    	"host_path": "/usr/local/HiAI_unused"
    }

  3. Click Install.

Components

Table 1 Add-on components

Component

Description

Resource Type

npu-driver-installer

Used for installing an NPU driver on NPU nodes.

DaemonSet

How to Check Whether the NPU Driver Has Been Installed on a Node

After ensuring that the driver is successfully installed, restart the node for the driver to take effect. Otherwise, the driver cannot take effect and NPU resources are unavailable. To check whether the driver is installed, perform the following operations:

  1. On the Add-ons page, click CCE AI Suite (Ascend NPU).

  2. Verify that the node where npu-driver-installer is deployed is in the Running state.

    If the node is restarted before the NPU driver is installed, the driver installation may fail and a message is displayed on the Nodes page of the cluster indicating that the Ascend driver is not ready. In this case, uninstall the NPU driver from the node and restart the npu-driver-installer pod to reinstall the NPU driver. After confirming that the driver is installed, restart the node. For details about how to uninstall the driver, see Uninstalling the NPU Driver.

Uninstalling the NPU Driver

Log in to the node, obtain the driver operation records in the /var/log/ascend_seclog/operation.log file, and find the driver run package used in last installation. If the lof file does not exist, the driver is installed using the npu_x86_latest.run or npu_arm_latest.run driver combined package. After finding the driver installation package, run the bash {run package name} --uninstall command to uninstall the driver and restart the node as prompted.

  1. Log in to the node where the NPU driver needs to be uninstalled and find the /var/log/ascend_seclog/operation.log file.
  2. If the /var/log/ascend_seclog/operation.log file can be found, view the driver installation log to find the driver installation record.

    If the /var/log/ascend_seclog/operation.log file cannot be found, the driver may be installed using the npu_x86_latest.run or npu_arm_latest.run driver combined package. You can confirm this by checking whether the /usr/local/HiAI/driver/ directory exists.

    The combined package of the NPU driver is stored in the /root/d310_driver directory, and other driver installation packages are stored in the /root/npu-drivers directory.

  3. After finding the driver installation package, run the bash {run package path} --uninstall command to uninstall the driver. The following uses Ascend310-hdk-npu-driver_6.0.rc1_linux-x86-64.run as an example:

    bash /root/npu-drivers/Ascend310-hdk-npu-driver_6.0.rc1_linux-x86-64.run --uninstall

  4. Restart the node as prompted. (The installation and uninstallation of the current NPU driver take effect only after the node is restarted.)

Change History

Table 2 Release history

Add-on Version

Supported Cluster Version

New Feature

2.1.5

v1.21

v1.23

v1.25

v1.27

v1.28

v1.29

  • CCE 1.29 clusters are supported.
  • Added silent fault codes.

2.0.9

v1.21

v1.23

v1.25

v1.27

v1.28

Fixed the issue that process-level fault recovery and annotation adding to workloads occasionally fail.

2.0.5

v1.21

v1.23

v1.25

v1.27

v1.28

  • CCE clusters 1.28 are supported.
  • Supported liveness probe.

1.2.14

v1.19

v1.21

v1.23

v1.25

v1.27

Supported NPU monitoring.

1.2.6

v1.19

v1.21

v1.23

v1.25

Supports automatical installation of NPU drivers.

1.2.5

v1.19

v1.21

v1.23

v1.25

Supports automatical installation of NPU drivers.

1.2.4

v1.19

v1.21

v1.23

v1.25

CCE clusters 1.25 are supported.

1.2.2

v1.19

v1.21

v1.23

CCE clusters 1.23 are supported.

1.2.1

v1.19

v1.21

v1.23

CCE clusters 1.23 are supported.

1.1.8

v1.15

v1.17

v1.19

v1.21

CCE clusters 1.21 are supported.

1.1.2

v1.15

v1.17

v1.19

Adds the default seccomp profile.

1.1.1

v1.15

v1.17

v1.19

CCE clusters 1.15 are supported.

1.1.0

v1.17

v1.19

CCE clusters 1.19 are supported.

1.0.8

v1.13

v1.15

v1.17

Adapts to the D310 C75 driver.

1.0.6

v1.13

v1.15

v1.17

Supports the Ascend C75 driver.

1.0.5

v1.13

v1.15

v1.17

Allows containers to use Huawei NPU add-ons.

1.0.3

v1.13

v1.15

v1.17

Allows containers to use Huawei NPU add-ons.