Preparing the Inference Environment

Prerequisites

You have prepared Lite Server resources. For details, see Resource Specifications. Snt9b or Snt9b23 resources are recommended.
The container connects to the public network. Git clone needs internet access during installation.

Step 1: Checking the Environment

npu-smi info                    # Run this command on each instance node to view the NPU status.
npu-smi info -l | grep Total    # Run this command on each instance node to view the total number of PUs and check whether these PUs have been mounted.
npu-smi info -t board -i 1 | egrep -i "software|firmware"   # Check the driver and firmware versions.

If an error occurs, the NPU devices on the server may not be properly installed, or the NPU image may be mounted to another container. Install the firmware and driver or release the mounted NPUs.

Table 1 lists the driver version requirements. If the requirements are not met, upgrade the driver by referring to Configuring the Software Environment on the NPU Server.

**Table 1** HDK firmware & driver
NPU	HDK Firmware & Driver
Snt9b23 (standard)	25.2.1
Snt9b (standard)	25.2.1

Check whether Docker is installed.

docker -v   # Check whether Docker is installed.

If Docker is not installed, run this command:

yum install -y docker-engine.aarch64 docker-engine-selinux.noarch docker-runc.aarch64

Configure IP forwarding for intra-container network accesses. Run the following command to check the value of net.ipv4.ip_forward. Skip this step if the value is 1.
```
sysctl -p | grep net.ipv4.ip_forward
```
If the value is not 1, configure IP forwarding:
```
sed -i 's/net\.ipv4\.ip_forward=0/net\.ipv4\.ip_forward=1/g' /etc/sysctl.conf
sysctl -p | grep net.ipv4.ip_forward
```

Step 2: Obtaining the Base Image

Use official images to deploy inference services. For details about the image path {image_url}, see Table 2.

docker pull {image_url}

Step 3: Uploading the Code Package and Weight File

Upload the inference code package AscendCloud-LLM-xxx.zip and operator package AscendCloud-OPP-xxx.zip to the host. For details about how to obtain the packages, see Table 3.
Upload the weight file to the Lite Server. The weight file must be in Hugging Face format. For details about how to obtain the open-source weight file, see Supported Models.
Save the weight file in the chosen folder on your disk and verify its size using this sample command:
```
df -h
```

Step 4: Creating an Inference Image

Run the script below to decompress the package. Ensure that the environment is connected to the network and supports git clone.

# Move the package to the build directory and decompress the package in sequence to build a production image.
unzip AscendCloud-*.zip -d AscendCloud && cd AscendCloud \
&& unzip AscendCloud-OPP-*.zip && unzip AscendCloud-OPP-*-torch-*-py*.zip -d ./AscendCloud-OPP \
&& unzip AscendCloud-LLM*.zip -d ./AscendCloud-LLM \
&& cp AscendCloud-LLM/llm_inference/ascend_vllm/Dockerfile . \
&& docker build -t ${image} --build-arg BASE_IMAGE=$base_image .

Parameters:

${base_image}: base image name, that is, {image_url}. The value is obtained from Table 2.
$image: name of the newly created image. You can customize this name, but it must follow the format "image_name:image_tag". For example:
pytorch_ascend:pytorch_2.5.1-cann_8.1.rc2-py_3.11-hce_2.0.2503-aarch64-snt9b23

After the execution is complete, the image required for inference is generated. Run docker images to find the generated image.

Step 5: Starting the Container

Before starting the container image, modify the parameters in ${} according to the parameter description. If Docker fails to start, there will be a corresponding error prompt. If it starts successfully, the corresponding Docker ID will be generated without any errors.

docker run -itd \
--device=/dev/davinci0 \
--device=/dev/davinci1 \
--device=/dev/davinci2 \
--device=/dev/davinci3 \
--device=/dev/davinci4 \
--device=/dev/davinci5 \
--device=/dev/davinci6 \
--device=/dev/davinci7 \
-v /etc/localtime:/etc/localtime  \
-v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
-v /etc/ascend_install.info:/etc/ascend_install.info \
--device=/dev/davinci_manager \
--device=/dev/devmm_svm \
--device=/dev/hisi_hdc \
-v /var/log/npu/:/usr/slog \
-v /usr/local/sbin/npu-smi:/usr/local/sbin/npu-smi \
-v /sys/fs/cgroup:/sys/fs/cgroup:ro \
-v ${dir}:${container_model_path} \
--net=host \
--name ${container_name} \
${image_id} \
/bin/bash

Parameters:

--device=/dev/davinci0, ..., --device=/dev/davinci7: mounts NPUs. In the example, eight NPUs (davinci0 to davinci7) are mounted.
-v ${dir}:${container_model_path}: host directory to be mounted to the container.
- The /home/ma-user directory cannot be mounted to the container. This directory is the home directory of user ma-user. If the container is mounted to /home/ma-user, the container conflicts with the base image when being started. As a result, the base image is unavailable.
- Both the driver and npu-smi must be mounted to the container.
- Avoid assigning multiple containers to one NPU. Doing so will prevent its use in later containers.
--name ${container_name}: container name, which is used when you access the container. You can define a container name.
{image_id} indicates the ID of the Docker image, that is, the ID of the new image generated in Step 4: Creating an Inference Image. You can run the docker images command on the host to query the ID.

Step 6: Entering the Container

Access the container.

docker exec -it -u ma-user ${container_name} /bin/bash

Evaluate inference resources. Run the following command in the container to retrieve the number of available NPUs:
```
npu-smi info       # Before starting the inference service, ensure that the NPUs are not occupied, ports are free, and relevant processes are running.
```
If an error occurs, the NPU devices on the server may not be properly installed, or the NPU image may be mounted to another container. Install the firmware and driver or release the mounted NPUs.

If the requirements are not met, upgrade the driver by referring to Configuring the Software Environment on the NPU Server. After the container is started, the default port number is 8080.
Specify which NPUs within the container should be used. For example, if the first NPU in the container is being used, set it to 0:
```
export ASCEND_RT_VISIBLE_DEVICES=0
```
If multiple NPUs are needed for the service, arrange them sequentially according to their numbering in the container. For example, if the first and second NPUs are used, set it to 0,1:
```
export ASCEND_RT_VISIBLE_DEVICES=0,1
```
Run the npu-smi info command to determine the order of NPUs in the container. For example, if two NPUs are detected and you wish to use the first and second ones, set export ASCEND_RT_VISIBLE_DEVICES=0,1. Be careful that the indices are not necessarily 4 or 5.
Figure 1 Query result

For details about how to start an inference service, see Starting an LLM-powered Inference Service.