Preparing the Inference Environment
Prerequisites
- You have prepared Lite Server resources. For details, see Resource Specifications. Snt9b or Snt9b23 resources are recommended.
- The container connects to the public network. Git clone needs internet access during installation.
Step 1: Checking the Environment
- Log in to the server via SSH and check the NPU status. Obtain the NPU device information:
npu-smi info # Run this command on each instance node to view the NPU status. npu-smi info -l | grep Total # Run this command on each instance node to view the total number of PUs and check whether these PUs have been mounted. npu-smi info -t board -i 1 | egrep -i "software|firmware" # Check the driver and firmware versions.
If an error occurs, the NPU devices on the server may not be properly installed, or the NPU image may be mounted to another container. Install the firmware and driver or release the mounted NPUs.
Table 1 lists the driver version requirements. If the requirements are not met, upgrade the driver by referring to Configuring the Software Environment on the NPU Server.
- Check whether Docker is installed.
docker -v # Check whether Docker is installed.
If Docker is not installed, run this command:
yum install -y docker-engine.aarch64 docker-engine-selinux.noarch docker-runc.aarch64
- Configure IP forwarding for intra-container network accesses. Run the following command to check the value of net.ipv4.ip_forward. Skip this step if the value is 1.
sysctl -p | grep net.ipv4.ip_forward
If the value is not 1, configure IP forwarding:sed -i 's/net\.ipv4\.ip_forward=0/net\.ipv4\.ip_forward=1/g' /etc/sysctl.conf sysctl -p | grep net.ipv4.ip_forward
Step 2: Obtaining the Base Image
Use official images to deploy inference services. For details about the image path {image_url}, see Table 2.
docker pull {image_url}
Step 3: Uploading the Code Package and Weight File
- Upload the inference code package AscendCloud-LLM-xxx.zip and operator package AscendCloud-OPP-xxx.zip to the host. For details about how to obtain the packages, see Table 3.
- Upload the weight file to the Lite Server. The weight file must be in Hugging Face format. For details about how to obtain the open-source weight file, see Supported Models.
- Save the weight file in the chosen folder on your disk and verify its size using this sample command:
df -h
Step 4: Creating an Inference Image
# Move the package to the build directory and decompress the package in sequence to build a production image.
unzip AscendCloud-*.zip -d AscendCloud && cd AscendCloud \
&& unzip AscendCloud-OPP-*.zip && unzip AscendCloud-OPP-*-torch-*-py*.zip -d ./AscendCloud-OPP \
&& unzip AscendCloud-LLM*.zip -d ./AscendCloud-LLM \
&& cp AscendCloud-LLM/llm_inference/ascend_vllm/Dockerfile . \
&& docker build -t ${image} --build-arg BASE_IMAGE=$base_image .
Parameters:
- ${base_image}: base image name, that is, {image_url}. The value is obtained from Table 2.
- $image: name of the newly created image. You can customize this name, but it must follow the format "image_name:image_tag". For example:
pytorch_ascend:pytorch_2.5.1-cann_8.1.rc2-py_3.11-hce_2.0.2503-aarch64-snt9b23
After the execution is complete, the image required for inference is generated. Run docker images to find the generated image.
Step 5: Starting the Container
Before starting the container image, modify the parameters in ${} according to the parameter description. If Docker fails to start, there will be a corresponding error prompt. If it starts successfully, the corresponding Docker ID will be generated without any errors.
docker run -itd \
--device=/dev/davinci0 \
--device=/dev/davinci1 \
--device=/dev/davinci2 \
--device=/dev/davinci3 \
--device=/dev/davinci4 \
--device=/dev/davinci5 \
--device=/dev/davinci6 \
--device=/dev/davinci7 \
-v /etc/localtime:/etc/localtime \
-v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
-v /etc/ascend_install.info:/etc/ascend_install.info \
--device=/dev/davinci_manager \
--device=/dev/devmm_svm \
--device=/dev/hisi_hdc \
-v /var/log/npu/:/usr/slog \
-v /usr/local/sbin/npu-smi:/usr/local/sbin/npu-smi \
-v /sys/fs/cgroup:/sys/fs/cgroup:ro \
-v ${dir}:${container_model_path} \
--net=host \
--name ${container_name} \
${image_id} \
/bin/bash
Parameters:
- --device=/dev/davinci0, ..., --device=/dev/davinci7: mounts NPUs. In the example, eight NPUs (davinci0 to davinci7) are mounted.
- -v ${dir}:${container_model_path}: host directory to be mounted to the container.
- The /home/ma-user directory cannot be mounted to the container. This directory is the home directory of user ma-user. If the container is mounted to /home/ma-user, the container conflicts with the base image when being started. As a result, the base image is unavailable.
- Both the driver and npu-smi must be mounted to the container.
- Avoid assigning multiple containers to one NPU. Doing so will prevent its use in later containers.
- --name ${container_name}: container name, which is used when you access the container. You can define a container name.
- {image_id} indicates the ID of the Docker image, that is, the ID of the new image generated in Step 4: Creating an Inference Image. You can run the docker images command on the host to query the ID.
Step 6: Entering the Container
- Access the container.
docker exec -it -u ma-user ${container_name} /bin/bash - Evaluate inference resources. Run the following command in the container to retrieve the number of available NPUs:
npu-smi info # Before starting the inference service, ensure that the NPUs are not occupied, ports are free, and relevant processes are running.
If an error occurs, the NPU devices on the server may not be properly installed, or the NPU image may be mounted to another container. Install the firmware and driver or release the mounted NPUs.
If the requirements are not met, upgrade the driver by referring to Configuring the Software Environment on the NPU Server. After the container is started, the default port number is 8080.
- Specify which NPUs within the container should be used. For example, if the first NPU in the container is being used, set it to 0:
export ASCEND_RT_VISIBLE_DEVICES=0
If multiple NPUs are needed for the service, arrange them sequentially according to their numbering in the container. For example, if the first and second NPUs are used, set it to 0,1:
export ASCEND_RT_VISIBLE_DEVICES=0,1
Run the npu-smi info command to determine the order of NPUs in the container. For example, if two NPUs are detected and you wish to use the first and second ones, set export ASCEND_RT_VISIBLE_DEVICES=0,1. Be careful that the indices are not necessarily 4 or 5.Figure 1 Query result
For details about how to start an inference service, see Starting an LLM-powered Inference Service.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot