Preparing an Image
Step 1: Checking the Environment
- Log in to the server via SSH and check the NPUs. Obtain the NPU device information:
npu-smi info # Run this command on each instance node to view the NPU status. npu-smi info -l | grep Total # Run this command on each instance node to view the total number of PUs and check whether these PUs have been mounted. npu-smi info -t board -i 1 | egrep -i "software|firmware" # Check the driver and firmware versions.
If an error occurs, the NPU devices on the server may not be properly installed, or the NPU image may be mounted to another container. Install the firmware and driver or release the mounted NPUs.
For details about the driver version requirements, see Table 4. If the driver version does not meet the requirements, upgrade the driver by installing the firmware and driver.
- Check whether Docker is installed.
docker -v # Check whether Docker is installed.
If Docker is not installed, run this command:
yum install -y docker-engine.aarch64 docker-engine-selinux.noarch docker-runc.aarch64
- Configure IP forwarding for intra-container network accesses. Run the following command to check the value of net.ipv4.ip_forward. Skip this step if the value is 1.
sysctl -p | grep net.ipv4.ip_forward
If the value is not 1, configure IP forwarding:sed -i 's/net\.ipv4\.ip_forward=0/net\.ipv4\.ip_forward=1/g' /etc/sysctl.conf sysctl -p | grep net.ipv4.ip_forward
If the configuration item is not found, configure IP forwarding:sed -i '$a\net.ipv4.ip_forward=1' /etc/sysctl.conf sysctl -p | grep net.ipv4.ip_forward
- Check whether the super pod IDs are the same.
for i in {0..7}; do npu-smi info -i $i -c 0 -t spod-info;done # Check the super pod IDs.Figure 1 Checking the super pod IDs
Step 2: Obtaining the Base Image
Use official images to deploy training. For details about the image path {image_url}, see Table 4.
docker pull {image_url}
Step 3: Creating a Training Image
Go to the folder (see key training files in the AscendCloud-LLM code package in Software Package Structure) containing the Dockerfile in the decompressed code directory and build the training image using the Dockerfile.
The installation requires cloning the Git repository online. Make sure your server has internet access.
- Go to the code directory containing the Dockerfile. ${work_dir} indicates the host directory storing the decompressed AscendCloud-LLM code. Change the directory based on your needs.
cd ${work_dir}/llm_train/AscendFactory - Create an image.
docker build --build-arg install_type=xxx -t <image_name>.
Set up a proxy and include the --build-arg parameter with the proxy address if you cannot access the public network.docker build --build-arg "https_proxy=http://xxx.xxx.xxx.xxx" --build-arg "http_proxy=http://xxx.xxx.xxx.xxx" --network=host --build-arg install_type=xxx -t <image_name>.
<image_name>: Custom image name. Example: pytorch_2_3_ascend:20241106
install_type: installation type. The value can be mindspeed-llm, llamafactory, verl, mindspeed-rl, or mindspeed-mm.
- When you build an image with a Dockerfile, the default working directory for the code is ${work_dir}/llm_train/AscendFactory. You can either run the container image to edit the code directly in this directory or rebuild the image.
- Make sure the image name in the Dockerfile matches the one in Table 4 of this document before building the image. Update it if needed.
# Modify the following content: FROM swr.cn-southwest-2.myhuaweicloud.com/atelier/xxx
Step 4: Starting the Container Image
- Before starting the container image, modify the parameters in ${} based on the parameter description. Add or modify parameters as needed. The following is sample commands to start a container.
export work_dir="Custom working directory mounted to the container" # Directory mounted to the container. If SFS is mounted, the mount directory can be used. export container_work_dir="Custom working directory mounted to the container" export container_name="Custom container name" export image_name="Image name" docker run -itd \ --device=/dev/davinci_manager \ --device=/dev/devmm_svm \ --device=/dev/hisi_hdc \ -e ASCEND_VISIBLE_DEVICES=0-15 \ -v /usr/local/sbin/npu-smi:/usr/local/sbin/npu-smi \ -v /usr/local/dcmi:/usr/local/dcmi \ -v /usr/local/Ascend/driver:/usr/local/Ascend/driver \ --cpus 320 \ --memory 2048g \ --shm-size 1024g \ --net=host \ -v ${work_dir}:${container_work_dir} \ --name ${container_name} \ $image_name \ /bin/bashParameters:
- --name ${container_name}: container name, which is used when you access the container. You can define a container name, for example, ascendspeed.
- -v ${work_dir}:${container_work_dir}: host directory to be mounted to the container. The host and container use different file systems. work_dir indicates the working directory on the host. The directory stores files such as code and data required for training. container_work_dir indicates the directory to be mounted to the container. The two paths can be the same.
- The /home/ma-user directory cannot be mounted to the container. This directory is the home directory of the ma-user user.
- Both the driver and npu-smi must be mounted to the container.
- Avoid assigning multiple containers to one NPU. Doing so will prevent its use in later containers.
- ${image_name} indicates the ID of the Docker image, which can be queried by running the docker images command on the host.
- --shm-size: shared memory, which is used for communication between multiple processes. Model files with large memory need to be converted. The size must be 1,024 GB or larger.
- --cpus: number of CPU cores on the host. Generally, set this parameter to 192 for Snt9b servers and 320 for Snt9b23 servers.
- --e ASCEND_VISIBLE_DEVICES=0-7: PU IDs. Generally, keep 0-7 for Snt9b servers and change it to 0-15 for Snt9b23 servers.
- --memory: Generally, set this parameter to 1024g for Snt9b servers and 2048g for Snt9b23 servers.
- Access the container using the container name. The default user is ma-user when the container is started.
docker exec -it ${container_name} bash
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot