Configuring the Software Environment on the NPU Server
Scenario
This section describes how to configure the environment on NPU-based Lite Servers, such as disk merge and mounting, and Docker installation. Table 1 lists the configuration items.
The latest Lite Servers come with most settings preconfigured. You can skip these steps.
Configuration Precautions
Before the configuration, pay attention to the following:
- During the first installation, once you have configured the basic information such as storage, firmware, driver, and network access, try not to make any changes.
- For developers who need to develop on a BMS, start an independent Docker container as your personal development environment. The Snt9b BMS contains eight-PU compute resources, which can be used by multiple users for development and debugging. To avoid usage conflicts, PUs should be arranged to each user beforehand, and users should develop in their own Docker containers.
- ModelArts provides standard base container images, in which the basic MindSpore/PyTorch framework and the development and debugging tool chain are preset. You can use the image directly. Alternatively, you can use your own service images or images provided by AscendHub. If the software version preset in the image does not meet your requirements, you can install and replace it.
- Use the exposed SSH port to connect to the container in remote development mode (VSCode SSH Remote or Xshell) for development. You can mount your storage directory to the container to store code and data.
Configuring Server SSH Connection Timeout
- Log in to the Lite Server using SSH and check the timeout configuration.
echo $TMOUT
- If it is set to 300, the server will be disconnected after 5 minutes. You can configure the parameter to set a longer timeout interval. If it is set to 0, skip this step. Run the following commands to configure the parameter:
vim /etc/profile # Change the value of TMOUT from 300 to 0 at the end of the file. The value 0 indicates that the idle connection is not disconnected. export TMOUT=0
- Run the following command for the configuration to take effect on the current terminal:
TMOUT=0
By running the export TMOUT=0 command, the idle timeout of the session is set to 0 during SSH connection to the Linux server, that is, the connection will not be automatically disconnected due to idleness. For security purposes, SSH connections may be automatically disconnected if no operation is performed for a while. However, if you are performing a task that requires long-time connection, run this command to prevent disconnection caused by idleness.
You can run the TMOUT=0 command in the current terminal session or add export TMOUT=0 to the /etc/profile file. In this way, new sessions of all users will not be disconnected due to idleness.
Do not configure TMOUT=0 in the production environment or on a public server, as it will bring certain security risks.
Merging and Mounting Disks
After you enable Lite Server resources, there may be multiple unmounted NVMe disks on the server. Before configuring the environment, you need to merge and mount the disks. This operation must be performed at the very first so that the content you stored will not be overwritten.
- Run lsblk to check whether there are three 7 TB disks that are not mounted.
As shown in Figure 1, nvme0n1, nvme1n1, and nvme2n1 are not mounted.As shown in Figure 2, the MOUNTPOINT column indicates the directory where the disk is mounted. You can skip this step and create a directory in /home.
- Edit the disk mounting script create_disk_partitions.sh. This script mounts /dev/nvme0n1 to /home for developers to create their own home directories, and mounts nvme1n1 and nvme2n1 to /docker for containers. If /docker does not have enough space, the root directory may be fully occupied when multiple users share the same Lite Server and create multiple container instances.
vim create_disk_partitions.sh
The following content shows the create_disk_partitions.sh script, which can be directly used without modification:
# ============================================================================ # Mount the nvme0n1 local disk to the /home directory. # Combine the nvme1n1 and nvme2n1 local disks as a logical volume and mount them to the /docker directory. Set automatic mounting upon system startup. # ============================================================================ set -e # Mount nvme0n1 to the user directory. mkfs -t xfs /dev/nvme0n1 mkdir -p /tmp/home cp -r /home/* /tmp/home/ mount /dev/nvme0n1 /home mv /tmp/home/* /home/ rm -rf /tmp/home # Mount nvme1n1 and nvme2n1 to the /docker directory. pvcreate /dev/nvme1n1 pvcreate /dev/nvme2n1 vgcreate nvme_group /dev/nvme1n1 /dev/nvme2n1 lvcreate -l 100%VG -n docker_data nvme_group mkfs -t xfs /dev/nvme_group/docker_data mkdir /docker mount /dev/nvme_group/docker_data /docker # Migrate Docker files to the new /docker directory. systemctl stop docker mv /var/lib/docker/* /docker sed -i '/"default-runtime"/i\ "data-root": "/docker",' /etc/docker/daemon.json systemctl start docker # Enable automatic mounting upon system startup. uuid=`blkid -o value -s UUID /dev/nvme_group/docker_data` && echo UUID=${uuid} /docker xfs defaults,nofail 0 0 >> /etc/fstab uuid=`blkid -o value -s UUID /dev/nvme0n1` && echo UUID=${uuid} /home xfs defaults,nofail 0 0 >> /etc/fstab mount -a df -h - Run the create_disk_partitions.sh script.
sh create_disk_partitions.sh
- After the configuration, run the df -h command to view the information about the mounted disks.
Figure 3 Viewing mounted disks
- After the disks are merged and mounted, you can create a working directory and name it in /home.
Installing the Driver and Firmware
- Check whether the npu-smi tool can be used properly. The firmware and driver installation can continue only when the npu-smi tool can be used properly. Run the command below. If the output is the same as that shown in Figure 4, the npu-smi tool is normal.
npu-smi info
If the command output is not complete as shown in the figure below (for example, an error is reported or only the upper part of the command output is displayed without the following process information), restore the npu-smi tool and install the firmware and driver of the new version. To do so, submit a service ticket to contact Huawei Cloud technical support.
- View the environment information. Run the following command to view the current firmware and driver versions:
npu-smi info -t board -i 1 | egrep -i "software|firmware"
Figure 5 Viewing the firmware and driver versions
firmware indicates the firmware version, and software indicates the driver version.
If the current versions do not meet your requirements and need to be changed, see the subsequent operations.
- View the OS version, check whether the architecture is AArch64 or x86_64, and obtain the firmware and driver packages from the official website. The firmware package is Ascend-hdk-Model-npu-firmware_Version.run and the driver package is Ascend-hdk-Model-npu-driver_Version_linux-aarch64.run. Only Huawei engineers and channel users have the permission to download the commercial version. For details, see the download link.
arch cat /etc/os-release
Figure 6 Viewing the OS version and architecture
The following uses the packages that adapt to EulerOS 2.0 (SP10) and AArch64 as an example.
- Install the driver and firmware.
Installation sequence is vital.- Initial installation: In scenarios where no driver is installed on a hardware device before delivery, or the installed driver and firmware on the hardware device have been uninstalled, you need to install the driver and then firmware.
- Overwrite installation: In scenarios where the driver and firmware have been installed on a hardware device and you need to install them again, install the firmware first and then the driver.
Generally, the firmware and driver are pre-installed on Snt9b and Snt9b23 servers before delivery. So in this case, overwrite installation is used.
If the firmware and driver to be installed are of a lower version, ensure that npu-smi functions, and install the packages without the need to uninstall the existing versions.
Installation commands:- Install the firmware and then restart the server.
chmod 700 *.run # Actual package name ./Ascend-hdk-<model>-npu-firmware_<version>.run --full reboot
- Install the driver. Enter y as prompted.
# Actual package name ./Ascend-hdk-<model>-npu-driver_<version>_linux-aarch64.run --full --install-for-all
- (Optional) Determine whether to restart the system as prompted. If yes, run the following command. If not, skip this step.
reboot
- After the installation, check the firmware and driver versions. If the output is normal, the installation is successful.
npu-smi info -t board -i 1 | egrep -i "software|firmware"
Figure 7 Checking the firmware and driver versions
Install the Docker environment.
ModelArts Lite Server's public images already include Docker. To install Docker manually, follow these steps:
- Run the command below to check whether Docker has been installed. If Docker has been installed, skip this step. Figure 8 shows that Docker has been installed.
docker -v
If Docker is not installed, run this command to install it:yum install -y docker-engine.aarch64 docker-engine-selinux.noarch docker-runc.aarch64
After the installation is complete, run the docker -v command again to check whether Docker is installed.
- Configure IP forwarding for network access in containers.
Run the following command to check the value of net.ipv4.ip_forward. Skip this step if the value is 1.
sysctl -p | grep net.ipv4.ip_forward
If the value is not 1, run the following command to configure IP forwarding:sed -i 's/net\.ipv4\.ip_forward=0/net\.ipv4\.ip_forward=1/g' /etc/sysctl.conf sysctl -p | grep net.ipv4.ip_forward
- Check whether Ascend-docker-runtime has been installed and configured in the environment.
docker info |grep Runtime
If the runtime is ascend in the output, the installation and configuration are complete. In this case, skip this step.Figure 9 Querying Ascend-docker-runtime
If Ascend-docker-runtime is not installed, click here to install it. The software package is a Docker plugin provided by AI Compute Service. During Docker runtime, paths of AI Compute Service drivers can be automatically mounted to the container. You do not need to specify --device during container startup. After the package is downloaded, upload it to the Lite Server and install it.chmod 700 *.run ./Ascend-hdk-<model>-npu-driver_<version>_linux-aarch64.run --install
For details, see Docker Runtime User Guide.
- Set the newly mounted disk as the path used by Docker containers.
Edit the /etc/docker/daemon.json file. If the file does not exist, create it.
vim /etc/docker/daemon.json
Add the two configurations as shown in the following figure. To ensure that the JSON format is correct, add a comma (,) at the end of the insecure-registries line. data_root indicates the path where the Docker data is stored. default-shm-size indicates the default sharing size during container startup. The default value is 64 MB. You can modify it in case the training fails due to insufficient sharing memory during distributed training.Figure 10 Configuring Docker
Save the configuration and run the following command to restart Docker for the configuration to take effect:systemctl daemon-reload && systemctl restart docker
Installing the pip Source
- Check whether pip has been installed and whether the access to the pip source is normal. If yes, skip this step.
pip install numpy
- If pip is not installed, run the following commands:
python -m ensurepip --upgrade ln -s /usr/bin/pip3 /usr/bin/pip
- Configure the pip source.
mkdir -p ~/.pip vim ~/.pip/pip.conf
Add the following information to the ~/.pip/pip.conf file:[global] index-url = http://mirrors.myhuaweicloud.com/pypi/web/simple format = columns [install] trusted-host=mirrors.myhuaweicloud.com
Testing the RoCE Network
The following RoCE network test only applies to Snt9b servers. For details about how to test the RoCE network of Snt9b23 servers, see Lite Server Node Fault Diagnosis.
- Install CANN Toolkit.
Check whether CANN Toolkit has been installed on the server. If the version number is displayed, it has been installed.
cat /usr/local/Ascend/ascend-toolkit/latest/aarch64-linux/ascend_toolkit_install.info
If it is not installed, obtain the software package from the official website. For common users, download the community edition. For Huawei engineers and channel users, the permissions of the commercial edition are limited, download it from here.
Install CANN Toolkit. Replace the package name.chmod 700 *.run ./Ascend-cann-toolkit_6.3.RC2_linux-aarch64.run --full --install-for-all
- Install mpich-3.2.1.tar.gz.
Click here to download the package and run the following commands to install it:
mkdir -p /home/mpich mv /root/mpich-3.2.1.tar.gz /home/ cd /home/;tar -zxvf mpich-3.2.1.tar.gz cd /home/mpich-3.2.1 ./configure --prefix=/home/mpich --disable-fortran make && make install
- Set environment variables and compile the HCCL operator.
export PATH=/home/mpich/bin:$PATH cd /usr/local/Ascend/ascend-toolkit/latest/tools/hccl_test export LD_LIBRARY_PATH=/home/mpich/lib/:/usr/local/Ascend/ascend-toolkit/latest/lib64:$LD_LIBRARY_PATH make MPI_HOME=/home/mpich ASCEND_DIR=/usr/local/Ascend/ascend-toolkit/latest
After the operator is compiled, the following information is displayed.Figure 11 Compiled operator
- Perform all_reduce_test in the single-node scenario.
For single-node single-PU, run the following command:
mpirun -n 1 ./bin/all_reduce_test -b 8 -e 1024M -f 2 -p 8
For single-node multi-PU, run the following command:mpirun -n 8 ./bin/all_reduce_test -b 8 -e 1024M -f 2 -p 8
Figure 12 all_reduce_test
- Test the bandwidth of multi-node RoCE NICs.
- Check the Ascend RoCE IP address.
cat /etc/hccn.conf
Figure 13 Viewing Ascend RoCE IP address
- Perform the RoCE test.
In session 1, run the -i<PU-ID> command on the receive end.
hccn_tool -i 7 -roce_test reset hccn_tool -i 7 -roce_test ib_send_bw -s 4096000 -n 1000 -tcp
In session 2, run the -i<PU-ID> command on the sending end. The IP address of the receive end is at the end.
cd /usr/local/Ascend/ascend-toolkit/latest/tools/hccl_test hccn_tool -i 0 -roce_test reset hccn_tool -i 0 -roce_test ib_send_bw -s 4096000 -n 1000 address 192.168.100.18 -tcp
The following figure shows the RoCE test result.
Figure 14 RoCE test result (receive end)
Figure 15 RoCE test result (server)
- Check the Ascend RoCE IP address.
- If the RoCE bandwidth test has been started for a NIC, the following error message is displayed when the task is started again.
Figure 16 Error
Run the following command to stop the roce_test task and then start the task:
hccn_tool -i 7 -roce_test reset
Run the following command to query the NIC status:for i in {0..7};do hccn_tool -i ${i} -link -g;doneRun the following command to check the IP address connectivity of the NIC on a common node:for i in $(seq 0 7);do hccn_tool -i $i -net_health -g;done
Creating a Containerized Custom Debugging Environment
You could start your Docker container on the physical machine (PM) for development. You can use your service images or base images provided by ModelArts, including Ascend+PyTorch and Ascend+MindSpore.
- Prepare a service base image.
- Choose an image based on your environment.
# Container image matching Snt9b. The following shows an example. docker pull swr.<region-code>.myhuaweicloud.com/atelier/<image-name>:<image-tag>
- Start the container image. If multiple users and containers are sharing a machine, allocate the PUs beforehand. Do not use PUs occupied by other containers.
# Start the container. Specify the container name and image information. ASCEND_VISIBLE_DEVICES indicates the PUs to be used by the container, for example, 0-1,3 indicates PUs 0, 1, and 3 are used. The hyphens (-) specify the range. # -v /home:/home_host indicates mounting the home directory of the host to the home_host directory of the container. Use this mounting directory in the container to store code and data for persistent storage. docker run -itd --cap-add=SYS_PTRACE -e ASCEND_VISIBLE_DEVICES=0 -v /home:/home_host -p 51234:22 -u=0 --name <custom-container-name> <SWR-address-of-the-image-pulled-in-the-preceding-step> /bin/bash
- Access the container.
docker exec -ti <custom-container-name-in-the-last-command> bash
- Access the Conda environment.
source /home/ma-user/.bashrc cd ~
- View the information of available PUs in the container.
npu-smi info
If the following error message is displayed, the PU specified by ASCEND_VISIBLE_DEVICES during container startup is occupied by another container. In this case, select another PU and restart the new container.Figure 17 Error
- After you run npu-smi info and the output is normal, run the commands below to test the container environment. If the output is normal, the container environment is available.
- PyTorch image test:
python3 -c "import torch;import torch_npu; a = torch.randn(3, 4).npu(); print(a + a);"
- MindSpore image test:
# The run_check program of MindSpore does not adapt to Snt9b. Configure two environment variables first. unset MS_GE_TRAIN unset MS_ENABLE_GE python -c "import mindspore;mindspore.set_context(device_target='Ascend');mindspore.run_check()" # Restore the environment variables after the test for actual training. export MS_GE_TRAIN=1 export MS_ENABLE_GE=1
Figure 18 Accessing the Conda environment and performing a test
- PyTorch image test:
- Choose an image based on your environment.
- (Optional) Configure SSH access for the container.
If you need to use the VS Code or SSH tool to directly connect to the container for development, perform the following operations:
- After accessing the container, run the SSH startup command to start the SSH service.
ssh-keygen -A /usr/sbin/sshd # Check whether SSH is started. ps -ef |grep ssh
- Set a password for user root as prompted.
passwd
Figure 19 Setting a password for user root
- Run the exit command to exit the container and perform the SSH test on the host.
ssh root@<host-IP-address> -p 51234(<mapped-port-number>)
Figure 20 Perform the SSH test.
If the error message "Host key verification failed" is displayed when you perform the SSH container test on the host machine, delete the ~/.ssh/known_host file from the host machine and try again.
- Use VS Code SSH to connect to the container environment.
If you have not used VS Code SSH, install the VS Code environment and Remote-SSH plugin by referring to Step1 Manually Connecting to a Notebook Instance Through VS Code.
Open VSCode Terminal and run the following command to generate a key pair on the local computer. If you already have a key pair, skip this step.ssh-keygen -t rsa
Add the public key to the authorization file of the remote server. Replace the server IP address and container port number.cat ~/.ssh/id_rsa.pub | ssh root@<server-IP-address> -p <container-port-number> "mkdir -p ~/.ssh && cat >> ~/.ssh/authorized_keys"
Open the Remote-SSH configuration file of VSCode and add SSH configuration items. Replace the server IP address and container port number.Host Snt9b-dev HostName <server-IP-address> User root port <SSH-port-number-of-the-container> identityFile ~\.ssh\id_rsa StrictHostKeyChecking no UserKnownHostsFile /dev/null ForwardAgent yesNote: Use the key to log in. If you want to use the password, delete the identityFile configuration and enter the password as prompted during the connection.
After the connection, install the Python plugin. For details, see Install the Python Plug-in in the Cloud Development Environment.
- After accessing the container, run the SSH startup command to start the SSH service.
- (Optional) Install CANN Toolkit.
CANN Toolkit has been installed in the preset images provided by ModelArts. If you need to use another version or use your own image that is not preset with CANN Toolkit, see the following operations.
- Check whether CANN Toolkit has been installed in the container. If the version number is displayed, it has been installed.
cat /usr/local/Ascend/ascend-toolkit/latest/aarch64-linux/ascend_toolkit_install.info
- If it is not installed or needs to be upgraded, obtain the software package from the official website. For common users, download the community edition. For Huawei engineers and channel users, the permissions of the commercial edition are limited, download it from here.
Install CANN Toolkit. Replace the package name.
chmod 700 *.run ./Ascend-cann-toolkit_6.3.RC2_linux-aarch64.run --full --install-for-all
- If it has been installed but needs to be upgraded, run the following command. Replace the package name.
chmod 700 *.run ./Ascend-cann-toolkit_6.3.RC2_linux-aarch64.run --upgrade --install-for-all
- Check whether CANN Toolkit has been installed in the container. If the version number is displayed, it has been installed.
- (Optional) Install MindSpore Lite.
MindSpore Lite has been installed in the preset image. If you need to use another version or use your own image that is not preset with MindSpore Lite, see the following operations.
- Check whether MindSpore Lite has been installed in the container. If the software information and version are displayed, it has been installed.
pip show mindspore-lite
- If it is not installed, download the .whl and .tar.gz packages from the official website and download them. Replace the package names.
pip install mindspore_lite-2.1.0-cp37-cp37m-linux_aarch64.whl mkdir -p /usr/local/mindspore-lite tar -zxvf mindspore-lite-2.1.0-linux-aarch64.tar.gz -C /usr/local/mindspore-lite --strip-components 1
- Check whether MindSpore Lite has been installed in the container. If the software information and version are displayed, it has been installed.
- Configure the pip source.
The pip source has been configured in the preset image provided by ModelArts. To use your own service images, configure it by referring to Installing the pip Source.
- Configure a Yum repository.
- Configure the Yum repository in Huawei EulerOS.
# Create the EulerOS.repo file in the /etc/yum.repos.d/ directory, cd /etc/yum.repos.d/ mv EulerOS.repo EulerOS.repo.bak vim EulerOS.repo # Configure the EulerOS.repo file based on the EulerOS version and system architecture. EulerOS 2.10 is used as an example. [base] name=EulerOS-2.0SP10 base baseurl=https://mirrors.huaweicloud.com/euler/2.10/os/aarch64/ enabled=1 gpgcheck=1 gpgkey=https://mirrors.huaweicloud.com/euler/2.10/os/RPM-GPG-KEY-EulerOS # Clear the existing Yum cache. yum clean all # Generate a new Yum cache. yum makecache # Perform a test. yum update --allowerasing --skip-broken --nobest
- Configure a Yum repository in Huawei Cloud EulerOS.
# Download the new hce.repo file to the /etc/yum.repos.d/ directory. wget -O /etc/yum.repos.d/hce.repo https://mirrors.huaweicloud.com/artifactory/os-conf/hce/hce.repo # Clear the existing Yum cache. yum clean all # Generate a new Yum cache. yum makecache # Perform a test. yum update --allowerasing --skip-broken --nobest
- Configure the Yum repository in Huawei EulerOS.
-
To use git clone and git lfs commands to download large models, see the following operations:
- The Euler source does not have the git-lfs package. Therefore, you need to decompress the package. To do so, enter the following address in the address box of the browser, download the git-lfs package, and upload it to the /home directory on the server. This directory is mounted to the /home_host directory of the container when the container is started. In this way, the git-lfs package can be directly used in the container.
https://github.com/git-lfs/git-lfs/releases/download/v3.2.0/git-lfs-linux-arm64-v3.2.0.tar.gz
- Go to the container and run the git-lfs installation commands.
cd /home_host tar -zxvf git-lfs-linux-arm64-v3.2.0.tar.gz cd git-lfs-3.2.0 sh install.sh
- Disable SSL verification for Git configuration.
git config --global http.sslVerify false
- The following commands use code in diffusers as an example. Replace the development directory.
# git clone diffusers source code. You can specify a branch for the -b parameter. Replace the development directory. cd /home_host/<user-directory> mkdir sd cd sd git clone https://github.com/huggingface/diffusers.git -b v0.11.1-patch
Run git clone to download the model on Hugging Face. The following uses a stable-diffusion (SD) model as an example.
If error "SSL_ERROR_SYSCALL" is reported during the download, try again. The download may take several hours due to network restrictions and large file size. If the download still fails after multiple retries, download the large file from the website and upload it to the personal development directory in /home on the server. To skip the large files during download, set GIT_LFS_SKIP_SMUDGE to 1.git lfs install git clone https://huggingface.co/runwayml/stable-diffusion-v1-5 -b onnx
Figure 21 Downloaded code
- The Euler source does not have the git-lfs package. Therefore, you need to decompress the package. To do so, enter the following address in the address box of the browser, download the git-lfs package, and upload it to the /home directory on the server. This directory is mounted to the /home_host directory of the container when the container is started. In this way, the git-lfs package can be directly used in the container.
- If a container is used or shared by multiple users, you should restrict the container from accessing the OpenStack management address (169.254.169.254) to prevent host machine metadata acquisition. For details, see Forbidding Containers to Obtain Host Machine Metadata.
- Save the image in the container environment.
After the environment is configured, you can develop and debug the service code. To prevent the environment from being lost after the host is restarted, run the following commands to save the configured environment as a new image:
# Check the ID of the container to be saved as an image. docker ps # Save the image. docker commit <container-ID> <custom-image-name>:<custom-image-tag> # View the saved image. docker images # If you need to share the image with others in other environments, save the image as a TAR file. This command takes a long time. You can view the file by running the ls command after it is saved. docker save -o <custom-name>.tar <image-name>:<image-tag> # Load the file on other hosts. After the file is loaded, you can view the image. docker load --input <custom-name>.tar
For details about how to migrate services to Ascend for development and debugging, see the related documents.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot



