Configuring the Software Environment on the NPU Server

Scenario

This section describes how to configure the environment on an Snt9b BMS, including merging and mounting disks and installing Docker. Pay attention to the following before the configuration:

During the the first installation, once you have configured the basic information such as storage, firmware, driver, and network access, try not to make any changes.
For developers who need to develop on a BMS, start an independent Docker container as your personal development environment. The Snt9b BMS contains eight-card compute resources, which can be used by multiple users for development and debugging. To avoid usage conflicts, cards should be arranged to each user beforehand, and users should develop in their own Docker containers.
ModelArts provides standard base container images, in which the basic MindSpore/PyTorch framework and the development and debugging tool chain are preset. You can use the image directly. Alternatively, you can use your own service images or images provided by AscendHub. If the software version preset in the image does not meet your requirements, you can install and replace it.
Use the exposed SSH port to connect to the container in remote development mode (VSCode SSH Remote or Xshell) for development. You can mount your storage directory to the container to store code and data.

Most operations in this guide have been preset in the latest Snt9b BMS environment. No further configuration is required. Skip the step if it has been preset.

Physical Machine Environment Configuration

Configure timeout parameters.

Log in to the server using SSH and check the timeout configuration.
```
echo $TMOUT
```
If it is set to 300, the server is disconnected after 5 minutes. You can configure the parameter to set a longer timeout interval. If it is set to 0, skip this step. Run the following commands to configure the parameter:
```
vim /etc/profile 
# Change the value of TMOUT from 300 to 0 at the end of the file. The value 0 indicates that the idle connection is not disconnected.
export TMOUT=0
```
Run the following command for the configuration to take effect on the current terminal:
```
TMOUT=0
```
Merge and mount disks.

After a BMS is purchased, there might be multiple unmounted nvme disks on the server. Before configuring the environment, you need to merge and mount the disks. This operation must be performed at the very first so that the content you stored will not be overwritten.
1. Run the lsblk command to check whether three 7-TB disks are not mounted. In the following figure, nvme0n1, nvme1n1, and nvme2n1 are not mounted.
  Figure 1 Unmounted disks
2. The MOUNTPOINT column indicates the directory where the disk is mounted, as shown in the following figure. You can skip this step and create a directory in /home.
  Figure 2 Mounted disks
  Run the automatic mounting script to mount /dev/nvme0n1 to /home for each developer to create their own home directory. Mount the other two disks to /docker for containers to use. If /docker does not have sufficient space, the root directory may be fully occupied when multiple container instances are created.
```
cd /root/tools/
sh create_disk_partitions.sh
```
  After the configuration, run the df -h command to view the information about the mounted disks.
  Figure 3 Viewing mounted disks
  
  After the disks are merged and mounted, you can create a working directory and name it in /home.
(Optional) Install the firmware and driver.
1. View the environment information. Run the following command to view the current firmware and driver versions:
```
npu-smi info -t board -i 1 | egrep -i "software|firmware"
```
  Figure 4 Viewing the firmware and driver versions
  
  The above figure shows an example of the latest Ascend commercial version. You can skip the step for installing the firmware and driver in this section.
  
  If the current versions do not meet your requirements and need to be changed, see the subsequent operations.
2. View the OS version, check whether the architecture is AArch64 or x86_64, and obtain the firmware and driver packages from the Ascend official website. The firmware package is Ascend-hdk-Model-npu-firmware_Version.run and the driver package is Ascend-hdk-Model-npu-driver_Version_linux-aarch64.run. Only Huawei engineers and channel users have the permission to download the commercial version. For details, see Ascend HDK 23.0.RC3.
```
arch
cat /etc/os-release
```
  Figure 5 Viewing the OS version and architecture
  
  The following uses the packages that adapt to EulerOS 2.0 (SP10) and AArch64 as an example.
3. Install the firmware and driver packages.
  1. Check whether the npu-smi tool functions, which is vital for subsequent firmware and driver installation. Run the npu-smi info command. If the content in the following figure is displayed, npu-smi functions properly.
    If the content in the following figure is not displayed, for example, an error is reported or only part of the content is displayed, submit a service ticket to contact Huawei technical support to restore npu-smi. Then, you can install the firmware and driver of the new version.
    
    Figure 6 Checking the npu-smi tool
  2. After the tool is verified to be normal, install the firmware and driver.
    Installation sequence is vital.
    
    Initial installation: In scenarios where no driver is installed on a hardware device before delivery, or the installed driver and firmware on the hardware device have been uninstalled, you need to install the driver and then firmware.
    
    Overwrite installation: In scenarios where the driver and firmware have been installed on a hardware device and you need to install them again, install the firmware first and then the driver.
    Generally, the firmware and driver are pre-installed on the Snt9b server before delivery. So in this case, overwrite installation is used as an example.
    1. If the firmware and driver to be installed are of a lower version, ensure that npu-smi functions, and install the packages without the need to uninstall the existing versions.
    2. If the firmware and driver fail to be installed, search for the solution in Developer Community based on the error message.
    The installation commands are as follows.
    1. Install the firmware and then restart the server.
      chmod 700 *.run # Actual package name ./Ascend-hdk-Model-npu-firmware_Version.run --full reboot
    2. Install the driver and enter y as prompted. You do not need to restart the server.
      # Actual package name ./Ascend-hdk-Model-npu-driver_Version_linux-aarch64.run --full --install-for-all
    3. After the installation, check the firmware and driver versions. If the output is normal, the installation is successful.
      npu-smi info -t board -i 1 | egrep -i "software|firmware"
      
      Figure 7 Checking the firmware and driver versions
Install the Docker environment.
1. Run the docker -v command to check whether Docker has been installed. If yes, skip this step.
  Run the following command to install Docker:
```
yum install -y docker-engine.aarch64 docker-engine-selinux.noarch docker-runc.aarch64
```
  Run the docker -v command to check whether the installation is successful.
  Figure 8 Viewing the Docker version
2. Configure IP forwarding for network access in containers.
  Run the following command to check the value of net.ipv4.ip_forward. Skip this step if the value is 1.
```
sysctl -p | grep net.ipv4.ip_forward
```
  If the value is not 1, run the following command to configure IP forwarding:
```
sed -i 's/net\.ipv4\.ip_forward=0/net\.ipv4\.ip_forward=1/g' /etc/sysctl.conf 
sysctl -p | grep net.ipv4.ip_forward
```
3. Check whether Ascend-docker-runtime has been installed and configured in the environment.
```
docker info |grep Runtime
```
  If the runtime is ascend in the output, the installation and configuration are complete. In this case, skip this step.
  Figure 9 Querying Ascend-docker-runtime
  If Ascend-docker-runtime is not installed, click here to install it. The software package is a Docker plug-in provided by Ascend. During Docker runtime, paths of Ascend drivers can be automatically mounted to the container. You do not need to specify --device during container startup. After the package is downloaded, upload it to the server and install it.
```
chmod 700 *.run
./Ascend-hdk-Model-npu-driver_Version_linux-aarch64.run --install
```
  For details, see Ascend Docker Runtime.
4. Set the newly mounted disk as the path used by Docker containers.
  Edit the /etc/docker/daemon.json file. If the file does not exist, create it.
```
vim /etc/docker/daemon.json
```
  Add the two configurations as shown in the following figure. To ensure that the JSON format is correct, add a comma (,) at the end of the insecure-registries line. data_root indicates the path where the Docker data is stored. default-shm-size indicates the default sharing size during container startup. The default value is 64 MB. You can modify it in case the training fails due to insufficient sharing memory during distributed training.
  Figure 10 Docker configuration
  Save the configuration and run the following command to restart Docker for the configuration to take effect:
```
systemctl daemon-reload && systemctl restart docker
```
(Optional) Install pip.
1. Check whether pip has been installed and whether the access to the pip source is normal. If yes, skip this step.
```
pip install numpy
```
2. If pip is not installed, run the following commands:
```
python -m ensurepip --upgrade
ln -s /usr/bin/pip3 /usr/bin/pip
```
3. Configure the pip source.
```
mkdir -p ~/.pip
vim ~/.pip/pip.conf
```
  Add the following information to the ~/.pip/pip.conf file:
```
[global]
index-url = http://mirrors.myhuaweicloud.com/pypi/web/simple
format = columns
[install]
trusted-host=mirrors.myhuaweicloud.com
```
Test the RoCE network.
1. Install CANN Toolkit.
  Check whether CANN Toolkit has been installed on the server. If the version number is displayed, it has been installed.
```
cat /usr/local/Ascend/ascend-toolkit/latest/aarch64-linux/ascend_toolkit_install.info
```
  If it is not installed, obtain the software package from the official website. For common users, download the community edition. For Huawei engineers and channel users, download the commercial edition.
  Install CANN Toolkit. Replace the package name.
```
chmod 700 *.run
./Ascend-cann-toolkit_6.3.RC2_linux-aarch64.run --full --install-for-all
```
2. Install mpich-3.2.1.tar.gz.
  Click here to download the package and run the following commands to install it:
```
mkdir -p /home/mpich
mv /root/mpich-3.2.1.tar.gz /home/
cd /home/;tar -zxvf mpich-3.2.1.tar.gz
cd /home/mpich-3.2.1
./configure --prefix=/home/mpich --disable-fortran
make && make install
```
3. Set environment variables and compile the HCCL operator.
```
export PATH=/home/mpich/bin:$PATH
cd /usr/local/Ascend/ascend-toolkit/latest/tools/hccl_test
export LD_LIBRARY_PATH=/home/mpich/lib/:/usr/local/Ascend/ascend-toolkit/latest/lib64:$LD_LIBRARY_PATH
make MPI_HOME=/home/mpich ASCEND_DIR=/usr/local/Ascend/ascend-toolkit/latest
```
  After the operator is compiled, the following information is displayed.
  Figure 11 Compiled operator
4. Perform all_reduce_test in the single-node scenario.
  Go to the hccl_test directory.
```
cd /usr/local/Ascend/ascend-toolkit/latest/tools/hccl_test
```
  For single-node single-card, run the following command:
```
mpirun -n 1 ./bin/all_reduce_test -b 8 -e 1024M -f 2 -p 8 
```
  For single-node multi-card, run the following command:
```
mpirun -n 8 ./bin/all_reduce_test -b 8 -e 1024M -f 2 -p 8
```
  Figure 12 all_reduce_test
5. Test the bandwidth of multi-node RoCE NICs.
  1. Check the Ascend RoCE IP address.
```
cat /etc/hccn.conf
```
    Figure 13 Viewing Ascend RoCE IP address
  2. Perform the RoCE test.
    In session 1, run the -iCard ID command on the receive end.
```
hccn_tool -i 7 -roce_test reset
hccn_tool -i 7 -roce_test ib_send_bw -s 4096000 -n 1000 -tcp
```
    In session 2, run the -iCard ID command on the sending end. The IP address of the receive end is at the end.
```
cd /usr/local/Ascend/ascend-toolkit/latest/tools/hccl_test
hccn_tool -i 0 -roce_test reset
hccn_tool -i 0 -roce_test ib_send_bw -s 4096000 -n 1000 address 192.168.100.18 -tcp
```
    The following figure shows the RoCE test result.
    
    Figure 14 RoCE test result (receive end)
    
    Figure 15 RoCE test result (server)
    If the RoCE bandwidth test has been started for a NIC, the following error message is displayed when the task is started again.
    Figure 16 Error
    
    Run the following command to stop the roce_test task and then start the task:
    
    hccn_tool -i 7 -roce_test reset
    
    Run the following command to query the NIC status:
    for i in {0..7};do hccn_tool -i ${i} -link -g;done
    
    Run the following command to check the IP address connectivity of the NIC on a single node:
    for i in $(seq 0 7);do hccn_tool -i $i -net_health -g;done

Creating a Containerized Custom Debugging Environment

Prepare a service base image.

You could start your Docker container on the physical machine (PM) for development. You can use your service images or base images provided by ModelArts, including Ascend+PyTorch and Ascend+MindSpore.

Choose an image based on your environment.

# Container image matching Snt9b. The following shows an example.
docker pull swr.<region-code>.myhuaweicloud.com/atelier/<image-name>:<image-tag>

Start the container image. If multiple users and containers are sharing a machine, allocate the cards beforehand. Do not use cards occupied by other containers.

#  Start the container. Specify the container name and image information. ASCEND_VISIBLE_DEVICES indicates the cards to be used by the container, for example, 0-1,3 indicates cards 0, 1, and 3 are used. The hyphens (-) specify the range.
# -v /home:/home_host indicates mounting the home directory of the host to the home_host directory of the container. Use this mounting directory in the container to store code and data for persistent storage.
docker run -itd --cap-add=SYS_PTRACE -e ASCEND_VISIBLE_DEVICES=0  -v /home:/home_host -p 51234:22 -u=0 --name Custom container name  SWR address of the image pulled in the preceding step  /bin/bash

Access the container.

docker exec -ti Custom container name in the last command bash

Access the Conda environment.
```
source /home/ma-user/.bashrc
cd ~
```
View the information of available cards in the container.
```
npu-smi info
```
If the following error message is displayed, the card specified by ASCEND_VISIBLE_DEVICES during container startup is occupied by another container. In this case, select another card and restart the new container.
Figure 17 Error

After you run the npu-smi info command and the output is normal, run the following commands to test the container environment. If the output is normal, the container environment is available.

PyTorch image test:

python3 -c "import torch;import torch_npu; a = torch.randn(3, 4).npu(); print(a + a);"

MindSpore image test:

# The run_check program of MindSpore does not adapt to Snt9b. Configure two environment variables first.
unset MS_GE_TRAIN 
unset MS_ENABLE_GE
python -c "import mindspore;mindspore.set_context(device_target='Ascend');mindspore.run_check()"
# Restore the environment variables after the test for actual training.
export MS_GE_TRAIN=1
export MS_ENABLE_GE=1

Figure 18 Accessing the Conda environment and performing a test
Click to enlarge

(Optional) Configure SSH access for the container.

If you need to use the VS Code or SSH tool to directly connect to the container for development, perform the following operations:
1. After accessing the container, run the SSH startup command to start the SSH service.
```
ssh-keygen  -A
/usr/sbin/sshd
#  Check whether SSH is started.
ps -ef |grep ssh
```
2. Set the passord for user root and enter the password as prompted.
```
passwd
```
  Figure 19 Setting a password for user root
3. Run the exit command to exit the container and perform the SSH test on the host.
```
ssh root@Host IP address -p 51234(Mapped port number)
```
  Figure 20 Perform the SSH test.
  
  If the error message "Host key verification failed" is displayed when you perform the SSH container test on the host machine, delete the ~/.ssh/known_host file from the host machine and try again.
4. Use VS Code SSH to connect to the container environment.
  If you have not used VS Code SSH, install the VS Code environment and Remote-SSH plug-in by referring to Step1 Manually Connecting to a Notebook Instance Through VS Code.
  Open VSCode Terminal and run the following command to generate a key pair on the local computer. If you already have a key pair, skip this step.
```
ssh-keygen -t rsa
```
  Add the public key to the authorization file of the remote server. Replace the server IP address and container port number.
```
cat ~/.ssh/id_rsa.pub | ssh root@Server IP address -p Container port number "mkdir -p ~/.ssh && cat >>  ~/.ssh/authorized_keys"
```
  Open the Remote-SSH configuration file of VSCode and add SSH configuration items. Replace the server IP address and container port number.
```
Host Snt9b-dev
    HostName Server IP address
    User root
    port SSH port number of the container
    identityFile ~\.ssh\id_rsa
    StrictHostKeyChecking no
    UserKnownHostsFile /dev/null
    ForwardAgent yes
```
  Note: Use the key to log in. If you want to use the password, delete the identityFile configuration and enter the password as prompted during the connection.
  
  After the connection, install the Python plug-in. For details, see Install the Python Plug-in in the Cloud Development Environment.
(Optional) Install CANN Toolkit.

CANN Toolkit has been installed in the preset images provided by ModelArts. If you need to use another version or use your own image that is not preset with CANN Toolkit, see the following operations.
1. Check whether CANN Toolkit has been installed in the container. If the version number is displayed, it has been installed.
```
cat /usr/local/Ascend/ascend-toolkit/latest/aarch64-linux/ascend_toolkit_install.info
```
2. If it is not installed or needs to be upgraded, obtain the software package from the official website. For common users, download the community edition. For Huawei engineers and channel users, download the commercial edition.
  Install CANN Toolkit. Replace the package name.
```
chmod 700 *.run
./Ascend-cann-toolkit_6.3.RC2_linux-aarch64.run --full --install-for-all
```
3. If it has been installed but needs to be upgraded, run the following command. Replace the package name.
```
chmod 700 *.run
./Ascend-cann-toolkit_6.3.RC2_linux-aarch64.run --upgrade --install-for-all
```
(Optional) Install MindSpore Lite.

MindSpore Lite has been installed in the preset image. If you need to use another version or use your own image that is not preset with MindSpore Lite, see the following operations.
1. Check whether MindSpore Lite has been installed in the container. If the software information and version are displayed, it has been installed.
```
pip show mindspore-lite
```
2. If it is not installed, download the .whl and .tar.gz packages from the official website and download them. Replace the package names.
```
pip install mindspore_lite-2.1.0-cp37-cp37m-linux_aarch64.whl
mkdir -p /usr/local/mindspore-lite
tar -zxvf mindspore-lite-2.1.0-linux-aarch64.tar.gz -C /usr/local/mindspore-lite --strip-components 1
```
Configure the pip source and Yum repository.
- Configure the pip source.
  The pip source has been configured in the preset image provided by ModelArts. If you need to use your own service images, configure it by referring to 5.
- Configure the Yum repository.
  Run the following commands to configure the Yum repository:
```
# Automatically configure the Yum repository.
wget http://mirrors.myhuaweicloud.com/repo/mirrors_source.sh && bash mirrors_source.sh

# Test
yum update --allowerasing --skip-broken --nobest
```
Install git-lfs and run the git clone command to download the code.

To use git clone and git lfs commands to download large models, see the following operations:
1. The Euler source does not have the git-lfs package. Therefore, you need to decompress the package. To do so, enter the following address in the address box of the browser, download the git-lfs package, and upload it to the /home directory on the server. This directory is mounted to the /home_host directory of the container when the container is started. In this way, the git-lfs package can be directly used in the container.
```
https://github.com/git-lfs/git-lfs/releases/download/v3.2.0/git-lfs-linux-arm64-v3.2.0.tar.gz
```
2. Go to the container and run the git-lfs installation commands.
```
cd /home_host
tar -zxvf git-lfs-linux-arm64-v3.2.0.tar.gz
cd git-lfs-3.2.0
sh install.sh
```
3. Disable SSL verification for Git configuration.
```
git config --global http.sslVerify false
```
4. The following commands use code in diffusers as an example. Replace the development directory.
```
#  git clone diffusers source code, -b You can specify a branch for this parameter. Replace the development directory.
cd /home_host/User directory
mkdir sd
cd sd
git clone https://github.com/huggingface/diffusers.git -b v0.11.1-patch
```
  Run the git clone command to download the model on Hugging Face. The following uses a stable-diffusion (SD) model as an example. If the SSL_ERROR_SYSCALL error is reported during the download, try again. The download may take several hours due to network restrictions and large file size. If the download still fails after multiple retries, download the large file from the website and upload it to the personal development directory in /home on the server. To skip the large files during download, set GIT_LFS_SKIP_SMUDGE to 1.
```
git lfs install 
git clone https://huggingface.co/runwayml/stable-diffusion-v1-5 -b onnx
```
  Figure 21 Downloaded code

Save the image in the container environment.

After the environment is configured, you can develop and debug the service code. To prevent the environment from being lost after the host is restarted, run the following commands to save the configured environment as a new image:

# Check the ID of the container to be saved as an image.
docker ps  
# Save the image.
docker commit Container ID  Custom image name:Custom image tag  
# View the saved image.
docker images  
# If you need to share the image with others in other environments, save the image as a local file. This command takes a long time. You can view the file after it is saved.
docker save -o Custom name.tar Image name:Image tag  
# Load the file on other hosts. After the file is loaded, you can view the image.
docker load --input Custom name.tar

For details about how to migrate services to Ascend for development and debugging, see the related documents.

Parent topic: Configuring the Software Environment

Previous topic: Configuring the Software Environment

Next topic: Configuring the Software Environment on the GPU Server