Updated on 2025-11-27 GMT+08:00

Configuring the Software Environment on the NPU Server

Scenario

This section describes how to configure the environment on NPU-based Lite Servers, such as disk merge and mounting, and Docker installation. Table 1 lists the configuration items.

The latest Lite Servers come with most settings preconfigured. You can skip these steps.

Table 1 physical machine environment configuration items

Configuration Item

Application Scope

Configuring Server SSH Connection Timeout

All server models

Merging and Mounting Disks

All server models

Installing the Driver and Firmware

NPU servers only

Install the Docker environment.

All server models

Installing the pip Source

All server models

Testing the RoCE Network

Snt9b servers only

Creating a Containerized Custom Debugging Environment

NPU servers only

Configuration Precautions

Before the configuration, pay attention to the following:

  • During the first installation, once you have configured the basic information such as storage, firmware, driver, and network access, try not to make any changes.
  • For developers who need to develop on a BMS, start an independent Docker container as your personal development environment. The Snt9b BMS contains eight-PU compute resources, which can be used by multiple users for development and debugging. To avoid usage conflicts, PUs should be arranged to each user beforehand, and users should develop in their own Docker containers.
  • ModelArts provides standard base container images, in which the basic MindSpore/PyTorch framework and the development and debugging tool chain are preset. You can use the image directly. Alternatively, you can use your own service images or images provided by AscendHub. If the software version preset in the image does not meet your requirements, you can install and replace it.
  • Use the exposed SSH port to connect to the container in remote development mode (VSCode SSH Remote or Xshell) for development. You can mount your storage directory to the container to store code and data.

Configuring Server SSH Connection Timeout

  1. Log in to the Lite Server using SSH and check the timeout configuration.
    echo $TMOUT
  2. If it is set to 300, the server will be disconnected after 5 minutes. You can configure the parameter to set a longer timeout interval. If it is set to 0, skip this step. Run the following commands to configure the parameter:
    vim /etc/profile 
    # Change the value of TMOUT from 300 to 0 at the end of the file. The value 0 indicates that the idle connection is not disconnected.
    export TMOUT=0
  3. Run the following command for the configuration to take effect on the current terminal:
    TMOUT=0

    By running the export TMOUT=0 command, the idle timeout of the session is set to 0 during SSH connection to the Linux server, that is, the connection will not be automatically disconnected due to idleness. For security purposes, SSH connections may be automatically disconnected if no operation is performed for a while. However, if you are performing a task that requires long-time connection, run this command to prevent disconnection caused by idleness.

    You can run the TMOUT=0 command in the current terminal session or add export TMOUT=0 to the /etc/profile file. In this way, new sessions of all users will not be disconnected due to idleness.

    Do not configure TMOUT=0 in the production environment or on a public server, as it will bring certain security risks.

Merging and Mounting Disks

After you enable Lite Server resources, there may be multiple unmounted NVMe disks on the server. Before configuring the environment, you need to merge and mount the disks. This operation must be performed at the very first so that the content you stored will not be overwritten.

  1. Run lsblk to check whether there are three 7 TB disks that are not mounted.
    As shown in Figure 1, nvme0n1, nvme1n1, and nvme2n1 are not mounted.
    Figure 1 Unmounted disks
    As shown in Figure 2, the MOUNTPOINT column indicates the directory where the disk is mounted. You can skip this step and create a directory in /home.
    Figure 2 Mounted disks
  2. Edit the disk mounting script create_disk_partitions.sh. This script mounts /dev/nvme0n1 to /home for developers to create their own home directories, and mounts nvme1n1 and nvme2n1 to /docker for containers. If /docker does not have enough space, the root directory may be fully occupied when multiple users share the same Lite Server and create multiple container instances.
    vim create_disk_partitions.sh

    The following content shows the create_disk_partitions.sh script, which can be directly used without modification:

    # ============================================================================
    # Mount the nvme0n1 local disk to the /home directory.
    # Combine the nvme1n1 and nvme2n1 local disks as a logical volume and mount them to the /docker directory. Set automatic mounting upon system startup.
    # ============================================================================
    set -e
    # Mount nvme0n1 to the user directory.
    mkfs -t xfs /dev/nvme0n1
    mkdir -p /tmp/home
    cp -r /home/* /tmp/home/
    mount /dev/nvme0n1 /home
    mv /tmp/home/* /home/
    rm -rf  /tmp/home
    # Mount nvme1n1 and nvme2n1 to the /docker directory.
    pvcreate /dev/nvme1n1
    pvcreate /dev/nvme2n1
    vgcreate nvme_group  /dev/nvme1n1 /dev/nvme2n1
    lvcreate -l 100%VG -n docker_data nvme_group
    mkfs -t xfs /dev/nvme_group/docker_data
    mkdir /docker
    mount /dev/nvme_group/docker_data /docker
    # Migrate Docker files to the new /docker directory.
    systemctl stop docker
    mv /var/lib/docker/* /docker
    sed -i '/"default-runtime"/i\        "data-root":     "/docker",' /etc/docker/daemon.json
    systemctl start docker
    # Enable automatic mounting upon system startup.
    uuid=`blkid -o value -s UUID /dev/nvme_group/docker_data` && echo UUID=${uuid} /docker xfs defaults,nofail 0 0 >> /etc/fstab
    uuid=`blkid -o value -s UUID /dev/nvme0n1` && echo UUID=${uuid} /home xfs defaults,nofail 0 0 >> /etc/fstab
    mount -a
    df -h
  3. Run the create_disk_partitions.sh script.
    sh create_disk_partitions.sh
  4. After the configuration, run the df -h command to view the information about the mounted disks.
    Figure 3 Viewing mounted disks
  5. After the disks are merged and mounted, you can create a working directory and name it in /home.

Installing the Driver and Firmware

  1. Check whether the npu-smi tool can be used properly. The firmware and driver installation can continue only when the npu-smi tool can be used properly. Run the command below. If the output is the same as that shown in Figure 4, the npu-smi tool is normal.
    npu-smi info

    If the command output is not complete as shown in the figure below (for example, an error is reported or only the upper part of the command output is displayed without the following process information), restore the npu-smi tool and install the firmware and driver of the new version. To do so, submit a service ticket to contact Huawei Cloud technical support.

    Figure 4 Checking the npu-smi tool
  2. View the environment information. Run the following command to view the current firmware and driver versions:
    npu-smi info -t board -i 1 | egrep -i "software|firmware"
    Figure 5 Viewing the firmware and driver versions

    firmware indicates the firmware version, and software indicates the driver version.

    If the current versions do not meet your requirements and need to be changed, see the subsequent operations.

  3. View the OS version, check whether the architecture is AArch64 or x86_64, and obtain the firmware and driver packages from the official website. The firmware package is Ascend-hdk-Model-npu-firmware_Version.run and the driver package is Ascend-hdk-Model-npu-driver_Version_linux-aarch64.run. Only Huawei engineers and channel users have the permission to download the commercial version. For details, see the download link.
    arch
    cat /etc/os-release
    Figure 6 Viewing the OS version and architecture

    The following uses the packages that adapt to EulerOS 2.0 (SP10) and AArch64 as an example.

  4. Install the driver and firmware.
    Installation sequence is vital.
    1. Initial installation: In scenarios where no driver is installed on a hardware device before delivery, or the installed driver and firmware on the hardware device have been uninstalled, you need to install the driver and then firmware.
    2. Overwrite installation: In scenarios where the driver and firmware have been installed on a hardware device and you need to install them again, install the firmware first and then the driver.

    Generally, the firmware and driver are pre-installed on Snt9b and Snt9b23 servers before delivery. So in this case, overwrite installation is used.

    If the firmware and driver to be installed are of a lower version, ensure that npu-smi functions, and install the packages without the need to uninstall the existing versions.

    Installation commands:
    1. Install the firmware and then restart the server.
      chmod 700 *.run 
      #  Actual package name
      ./Ascend-hdk-<model>-npu-firmware_<version>.run  --full
      reboot
    2. Install the driver. Enter y as prompted.
      # Actual package name
      ./Ascend-hdk-<model>-npu-driver_<version>_linux-aarch64.run --full --install-for-all
    3. (Optional) Determine whether to restart the system as prompted. If yes, run the following command. If not, skip this step.
      reboot
    4. After the installation, check the firmware and driver versions. If the output is normal, the installation is successful.
      npu-smi info -t board -i 1 | egrep -i "software|firmware"
      Figure 7 Checking the firmware and driver versions

Install the Docker environment.

ModelArts Lite Server's public images already include Docker. To install Docker manually, follow these steps:

  1. Run the command below to check whether Docker has been installed. If Docker has been installed, skip this step. Figure 8 shows that Docker has been installed.
    docker -v
    Figure 8 Viewing the Docker version
    If Docker is not installed, run this command to install it:
    yum install -y docker-engine.aarch64 docker-engine-selinux.noarch docker-runc.aarch64

    After the installation is complete, run the docker -v command again to check whether Docker is installed.

  2. Configure IP forwarding for network access in containers.
    Run the following command to check the value of net.ipv4.ip_forward. Skip this step if the value is 1.
    sysctl -p | grep net.ipv4.ip_forward
    If the value is not 1, run the following command to configure IP forwarding:
    sed -i 's/net\.ipv4\.ip_forward=0/net\.ipv4\.ip_forward=1/g' /etc/sysctl.conf 
    sysctl -p | grep net.ipv4.ip_forward
  3. Check whether Ascend-docker-runtime has been installed and configured in the environment.
    docker info |grep Runtime
    If the runtime is ascend in the output, the installation and configuration are complete. In this case, skip this step.
    Figure 9 Querying Ascend-docker-runtime
    If Ascend-docker-runtime is not installed, click here to install it. The software package is a Docker plugin provided by AI Compute Service. During Docker runtime, paths of AI Compute Service drivers can be automatically mounted to the container. You do not need to specify --device during container startup. After the package is downloaded, upload it to the Lite Server and install it.
    chmod 700 *.run
    ./Ascend-hdk-<model>-npu-driver_<version>_linux-aarch64.run --install

    For details, see Docker Runtime User Guide.

  4. Set the newly mounted disk as the path used by Docker containers.
    Edit the /etc/docker/daemon.json file. If the file does not exist, create it.
    vim /etc/docker/daemon.json
    Add the two configurations as shown in the following figure. To ensure that the JSON format is correct, add a comma (,) at the end of the insecure-registries line. data_root indicates the path where the Docker data is stored. default-shm-size indicates the default sharing size during container startup. The default value is 64 MB. You can modify it in case the training fails due to insufficient sharing memory during distributed training.
    Figure 10 Configuring Docker

    Save the configuration and run the following command to restart Docker for the configuration to take effect:
    systemctl daemon-reload && systemctl restart docker

Installing the pip Source

  1. Check whether pip has been installed and whether the access to the pip source is normal. If yes, skip this step.
    pip install numpy
  2. If pip is not installed, run the following commands:
    python -m ensurepip --upgrade
    ln -s /usr/bin/pip3 /usr/bin/pip
  3. Configure the pip source.
    mkdir -p ~/.pip
    vim ~/.pip/pip.conf
    Add the following information to the ~/.pip/pip.conf file:
    [global]
    index-url = http://mirrors.myhuaweicloud.com/pypi/web/simple
    format = columns
    [install]
    trusted-host=mirrors.myhuaweicloud.com

Testing the RoCE Network

The following RoCE network test only applies to Snt9b servers. For details about how to test the RoCE network of Snt9b23 servers, see Lite Server Node Fault Diagnosis.

  1. Install CANN Toolkit.
    Check whether CANN Toolkit has been installed on the server. If the version number is displayed, it has been installed.
    cat /usr/local/Ascend/ascend-toolkit/latest/aarch64-linux/ascend_toolkit_install.info

    If it is not installed, obtain the software package from the official website. For common users, download the community edition. For Huawei engineers and channel users, the permissions of the commercial edition are limited, download it from here.

    Install CANN Toolkit. Replace the package name.
    chmod 700 *.run
    ./Ascend-cann-toolkit_6.3.RC2_linux-aarch64.run --full --install-for-all
  2. Install mpich-3.2.1.tar.gz.
    Click here to download the package and run the following commands to install it:
    mkdir -p /home/mpich
    mv /root/mpich-3.2.1.tar.gz /home/
    cd /home/;tar -zxvf mpich-3.2.1.tar.gz
    cd /home/mpich-3.2.1
    ./configure --prefix=/home/mpich --disable-fortran
    make && make install
  3. Set environment variables and compile the HCCL operator.
    export PATH=/home/mpich/bin:$PATH
    cd /usr/local/Ascend/ascend-toolkit/latest/tools/hccl_test
    export LD_LIBRARY_PATH=/home/mpich/lib/:/usr/local/Ascend/ascend-toolkit/latest/lib64:$LD_LIBRARY_PATH
    make MPI_HOME=/home/mpich ASCEND_DIR=/usr/local/Ascend/ascend-toolkit/latest
    After the operator is compiled, the following information is displayed.
    Figure 11 Compiled operator
  4. Perform all_reduce_test in the single-node scenario.
    Go to the hccl_test directory.
    cd /usr/local/Ascend/ascend-toolkit/latest/tools/hccl_test
    For single-node single-PU, run the following command:
    mpirun -n 1 ./bin/all_reduce_test -b 8 -e 1024M -f 2 -p 8 
    For single-node multi-PU, run the following command:
    mpirun -n 8 ./bin/all_reduce_test -b 8 -e 1024M -f 2 -p 8
    Figure 12 all_reduce_test

  5. Test the bandwidth of multi-node RoCE NICs.
    1. Check the Ascend RoCE IP address.
      cat /etc/hccn.conf
      Figure 13 Viewing Ascend RoCE IP address

    2. Perform the RoCE test.

      In session 1, run the -i<PU-ID> command on the receive end.

      hccn_tool -i 7 -roce_test reset
      hccn_tool -i 7 -roce_test ib_send_bw -s 4096000 -n 1000 -tcp

      In session 2, run the -i<PU-ID> command on the sending end. The IP address of the receive end is at the end.

      cd /usr/local/Ascend/ascend-toolkit/latest/tools/hccl_test
      hccn_tool -i 0 -roce_test reset
      hccn_tool -i 0 -roce_test ib_send_bw -s 4096000 -n 1000 address 192.168.100.18 -tcp

      The following figure shows the RoCE test result.

      Figure 14 RoCE test result (receive end)
      Figure 15 RoCE test result (server)

  6. If the RoCE bandwidth test has been started for a NIC, the following error message is displayed when the task is started again.
    Figure 16 Error

    Run the following command to stop the roce_test task and then start the task:

    hccn_tool -i 7 -roce_test reset
    Run the following command to query the NIC status:
    for i in {0..7};do hccn_tool -i ${i} -link -g;done
    Run the following command to check the IP address connectivity of the NIC on a common node:
    for i in $(seq 0 7);do hccn_tool -i $i -net_health -g;done

Creating a Containerized Custom Debugging Environment

You could start your Docker container on the physical machine (PM) for development. You can use your service images or base images provided by ModelArts, including Ascend+PyTorch and Ascend+MindSpore.

  1. Prepare a service base image.
    1. Choose an image based on your environment.
      # Container image matching Snt9b. The following shows an example.
      docker pull swr.<region-code>.myhuaweicloud.com/atelier/<image-name>:<image-tag>
    2. Start the container image. If multiple users and containers are sharing a machine, allocate the PUs beforehand. Do not use PUs occupied by other containers.
      #  Start the container. Specify the container name and image information. ASCEND_VISIBLE_DEVICES indicates the PUs to be used by the container, for example, 0-1,3 indicates PUs 0, 1, and 3 are used. The hyphens (-) specify the range.
      # -v /home:/home_host indicates mounting the home directory of the host to the home_host directory of the container. Use this mounting directory in the container to store code and data for persistent storage.
      docker run -itd --cap-add=SYS_PTRACE -e ASCEND_VISIBLE_DEVICES=0  -v /home:/home_host -p 51234:22 -u=0 --name <custom-container-name>  <SWR-address-of-the-image-pulled-in-the-preceding-step>  /bin/bash
    3. Access the container.
      docker exec -ti <custom-container-name-in-the-last-command> bash
    4. Access the Conda environment.
      source /home/ma-user/.bashrc
      cd ~
    5. View the information of available PUs in the container.
      npu-smi info
      If the following error message is displayed, the PU specified by ASCEND_VISIBLE_DEVICES during container startup is occupied by another container. In this case, select another PU and restart the new container.
      Figure 17 Error
    6. After you run npu-smi info and the output is normal, run the commands below to test the container environment. If the output is normal, the container environment is available.
      • PyTorch image test:
        python3 -c "import torch;import torch_npu; a = torch.randn(3, 4).npu(); print(a + a);"
      • MindSpore image test:
        # The run_check program of MindSpore does not adapt to Snt9b. Configure two environment variables first.
        unset MS_GE_TRAIN 
        unset MS_ENABLE_GE
        python -c "import mindspore;mindspore.set_context(device_target='Ascend');mindspore.run_check()"
        # Restore the environment variables after the test for actual training.
        export MS_GE_TRAIN=1
        export MS_ENABLE_GE=1
      Figure 18 Accessing the Conda environment and performing a test
  2. (Optional) Configure SSH access for the container.

    If you need to use the VS Code or SSH tool to directly connect to the container for development, perform the following operations:

    1. After accessing the container, run the SSH startup command to start the SSH service.
      ssh-keygen  -A
      /usr/sbin/sshd
      #  Check whether SSH is started.
      ps -ef |grep ssh
    2. Set a password for user root as prompted.
      passwd
      Figure 19 Setting a password for user root
    3. Run the exit command to exit the container and perform the SSH test on the host.
      ssh root@<host-IP-address> -p 51234(<mapped-port-number>)
      Figure 20 Perform the SSH test.

      If the error message "Host key verification failed" is displayed when you perform the SSH container test on the host machine, delete the ~/.ssh/known_host file from the host machine and try again.

    4. Use VS Code SSH to connect to the container environment.

      If you have not used VS Code SSH, install the VS Code environment and Remote-SSH plugin by referring to Step1 Manually Connecting to a Notebook Instance Through VS Code.

      Open VSCode Terminal and run the following command to generate a key pair on the local computer. If you already have a key pair, skip this step.
      ssh-keygen -t rsa
      Add the public key to the authorization file of the remote server. Replace the server IP address and container port number.
      cat ~/.ssh/id_rsa.pub | ssh root@<server-IP-address> -p <container-port-number> "mkdir -p ~/.ssh && cat >>  ~/.ssh/authorized_keys"
      Open the Remote-SSH configuration file of VSCode and add SSH configuration items. Replace the server IP address and container port number.
      Host Snt9b-dev
          HostName <server-IP-address>
          User root
          port <SSH-port-number-of-the-container>
          identityFile ~\.ssh\id_rsa
          StrictHostKeyChecking no
          UserKnownHostsFile /dev/null
          ForwardAgent yes

      Note: Use the key to log in. If you want to use the password, delete the identityFile configuration and enter the password as prompted during the connection.

      After the connection, install the Python plugin. For details, see Install the Python Plug-in in the Cloud Development Environment.

  3. (Optional) Install CANN Toolkit.

    CANN Toolkit has been installed in the preset images provided by ModelArts. If you need to use another version or use your own image that is not preset with CANN Toolkit, see the following operations.

    1. Check whether CANN Toolkit has been installed in the container. If the version number is displayed, it has been installed.
      cat /usr/local/Ascend/ascend-toolkit/latest/aarch64-linux/ascend_toolkit_install.info
    2. If it is not installed or needs to be upgraded, obtain the software package from the official website. For common users, download the community edition. For Huawei engineers and channel users, the permissions of the commercial edition are limited, download it from here.
      Install CANN Toolkit. Replace the package name.
      chmod 700 *.run
      ./Ascend-cann-toolkit_6.3.RC2_linux-aarch64.run --full --install-for-all
    3. If it has been installed but needs to be upgraded, run the following command. Replace the package name.
      chmod 700 *.run
      ./Ascend-cann-toolkit_6.3.RC2_linux-aarch64.run --upgrade --install-for-all
  4. (Optional) Install MindSpore Lite.

    MindSpore Lite has been installed in the preset image. If you need to use another version or use your own image that is not preset with MindSpore Lite, see the following operations.

    1. Check whether MindSpore Lite has been installed in the container. If the software information and version are displayed, it has been installed.
      pip show mindspore-lite
    2. If it is not installed, download the .whl and .tar.gz packages from the official website and download them. Replace the package names.
      pip install mindspore_lite-2.1.0-cp37-cp37m-linux_aarch64.whl
      mkdir -p /usr/local/mindspore-lite
      tar -zxvf mindspore-lite-2.1.0-linux-aarch64.tar.gz -C /usr/local/mindspore-lite --strip-components 1
  5. Configure the pip source.

    The pip source has been configured in the preset image provided by ModelArts. To use your own service images, configure it by referring to Installing the pip Source.

  6. Configure a Yum repository.
    • Configure the Yum repository in Huawei EulerOS.
      # Create the EulerOS.repo file in the /etc/yum.repos.d/ directory,
      cd /etc/yum.repos.d/
      mv EulerOS.repo EulerOS.repo.bak
      vim EulerOS.repo
      # Configure the EulerOS.repo file based on the EulerOS version and system architecture. EulerOS 2.10 is used as an example.
      [base]
      name=EulerOS-2.0SP10 base
      baseurl=https://mirrors.huaweicloud.com/euler/2.10/os/aarch64/
      enabled=1
      gpgcheck=1
      gpgkey=https://mirrors.huaweicloud.com/euler/2.10/os/RPM-GPG-KEY-EulerOS
      # Clear the existing Yum cache.
      yum clean all
      # Generate a new Yum cache.
      yum makecache
      # Perform a test.
      yum update --allowerasing --skip-broken --nobest
    • Configure a Yum repository in Huawei Cloud EulerOS.
      # Download the new hce.repo file to the /etc/yum.repos.d/ directory.
      wget -O /etc/yum.repos.d/hce.repo https://mirrors.huaweicloud.com/artifactory/os-conf/hce/hce.repo
      # Clear the existing Yum cache.
      yum clean all
      # Generate a new Yum cache.
      yum makecache
      # Perform a test.
      yum update --allowerasing --skip-broken --nobest
  7. To use git clone and git lfs commands to download large models, see the following operations:

    1. The Euler source does not have the git-lfs package. Therefore, you need to decompress the package. To do so, enter the following address in the address box of the browser, download the git-lfs package, and upload it to the /home directory on the server. This directory is mounted to the /home_host directory of the container when the container is started. In this way, the git-lfs package can be directly used in the container.
      https://github.com/git-lfs/git-lfs/releases/download/v3.2.0/git-lfs-linux-arm64-v3.2.0.tar.gz
    2. Go to the container and run the git-lfs installation commands.
      cd /home_host
      tar -zxvf git-lfs-linux-arm64-v3.2.0.tar.gz
      cd git-lfs-3.2.0
      sh install.sh
    3. Disable SSL verification for Git configuration.
      git config --global http.sslVerify false
    4. The following commands use code in diffusers as an example. Replace the development directory.
      #  git clone diffusers source code. You can specify a branch for the -b parameter. Replace the development directory.
      cd /home_host/<user-directory>
      mkdir sd
      cd sd
      git clone https://github.com/huggingface/diffusers.git -b v0.11.1-patch

      Run git clone to download the model on Hugging Face. The following uses a stable-diffusion (SD) model as an example.

      If error "SSL_ERROR_SYSCALL" is reported during the download, try again. The download may take several hours due to network restrictions and large file size. If the download still fails after multiple retries, download the large file from the website and upload it to the personal development directory in /home on the server. To skip the large files during download, set GIT_LFS_SKIP_SMUDGE to 1.
      git lfs install 
      git clone https://huggingface.co/runwayml/stable-diffusion-v1-5 -b onnx
      Figure 21 Downloaded code
  8. If a container is used or shared by multiple users, you should restrict the container from accessing the OpenStack management address (169.254.169.254) to prevent host machine metadata acquisition. For details, see Forbidding Containers to Obtain Host Machine Metadata.
  9. Save the image in the container environment.

    After the environment is configured, you can develop and debug the service code. To prevent the environment from being lost after the host is restarted, run the following commands to save the configured environment as a new image:

    # Check the ID of the container to be saved as an image.
    docker ps  
    # Save the image.
    docker commit <container-ID>  <custom-image-name>:<custom-image-tag>  
    # View the saved image.
    docker images  
    # If you need to share the image with others in other environments, save the image as a TAR file. This command takes a long time. You can view the file by running the ls command after it is saved.
    docker save -o <custom-name>.tar <image-name>:<image-tag>  
    # Load the file on other hosts. After the file is loaded, you can view the image.
    docker load --input <custom-name>.tar

    For details about how to migrate services to Ascend for development and debugging, see the related documents.