Compute
Elastic Cloud Server
Huawei Cloud Flexus
Bare Metal Server
Auto Scaling
Image Management Service
Dedicated Host
FunctionGraph
Cloud Phone Host
Huawei Cloud EulerOS
Networking
Virtual Private Cloud
Elastic IP
Elastic Load Balance
NAT Gateway
Direct Connect
Virtual Private Network
VPC Endpoint
Cloud Connect
Enterprise Router
Enterprise Switch
Global Accelerator
Management & Governance
Cloud Eye
Identity and Access Management
Cloud Trace Service
Resource Formation Service
Tag Management Service
Log Tank Service
Config
OneAccess
Resource Access Manager
Simple Message Notification
Application Performance Management
Application Operations Management
Organizations
Optimization Advisor
IAM Identity Center
Cloud Operations Center
Resource Governance Center
Migration
Server Migration Service
Object Storage Migration Service
Cloud Data Migration
Migration Center
Cloud Ecosystem
KooGallery
Partner Center
User Support
My Account
Billing Center
Cost Center
Resource Center
Enterprise Management
Service Tickets
HUAWEI CLOUD (International) FAQs
ICP Filing
Support Plans
My Credentials
Customer Operation Capabilities
Partner Support Plans
Professional Services
Analytics
MapReduce Service
Data Lake Insight
CloudTable Service
Cloud Search Service
Data Lake Visualization
Data Ingestion Service
GaussDB(DWS)
DataArts Studio
Data Lake Factory
DataArts Lake Formation
IoT
IoT Device Access
Others
Product Pricing Details
System Permissions
Console Quick Start
Common FAQs
Instructions for Associating with a HUAWEI CLOUD Partner
Message Center
Security & Compliance
Security Technologies and Applications
Web Application Firewall
Host Security Service
Cloud Firewall
SecMaster
Anti-DDoS Service
Data Encryption Workshop
Database Security Service
Cloud Bastion Host
Data Security Center
Cloud Certificate Manager
Edge Security
Situation Awareness
Managed Threat Detection
Blockchain
Blockchain Service
Web3 Node Engine Service
Media Services
Media Processing Center
Video On Demand
Live
SparkRTC
MetaStudio
Storage
Object Storage Service
Elastic Volume Service
Cloud Backup and Recovery
Storage Disaster Recovery Service
Scalable File Service Turbo
Scalable File Service
Volume Backup Service
Cloud Server Backup Service
Data Express Service
Dedicated Distributed Storage Service
Containers
Cloud Container Engine
SoftWare Repository for Container
Application Service Mesh
Ubiquitous Cloud Native Service
Cloud Container Instance
Databases
Relational Database Service
Document Database Service
Data Admin Service
Data Replication Service
GeminiDB
GaussDB
Distributed Database Middleware
Database and Application Migration UGO
TaurusDB
Middleware
Distributed Cache Service
API Gateway
Distributed Message Service for Kafka
Distributed Message Service for RabbitMQ
Distributed Message Service for RocketMQ
Cloud Service Engine
Multi-Site High Availability Service
EventGrid
Dedicated Cloud
Dedicated Computing Cluster
Business Applications
Workspace
ROMA Connect
Message & SMS
Domain Name Service
Edge Data Center Management
Meeting
AI
Face Recognition Service
Graph Engine Service
Content Moderation
Image Recognition
Optical Character Recognition
ModelArts
ImageSearch
Conversational Bot Service
Speech Interaction Service
Huawei HiLens
Video Intelligent Analysis Service
Developer Tools
SDK Developer Guide
API Request Signing Guide
Terraform
Koo Command Line Interface
Content Delivery & Edge Computing
Content Delivery Network
Intelligent EdgeFabric
CloudPond
Intelligent EdgeCloud
Solutions
SAP Cloud
High Performance Computing
Developer Services
ServiceStage
CodeArts
CodeArts PerfTest
CodeArts Req
CodeArts Pipeline
CodeArts Build
CodeArts Deploy
CodeArts Artifact
CodeArts TestPlan
CodeArts Check
CodeArts Repo
Cloud Application Engine
MacroVerse aPaaS
KooMessage
KooPhone
KooDrive

Configuring the Software Environment on the GPU Server

Updated on 2025-02-13 GMT+08:00

Scenario

This section describes how to configure the environment on a GPU BMS, including installing NVIDIA and CUDA drivers. For different GPU preset images, the pre-installed software varies. You can view the installed software by referring to Mapping Between Compute Resources and Image Versions. The following describes the common software installation procedure. You can view the content based on the software to be installed.

The following are the typical configuration scenarios. View the related documents for quick configuration.

Installing a NVIDIA Driver

  1. Visit the NVIDIA official website.
  2. The Ant8 specifications are used as an example. Select a driver based on the Ant8 details and the required CUDA version.

    Figure 1 Selecting a driver

    The driver version is automatically displayed and downloaded.
    wget https://cn.download.nvidia.com/tesla/470.182.03/NVIDIA-Linux-x86_64-470.182.03.run

  3. Assign permissions.

    chmod +x NVIDIA-Linux-x86_64-470.182.03.run

  4. Run the installation file.

    ./NVIDIA-Linux-x86_64-470.182.03.run

    The NVIDIA-DRIVER driver is installed.

Installing a CUDA Toolkit

The NVIDIA driver is installed based on CUDA 12.0. In this case, CUDA 12.0 is installed by default.

  1. Visit CUDA Toolkit.
  2. After you set the OS, architecture, distribution, version, and installation type, an installation command is generated. Copy the command and run it.

    Figure 2 Settings

    The generated installation commands are as follows:

    wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-ubuntu2004.pin
    sudo mv cuda-ubuntu2004.pin /etc/apt/preferences.d/cuda-repository-pin-600
    wget https://developer.download.nvidia.com/compute/cuda/12.1.1/local_installers/cuda-repo-ubuntu2004-12-1-local_12.1.1-530.30.02-1_amd64.deb
    sudo dpkg -i cuda-repo-ubuntu2004-12-1-local_12.1.1-530.30.02-1_amd64.deb
    sudo cp /var/cuda-repo-ubuntu2004-12-1-local/cuda-*-keyring.gpg /usr/share/keyrings/
    sudo apt-get update
    sudo apt-get -y install cuda
    NOTE:

    To obtain CUDA of earlier versions, see CUDA Toolkit Archive.

Installing Docker

Docker is not installed in some preset images of Vnt1 BMS. To install Docker, see the following operations:

  1. Install Docker.

    curl https://get.docker.com | sh && sudo systemctl --now enable docker

  2. Install the NIVDIA container plug-in.

    distribution=$(. /etc/os-release;echo $ID$VERSION_ID) 
    && curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
    && curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list |
     sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
    apt-get update
    apt-get install -y nvidia-container-toolkit
    nvidia-ctk runtime configure --runtime=docker
    systemctl restart docker

  3. Check whether the Docker environment has been installed.

    The following uses PyTorch 2.0 as an example. The image used in this case is large and it may take a while to pull the image.

    docker run -ti --runtime=nvidia --gpus all pytorch/pytorch:2.0.0-cuda11.7-cudnn8-devel bash
    Figure 3 Image pulled

Installing nvidia-fabricmanager

NVLink and NVSwitch are supported for Ant GPUs. If you use a node with multiple GPUs, install nvidia-fabricmanager matching your driver version to enable interconnection between GPUs. Otherwise, GPU pods may fail to be used.

NOTE:

The nvidia-fabricmanager version must be the same as the nvidia driver version.

The following uses version 515.105.01 as an example.

version=515.105.01
main_version=$(echo $version | awk -F '.' '{print $1}')
apt-get update
apt-get -y install nvidia-fabricmanager-${main_version}=${version}-*

Verify the driver installation result. Start the fabricmanager service and check whether the status is RUNNING.

nvidia-smi -pm 1
nvidia-smi
systemctl enable nvidia-fabricmanager
systemctl start nvidia-fabricmanager
systemctl status nvidia-fabricmanager

Installing NVIDIA 515 and CUDA 11.7 on a GP Vnt1 BMS in EulerOS 2.9

This section describes how to install NVIDIA 515.105.01 and CUDA 11.7.1 on a GP Vnt1 BMS in EulerOS 2.9.

  1. Install the NVIDIA driver.

    wget https://us.download.nvidia.com/tesla/515.105.01/NVIDIA-Linux-x86_64-515.105.01.run
    chmod 700 NVIDIA-Linux-x86_64-515.105.01.run
    
    yum install -y elfutils-libelf-devel
    ./NVIDIA-Linux-x86_64-515.105.01.run --kernel-source-path=/usr/src/kernels/4.18.0-147.5.1.6.h998.eulerosv2r9.x86_64
    NOTE:

    By default, the Yum repository used by the Vnt1 BMS is http://repo.huaweicloud.com, which is available. If an error message is displayed when you run the yum update command, indicating that a software package conflict occurs, run the yum remove xxx software package command.

    The NVIDIA driver is a binary file and requires the libelf library in the elfutils-libelf-devel development package in the system. It provides a set of C functions for reading, modifying, and creating ELF files. NVIDIA drivers need these functions to parse the currently running kernel and other related information.

    During the installation, select OK or YES as prompted. After the installation, run the reboot command to restart the server. Log in to the server again and run the following command to view the GPU information:

     nvidia-smi -pm 1    #This command will be executed for a while. Wait patiently. The persistent mode is enabled to optimize the GPU performance on the Linux instance.
     nvidia-smi

  2. Install CUDA.

    wget https://developer.download.nvidia.com/compute/cuda/11.7.1/local_installers/cuda_11.7.1_515.65.01_linux.run
    chmod 700 cuda_11.7.1_515.65.01_linux.run
    ./cuda_11.7.1_515.65.01_linux.run --toolkit --samples --silent

    Check the installation result.

    /usr/local/cuda/bin/nvcc -V

  3. Install PyTorch 2.0 and verify CUDA.

    To install PyTorch 2.0, Python 3.10 is required, and the miniconda environment needs to be installed and configured.

    1. Install miniconda and create the alpha environment.
      wget https://repo.anaconda.com/miniconda/Miniconda3-py310_23.1.0-1-Linux-x86_64.sh
      chmod 750 Miniconda3-py310_23.1.0-1-Linux-x86_64.sh
      bash Miniconda3-py310_23.1.0-1-Linux-x86_64.sh -b -p /home/miniconda
      export PATH=/home/miniconda/bin:$PATH
      conda create --quiet --yes -n alpha python=3.10
    2. Install PyTorch 2.0 and verify the CUDA status.
      Install PyTorch 2.0 in the alpha environment and use the Tsinghua PIP source.
      source activate alpha
      pip install torch==2.0 -i https://pypi.tuna.tsinghua.edu.cn/simple
      python
      Verify the installation status of PyTorch and CUDA. If the output is True, the installation is successful.
      import torch
      print(torch.cuda.is_available())

Installing NVIDIA 470 and CUDA 11.4 on a GP Vnt1 BMS in Ubuntu 18.04

This section describes how to install NVIDIA 470 and CUDA 11.4 on a GP Vnt1 BMS in Ubuntu 18.04.

  1. Install the NVIDIA driver.

    apt-get update
    sudo apt-get install nvidia-driver-470

  2. Install CUDA.

    wget https://developer.download.nvidia.com/compute/cuda/11.4.4/local_installers/cuda_11.4.4_470.82.01_linux.run
    chmod +x cuda_11.4.4_470.82.01_linux.run
    ./cuda_11.4.4_470.82.01_linux.run --toolkit --samples --silent

  3. Verify the NVIDIA installation result.

    nvidia-smi -pm 1
    nvidia-smi
    /usr/local/cuda/bin/nvcc -V

  4. Install PyTorch 2.0 and verify CUDA.

    To install PyTorch 2.0, Python 3.10 is required, and the miniconda environment needs to be installed and configured.

    1. Install miniconda and create the alpha environment.
      wget https://repo.anaconda.com/miniconda/Miniconda3-py310_23.1.0-1-Linux-x86_64.sh
      chmod 750 Miniconda3-py310_23.1.0-1-Linux-x86_64.sh
      bash Miniconda3-py310_23.1.0-1-Linux-x86_64.sh -b -p /home/miniconda
      export PATH=/home/miniconda/bin:$PATH
      conda create --quiet --yes -n alpha python=3.10
    2. Install PyTorch 2.0 and verify the CUDA status.
      Install PyTorch 2.0 in the alpha environment and use the Tsinghua PIP source.
      source activate alpha
      conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia
      python
      Verify the installation status of PyTorch and CUDA. If the output is True, the installation is successful.
      import torch
      print(torch.cuda.is_available())

Installing NVIDIA 515 and CUDA 11.7 on a GP Vnt1 BMS in Ubuntu18.04

This section describes how to install NVIDIA 515 and CUDA 11.7 on a GP Vnt1 BMS in Ubuntu18.04.

  1. Install the NVIDIA driver.

    wget https://us.download.nvidia.com/tesla/515.105.01/NVIDIA-Linux-x86_64-515.105.01.run
    chmod +x NVIDIA-Linux-x86_64-515.105.01.run
    ./NVIDIA-Linux-x86_64-515.105.01.run

  2. Install CUDA.

    wget https://developer.download.nvidia.com/compute/cuda/11.7.1/local_installers/cuda_11.7.1_515.65.01_linux.run
    chmod +x cuda_11.7.1_515.65.01_linux.run
    ./cuda_11.7.1_515.65.01_linux.run --toolkit --samples –silent

  3. Install Docker.

    curl https://get.docker.com | sh && sudo systemctl --now enable docker

  4. Install the NIVDIA container plug-in.

    distribution=$(. /etc/os-release;echo $ID$VERSION_ID) 
    && curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
    && curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list |
     sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
    apt-get update
    apt-get install -y nvidia-container-toolkit
    nvidia-ctk runtime configure --runtime=docker
    systemctl restart docker

  5. Check whether the Docker environment has been installed.

    The following uses PyTorch 2.0 as an example. The image used in this case is large and it may take a while to pull the image.

    docker run -ti --runtime=nvidia --gpus all pytorch/pytorch:2.0.0-cuda11.7-cudnn8-devel bash
    Figure 4 Image pulled

Installing NVIDIA 515 and CUDA 11.7 on a GP Ant8 BMS in Ubuntu 20.04

This section describes how to install NVIDIA driver 515, CUDA 11.7, and nvidia-fabricmanager 515 on GP Ant8 BMS (Ubuntu 20.04) and perform the nccl-test test.

  1. Replace the APT source.

    sudo sed -i "s@http://.*archive.ubuntu.com@http://repo.huaweicloud.com@g" /etc/apt/sources.list
    sudo sed -i "s@http://.*security.ubuntu.com@http://repo.huaweicloud.com@g" /etc/apt/sources.list
    sudo apt update

  2. Install the NVIDIA driver.

    wget https://us.download.nvidia.com/tesla/515.105.01/NVIDIA-Linux-x86_64-515.105.01.run
    chmod +x NVIDIA-Linux-x86_64-515.105.01.run
    ./NVIDIA-Linux-x86_64-515.105.01.run

  3. Install CUDA.

    # Install the .run package.
    wget https://developer.download.nvidia.com/compute/cuda/11.7.0/local_installers/cuda_11.7.0_515.43.04_linux.run
    chmod +x cuda_11.7.0_515.43.04_linux.run
    ./cuda_11.7.0_515.43.04_linux.run --toolkit --samples --silent

  4. Install NCCL.

    NOTE:

    The following uses CUDA 11.7 as an example. Install NCCL.

    wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-keyring_1.0-1_all.deb
    sudo dpkg -i cuda-keyring_1.0-1_all.deb
    sudo apt update
    sudo apt install libnccl2=2.14.3-1+cuda11.7 libnccl-dev=2.14.3-1+cuda11.7

    The following is displayed after NCCL is installed.

    Figure 5 Viewing NCCL

  5. Install nvidia-fabricmanager.

    NOTE:

    The nvidia-fabricmanager version must be the same as the nvidia driver version.

    version=515.105.01
    main_version=$(echo $version | awk -F '.' '{print $1}')
    apt-get update
    apt-get -y install nvidia-fabricmanager-${main_version}=${version}-*

    Verify the driver installation result. Start the fabricmanager service and check whether the status is RUNNING.

    nvidia-smi -pm 1
    nvidia-smi
    systemctl enable nvidia-fabricmanager
    systemctl start nvidia-fabricmanager
    systemctl status nvidia-fabricmanager

  6. Install nv-peer-memory.

    git clone https://github.com/Mellanox/nv_peer_memory.git
    cd ./nv_peer_memory
    ./build_module.sh
    cd /tmp
    tar xzf /tmp/nvidia-peer-memory_1.3.orig.tar.gz
    cd nvidia-peer-memory-1.3
    dpkg-buildpackage -us -uc
    dpkg -i ../nvidia-peer-memory-dkms_1.2-0_all.deb

    nv_peer_mem works in Linux kernel mode. Run the lsmod | grep peer command to check whether nv_peer_mem is loaded to the kernel.

    NOTE:
    • If the code cannot be pulled by running the git clone command, configure git.
      git config --global core.compression -1
      export GIT_SSL_NO_VERIFY=1
      git config --global http.sslVerify false
      git config --global http.postBuffer 10524288000
      git config --global http.lowSpeedLimit 1000
      git config --global http.lowSpeedTime 1800
    • If nv-peer-memory is not displayed after the installation, the InfiniBand driver version may be too early. In this case, upgrade InfiniBand.
      wget https://content.mellanox.com/ofed/MLNX_OFED-5.4-3.6.8.1/MLNX_OFED_LINUX-5.4-3.6.8.1-ubuntu20.04-x86_64.tgz
      tar -zxvf MLNX_OFED_LINUX-5.4-3.6.8.1-ubuntu20.04-x86_64.tgz
      cd MLNX_OFED_LINUX-5.4-3.6.8.1-ubuntu20.04-x86_64
      apt-get install -y python3 gcc quilt build-essential bzip2 dh-python pkg-config dh-autoreconf python3-distutils debhelper make
      ./mlnxofedinstall --add-kernel-support
    • For details about how to install a later version, see Linux InfiniBand Drivers. For example, install the latest version, which is MLNX_OFED-5.8-2.0.3.0.
      wget https://content.mellanox.com/ofed/MLNX_OFED-5.8-2.0.3.0/MLNX_OFED_LINUX-5.8-2.0.3.0-ubuntu20.04-x86_64.tgz
      tar -zxvf MLNX_OFED_LINUX-5.8-2.0.3.0-ubuntu20.04-x86_64.tgz
      cd MLNX_OFED_LINUX-5.8-2.0.3.0-ubuntu20.04-x86_64
      apt-get install -y python3 gcc quilt build-essential bzip2 dh-python pkg-config dh-autoreconf python3-distutils debhelper make
      ./mlnxofedinstall --add-kernel-support
    • After nv_peer_mem is installed, view its status.
      /etc/init.d/nv_peer_mem/ status

      If the file does not exist, the file may not be copied by default during the installation. In this case, you need to copy the file.

      cp /tmp/nvidia-peer-memory-1.3/nv_peer_mem.conf  /etc/infiniband/
      cp /tmp/nvidia-peer-memory-1.3/debian/tmp/etc/init.d/nv_peer_mem   /etc/init.d/ 

  7. Configure environment variables.

    NOTE:

    The MPI path version must match. You can run the ls /usr/mpi/gcc/ command to view the Open MPI version.

    # Add to ~/.bashrc.
    export LD_LIBRARY_PATH=/usr/local/cuda/lib:usr/local/cuda/lib64:/usr/include/nccl.h:/usr/mpi/gcc/openmpi-4.1.2a1/lib:$LD_LIBRARY_PATH
    export PATH=$PATH:/usr/local/cuda/bin:/usr/mpi/gcc/openmpi-4.1.2a1/bin

  8. Install and compile nccl-test.

    cd /root
    git clone https://github.com/NVIDIA/nccl-tests.git
    cd ./nccl-tests
    make  MPI=1 MPI_HOME=/usr/mpi/gcc/openmpi-4.1.2a1 -j 8
    NOTE:

    The parameter MPI=1 must be added during compilation. Otherwise, the test between multiple devices cannot be performed.

    The MPI path version must match. You can run the ls /usr/mpi/gcc/ command to view the Open MPI version.

  9. Perform the nccl-test test.

    • Single-server test:
      /root/nccl-tests/build/all_reduce_perf -b 8 -e 1024M -f 2 -g 8
    • Multi-server test (replace the content comes after btl_tcp_if_include with the active NIC name):
      mpirun --allow-run-as-root --hostfile hostfile -mca btl_tcp_if_include eth0 -mca btl_openib_allow_ib true -x NCCL_DEBUG=INFO -x NCCL_IB_GID_INDEX=3 -x NCCL_IB_TC=128 -x NCCL_ALGO=RING -x NCCL_IB_HCA=^mlx5_bond_0 -x LD_LIBRARY_PATH  /root/nccl-tests/build/all_reduce_perf -b 8 -e 11g -f 2 -g 8

      hostfile format:

      #Private IP address of the host  Number of processes on a single node
      192.168.20.1 slots=1
      192.168.20.2 slots=1

      NCCL environment variables:

      • NCCL_IB_GID_INDEX=3: enables data packets to be transmitted through the queue 4 of switches, which is RoCE-compliant.
      • NCCL_IB_TC=128: enables RoCEv2. RoCEv1 is enabled by default. However, RoCEv1 does not support congestion control on switches, which may lead to packet loss. In addition, later-version switches do not support RoCEv1, leading to a RoCEv1 failure.
      • NCCL_ALGO=RING: The bus bandwidth of nccl_test is calculated based on the ring algorithm.

        The calculation formulas are as follows: Bus bandwidth = Algorithm bandwidth x 2(N-1)/N, Algorithm bandwidth = Data volume/Time

        The ring algorithm must be used. The formulas are different for the tree algorithm.

        The bus bandwidth calculated by the tree algorithm is equivalent to the performance acceleration compared with the ring algorithm. The total time required for algorithm calculation is reduced. Therefore, the bus bandwidth calculated using the formula is also increased. Theoretically, the tree algorithm is better than the ring algorithm. However, the tree algorithm has higher requirements on the network than the ring algorithm, and the calculation may be unstable. The tree algorithm can complete the all reduce calculation with less data traffic, but it is not suitable for testing performance. Therefore, the actual bandwidth of two nodes is 100 GB/s, but the tested speed is 110 GB/s or even 130 GB/s. After this parameter is added, the speed is stable in the case of two or more nodes.

        NOTE:
        During the test, password-free login is required between the node where the mpirun command is executed and the node in the hostfile. To set SSH password-free login, perform the following steps:
        1. Generate a pair of public and private keys on the local client.
          ssh-keygen

          After the preceding command is executed, id_rsa.pub (public key) and id_rsa (private key) are created in the .ssh folder in the user directory. View the public key and private key.

          cd ~/.ssh
        2. Upload the public key to the server.
          For example, if the username is root and the server address is 192.168.222.213.
          ssh-copy-id -i ~/.ssh/id_rsa.pub root@192.168.222.213

          View the id_rsa.pub (public key) content.

          cd ~/.ssh
          vim authorized_keys
        3. Test password-free login.

          The client connects to the remote server through SSH. You can log in to the server without entering a password.

          ssh root@192.168.222.213

We use cookies to improve our site and your experience. By continuing to browse our site you accept our cookie policy. Find out more

Feedback

Feedback

Feedback

0/500

Selected Content

Submit selected content with the feedback