Help Center/ ModelArts/ ModelArts User Guide (Lite Server)/ Configuring Lite Server Resources/ Configuring the Software Environment/ Configuring the Software Environment on the GPU Server

Updated on 2024-11-19 GMT+08:00

View PDF

Configuring the Software Environment on the GPU Server

Scenario

This section describes how to configure the environment on a GPU BMS, including installing NVIDIA and CUDA drivers. For different GPU preset images, the pre-installed software varies. You can view the installed software by referring to Mapping Between Compute Resources and Image Versions. The following describes the common software installation procedure. You can view the content based on the software to be installed.

Installing a NVIDIA Driver
Installing a CUDA Toolkit
Installing Docker
Installing nvidia-fabricmanager

The following are the typical configuration scenarios. View the related documents for quick configuration.

Installing NVIDIA 515 and CUDA 11.7 on a GP Vnt1 BMS in EulerOS 2.9
Installing NVIDIA 470 and CUDA 11.4 on a GP Vnt1 BMS in Ubuntu 18.04
Installing NVIDIA 515 and CUDA 11.7 on a GP Vnt1 BMS in Ubuntu18.04
Installing NVIDIA 515 and CUDA 11.7 on a GP Ant8 BMS in Ubuntu 20.04

Installing a NVIDIA Driver

Visit the NVIDIA official website.
The Ant8 specifications are used as an example. Select a driver based on the Ant8 details and the required CUDA version.

Figure 1 Selecting a driver
The driver version is automatically displayed and downloaded.
```
wget https://cn.download.nvidia.com/tesla/470.182.03/NVIDIA-Linux-x86_64-470.182.03.run
```

Assign permissions.

chmod +x NVIDIA-Linux-x86_64-470.182.03.run

Run the installation file.
```
./NVIDIA-Linux-x86_64-470.182.03.run
```
The NVIDIA-DRIVER driver is installed.

Installing a CUDA Toolkit

The NVIDIA driver is installed based on CUDA 12.0. In this case, CUDA 12.0 is installed by default.

Visit CUDA Toolkit.

After you set the OS, architecture, distribution, version, and installation type, an installation command is generated. Copy the command and run it.

Figure 2 Settings
Click to enlarge

The generated installation commands are as follows:

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-ubuntu2004.pin
sudo mv cuda-ubuntu2004.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.com/compute/cuda/12.1.1/local_installers/cuda-repo-ubuntu2004-12-1-local_12.1.1-530.30.02-1_amd64.deb
sudo dpkg -i cuda-repo-ubuntu2004-12-1-local_12.1.1-530.30.02-1_amd64.deb
sudo cp /var/cuda-repo-ubuntu2004-12-1-local/cuda-*-keyring.gpg /usr/share/keyrings/
sudo apt-get update
sudo apt-get -y install cuda

To obtain CUDA of earlier versions, see CUDA Toolkit Archive.

Installing Docker

Docker is not installed in some preset images of Vnt1 BMS. To install Docker, see the following operations:

Install Docker.

curl https://get.docker.com | sh && sudo systemctl --now enable docker

Install the NIVDIA container plug-in.

distribution=$(. /etc/os-release;echo $ID$VERSION_ID) 
&& curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
&& curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list |
 sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
apt-get update
apt-get install -y nvidia-container-toolkit
nvidia-ctk runtime configure --runtime=docker
systemctl restart docker

Check whether the Docker environment has been installed.

The following uses PyTorch 2.0 as an example. The image used in this case is large and it may take a while to pull the image.
```
docker run -ti --runtime=nvidia --gpus all pytorch/pytorch:2.0.0-cuda11.7-cudnn8-devel bash
```
Figure 3 Image pulled

Installing nvidia-fabricmanager

NVLink and NVSwitch are supported for Ant GPUs. If you use a node with multiple GPUs, install nvidia-fabricmanager matching your driver version to enable interconnection between GPUs. Otherwise, GPU pods may fail to be used.

The nvidia-fabricmanager version must be the same as the nvidia driver version.

The following uses version 515.105.01 as an example.

version=515.105.01
main_version=$(echo $version | awk -F '.' '{print $1}')
apt-get update
apt-get -y install nvidia-fabricmanager-${main_version}=${version}-*

Verify the driver installation result. Start the fabricmanager service and check whether the status is RUNNING.

nvidia-smi -pm 1
nvidia-smi
systemctl enable nvidia-fabricmanager
systemctl start nvidia-fabricmanager
systemctl status nvidia-fabricmanager

Installing NVIDIA 515 and CUDA 11.7 on a GP Vnt1 BMS in EulerOS 2.9

This section describes how to install NVIDIA 515.105.01 and CUDA 11.7.1 on a GP Vnt1 BMS in EulerOS 2.9.

Install the NVIDIA driver.
```
wget https://us.download.nvidia.com/tesla/515.105.01/NVIDIA-Linux-x86_64-515.105.01.run
chmod 700 NVIDIA-Linux-x86_64-515.105.01.run

yum install -y elfutils-libelf-devel
./NVIDIA-Linux-x86_64-515.105.01.run --kernel-source-path=/usr/src/kernels/4.18.0-147.5.1.6.h998.eulerosv2r9.x86_64
```
By default, the Yum repository used by the Vnt1 BMS is http://repo.huaweicloud.com, which is available. If an error message is displayed when you run the yum update command, indicating that a software package conflict occurs, run the yum remove xxx software package command.

The NVIDIA driver is a binary file and requires the libelf library in the elfutils-libelf-devel development package in the system. It provides a set of C functions for reading, modifying, and creating ELF files. NVIDIA drivers need these functions to parse the currently running kernel and other related information.

During the installation, select OK or YES as prompted. After the installation, run the reboot command to restart the server. Log in to the server again and run the following command to view the GPU information:
```
 nvidia-smi -pm 1    #This command will be executed for a while. Wait patiently. The persistent mode is enabled to optimize the GPU performance on the Linux instance.
 nvidia-smi
```

Install CUDA.

wget https://developer.download.nvidia.com/compute/cuda/11.7.1/local_installers/cuda_11.7.1_515.65.01_linux.run
chmod 700 cuda_11.7.1_515.65.01_linux.run
./cuda_11.7.1_515.65.01_linux.run --toolkit --samples --silent

Check the installation result.

/usr/local/cuda/bin/nvcc -V

Install PyTorch 2.0 and verify CUDA.

To install PyTorch 2.0, Python 3.10 is required, and the miniconda environment needs to be installed and configured.
1. Install miniconda and create the alpha environment.
```
wget https://repo.anaconda.com/miniconda/Miniconda3-py310_23.1.0-1-Linux-x86_64.sh
chmod 750 Miniconda3-py310_23.1.0-1-Linux-x86_64.sh
bash Miniconda3-py310_23.1.0-1-Linux-x86_64.sh -b -p /home/miniconda
export PATH=/home/miniconda/bin:$PATH
conda create --quiet --yes -n alpha python=3.10
```
2. Install PyTorch 2.0 and verify the CUDA status.
  Install PyTorch 2.0 in the alpha environment and use the Tsinghua PIP source.
```
source activate alpha
pip install torch==2.0 -i https://pypi.tuna.tsinghua.edu.cn/simple
python
```
  Verify the installation status of PyTorch and CUDA. If the output is True, the installation is successful.
```
import torch
print(torch.cuda.is_available())
```

Installing NVIDIA 470 and CUDA 11.4 on a GP Vnt1 BMS in Ubuntu 18.04

This section describes how to install NVIDIA 470 and CUDA 11.4 on a GP Vnt1 BMS in Ubuntu 18.04.

Install the NVIDIA driver.

apt-get update
sudo apt-get install nvidia-driver-470

Install CUDA.

wget https://developer.download.nvidia.com/compute/cuda/11.4.4/local_installers/cuda_11.4.4_470.82.01_linux.run
chmod +x cuda_11.4.4_470.82.01_linux.run
./cuda_11.4.4_470.82.01_linux.run --toolkit --samples --silent

Verify the NVIDIA installation result.

nvidia-smi -pm 1
nvidia-smi
/usr/local/cuda/bin/nvcc -V

Install PyTorch 2.0 and verify CUDA.

To install PyTorch 2.0, Python 3.10 is required, and the miniconda environment needs to be installed and configured.

Install miniconda and create the alpha environment.

wget https://repo.anaconda.com/miniconda/Miniconda3-py310_23.1.0-1-Linux-x86_64.sh
chmod 750 Miniconda3-py310_23.1.0-1-Linux-x86_64.sh
bash Miniconda3-py310_23.1.0-1-Linux-x86_64.sh -b -p /home/miniconda
export PATH=/home/miniconda/bin:$PATH
conda create --quiet --yes -n alpha python=3.10

Install PyTorch 2.0 and verify the CUDA status.
Install PyTorch 2.0 in the alpha environment and use the Tsinghua PIP source.
```
source activate alpha
conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia
python
```
Verify the installation status of PyTorch and CUDA. If the output is True, the installation is successful.
```
import torch
print(torch.cuda.is_available())
```

Installing NVIDIA 515 and CUDA 11.7 on a GP Vnt1 BMS in Ubuntu18.04

This section describes how to install NVIDIA 515 and CUDA 11.7 on a GP Vnt1 BMS in Ubuntu18.04.

Install the NVIDIA driver.

wget https://us.download.nvidia.com/tesla/515.105.01/NVIDIA-Linux-x86_64-515.105.01.run
chmod +x NVIDIA-Linux-x86_64-515.105.01.run
./NVIDIA-Linux-x86_64-515.105.01.run

Install CUDA.

wget https://developer.download.nvidia.com/compute/cuda/11.7.1/local_installers/cuda_11.7.1_515.65.01_linux.run
chmod +x cuda_11.7.1_515.65.01_linux.run
./cuda_11.7.1_515.65.01_linux.run --toolkit --samples –silent

Install Docker.

curl https://get.docker.com | sh && sudo systemctl --now enable docker

Install the NIVDIA container plug-in.

distribution=$(. /etc/os-release;echo $ID$VERSION_ID) 
&& curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
&& curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list |
 sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
apt-get update
apt-get install -y nvidia-container-toolkit
nvidia-ctk runtime configure --runtime=docker
systemctl restart docker

Check whether the Docker environment has been installed.

The following uses PyTorch 2.0 as an example. The image used in this case is large and it may take a while to pull the image.
```
docker run -ti --runtime=nvidia --gpus all pytorch/pytorch:2.0.0-cuda11.7-cudnn8-devel bash
```
Figure 4 Image pulled

Installing NVIDIA 515 and CUDA 11.7 on a GP Ant8 BMS in Ubuntu 20.04

This section describes how to install NVIDIA driver 515, CUDA 11.7, and nvidia-fabricmanager 515 on GP Ant8 BMS (Ubuntu 20.04) and perform the nccl-test test.

Replace the APT source.

sudo sed -i "s@http://.*archive.ubuntu.com@http://repo.huaweicloud.com@g" /etc/apt/sources.list
sudo sed -i "s@http://.*security.ubuntu.com@http://repo.huaweicloud.com@g" /etc/apt/sources.list
sudo apt update

Install the NVIDIA driver.

wget https://us.download.nvidia.com/tesla/515.105.01/NVIDIA-Linux-x86_64-515.105.01.run
chmod +x NVIDIA-Linux-x86_64-515.105.01.run
./NVIDIA-Linux-x86_64-515.105.01.run

Install CUDA.

# Install the .run package.
wget https://developer.download.nvidia.com/compute/cuda/11.7.0/local_installers/cuda_11.7.0_515.43.04_linux.run
chmod +x cuda_11.7.0_515.43.04_linux.run
./cuda_11.7.0_515.43.04_linux.run --toolkit --samples --silent

Install NCCL.
- For details about how to install NCCL, see NVIDIA Deep Learning NCCL Documentation.
- For details about the mapping between NCCL and CUDA versions and the installation method, see NCL Developer.
The following uses CUDA 11.7 as an example. Install NCCL.
```
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-keyring_1.0-1_all.deb
sudo dpkg -i cuda-keyring_1.0-1_all.deb
sudo apt update
sudo apt install libnccl2=2.14.3-1+cuda11.7 libnccl-dev=2.14.3-1+cuda11.7
```
The following is displayed after NCCL is installed.

Figure 5 Viewing NCCL

Install nvidia-fabricmanager.

The nvidia-fabricmanager version must be the same as the nvidia driver version.

version=515.105.01
main_version=$(echo $version | awk -F '.' '{print $1}')
apt-get update
apt-get -y install nvidia-fabricmanager-${main_version}=${version}-*

Verify the driver installation result. Start the fabricmanager service and check whether the status is RUNNING.

nvidia-smi -pm 1
nvidia-smi
systemctl enable nvidia-fabricmanager
systemctl start nvidia-fabricmanager
systemctl status nvidia-fabricmanager

Install nv-peer-memory.

git clone https://github.com/Mellanox/nv_peer_memory.git
cd ./nv_peer_memory
./build_module.sh
cd /tmp
tar xzf /tmp/nvidia-peer-memory_1.3.orig.tar.gz
cd nvidia-peer-memory-1.3
dpkg-buildpackage -us -uc
dpkg -i ../nvidia-peer-memory-dkms_1.2-0_all.deb

nv_peer_mem works in Linux kernel mode. Run the lsmod | grep peer command to check whether nv_peer_mem is loaded to the kernel.

If the code cannot be pulled by running the git clone command, configure git.

git config --global core.compression -1
export GIT_SSL_NO_VERIFY=1
git config --global http.sslVerify false
git config --global http.postBuffer 10524288000
git config --global http.lowSpeedLimit 1000
git config --global http.lowSpeedTime 1800

If nv-peer-memory is not displayed after the installation, the InfiniBand driver version may be too early. In this case, upgrade InfiniBand.

wget https://content.mellanox.com/ofed/MLNX_OFED-5.4-3.6.8.1/MLNX_OFED_LINUX-5.4-3.6.8.1-ubuntu20.04-x86_64.tgz
tar -zxvf MLNX_OFED_LINUX-5.4-3.6.8.1-ubuntu20.04-x86_64.tgz
cd MLNX_OFED_LINUX-5.4-3.6.8.1-ubuntu20.04-x86_64
apt-get install -y python3 gcc quilt build-essential bzip2 dh-python pkg-config dh-autoreconf python3-distutils debhelper make
./mlnxofedinstall --add-kernel-support

For details about how to install a later version, see Linux InfiniBand Drivers. For example, install the latest version, which is MLNX_OFED-5.8-2.0.3.0.

wget https://content.mellanox.com/ofed/MLNX_OFED-5.8-2.0.3.0/MLNX_OFED_LINUX-5.8-2.0.3.0-ubuntu20.04-x86_64.tgz
tar -zxvf MLNX_OFED_LINUX-5.8-2.0.3.0-ubuntu20.04-x86_64.tgz
cd MLNX_OFED_LINUX-5.8-2.0.3.0-ubuntu20.04-x86_64
apt-get install -y python3 gcc quilt build-essential bzip2 dh-python pkg-config dh-autoreconf python3-distutils debhelper make
./mlnxofedinstall --add-kernel-support

After nv_peer_mem is installed, view its status.
```
/etc/init.d/nv_peer_mem/ status
```
If the file does not exist, the file may not be copied by default during the installation. In this case, you need to copy the file.
```
cp /tmp/nvidia-peer-memory-1.3/nv_peer_mem.conf  /etc/infiniband/
cp /tmp/nvidia-peer-memory-1.3/debian/tmp/etc/init.d/nv_peer_mem   /etc/init.d/ 
```

Configure environment variables.

The MPI path version must match. You can run the ls /usr/mpi/gcc/ command to view the Open MPI version.

# Add to ~/.bashrc.
export LD_LIBRARY_PATH=/usr/local/cuda/lib:usr/local/cuda/lib64:/usr/include/nccl.h:/usr/mpi/gcc/openmpi-4.1.2a1/lib:$LD_LIBRARY_PATH
export PATH=$PATH:/usr/local/cuda/bin:/usr/mpi/gcc/openmpi-4.1.2a1/bin

Install and compile nccl-test.
```
cd /root
git clone https://github.com/NVIDIA/nccl-tests.git
cd ./nccl-tests
make  MPI=1 MPI_HOME=/usr/mpi/gcc/openmpi-4.1.2a1 -j 8
```
The parameter MPI=1 must be added during compilation. Otherwise, the test between multiple devices cannot be performed.

The MPI path version must match. You can run the ls /usr/mpi/gcc/ command to view the Open MPI version.
Perform the nccl-test test.
- Single-server test:
```
/root/nccl-tests/build/all_reduce_perf -b 8 -e 1024M -f 2 -g 8
```
- Multi-server test (replace the content comes after btl_tcp_if_include with the active NIC name):
```
mpirun --allow-run-as-root --hostfile hostfile -mca btl_tcp_if_include eth0 -mca btl_openib_allow_ib true -x NCCL_DEBUG=INFO -x NCCL_IB_GID_INDEX=3 -x NCCL_IB_TC=128 -x NCCL_ALGO=RING -x NCCL_IB_HCA=^mlx5_bond_0 -x LD_LIBRARY_PATH  /root/nccl-tests/build/all_reduce_perf -b 8 -e 11g -f 2 -g 8
```
  hostfile format:
```
#Private IP address of the host  Number of processes on a single node
192.168.20.1 slots=1
192.168.20.2 slots=1
```
  NCCL environment variables:
  - NCCL_IB_GID_INDEX=3: enables data packets to be transmitted through the queue 4 of switches, which is RoCE-compliant.
  - NCCL_IB_TC=128: enables RoCEv2. RoCEv1 is enabled by default. However, RoCEv1 does not support congestion control on switches, which may lead to packet loss. In addition, later-version switches do not support RoCEv1, leading to a RoCEv1 failure.
  - NCCL_ALGO=RING: The bus bandwidth of nccl_test is calculated based on the ring algorithm.
    The calculation formulas are as follows: Bus bandwidth = Algorithm bandwidth x 2(N-1)/N, Algorithm bandwidth = Data volume/Time
    
    The ring algorithm must be used. The formulas are different for the tree algorithm.
    
    The bus bandwidth calculated by the tree algorithm is equivalent to the performance acceleration compared with the ring algorithm. The total time required for algorithm calculation is reduced. Therefore, the bus bandwidth calculated using the formula is also increased. Theoretically, the tree algorithm is better than the ring algorithm. However, the tree algorithm has higher requirements on the network than the ring algorithm, and the calculation may be unstable. The tree algorithm can complete the all reduce calculation with less data traffic, but it is not suitable for testing performance. Therefore, the actual bandwidth of two nodes is 100 GB/s, but the tested speed is 110 GB/s or even 130 GB/s. After this parameter is added, the speed is stable in the case of two or more nodes.
    During the test, password-free login is required between the node where the mpirun command is executed and the node in the hostfile. To set SSH password-free login, perform the following steps:
    
    Generate a pair of public and private keys on the local client.
    ssh-keygen
    
    After the preceding command is executed, id_rsa.pub (public key) and id_rsa (private key) are created in the .ssh folder in the user directory. View the public key and private key.
    
    cd ~/.ssh
    
    Upload the public key to the server.
    For example, if the username is root and the server address is 192.168.222.213.
    ssh-copy-id -i ~/.ssh/id_rsa.pub root@192.168.222.213
    
    View the id_rsa.pub (public key) content.
    
    cd ~/.ssh vim authorized_keys
    
    Test password-free login.
    The client connects to the remote server through SSH. You can log in to the server without entering a password.
    
    ssh root@192.168.222.213