Running a Multi-Node Multi-PU Training Job on ModelArts Standard

Process

Preparations
1. Purchasing Service Resources (VPC, SFS, OBS, SWR, and ECS)
2. Assigning Permissions
3. Creating a Dedicated Resource Pool (connecting to the VPC)
4. Mounting an SFS Turbo File System to an ECS
5. Granting the Read Permission to ModelArts Users on the ECS
6. Installing and Configuring the OBS CLI
7. (Optional) Configuring Workspaces
Model training

Building and Debugging an Image Locally

In this section, conda env is packaged to build the environment. You can also install conda environment dependencies using pip install or conda install.

The container image should be smaller than 15 GB. For details, see Constraints on Custom Images of the Training Framework.
Build an image through the official open-source website, for example, PyTorch.
Containers should be built by layer. Each layer must not have more than 1 GB of capacity or 100,000 files. You need to start with the layers that change less frequently. For example, build the OS, CUDA driver, Python, PyTorch, and other dependency packages in sequence.
If the training data and code change frequently, do not store them in the container image in case you need to build container images frequently.
The containers can meet the isolation requirements. Do not create conda environments in a container.

Export the conda environment.

Start the offline container image.

# run on terminal
docker run -ti ${your_image:tag}

Obtain pytorch.tar.gz.

# run on container

# Create a conda environment named pytorch based on the target base environment.
conda create --name pytorch --clone base

pip install conda-pack

# Pack pytorch env to generate pytorch.tar.gz.
conda pack -n pytorch -o pytorch.tar.gz

Upload the package to a local path.

# run on terminal
docker cp ${your_container_id}:/xxx/xxx/pytorch.tar.gz .

Upload pytorch.tar.gz to OBS and set it to public read. Obtain, decompress, and clear pytorch.tar.gz using the wget commands during creation.

Create an image.
Choose either the official Ubuntu 18.04 image or the image with the CUDA driver from NVIDIA as the base image. Obtain the images on the Docker Hub official website.

To create an image, do as follows: Install the required apt package and driver, configure the ma-user user, import the conda environment, and configure the notebook dependency.
- Creating images with a Dockerfile is recommended. This ensures Dockerfile traceability and archiving, as well as image content without redundancy or residue.
- To reduce the final image size, delete intermediate files such as TAR packages when building each layer. For details about how to clear the cache, see conda clean.

Refer to the following example.

Dockerfile example:

FROM nvidia/cuda:11.3.1-cudnn8-devel-ubuntu18.04

USER root

# section1: add user ma-user whose uid is 1000 and user group ma-group whose gid is 100. If there already exists 1000:100 but not ma-user:ma-group, below code will remove it
RUN default_user=$(getent passwd 1000 | awk -F ':' '{print $1}') || echo "uid: 1000 does not exist" && \
    default_group=$(getent group 100 | awk -F ':' '{print $1}') || echo "gid: 100 does not exist" && \
    if [ ! -z ${default_group} ] && [ ${default_group} != "ma-group" ]; then \
        groupdel -f ${default_group}; \
        groupadd -g 100 ma-group; \
    fi && \
    if [ -z ${default_group} ]; then \
        groupadd -g 100 ma-group; \
    fi && \
    if [ ! -z ${default_user} ] && [ ${default_user} != "ma-user" ]; then \
        userdel -r ${default_user}; \
        useradd -d /home/ma-user -m -u 1000 -g 100 -s /bin/bash ma-user; \
        chmod -R 750 /home/ma-user; \
    fi && \
    if [ -z ${default_user} ]; then \
        useradd -d /home/ma-user -m -u 1000 -g 100 -s /bin/bash ma-user; \
        chmod -R 750 /home/ma-user; \
    fi && \
    # set bash as default
    rm /bin/sh && ln -s /bin/bash /bin/sh

# section2: config apt source and install tools needed.
RUN sed -i "s@http://.*archive.ubuntu.com@http://repo.huaweicloud.com@g" /etc/apt/sources.list && \
    sed -i "s@http://.*security.ubuntu.com@http://repo.huaweicloud.com@g" /etc/apt/sources.list && \
    apt-get update && \
    apt-get install -y ca-certificates curl ffmpeg git libgl1-mesa-glx libglib2.0-0 libibverbs-dev libjpeg-dev libpng-dev libsm6 libxext6 libxrender-dev ninja-build screen sudo vim wget zip && \
    apt-get clean  && \
    rm -rf /var/lib/apt/lists/*

USER ma-user

# section3: install miniconda and rebuild conda env
RUN mkdir -p /home/ma-user/work/ && cd /home/ma-user/work/ && \
    wget https://repo.anaconda.com/miniconda/Miniconda3-py37_4.12.0-Linux-x86_64.sh && \
    chmod 777 Miniconda3-py37_4.12.0-Linux-x86_64.sh && \
    bash Miniconda3-py37_4.12.0-Linux-x86_64.sh -bfp /home/ma-user/anaconda3 && \
    wget https://${bucketname}.obs.cn-north-4.myhuaweicloud.com/${folder_name}/pytorch.tar.gz && \
    mkdir -p /home/ma-user/anaconda3/envs/pytorch && \
    tar -xzf pytorch.tar.gz -C /home/ma-user/anaconda3/envs/pytorch && \
    source /home/ma-user/anaconda3/envs/pytorch/bin/activate && conda-unpack && \
    /home/ma-user/anaconda3/bin/conda init bash && \
    rm -rf /home/ma-user/work/*

ENV PATH=/home/ma-user/anaconda3/envs/pytorch/bin:$PATH

# section4: settings of Jupyter Notebook for pytorch env
RUN source /home/ma-user/anaconda3/envs/pytorch/bin/activate && \
    pip install ipykernel==6.7.0 --trusted-host https://repo.huaweicloud.com -i https://repo.huaweicloud.com/repository/pypi/simple && \
    ipython kernel install --user --env PATH /home/ma-user/anaconda3/envs/pytorch/bin:$PATH --name=pytorch && \
    rm -rf /home/ma-user/.local/share/jupyter/kernels/pytorch/logo-* && \
    rm -rf ~/.cache/pip/* && \
    echo 'export PATH=$PATH:/home/ma-user/.local/bin' >> /home/ma-user/.bashrc && \
    echo 'export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/nvidia/lib64' >> /home/ma-user/.bashrc && \
    echo 'conda activate pytorch' >> /home/ma-user/.bashrc

ENV DEFAULT_CONDA_ENV_NAME=pytorch

Replace https://${bucket_name}.obs.cn-north-4.myhuaweicloud.com/${folder_name}/pytorch.tar.gz in the Dockerfile with the OBS path of pytorch.tar.gz in 1 (the file must be set to public read).

Go to the Dockerfile directory and run the following commands to create an image:

# Use the cd command to navigate to the directory that contains the Dockerfile and run the build command.
# docker build -t ${image_name}:${image_version} .
# For example:
docker build -t pytorch-1.13-cuda11.3-cudnn8-ubuntu18.04:v1 .

Debug an image.

It is recommended that you use a Dockerfile to include the changes made during debugging in the official container building process and test it again.
1. Ensure that the corresponding script, code, and process are running properly on the Linux server.
  If the running fails, commission the container and then create a container image.
2. Ensure that the image file is located correctly and that you have the required file permission.
  Before training, verify that the custom dependency package is normal, that the required packages are in the pip list, and that the Python used by the container is the correct one. (If you have multiple Pythons installed in the container image, you need to set the Python path environment variable.)
3. Test the training boot script.
  1. Data copy and verification
    Generally, the image does not contain training data and code. You need to copy the required files to the image after starting the image. To avoid running out of disk space, store data, code, and intermediate data in the /cache directory. It is recommended that the Linux server have sufficient memory (more than 8 GB) and hard disk (more than 100 GB).
    
    The following command enables file interaction between Docker and Linux:
```
docker cp data/ 39c9ceedb1f6:/cache/
```
    Once you have prepared the data, run the training script and verify that the training starts correctly. Generally, the boot script is as follows:
```
cd /cache/code/ 
python start_train.py
```
    To troubleshoot the training process, you can access the logs and errors in the container instance and adjust the code and environment variables accordingly.
  2. Preset script testing
    The run.sh script is typically used to copy data and code from OBS to containers, and to copy output results from containers to OBS. For details about how to build run.sh, see Running a Training Job on ModelArts Standard.
    
    You can edit and iterate the script in the container instance if the preset script does not produce the desired result.
  3. Dedicated pool scenario
    Mounting SFS in dedicated pools allows you to import code and data without worrying about OBS operations.
    
    You can either mount the SFS directory to the /mnt/sfs_turbo directory of the debugging node, or sync the directory content with the SFS disk.
    
    To start a container instance during debugging, use the -v parameter to mount a directory from the host machine to the container environment.
```
docker run -ti -d -v /mnt/sfs_turbo:/sfs my_deeplearning_image:v1
```
    The command above mounts the /mnt/sfs_turbo directory of the host machine to the /sfs directory of the container. Any changes in the corresponding directories of the host machine and container are synchronized in real time.
4. To locate faults, check the logs for the training image, and check the API response for the inference image.
  Run the following command to view all stdout logs output by the container:
```
docker logs -f 39c9ceedb1f6
```
  Some logs are stored in the container when creating an inference image. You need to access the container to view these logs. Note: You need to check whether the logs contain errors (including when the container is started and when the API is executed).
5. To modify the user groups of some files that are inconsistent, run the following command as the root user on the host machine.
```
docker exec -u root:root 39c9ceedb1f6 bash -c "chown -R ma-user:ma-user /cache"
```
6. To fix an error during debugging, edit it in the container instance. You can run the commit command to save the changes.

Uploading an Image

Uploading an image through the client is to run Docker commands on the machine where container engine client is installed to push the image to an image repository of SWR.

If your container engine client is an ECS or CCE node, you can push an image over two types of networks.

If your client and the image repository are in the same region, you can push an image over private networks.
If your client and the image repository are in different regions, you can push an image over public networks and the client needs to be bound to an EIP.

Each image layer uploaded through the client cannot be larger than 10 GB.
Your container engine client version must be 1.11.2 or later.

Access SWR.
1. Log in to the SWR console.
2. Click Create Organization in the upper right corner and enter an organization name to create an organization. Customize the organization name. Replace the organization name deep-learning in subsequent commands with the actual organization name.
3. In the navigation pane on the left, choose Dashboard and click Generate Login Command in the upper right corner. On the displayed page, click to copy the login command.
  - The validity period of the generated login command is 24 hours. To obtain a long-term valid login command, see Obtaining a Login Command with Long-Term Validity. After you obtain a long-term valid login command, your temporary login commands will still be valid as long as they are in their validity periods.
  - The domain name at the end of the login command is the image repository address. Record the address for later use.
4. Run the login command on the machine where the container engine is installed.
  The message Login Succeeded will be displayed upon a successful login.
Run the following command on the device where the container engine is installed to label the image:
docker tag [Image name 1:tag 1] [Image repository address]/[Organization name]/[Image name 2:tag 2]
- [Image name 1:tag 1]: Replace it with the actual name and tag of the image to be uploaded.
- [Image repository address]: You can query the address on the SWR console, that is, the domain name at the end of the login command in 1.c.
- [Organization name]: Replace it with the name of the organization created.
- [Image name 2:tag 2]: Replace it with the desired image name and tag.
Example:
```
docker tag ${image_name}:${image_version} swr.cn-north-4.myhuaweicloud.com/${organization_name}/${image_name}:${image_version}
```
Run the following command to push the image to the image repository:
docker push [Image repository address]/[Organization name]/[Image name 2:tag 2]

Example:
```
docker push swr.cn-north-4.myhuaweicloud.com/${organization_name}/${image_name}:${image_version}
```
To view the pushed image, go to the SWR console and refresh the My Images page.

Uploading data to OBS

A common OBS bucket has been created. For details, see Creating a Bucket.
obsutil has been installed. For details, see Installing and Configuring the OBS CLI.
For details about the data transmission principle between OBS and the training container, see Running a Training Job on ModelArts Standard.

Visit the official ImageNet website and download the ImageNet-21K dataset from http://image-net.org/.
After converting the format, download the annotation files: ILSVRC2021winner21k_whole_map_train.txt and ILSVRC2021winner21k_whole_map_val.txt.
Upload the downloaded files to the imagenet21k_whole folder in the OBS bucket. For details about how to upload files to an OBS bucket, see Uploading Data and Algorithms to OBS.

Uploading the Algorithm to SFS

Download the Swin-Transformer code.

git clone --recursive https://github.com/microsoft/Swin-Transformer.git

Comment out line 27 t_mul=1 of the lr_scheduler.py file.
Comment out line 28 print("ERROR IMG LOADED: ", path) of the imagenet22k_dataset.py file in the data folder.
Change prefix = 'ILSVRC2011fall_whole' to prefix = 'ILSVRC2021winner21k_whole' in line 112 of the build.py file in the data folder.
Create the requirements.txt file in the Swin-Transformer directory to specify the Python dependency library.
```
# The content of the requirements.txt file is as follows:

timm==0.4.12
termcolor==1.1.0
yacs==0.1.8
```
Prepare the OBS paths required in the run.sh file.
1. Prepare the URL for sharing the ImageNet dataset.
  Select the imagenet21k_whole dataset folder to be shared, click the share button, select the validity period of the share URL, enter the access code, for example, 123456, click Copy Link, and record the URL.
2. Prepare the obsutil_linux_amd64.tar.gz share URL.
  Download obsutil_linux_amd64.tar.gz by referring to Downloading and Installing obsutil, upload it to the OBS bucket, and set it to public read. Click Object Properties and copy the URL.
  
  Sample link:
```
https://${bucketname_name}.obs.cn-north-4.myhuaweicloud.com/${folders_name}/pytorch.tar.gz
```

In the Swin-Transformer directory, create the run.sh boot script.

Replace SRC_DATA_PATH=${URL for sharing the ImageNet dataset in OBS} in the script with the share URL of the imagenet21k_whole folder in the previous step.
Replace https://${bucket_name}.obs.cn-north-4.myhuaweicloud.com/${folder_name}/obsutil_linux_amd64.tar.gz in the script with the OBS path of obsutil_linux_amd64.tar.gz in the previous step (the file must be set to public read).

Boot script for single-node single-PU training:

# Create a run.sh script in the home directory of the code. The script content is as follows:

#!/bin/bash

# Download data from OBS to the local SSD.
DIS_DATA_PATH=/cache
SRC_DATA_PATH=${URL for sharing the ImageNet dataset in OBS}
OBSUTIL_PATH=https://${bucket_name}.obs.cn-north-4.myhuaweicloud.com/${folder_name}/obsutil_linux_amd64.tar.gz
mkdir -p $DIS_DATA_PATH && cd $DIS_DATA_PATH && wget $OBSUTIL_PATH && tar -xzvf obsutil_linux_amd64.tar.gz && $DIS_DATA_PATH/obsutil_linux_amd64*/obsutil share-cp $SRC_DATA_PATH $DIS_DATA_PATH/ -ac=123456 -r -f -j 256 && cd -
IMAGE_DATA_PATH=$DIS_DATA_PATH/imagenet21k_whole

# Path for storing the model weights and training configurations during model training
OUTPUT_PATH=/cache/output

MASTER_PORT="6061"

/home/ma-user/anaconda3/envs/pytorch/bin/python -m torch.distributed.launch --nproc_per_node=1 --master_addr localhost --master_port=$MASTER_PORT main.py --data-path $IMAGE_DATA_PATH --output $OUTPUT_PATH --cfg ./configs/swin/swin_base_patch4_window7_224_22k.yaml --local_rank 0

Boot script for multi-node multi-PU training:

# Create a run.sh script.

#!/bin/bash

# Download data from OBS to the local SSD.
DIS_DATA_PATH=/cache
SRC_DATA_PATH=${URL for sharing the ImageNet dataset in OBS}
OBSUTIL_PATH=https://${bucket_name}.obs.cn-north-4.myhuaweicloud.com/${folder_name}/obsutil_linux_amd64.tar.gz
mkdir -p $DIS_DATA_PATH && cd $DIS_DATA_PATH && wget $OBSUTIL_PATH && tar -xzvf obsutil_linux_amd64.tar.gz && $DIS_DATA_PATH/obsutil_linux_amd64*/obsutil share-cp $SRC_DATA_PATH $DIS_DATA_PATH/ -ac=123456 -r -f -j 256 && cd -
IMAGE_DATA_PATH=$DIS_DATA_PATH/imagenet21k_whole

# Path for storing the model weights and training configurations during model training
OUTPUT_PATH=/cache/output

MASTER_ADDR=$(echo ${VC_WORKER_HOSTS} | cut -d "," -f 1)
MASTER_PORT="6060"
NNODES="$VC_WORKER_NUM"
NODE_RANK="$VC_TASK_INDEX"
NGPUS_PER_NODE="$MA_NUM_GPUS"

/home/ma-user/anaconda3/envs/pytorch/bin/python -m torch.distributed.launch --nnodes=$NNODES --node_rank=$NODE_RANK --nproc_per_node=$NGPUS_PER_NODE --master_addr $MASTER_ADDR --master_port=$MASTER_PORT main.py --data-path $IMAGE_DATA_PATH --output=$OUTPUT_PATH --cfg ./configs/swin/swin_base_patch4_window7_224_22k.yaml

You are advised to use the boot script for single-node single-PU training. After the script runs properly, use the boot script for multi-node multi-PU training.
VC_WORKER_HOSTS, VC_WORKER_NUM, VC_TASK_INDEX, and MA_NUM_GPUS in run.sh for multi-node multi-PU training are environment variables preset in the ModelArts training container. For details about environment variables of a training container, see Viewing Environment Variables of a Training Container.
OUTPUT_PATH in run.sh is the path for storing intermediate results such as model weights and training configurations during training. If config.TRAIN.AUTO_RESUME in the training script is set to True (default value), the latest model weight will be automatically loaded in the OUTPUT_PATH directory during training.

Save the code folder to OBS through obsutils and upload the code to the target directory of SFS through OBS.
In SFS, set the owner of the code file Swin-Transformer-main to ma-user.
```
chown -R ma-user:ma-group Swin-Transformer
```
Run the following command to remove \r from the shell script:
```
cd Swin-Transformer
sed -i 's/\r//' run.sh
```
Shell scripts written in Windows have \r\n as the line ending, but Linux uses \n as the line ending. This means that Linux treats \r as part of the script and shows the error message "$'\r': command not found" when running it. To fix this, you need to remove \r from the shell script.

Debugging Code with Notebook

A notebook instance includes a /cache directory limited to 500 GB. Exceeding this limit causes the instance to restart. Since the ImageNet dataset surpasses 500 GB, consider using offline resources or a smaller portion for debugging within a notebook instance. (For details about debugging in a notebook instance, see Debugging Code with Notebook.)

Creating a Multi-Node Multi-PU Training Job

Log in to the ModelArts console and check whether access authorization has been configured for your account. For details, see Configuring Agency Authorization for ModelArts with One Click. If you have been authorized using access keys, clear the authorization and configure agency authorization.
In the navigation pane on the left, choose Model Training > Training Jobs. The training job list is displayed by default.
On the Create Training Job page, configure parameters and click Submit.
- Algorithm Type: Custom algorithm
- Boot Mode: Custom image
- Image: Select the uploaded custom image.
- Boot Command:
```
cd /home/ma-user/work/code/Swin-Transformer && /home/ma-user/anaconda3/envs/pytorch/bin/pip install -r requirements.txt && /bin/sh run.sh
```
- Auto Restart: Enable this function and set Restarts. In this way, when a node is faulty, the platform automatically restarts the job and isolates the faulty node.
  If this function is enabled, you need to configure the output path in the run.sh script. In this way, when a fault occurs, the training can be continued based on the model weight saved in the previous round, minimizing resource waste.
- Resource Pool: In the Dedicated Resource Pool tab, select dedicated resource pool of the GPU flavor.
- Specifications: Select required GPU specifications.
- Compute Nodes: Select the number of required nodes.
- SFS Turbo: Add the mount configuration and select the SFS name. The cloud mount path is /home/ma-user/work.
  
  To ensure that the code path and boot command are the same as those for notebook debugging, set the cloud mount path to /home/ma-user/work.
Click Submit. On the information confirmation page, check the parameters, and click OK.
Wait until the training job is created.
After you submit the job creation request, the system will automatically perform operations on the backend, such as downloading the container image and code directory and running the boot command. A training job requires a certain period of time for running. The duration ranges from dozens of minutes to several hours, depending on the service logic and selected resources.