Example of Starting PyTorch DDP Training Based on a Training Job

This topic describes three methods of using a training job to start PyTorch DDP training and provides their sample code.

Use PyTorch preset images and run the mp.spawn command.
Use custom images.
- Run the torch.distributed.launch command.
- Run the torch.distributed.run command.

Creating a Training Job

Method 1: Use the preset PyTorch framework and run the mp.spawn command to start a training job.

For details about parameters for creating a training job, see Table 1.

**Table 1** Creating a training job (preset framework)
Parameter	Description
Algorithm Type	Select Custom algorithm.
Boot Mode	Choose Preset image and set AI Engine to PyTorch. Configure the PyTorch version based on your training requirements.
Code Directory	Select the path where the training code folder is stored in the OBS bucket, for example, obs://test-modelarts/code/.
Boot File	Select the Python boot script of the training job in the code directory, for example, obs://test-modelarts/code/main.py.
Hyperparameters	If the resource specification is single-node multi-card, you need to specify the hyperparameters world_size and rank. If you select a resource flavor with multiple nodes (more than one compute node), you do not need to set these hyperparameters. world_size and rank are automatically injected by ModelArts.

Method 2: Use a custom image and run the torch.distributed.launch command to start a training job.

For details about parameters for creating a training job, see Table 2.

**Table 2** Creating a training job (custom image + **torch.distributed.launch**)
Parameter	Description
Algorithm Type	Select Custom algorithm.
Boot Mode	Select Custom image.
Image	Select a PyTorch image for training.
Code Directory	Select the path where the training code folder is stored in the OBS bucket, for example, obs://test-modelarts/code/.
Boot Command	Enter the Python startup command of the image, for example: bash ${MA_JOB_DIR}/code/torchlaunch.sh

Method 3: Use a custom image and run the torch.distributed.run command to start a training job.

For details about parameters for creating a training job, see Table 3.

**Table 3** Creating a training job (custom image + **torch.distributed.run**)
Parameter	Description
Algorithm Type	Select Custom algorithm.
Boot Mode	Select Custom image.
Image	Select a PyTorch image for training.
Code Directory	Select the path where the training code folder is stored in the OBS bucket, for example, obs://test-modelarts/code/.
Boot Command	Enter the Python startup command of the image, for example: bash ${MA_JOB_DIR}/code/torchrun.sh

Code Examples

Upload the following files to an OBS bucket:

code                             # Root directory of the code
 └─torch_ddp.py                # Code file for PyTorch DDP training
 └─main.py                     # Boot file for starting training using the PyTorch preset image and the mp.spawn command
 └─torchlaunch.sh              # Boot file for starting training using the custom image and the torch.distributed.launch command
 └─torchrun.sh                 # Boot file for starting training using the custom image and the torch.distributed.run command

torch_ddp.py

import os
import torch
import torch.distributed as dist
import torch.nn as nn
import torch.optim as optim
from torch.nn.parallel import DistributedDataParallel as DDP

# Start training by running mp.spawn.
def init_from_arg(local_rank, base_rank, world_size, init_method):
    rank = base_rank + local_rank
    dist.init_process_group("nccl", rank=rank, init_method=init_method, world_size=world_size)
    ddp_train(local_rank)

# Start training by running torch.distributed.launch or torch.distributed.run.
def init_from_env():
    dist.init_process_group(backend='nccl', init_method='env://')
    local_rank=int(os.environ["LOCAL_RANK"])
    ddp_train(local_rank)

def cleanup():
    dist.destroy_process_group()

class ToyModel(nn.Module):
    def __init__(self):
        super(ToyModel, self).__init__()
        self.net1 = nn.Linear(10, 10)
        self.relu = nn.ReLU()
        self.net2 = nn.Linear(10, 5)
    def forward(self, x):
        return self.net2(self.relu(self.net1(x)))

def ddp_train(device_id):
    # create model and move it to GPU with id rank
    model = ToyModel().to(device_id)
    ddp_model = DDP(model, device_ids=[device_id])
    loss_fn = nn.MSELoss()
    optimizer = optim.SGD(ddp_model.parameters(), lr=0.001)
    optimizer.zero_grad()
    outputs = ddp_model(torch.randn(20, 10))
    labels = torch.randn(20, 5).to(device_id)
    loss_fn(outputs, labels).backward()
    optimizer.step()
    cleanup()

if __name__ == "__main__":
    init_from_env()

main.py

import argparse
import torch
import torch.multiprocessing as mp

parser = argparse.ArgumentParser(description='ddp demo args')
parser.add_argument('--world_size', type=int, required=True)
parser.add_argument('--rank', type=int, required=True)
parser.add_argument('--init_method', type=str, required=True)
args, unknown = parser.parse_known_args()

if __name__ == "__main__":
    n_gpus = torch.cuda.device_count()
    world_size = n_gpus * args.world_size
    base_rank = n_gpus * args.rank
    # Call the start function in the DDP sample code.
    from torch_ddp import init_from_arg
    mp.spawn(init_from_arg,
        args=(base_rank, world_size, args.init_method),
        nprocs=n_gpus,
        join=True)

torchlaunch.sh

#!/bin/bash
# Default system environment variables. Do not modify them.
MASTER_HOST="$VC_WORKER_HOSTS"
MASTER_ADDR="${VC_WORKER_HOSTS%%,*}"
MASTER_PORT="6060"
JOB_ID="1234"
NNODES="$MA_NUM_HOSTS"
NODE_RANK="$VC_TASK_INDEX"
NGPUS_PER_NODE="$MA_NUM_GPUS"

# Custom environment variables to specify the Python script and parameters.
PYTHON_SCRIPT=${MA_JOB_DIR}/code/torch_ddp.py
PYTHON_ARGS=""

CMD="python -m torch.distributed.launch \
    --nnodes=$NNODES \
    --node_rank=$NODE_RANK \
    --nproc_per_node=$NGPUS_PER_NODE \
    --master_addr $MASTER_ADDR \
    --master_port=$MASTER_PORT \
    --use_env \
    $PYTHON_SCRIPT \
    $PYTHON_ARGS
"
echo $CMD
$CMD

torchrun.sh

In PyTorch 2.1, you must set rdzv_backend to static: --rdzv_backend=static.

#!/bin/bash
# Default system environment variables. Do not modify them.
MASTER_HOST="$VC_WORKER_HOSTS"
MASTER_ADDR="${VC_WORKER_HOSTS%%,*}"
MASTER_PORT="6060"
JOB_ID="1234"
NNODES="$MA_NUM_HOSTS"
NODE_RANK="$VC_TASK_INDEX"
NGPUS_PER_NODE="$MA_NUM_GPUS"

# Custom environment variables to specify the Python script and parameters.
PYTHON_SCRIPT=${MA_JOB_DIR}/code/torch_ddp.py
PYTHON_ARGS=""

if [[ $NODE_RANK == 0 ]]; then
    EXT_ARGS="--rdzv_conf=is_host=1"
else
    EXT_ARGS=""
fi

CMD="python -m torch.distributed.run \
    --nnodes=$NNODES \
    --node_rank=$NODE_RANK \
    $EXT_ARGS \
    --nproc_per_node=$NGPUS_PER_NODE \
    --rdzv_id=$JOB_ID \
    --rdzv_backend=c10d \
    --rdzv_endpoint=$MASTER_ADDR:$MASTER_PORT \
    $PYTHON_SCRIPT \
    $PYTHON_ARGS
    "
echo $CMD
$CMD

Parent topic: Distributed Training

Previous topic: Sample Code of Distributed Training

Next topic: Automatic Model Tuning (AutoSearch)