Starting a Preset Image's Boot File

ModelArts Standard offers multiple AI images for model training, which can be adapted by modifying their boot commands.

This section describes how to modify the boot file when creating a training job using different preset images.

Ascend-Powered-Engine Boot Principles

Ascend-Powered-Engine is a unique engine that combines multiple AI frameworks, runtime environments, and boot modes tailored to Ascend accelerator chips.

Major Snt9 Ascend accelerators run on Arm-backed CPU servers, which means the upper-layer Docker images are Arm images. The NVIDIA CUDA (unified computing architecture) compute library is installed in images for GPU scenarios. The Huawei CANN (heterogeneous computing architecture) compute library is installed in images powered by the Ascend-Powered-Engine, adapting to the Ascend driver.

After a training job is submitted, ModelArts Standard automatically runs the boot file.

The default boot mode of the boot file of the Ascend-Powered-Engine framework is as follows:

The number of times the boot file runs for each training job depends on the number of cards used. When a job is running, the boot file is executed once for each card. For example, in a single-node job with one card, the boot file runs once. In a single-node job with eight cards, it runs eight times. So, do not listen on ports in the boot file.

The following environment variables are automatically configured in the boot file:

RANK_TABLE_FILE: path of the rank table file.
ASCEND_DEVICE_ID: logical device ID. For example, for single-card training, the value is always 0.
RANK_ID: logical (sequential) number of a device in a training job.
RANK_SIZE: Set this parameter based on the number of devices in the rank table file. For example, the value is 4 for 4 snt9b devices.

To ensure the boot file runs only once, check the ASCEND_DEVICE_ID value. If it is 0, execute the logic; otherwise, exit directly.

For details about the example code file mindspore-verification.py of the Ascend-Powered-Engine framework, see Training the mindspore-verification.py File.

The command for starting Ascend-Powered-Engine in standalone mode is the same as that in distributed mode.

The Ascend-Powered-Engine framework offers multiple boot modes. By default, the boot file is executed based on RANK_TABLE_FILE. You can also configure the MA_RUN_METHOD environment variable to run the boot file using alternative methods. The MA_RUN_METHOD environment variable can be set to torchrun and msrun.

If MA_RUN_METHOD is set torchrun, ModelArts Standard uses the torchrun command to run the boot file.

The PyTorch version must be 1.11.0 or later.
- For single-node jobs, ModelArts Standard uses these commands to start the boot file:
```
torchrun --standalone --nnodes=${MA_NUM_HOSTS} --nproc_per_node=${MA_NUM_GPUS} ${MA_EXTRA_TORCHRUN_PARAMS} "Boot file" {arg1} {arg2} ...
```
- For multi-node jobs, ModelArts Standard uses these commands to start the boot file:
```
torchrun --nnodes=${MA_NUM_HOSTS} --nproc_per_node=${MA_NUM_GPUS} --node_rank=${VC_TASK_INDEX} --master_addr={master_addr} --master_port=${MA_TORCHRUN_MASTER_PORT} --rdzv_id={ma_job_name} --rdzv_backend=static ${MA_EXTRA_TORCHRUN_PARAMS} "Boot file" {arg1} {arg2} ...
```
Parameters:
- standalone: identifier of a single-node job.
- nnodes: number of task nodes.
- nproc_per_node: number of main processes started by each task node. Set this parameter to the number of NPUs allocated to the task.
- node_rank: task rank, which is used for multi-task distributed training.
- master_addr: address of the main task (rank 0). Set it to the communication domain name of worker-0.
- master_port: port used for communication during distributed training on the main task (rank 0). The default value is 18888. When a master_port conflict occurs, you can modify the port configuration by configuring the MA_TORCHRUN_MASTER_PORT environment variable.
- rdzv_id: Rendezvous ID. Set it to a value with the training job ID.
- rdzv_backend: Rendezvous backend, which is fixed at static. That is, master_addr and master_port are used instead of Rendezvous. Additionally, you can configure the MA_EXTRA_TORCHRUN_PARAMS environment variable to add additional torchrun command parameters or overwrite the preset torchrun command parameters. The following is an example of configuring the rdzv_conf parameter in the torchrun command:
```
"environments": {
"MA_RUN_METHOD": "torchrun",
"MA_EXTRA_TORCHRUN_PARAMS": "--rdzv_conf=timeout=7200"
}
```
If the RuntimeError: Socket Timeout error occurs during the distributed process group initialization using torchrun, you can add the following environment variables to create a training job again to view initialization details and further locate the fault.
- LOGLEVEL=INFO
- TORCH_CPP_LOG_LEVEL=INFO
- TORCH_DISTRIBUTED_DEBUG=DETAIL
The RuntimeError: Socket Timeout error is caused by a significant time discrepancy between tasks when running the torchrun command. The time discrepancy is usually caused by initialization tasks, like downloading the training data and checkpoint read/write, which happen before the torchrun command is run. If the time taken to complete these initialization tasks varies significantly, a Socket Timeout error may occur. When this error happens, check the time difference between the torchrun execution points for each task. If the time difference is too large, optimize the initialization process before running the torchrun command to ensure a reasonable time gap.
If MA_RUN_METHOD is set msrun, ModelArts Standard uses the msrun command to run the boot file.

The MindSpore version must be 2.3.0 or later.

This solution supports dynamic networking and rank table file-based networking. If you set the environment variable MS_RANKTABLE_ENABLE to True, msrun reads the rank table file for networking. Otherwise, dynamic networking is used by default.

msrun uses these commands to start the boot file:
```
msrun --worker_num=${msrun_worker_num} --local_worker_num=${MA_NUM_GPUS} --master_addr=${msrun_master_addr} --node_rank=${VC_TASK_INDEX} --master_port=${msrun_master_port} --log_dir=${msrun_log_dir} --join=True --cluster_time_out=${MSRUN_CLUSTER_TIME_OUT} --rank_table_file=${msrun_rank_table_file} "Boot file" {arg1} {arg2} ...
```
Parameters:
- worker_num: total number of processes, which is also equivalent to the number of cards involved, as each card initiates a process.
- local_worker_num: number of processes on the current node, which is also the number of cards used by the current node.
- master_addr: IP address of the node where the msrun scheduling process is located. This parameter does not need to be configured for single-node jobs.
- master_port: port number of the msrun scheduling process.
- node_rank: ID of the current node.
- log_dir: log output directory of msrun and all processes.
- join: whether the msrun process still exists after the training process is started. The default value is True, indicating that the msrun process exits after all processes exit.
- cluster_time_out: timeout interval of the cluster networking. The default value is 600s. The value can be controlled by the MSRUN_CLUSTER_TIME_OUT environment variable.
- rank_table_file: address of the rank table file. If the environment variable MS_RANKTABLE_ENABLE is set to True, this parameter is added during startup.

PyTorch-GPU Boot Principles

For single-node multi-card scenarios, the platform adds the --init_method "tcp://<ip>:<port>" parameter to the boot file.

For multi-node multi-card scenarios, the platform adds the --init_method "tcp://<ip>:<port>" --rank <rank_id> --world_size <node_num> parameter to the boot file.

The preceding parameters must be parsed in the boot file.

For details about the code example of the PyTorch-GPU framework, see "Method 1" in Example: Creating a DDP Distributed Training Job (PyTorch + GPU).

TensorFlow-GPU Boot Principles

For a single-node job, ModelArts starts a training container that exclusively uses the resources on the node.

For a multi-node job, ModelArts starts a parameter server and a worker on the same node. It allocates parameter server and worker tasks in a 1:1 ratio. For example, in a two-node job, two parameter servers and two workers are allocated. ModelArts also injects the following parameters into the boot file:

--task_index <VC_TASK_INDEX> --ps_hosts <TF_PS_HOSTS> --worker_hosts <TF_WORKER_HOSTS> --job_name <MA_TASK_NAME>

The following parameters must be parsed in the boot file.

VC_TASK_INDEX: task serial number, for example, 0, 1, or 2.
TF_PS_HOSTS: addresses of parameter server nodes, for example, [xx-ps-0.xx:TCP_PORT,xx-ps-1.xx:TCP_PORT]. The value of TCP_PORT is a random port ranging from 5,000 to 10,000.
TF_WORKER_HOSTS: addresses of worker nodes, for example, [xx-worker-0.xx:TCP_PORT,xx-worker-1.xx:TCP_PORT]. The value of TCP_PORT is a random port ranging from 5,000 to 10,000.
MA_TASK_NAME: task name, which can be ps or worker.

For details, see the example code file mnist.py (single-node) of the TensorFlow-GPU framework.

Horovod/MPI/MindSpore-GPU

ModelArts uses mpirun to run boot files for Horovod, MPI, or MindSpore-GPU. To use a preset engine in ModelArts Standard, simply edit the boot file (training script). ModelArts Standard automatically builds the mpirun command and training job cluster. The platform does not add extra parameters to the boot file.

Example of pytorch_synthetic_benchmark.py:

import argparse
import torch.backends.cudnn as cudnn
import torch.nn.functional as F
import torch.optim as optim
import torch.utils.data.distributed
from torchvision import models
import horovod.torch as hvd
import timeit
import numpy as np

# Benchmark settings
parser = argparse.ArgumentParser(description='PyTorch Synthetic Benchmark',
                                 formatter_class=argparse.ArgumentDefaultsHelpFormatter)
parser.add_argument('--fp16-allreduce', action='store_true', default=False,
                    help='use fp16 compression during allreduce')

parser.add_argument('--model', type=str, default='resnet50',
                    help='model to benchmark')
parser.add_argument('--batch-size', type=int, default=32,
                    help='input batch size')

parser.add_argument('--num-warmup-batches', type=int, default=10,
                    help='number of warm-up batches that don\'t count towards benchmark')
parser.add_argument('--num-batches-per-iter', type=int, default=10,
                    help='number of batches per benchmark iteration')
parser.add_argument('--num-iters', type=int, default=10,
                    help='number of benchmark iterations')

parser.add_argument('--no-cuda', action='store_true', default=False,
                    help='disables CUDA training')

parser.add_argument('--use-adasum', action='store_true', default=False,
                    help='use adasum algorithm to do reduction')

args = parser.parse_args()
args.cuda = not args.no_cuda and torch.cuda.is_available()

hvd.init()

if args.cuda:
    # Horovod: pin GPU to local rank.
    torch.cuda.set_device(hvd.local_rank())

cudnn.benchmark = True

# Set up standard model.
model = getattr(models, args.model)()

# By default, Adasum doesn't need scaling up learning rate.
lr_scaler = hvd.size() if not args.use_adasum else 1

if args.cuda:
    # Move model to GPU.
    model.cuda()
    # If using GPU Adasum allreduce, scale learning rate by local_size.
    if args.use_adasum and hvd.nccl_built():
        lr_scaler = hvd.local_size()

optimizer = optim.SGD(model.parameters(), lr=0.01 * lr_scaler)

# Horovod: (optional) compression algorithm.
compression = hvd.Compression.fp16 if args.fp16_allreduce else hvd.Compression.none

# Horovod: wrap optimizer with DistributedOptimizer.
optimizer = hvd.DistributedOptimizer(optimizer,
                                     named_parameters=model.named_parameters(),
                                     compression=compression,
                                     op=hvd.Adasum if args.use_adasum else hvd.Average)

# Horovod: broadcast parameters & optimizer state.
hvd.broadcast_parameters(model.state_dict(), root_rank=0)
hvd.broadcast_optimizer_state(optimizer, root_rank=0)

# Set up fixed fake data
data = torch.randn(args.batch_size, 3, 224, 224)
target = torch.LongTensor(args.batch_size).random_() % 1000
if args.cuda:
    data, target = data.cuda(), target.cuda()


def benchmark_step():
    optimizer.zero_grad()
    output = model(data)
    loss = F.cross_entropy(output, target)
    loss.backward()
    optimizer.step()


def log(s, nl=True):
    if hvd.rank() != 0:
        return
    print(s, end='\n' if nl else '')


log('Model: %s' % args.model)
log('Batch size: %d' % args.batch_size)
device = 'GPU' if args.cuda else 'CPU'
log('Number of %ss: %d' % (device, hvd.size()))

# Warm-up
log('Running warmup...')
timeit.timeit(benchmark_step, number=args.num_warmup_batches)

# Benchmark
log('Running benchmark...')
img_secs = []
for x in range(args.num_iters):
    time = timeit.timeit(benchmark_step, number=args.num_batches_per_iter)
    img_sec = args.batch_size * args.num_batches_per_iter / time
    log('Iter #%d: %.1f img/sec per %s' % (x, img_sec, device))
    img_secs.append(img_sec)

# Results
img_sec_mean = np.mean(img_secs)
img_sec_conf = 1.96 * np.std(img_secs)
log('Img/sec per %s: %.1f +-%.1f' % (device, img_sec_mean, img_sec_conf))
log('Total img/sec on %d %s(s): %.1f +-%.1f' %
    (hvd.size(), device, hvd.size() * img_sec_mean, hvd.size() * img_sec_conf))

run_mpi.sh is as follows:

#!/bin/bash
MY_HOME=/home/ma-user

MY_SSHD_PORT=${MY_SSHD_PORT:-"36666"}

MY_MPI_BTL_TCP_IF=${MY_MPI_BTL_TCP_IF:-"eth0,bond0"}

MY_TASK_INDEX=${MA_TASK_INDEX:-${VC_TASK_INDEX:-${VK_TASK_INDEX}}}

MY_MPI_SLOTS=${MY_MPI_SLOTS:-"${MA_NUM_GPUS}"}

MY_MPI_TUNE_FILE="${MY_HOME}/env_for_user_process"

if [ -z ${MY_MPI_SLOTS} ]; then
    echo "[run_mpi] MY_MPI_SLOTS is empty, set it be 1"
    MY_MPI_SLOTS="1"
fi

printf "MY_HOME: ${MY_HOME}\nMY_SSHD_PORT: ${MY_SSHD_PORT}\nMY_MPI_BTL_TCP_IF: ${MY_MPI_BTL_TCP_IF}\nMY_TASK_INDEX: ${MY_TASK_INDEX}\nMY_MPI_SLOTS: ${MY_MPI_SLOTS}\n"

env | grep -E '^MA_|SHARED_|^S3_|^PATH|^VC_WORKER_|^SCC|^CRED' | grep -v '=$' > ${MY_MPI_TUNE_FILE}
# add -x to each line
sed -i 's/^/-x /' ${MY_MPI_TUNE_FILE}

sed -i "s|{{MY_SSHD_PORT}}|${MY_SSHD_PORT}|g" ${MY_HOME}/etc/ssh/sshd_config

# start sshd service
bash -c "$(which sshd) -f ${MY_HOME}/etc/ssh/sshd_config"

# confirm the sshd is up
netstat -anp | grep LIS | grep ${MY_SSHD_PORT}

if [ $MY_TASK_INDEX -eq 0 ]; then
    # generate the hostfile of mpi
    for ((i=0; i<$MA_NUM_HOSTS; i++))
    do
        eval hostname=${MA_VJ_NAME}-${MA_TASK_NAME}-${i}.${MA_VJ_NAME}
        echo "[run_mpi] hostname: ${hostname}"

        ip=""
        while [ -z "$ip" ]; do
            ip=$(ping -c 1 ${hostname} | grep "PING" | sed -E 's/PING .* .([0-9.]+). .*/\1/g')
            sleep 1
        done
        echo "[run_mpi] resolved ip: ${ip}"

        # test the sshd is up
        while :
        do
            if [ cat < /dev/null >/dev/tcp/${ip}/${MY_SSHD_PORT} ]; then
                break
            fi
            sleep 1
        done

        echo "[run_mpi] the sshd of ip ${ip} is up"

        echo "${ip} slots=$MY_MPI_SLOTS" >> ${MY_HOME}/hostfile
    done

    printf "[run_mpi] hostfile:\n`cat ${MY_HOME}/hostfile`\n"
fi

RET_CODE=0

if [ $MY_TASK_INDEX -eq 0 ]; then

    echo "[run_mpi] start exec command time: "$(date +"%Y-%m-%d-%H:%M:%S")

    np=$(( ${MA_NUM_HOSTS} * ${MY_MPI_SLOTS} ))

    echo "[run_mpi] command: mpirun -np ${np} -hostfile ${MY_HOME}/hostfile -mca plm_rsh_args \"-p ${MY_SSHD_PORT}\" -tune ${MY_MPI_TUNE_FILE} ... $@"

    # execute mpirun at worker-0
    # mpirun
    mpirun \
        -np ${np} \
        -hostfile ${MY_HOME}/hostfile \
        -mca plm_rsh_args "-p ${MY_SSHD_PORT}" \
        -tune ${MY_MPI_TUNE_FILE} \
        -bind-to none -map-by slot \
        -x NCCL_DEBUG=INFO -x NCCL_SOCKET_IFNAME=${MY_MPI_BTL_TCP_IF} -x NCCL_SOCKET_FAMILY=AF_INET \
        -x HOROVOD_MPI_THREADS_DISABLE=1 \
        -x LD_LIBRARY_PATH \
        -mca pml ob1 -mca btl ^openib -mca plm_rsh_no_tree_spawn true \
        "$@"

    RET_CODE=$?

    if [ $RET_CODE -ne 0 ]; then
        echo "[run_mpi] exec command failed, exited with $RET_CODE"
    else
        echo "[run_mpi] exec command successfully, exited with $RET_CODE"
    fi

    # stop 1...N worker by killing the sleep proc
    sed -i '1d' ${MY_HOME}/hostfile
    if [ `cat ${MY_HOME}/hostfile | wc -l` -ne 0 ]; then
        echo "[run_mpi] stop 1 to (N - 1) worker by killing the sleep proc"

        sed -i 's/${MY_MPI_SLOTS}/1/g' ${MY_HOME}/hostfile
        printf "[run_mpi] hostfile:\n`cat ${MY_HOME}/hostfile`\n"

        mpirun \
        --hostfile ${MY_HOME}/hostfile \
        --mca btl_tcp_if_include ${MY_MPI_BTL_TCP_IF} \
        --mca plm_rsh_args "-p ${MY_SSHD_PORT}" \
        -x PATH -x LD_LIBRARY_PATH \
        pkill sleep \
        > /dev/null 2>&1
    fi

    echo "[run_mpi] exit time: "$(date +"%Y-%m-%d-%H:%M:%S")
else
    echo "[run_mpi] the training log is in worker-0"
    sleep 365d
    echo "[run_mpi] exit time: "$(date +"%Y-%m-%d-%H:%M:%S")
fi

exit $RET_CODE

Parent topic: Preparing Model Training Code

Previous topic: Preparing Model Training Code

Next topic: Developing Code for Training Using a Preset Image