Starting Training Using a Preset Image's Boot File

ModelArts Standard offers preset popular AI images tailored to their specific features. To train models efficiently, modify the boot command according to your chosen image's requirements for seamless execution.

ModelArts provides four preset images for creating training jobs.

**Table 1** Preset images
Scenario	Preset Image	Description		Advantages and Suggestions
NPU	Ascend-Powered-Engine	It is a set of AI images, runtime environments, and boot modes tailored for Huawei Cloud's AI accelerator chips. There are three boot modes:	Method 1: Using an RTF file to start a training job	Select this option if you use Huawei Cloud NPUs. This option can improve the model training efficiency.	RTF: general scenarios
			Method 2: Using the torchrun command to start a training job		torchrun: Use PyTorch+Ascend Extension for PyTorch.
			Method 3: Using the msrun command to start a training job		msrun: You can use this mode after MindSpore is installed. This option works independently of external libraries or configuration files, includes disaster recovery, and maintains high security.
GPU	PyTorch-GPU	PyTorch-GPU preset image.		This image features flexibility and ease of use, and works well for research and quick development.
	TensorFlow-GPU	TensorFlow-GPU preset image.		This image excels in deploying models in production environments and optimizing performance, making it ideal for enterprise applications.
	Horovod/MPI/MindSpore-GPU	ModelArts uses mpirun to run boot files for Horovod, MPI, or MindSpore-GPU.		This feature requires the OpenMPI library. Choose it only if you are familiar with OpenMPI.

This section describes how to modify the boot file when creating a training job using different preset images.

Ascend-Powered-Engine

Ascend-Powered-Engine is a unique engine that combines an AI framework, runtime environment, and boot mode tailored to Ascend accelerator chips. Unlike conventional AI frameworks like PyTorch or TensorFlow, or parallel execution frameworks such as MPI, it serves a distinct purpose.

Snt9 Ascend accelerators run on Arm CPU environments, which means their Docker images are Arm images. Ascend-Powered-Engine images include the Huawei's CANN (heterogeneous computing architecture) compute library instead of the CUDA (unified computing architecture) compute library used in GPU setups. CANN supports AI tasks and works with Ascend drivers.

After a training job is submitted, ModelArts Standard automatically runs the boot file. When using the Ascend-Powered-Engine framework, both single-node and distributed jobs start with the same command.

The Ascend-Powered-Engine framework offers three boot modes. By default, the boot file is executed based on RANK_TABLE_FILE. You can also configure the MA_RUN_METHOD environment variable to run the boot file using alternative methods. The MA_RUN_METHOD environment variable offers two boot options: torchrun and msrun.

Method 1: Using an RTF file to start a training job
If the environment variable MA_RUN_METHOD is not configured, ModelArts Standard uses a rank table file to start the boot file of the training job by default.

The number of times the boot file runs for each training job depends on the number of PUs used. When a job is running, the boot file is executed once for each PU. For example, in a single-node job with one PU, the boot file runs once. In a single-node job with eight PUs, it runs eight times. So, do not listen on ports in the boot file.

The following environment variables are automatically configured in the boot file:
- RANK_TABLE_FILE: rank table file path.
- ASCEND_DEVICE_ID: logical device ID. For example, for single-PU training, the value is always 0.
- RANK_ID: logical (sequential) number of a device in a training job.
- RANK_SIZE: Set this parameter based on the number of devices in the rank table file. For example, the value is 4 for 4 Snt9b devices.
To ensure the boot file runs only once, check the ASCEND_DEVICE_ID value. If it is 0, execute the logic; otherwise, exit directly.

To enable ranktable dynamic routing for training network acceleration, add the environment variable ROUTE_PLAN=true. For details, see Enabling Dynamic Route Acceleration for Training Jobs.

Sample code script mindspore-verification.py of the Ascend-Powered-Engine framework:
```
import os
import numpy as np
from mindspore import Tensor
import mindspore.ops as ops
import mindspore.context as context

print('Ascend Envs')
print('------')
print('JOB_ID: ', os.environ['JOB_ID'])
print('RANK_TABLE_FILE: ', os.environ['RANK_TABLE_FILE'])
print('RANK_SIZE: ', os.environ['RANK_SIZE'])
print('ASCEND_DEVICE_ID: ', os.environ['ASCEND_DEVICE_ID'])
print('DEVICE_ID: ', os.environ['DEVICE_ID'])
print('RANK_ID: ', os.environ['RANK_ID'])
print('------')

context.set_context(device_target="Ascend")
x = Tensor(np.ones([1,3,3,4]).astype(np.float32))
y = Tensor(np.ones([1,3,3,4]).astype(np.float32))

print(ops.add(x, y))
```

Method 2: Using the torchrun command to start a training job

If the environment variable MA_RUN_METHOD is set torchrun, ModelArts Standard uses the torchrun command to run the boot file.

The PyTorch version must be 1.11.0 or later.

For single-node jobs, ModelArts Standard uses these commands to start the boot file:

torchrun --standalone --nnodes=${MA_NUM_HOSTS} --nproc_per_node=${MA_NUM_GPUS} ${MA_EXTRA_TORCHRUN_PARAMS} "Boot file" {arg1} {arg2} ...

For multi-node jobs, ModelArts Standard uses these commands to start the boot file:

torchrun --nnodes=${MA_NUM_HOSTS} --nproc_per_node=${MA_NUM_GPUS} --node_rank=${VC_TASK_INDEX} --master_addr={master_addr} --master_port=${MA_TORCHRUN_MASTER_PORT} --rdzv_id={ma_job_name} --rdzv_backend=static ${MA_EXTRA_TORCHRUN_PARAMS} "Boot file" {arg1} {arg2} ...

Parameters:

**Table 2** Parameters for starting a training job using the **torchrun** command
Parameter	Description
standalone	Identifier of a single-node job.
nnodes	Number of task nodes.
nproc_per_node	Number of main processes started by each task node. Set this parameter to the number of NPUs allocated to the task.
node_rank	Task rank, which is used for multi-task distributed training.
master_addr	Address of the main task (rank 0). Set it to the communication domain name of worker-0.
master_port	Port used for communication during distributed training on the main task (rank 0). The default value is 18888. When a master_port conflict occurs, you can modify the port configuration by configuring the MA_TORCHRUN_MASTER_PORT environment variable.
rdzv_id	Rendezvous ID. Set it to a value with the training job ID.
rdzv_backend	Rendezvous backend, which is fixed at static. That is, master_addr and master_port are used instead of Rendezvous.

Additionally, you can configure the MA_EXTRA_TORCHRUN_PARAMS environment variable to add additional torchrun command parameters or overwrite the preset torchrun command parameters. The following is an example of configuring the rdzv_conf parameter in the torchrun command:
```
"environments": {
"MA_RUN_METHOD": "torchrun",
"MA_EXTRA_TORCHRUN_PARAMS": "--rdzv_conf=timeout=7200"
}
```
If the RuntimeError: Socket Timeout error occurs during the distributed process group initialization using torchrun, go to 1 and further locate the fault.

Method 3: Using the msrun command to start a training job

If the environment variable MA_RUN_METHOD is set msrun, ModelArts Standard uses the msrun command to run the boot file.

The MindSpore version must be 2.3.0 or later.

This solution supports dynamic networking and rank table file-based networking. If you set the environment variable MS_RANKTABLE_ENABLE to True, msrun reads the rank table file for networking. Otherwise, dynamic networking is used by default.

msrun uses these commands to start the boot file:

msrun --worker_num=${msrun_worker_num} --local_worker_num=${MA_NUM_GPUS} --master_addr=${msrun_master_addr} --node_rank=${VC_TASK_INDEX} --master_port=${msrun_master_port} --log_dir=${msrun_log_dir} --join=True --cluster_time_out=${MSRUN_CLUSTER_TIME_OUT} --rank_table_file=${msrun_rank_table_file} "Boot file" {arg1} {arg2} ...

Parameters:

**Table 3** Parameters for starting a training job using the **msrun** command
Parameter	Description
worker_num	Total number of processes, which is also equivalent to the number of PUs involved, as each PU initiates a process.
local_worker_num	Number of processes on the current node, which is also the number of PUs used by the current node.
master_addr	IP address of the node where the msrun scheduling process is located. This parameter does not need to be configured for single-node jobs.
master_port	Port number of the msrun scheduling process.
node_rank	Port number of the msrun scheduling process.
log_dir	Log output directory of msrun and all processes.
join	Specifies whether the msrun process still exists after the training process is started. The default value is True, indicating that the msrun process exits after all processes exit.
cluster_time_out	Timeout interval of the cluster networking. The default value is 600s. The value can be controlled by the MSRUN_CLUSTER_TIME_OUT environment variable.
rank_table_file	Address of the rank table file. If the environment variable MS_RANKTABLE_ENABLE is set to True, this parameter is added during startup.

PyTorch-GPU

For single-node multi-PU scenarios, the platform adds the --init_method "tcp://<ip>:<port>" parameter to the boot file.

For multi-node multi-PU scenarios, the platform adds the --init_method "tcp://<ip>:<port>" --rank <rank_id> --world_size <node_num> parameter to the boot file.

The preceding parameters must be parsed in the boot file.

For details about the code example of the PyTorch-GPU framework, see "Method 1" in Example: Creating a DDP Distributed Training Job (PyTorch + GPU).

TensorFlow-GPU

For a single-node job, ModelArts starts a training container that exclusively uses the resources on the node.

For a multi-node job, ModelArts starts a parameter server and a worker on the same node. It allocates parameter server and worker tasks in a 1:1 ratio. For example, in a two-node job, two parameter servers and two workers are allocated. ModelArts also injects the following parameters into the boot file:

--task_index <VC_TASK_INDEX> --ps_hosts <TF_PS_HOSTS> --worker_hosts <TF_WORKER_HOSTS> --job_name <MA_TASK_NAME>

The following parameters must be parsed in the boot file.

**Table 4** Parameters
Parameter	Description
VC_TASK_INDEX	Task serial number, for example, 0, 1, or 2.
TF_PS_HOSTS	Addresses of parameter server nodes, for example, [xx-ps-0.xx:TCP_PORT,xx-ps-1.xx:TCP_PORT]. The value of TCP_PORT is a random port ranging from 5,000 to 10,000.
TF_WORKER_HOSTS	Addresses of worker nodes, for example, [xx-worker-0.xx:TCP_PORT,xx-worker-1.xx:TCP_PORT]. The value of TCP_PORT is a random port ranging from 5,000 to 10,000.
MA_TASK_NAME	Task name, which can be ps or worker.

Horovod/MPI/MindSpore-GPU

ModelArts uses mpirun to run training boot files for Horovod, MPI, or MindSpore-GPU. To use a preset engine in ModelArts Standard, simply edit the boot file (training script). ModelArts Standard automatically builds the mpirun command and training job cluster. The platform does not add extra parameters to the boot file.

Example of pytorch_synthetic_benchmark.py:

import argparse
import torch.backends.cudnn as cudnn
import torch.nn.functional as F
import torch.optim as optim
import torch.utils.data.distributed
from torchvision import models
import horovod.torch as hvd
import timeit
import numpy as np

# Benchmark settings
parser = argparse.ArgumentParser(description='PyTorch Synthetic Benchmark',
                                 formatter_class=argparse.ArgumentDefaultsHelpFormatter)
parser.add_argument('--fp16-allreduce', action='store_true', default=False,
                    help='use fp16 compression during allreduce')

parser.add_argument('--model', type=str, default='resnet50',
                    help='model to benchmark')
parser.add_argument('--batch-size', type=int, default=32,
                    help='input batch size')

parser.add_argument('--num-warmup-batches', type=int, default=10,
                    help='number of warm-up batches that don\'t count towards benchmark')
parser.add_argument('--num-batches-per-iter', type=int, default=10,
                    help='number of batches per benchmark iteration')
parser.add_argument('--num-iters', type=int, default=10,
                    help='number of benchmark iterations')

parser.add_argument('--no-cuda', action='store_true', default=False,
                    help='disables CUDA training')

parser.add_argument('--use-adasum', action='store_true', default=False,
                    help='use adasum algorithm to do reduction')

args = parser.parse_args()
args.cuda = not args.no_cuda and torch.cuda.is_available()

hvd.init()

if args.cuda:
    # Horovod: pin GPU to local rank.
    torch.cuda.set_device(hvd.local_rank())

cudnn.benchmark = True

# Set up standard model.
model = getattr(models, args.model)()

# By default, Adasum doesn't need scaling up learning rate.
lr_scaler = hvd.size() if not args.use_adasum else 1

if args.cuda:
    # Move model to GPU.
    model.cuda()
    # If using GPU Adasum allreduce, scale learning rate by local_size.
    if args.use_adasum and hvd.nccl_built():
        lr_scaler = hvd.local_size()

optimizer = optim.SGD(model.parameters(), lr=0.01 * lr_scaler)

# Horovod: (optional) compression algorithm.
compression = hvd.Compression.fp16 if args.fp16_allreduce else hvd.Compression.none

# Horovod: wrap optimizer with DistributedOptimizer.
optimizer = hvd.DistributedOptimizer(optimizer,
                                     named_parameters=model.named_parameters(),
                                     compression=compression,
                                     op=hvd.Adasum if args.use_adasum else hvd.Average)

# Horovod: broadcast parameters & optimizer state.
hvd.broadcast_parameters(model.state_dict(), root_rank=0)
hvd.broadcast_optimizer_state(optimizer, root_rank=0)

# Set up fixed fake data
data = torch.randn(args.batch_size, 3, 224, 224)
target = torch.LongTensor(args.batch_size).random_() % 1000
if args.cuda:
    data, target = data.cuda(), target.cuda()


def benchmark_step():
    optimizer.zero_grad()
    output = model(data)
    loss = F.cross_entropy(output, target)
    loss.backward()
    optimizer.step()


def log(s, nl=True):
    if hvd.rank() != 0:
        return
    print(s, end='\n' if nl else '')


log('Model: %s' % args.model)
log('Batch size: %d' % args.batch_size)
device = 'GPU' if args.cuda else 'CPU'
log('Number of %ss: %d' % (device, hvd.size()))

# Warm-up
log('Running warmup...')
timeit.timeit(benchmark_step, number=args.num_warmup_batches)

# Benchmark
log('Running benchmark...')
img_secs = []
for x in range(args.num_iters):
    time = timeit.timeit(benchmark_step, number=args.num_batches_per_iter)
    img_sec = args.batch_size * args.num_batches_per_iter / time
    log('Iter #%d: %.1f img/sec per %s' % (x, img_sec, device))
    img_secs.append(img_sec)

# Results
img_sec_mean = np.mean(img_secs)
img_sec_conf = 1.96 * np.std(img_secs)
log('Img/sec per %s: %.1f +-%.1f' % (device, img_sec_mean, img_sec_conf))
log('Total img/sec on %d %s(s): %.1f +-%.1f' %
    (hvd.size(), device, hvd.size() * img_sec_mean, hvd.size() * img_sec_conf))

run_mpi.sh is as follows:

#!/bin/bash
MY_HOME=/home/ma-user

MY_SSHD_PORT=${MY_SSHD_PORT:-"36666"}

MY_MPI_BTL_TCP_IF=${MY_MPI_BTL_TCP_IF:-"eth0,bond0"}

MY_TASK_INDEX=${MA_TASK_INDEX:-${VC_TASK_INDEX:-${VK_TASK_INDEX}}}

MY_MPI_SLOTS=${MY_MPI_SLOTS:-"${MA_NUM_GPUS}"}

MY_MPI_TUNE_FILE="${MY_HOME}/env_for_user_process"

if [ -z ${MY_MPI_SLOTS} ]; then
    echo "[run_mpi] MY_MPI_SLOTS is empty, set it be 1"
    MY_MPI_SLOTS="1"
fi

printf "MY_HOME: ${MY_HOME}\nMY_SSHD_PORT: ${MY_SSHD_PORT}\nMY_MPI_BTL_TCP_IF: ${MY_MPI_BTL_TCP_IF}\nMY_TASK_INDEX: ${MY_TASK_INDEX}\nMY_MPI_SLOTS: ${MY_MPI_SLOTS}\n"

env | grep -E '^MA_|SHARED_|^S3_|^PATH|^VC_WORKER_|^SCC|^CRED' | grep -v '=$' > ${MY_MPI_TUNE_FILE}
# add -x to each line
sed -i 's/^/-x /' ${MY_MPI_TUNE_FILE}

sed -i "s|{{MY_SSHD_PORT}}|${MY_SSHD_PORT}|g" ${MY_HOME}/etc/ssh/sshd_config

# start sshd service
bash -c "$(which sshd) -f ${MY_HOME}/etc/ssh/sshd_config"

# confirm the sshd is up
netstat -anp | grep LIS | grep ${MY_SSHD_PORT}

if [ $MY_TASK_INDEX -eq 0 ]; then
    # generate the hostfile of mpi
    for ((i=0; i<$MA_NUM_HOSTS; i++))
    do
        eval hostname=${MA_VJ_NAME}-${MA_TASK_NAME}-${i}.${MA_VJ_NAME}
        echo "[run_mpi] hostname: ${hostname}"

        ip=""
        while [ -z "$ip" ]; do
            ip=$(ping -c 1 ${hostname} | grep "PING" | sed -E 's/PING .* .([0-9.]+). .*/\1/g')
            sleep 1
        done
        echo "[run_mpi] resolved ip: ${ip}"

        # test the sshd is up
        while :
        do
            if [ cat < /dev/null >/dev/tcp/${ip}/${MY_SSHD_PORT} ]; then
                break
            fi
            sleep 1
        done

        echo "[run_mpi] the sshd of ip ${ip} is up"

        echo "${ip} slots=$MY_MPI_SLOTS" >> ${MY_HOME}/hostfile
    done

    printf "[run_mpi] hostfile:\n`cat ${MY_HOME}/hostfile`\n"
fi

RET_CODE=0

if [ $MY_TASK_INDEX -eq 0 ]; then

    echo "[run_mpi] start exec command time: "$(date +"%Y-%m-%d-%H:%M:%S")

    np=$(( ${MA_NUM_HOSTS} * ${MY_MPI_SLOTS} ))

    echo "[run_mpi] command: mpirun -np ${np} -hostfile ${MY_HOME}/hostfile -mca plm_rsh_args \"-p ${MY_SSHD_PORT}\" -tune ${MY_MPI_TUNE_FILE} ... $@"

    # execute mpirun at worker-0
    # mpirun
    mpirun \
        -np ${np} \
        -hostfile ${MY_HOME}/hostfile \
        -mca plm_rsh_args "-p ${MY_SSHD_PORT}" \
        -tune ${MY_MPI_TUNE_FILE} \
        -bind-to none -map-by slot \
        -x NCCL_DEBUG=INFO -x NCCL_SOCKET_IFNAME=${MY_MPI_BTL_TCP_IF} -x NCCL_SOCKET_FAMILY=AF_INET \
        -x HOROVOD_MPI_THREADS_DISABLE=1 \
        -x LD_LIBRARY_PATH \
        -mca pml ob1 -mca btl ^openib -mca plm_rsh_no_tree_spawn true \
        "$@"

    RET_CODE=$?

    if [ $RET_CODE -ne 0 ]; then
        echo "[run_mpi] exec command failed, exited with $RET_CODE"
    else
        echo "[run_mpi] exec command successfully, exited with $RET_CODE"
    fi

    # stop 1...N worker by killing the sleep proc
    sed -i '1d' ${MY_HOME}/hostfile
    if [ `cat ${MY_HOME}/hostfile | wc -l` -ne 0 ]; then
        echo "[run_mpi] stop 1 to (N - 1) worker by killing the sleep proc"

        sed -i 's/${MY_MPI_SLOTS}/1/g' ${MY_HOME}/hostfile
        printf "[run_mpi] hostfile:\n`cat ${MY_HOME}/hostfile`\n"

        mpirun \
        --hostfile ${MY_HOME}/hostfile \
        --mca btl_tcp_if_include ${MY_MPI_BTL_TCP_IF} \
        --mca plm_rsh_args "-p ${MY_SSHD_PORT}" \
        -x PATH -x LD_LIBRARY_PATH \
        pkill sleep \
        > /dev/null 2>&1
    fi

    echo "[run_mpi] exit time: "$(date +"%Y-%m-%d-%H:%M:%S")
else
    echo "[run_mpi] the training log is in worker-0"
    sleep 365d
    echo "[run_mpi] exit time: "$(date +"%Y-%m-%d-%H:%M:%S")
fi

exit $RET_CODE

FAQs

What Should I Do If RuntimeError: Socket Timeout Is Displayed During Distributed Process Group Initialization Using torchrun?
If the RuntimeError: Socket Timeout error occurs during the distributed process group initialization using torchrun, you can add the following environment variables to create a training job again to view initialization details and further locate the fault.
- LOGLEVEL=INFO
- TORCH_CPP_LOG_LEVEL=INFO
- TORCH_DISTRIBUTED_DEBUG=DETAIL
The RuntimeError: Socket Timeout error is caused by a significant time discrepancy between tasks when running the torchrun command. The time discrepancy is caused by initialization tasks, like downloading the training data and checkpoint read/write, which happen before the torchrun command is run. If the time taken to complete these initialization tasks varies significantly, a Socket Timeout error may occur. When this error happens, check the time difference between the torchrun execution points for each task. If the time difference is too large, optimize the initialization process before running the torchrun command to ensure a reasonable time gap.