Starting Training Using a Preset Image's Boot File
ModelArts Standard offers preset popular AI images tailored to their specific features. To train models efficiently, modify the boot command according to your chosen image's requirements for seamless execution.
ModelArts provides four preset images for creating training jobs.
Scenario |
Preset Image |
Description |
Advantages and Suggestions |
||
---|---|---|---|---|---|
NPU |
It is a set of AI images, runtime environments, and boot modes tailored for Huawei Cloud's AI accelerator chips. There are three boot modes: |
Select this option if you use Huawei Cloud NPUs. This option can improve the model training efficiency. |
RTF: general scenarios |
||
Method 2: Using the torchrun command to start a training job |
torchrun: Use PyTorch+Ascend Extension for PyTorch. |
||||
msrun: You can use this mode after MindSpore is installed. This option works independently of external libraries or configuration files, includes disaster recovery, and maintains high security. |
|||||
GPU |
PyTorch-GPU preset image. |
This image features flexibility and ease of use, and works well for research and quick development. |
|||
TensorFlow-GPU preset image. |
This image excels in deploying models in production environments and optimizing performance, making it ideal for enterprise applications. |
||||
ModelArts uses mpirun to run boot files for Horovod, MPI, or MindSpore-GPU. |
This feature requires the OpenMPI library. Choose it only if you are familiar with OpenMPI. |
This section describes how to modify the boot file when creating a training job using different preset images.
Ascend-Powered-Engine
Ascend-Powered-Engine is a unique engine that combines an AI framework, runtime environment, and boot mode tailored to Ascend accelerator chips. Unlike conventional AI frameworks like PyTorch or TensorFlow, or parallel execution frameworks such as MPI, it serves a distinct purpose.
Snt9 Ascend accelerators run on Arm CPU environments, which means their Docker images are Arm images. Ascend-Powered-Engine images include the Huawei's CANN (heterogeneous computing architecture) compute library instead of the CUDA (unified computing architecture) compute library used in GPU setups. CANN supports AI tasks and works with Ascend drivers.
After a training job is submitted, ModelArts Standard automatically runs the boot file. When using the Ascend-Powered-Engine framework, both single-node and distributed jobs start with the same command.
The Ascend-Powered-Engine framework offers three boot modes. By default, the boot file is executed based on RANK_TABLE_FILE. You can also configure the MA_RUN_METHOD environment variable to run the boot file using alternative methods. The MA_RUN_METHOD environment variable offers two boot options: torchrun and msrun.
- Method 1: Using an RTF file to start a training job
If the environment variable MA_RUN_METHOD is not configured, ModelArts Standard uses a rank table file to start the boot file of the training job by default.
The number of times the boot file runs for each training job depends on the number of PUs used. When a job is running, the boot file is executed once for each PU. For example, in a single-node job with one PU, the boot file runs once. In a single-node job with eight PUs, it runs eight times. So, do not listen on ports in the boot file.
The following environment variables are automatically configured in the boot file:
- RANK_TABLE_FILE: rank table file path.
- ASCEND_DEVICE_ID: logical device ID. For example, for single-PU training, the value is always 0.
- RANK_ID: logical (sequential) number of a device in a training job.
- RANK_SIZE: Set this parameter based on the number of devices in the rank table file. For example, the value is 4 for 4 Snt9b devices.
To ensure the boot file runs only once, check the ASCEND_DEVICE_ID value. If it is 0, execute the logic; otherwise, exit directly.
To enable ranktable dynamic routing for training network acceleration, add the environment variable ROUTE_PLAN=true. For details, see Enabling Dynamic Route Acceleration for Training Jobs.
For details about the example code file mindspore-verification.py of the Ascend-Powered-Engine framework, see Training the mindspore-verification.py File.
- Method 2: Using the torchrun command to start a training job
If the environment variable MA_RUN_METHOD is set torchrun, ModelArts Standard uses the torchrun command to run the boot file.
The PyTorch version must be 1.11.0 or later.
- For single-node jobs, ModelArts Standard uses these commands to start the boot file:
torchrun --standalone --nnodes=${MA_NUM_HOSTS} --nproc_per_node=${MA_NUM_GPUS} ${MA_EXTRA_TORCHRUN_PARAMS} "Boot file" {arg1} {arg2} ...
- For multi-node jobs, ModelArts Standard uses these commands to start the boot file:
torchrun --nnodes=${MA_NUM_HOSTS} --nproc_per_node=${MA_NUM_GPUS} --node_rank=${VC_TASK_INDEX} --master_addr={master_addr} --master_port=${MA_TORCHRUN_MASTER_PORT} --rdzv_id={ma_job_name} --rdzv_backend=static ${MA_EXTRA_TORCHRUN_PARAMS} "Boot file" {arg1} {arg2} ...
Parameters:
Table 2 Parameters for starting a training job using the torchrun command Parameter
Description
standalone
Identifier of a single-node job.
nnodes
Number of task nodes.
nproc_per_node
Number of main processes started by each task node. Set this parameter to the number of NPUs allocated to the task.
node_rank
Task rank, which is used for multi-task distributed training.
master_addr
Address of the main task (rank 0). Set it to the communication domain name of worker-0.
master_port
Port used for communication during distributed training on the main task (rank 0). The default value is 18888. When a master_port conflict occurs, you can modify the port configuration by configuring the MA_TORCHRUN_MASTER_PORT environment variable.
rdzv_id
Rendezvous ID. Set it to a value with the training job ID.
rdzv_backend
Rendezvous backend, which is fixed at static. That is, master_addr and master_port are used instead of Rendezvous.
- Additionally, you can configure the MA_EXTRA_TORCHRUN_PARAMS environment variable to add additional torchrun command parameters or overwrite the preset torchrun command parameters. The following is an example of configuring the rdzv_conf parameter in the torchrun command:
"environments": { "MA_RUN_METHOD": "torchrun", "MA_EXTRA_TORCHRUN_PARAMS": "--rdzv_conf=timeout=7200" }
If the RuntimeError: Socket Timeout error occurs during the distributed process group initialization using torchrun, go to 1 and further locate the fault.
- For single-node jobs, ModelArts Standard uses these commands to start the boot file:
- Method 3: Using the msrun command to start a training job
If the environment variable MA_RUN_METHOD is set msrun, ModelArts Standard uses the msrun command to run the boot file.
The MindSpore version must be 2.3.0 or later.
This solution supports dynamic networking and rank table file-based networking. If you set the environment variable MS_RANKTABLE_ENABLE to True, msrun reads the rank table file for networking. Otherwise, dynamic networking is used by default.
msrun uses these commands to start the boot file:
msrun --worker_num=${msrun_worker_num} --local_worker_num=${MA_NUM_GPUS} --master_addr=${msrun_master_addr} --node_rank=${VC_TASK_INDEX} --master_port=${msrun_master_port} --log_dir=${msrun_log_dir} --join=True --cluster_time_out=${MSRUN_CLUSTER_TIME_OUT} --rank_table_file=${msrun_rank_table_file} "Boot file" {arg1} {arg2} ...
Parameters:
Table 3 Parameters for starting a training job using the msrun command Parameter
Description
worker_num
Total number of processes, which is also equivalent to the number of PUs involved, as each PU initiates a process.
local_worker_num
Number of processes on the current node, which is also the number of PUs used by the current node.
master_addr
IP address of the node where the msrun scheduling process is located. This parameter does not need to be configured for single-node jobs.
master_port
Port number of the msrun scheduling process.
node_rank
Port number of the msrun scheduling process.
log_dir
Log output directory of msrun and all processes.
join
Specifies whether the msrun process still exists after the training process is started. The default value is True, indicating that the msrun process exits after all processes exit.
cluster_time_out
Timeout interval of the cluster networking. The default value is 600s. The value can be controlled by the MSRUN_CLUSTER_TIME_OUT environment variable.
rank_table_file
Address of the rank table file. If the environment variable MS_RANKTABLE_ENABLE is set to True, this parameter is added during startup.
PyTorch-GPU
For single-node multi-PU scenarios, the platform adds the --init_method "tcp://<ip>:<port>" parameter to the boot file.
For multi-node multi-PU scenarios, the platform adds the --init_method "tcp://<ip>:<port>" --rank <rank_id> --world_size <node_num> parameter to the boot file.
The preceding parameters must be parsed in the boot file.
For details about the code example of the PyTorch-GPU framework, see "Method 1" in Example: Creating a DDP Distributed Training Job (PyTorch + GPU).
TensorFlow-GPU
For a single-node job, ModelArts starts a training container that exclusively uses the resources on the node.
For a multi-node job, ModelArts starts a parameter server and a worker on the same node. It allocates parameter server and worker tasks in a 1:1 ratio. For example, in a two-node job, two parameter servers and two workers are allocated. ModelArts also injects the following parameters into the boot file:
--task_index <VC_TASK_INDEX> --ps_hosts <TF_PS_HOSTS> --worker_hosts <TF_WORKER_HOSTS> --job_name <MA_TASK_NAME>
The following parameters must be parsed in the boot file.
Parameter |
Description |
---|---|
VC_TASK_INDEX |
Task serial number, for example, 0, 1, or 2. |
TF_PS_HOSTS |
Addresses of parameter server nodes, for example, [xx-ps-0.xx:TCP_PORT,xx-ps-1.xx:TCP_PORT]. The value of TCP_PORT is a random port ranging from 5,000 to 10,000. |
TF_WORKER_HOSTS |
Addresses of worker nodes, for example, [xx-worker-0.xx:TCP_PORT,xx-worker-1.xx:TCP_PORT]. The value of TCP_PORT is a random port ranging from 5,000 to 10,000. |
MA_TASK_NAME |
Task name, which can be ps or worker. |
Horovod/MPI/MindSpore-GPU
ModelArts uses mpirun to run training boot files for Horovod, MPI, or MindSpore-GPU. To use a preset engine in ModelArts Standard, simply edit the boot file (training script). ModelArts Standard automatically builds the mpirun command and training job cluster. The platform does not add extra parameters to the boot file.
Example of pytorch_synthetic_benchmark.py:
import argparse import torch.backends.cudnn as cudnn import torch.nn.functional as F import torch.optim as optim import torch.utils.data.distributed from torchvision import models import horovod.torch as hvd import timeit import numpy as np # Benchmark settings parser = argparse.ArgumentParser(description='PyTorch Synthetic Benchmark', formatter_class=argparse.ArgumentDefaultsHelpFormatter) parser.add_argument('--fp16-allreduce', action='store_true', default=False, help='use fp16 compression during allreduce') parser.add_argument('--model', type=str, default='resnet50', help='model to benchmark') parser.add_argument('--batch-size', type=int, default=32, help='input batch size') parser.add_argument('--num-warmup-batches', type=int, default=10, help='number of warm-up batches that don\'t count towards benchmark') parser.add_argument('--num-batches-per-iter', type=int, default=10, help='number of batches per benchmark iteration') parser.add_argument('--num-iters', type=int, default=10, help='number of benchmark iterations') parser.add_argument('--no-cuda', action='store_true', default=False, help='disables CUDA training') parser.add_argument('--use-adasum', action='store_true', default=False, help='use adasum algorithm to do reduction') args = parser.parse_args() args.cuda = not args.no_cuda and torch.cuda.is_available() hvd.init() if args.cuda: # Horovod: pin GPU to local rank. torch.cuda.set_device(hvd.local_rank()) cudnn.benchmark = True # Set up standard model. model = getattr(models, args.model)() # By default, Adasum doesn't need scaling up learning rate. lr_scaler = hvd.size() if not args.use_adasum else 1 if args.cuda: # Move model to GPU. model.cuda() # If using GPU Adasum allreduce, scale learning rate by local_size. if args.use_adasum and hvd.nccl_built(): lr_scaler = hvd.local_size() optimizer = optim.SGD(model.parameters(), lr=0.01 * lr_scaler) # Horovod: (optional) compression algorithm. compression = hvd.Compression.fp16 if args.fp16_allreduce else hvd.Compression.none # Horovod: wrap optimizer with DistributedOptimizer. optimizer = hvd.DistributedOptimizer(optimizer, named_parameters=model.named_parameters(), compression=compression, op=hvd.Adasum if args.use_adasum else hvd.Average) # Horovod: broadcast parameters & optimizer state. hvd.broadcast_parameters(model.state_dict(), root_rank=0) hvd.broadcast_optimizer_state(optimizer, root_rank=0) # Set up fixed fake data data = torch.randn(args.batch_size, 3, 224, 224) target = torch.LongTensor(args.batch_size).random_() % 1000 if args.cuda: data, target = data.cuda(), target.cuda() def benchmark_step(): optimizer.zero_grad() output = model(data) loss = F.cross_entropy(output, target) loss.backward() optimizer.step() def log(s, nl=True): if hvd.rank() != 0: return print(s, end='\n' if nl else '') log('Model: %s' % args.model) log('Batch size: %d' % args.batch_size) device = 'GPU' if args.cuda else 'CPU' log('Number of %ss: %d' % (device, hvd.size())) # Warm-up log('Running warmup...') timeit.timeit(benchmark_step, number=args.num_warmup_batches) # Benchmark log('Running benchmark...') img_secs = [] for x in range(args.num_iters): time = timeit.timeit(benchmark_step, number=args.num_batches_per_iter) img_sec = args.batch_size * args.num_batches_per_iter / time log('Iter #%d: %.1f img/sec per %s' % (x, img_sec, device)) img_secs.append(img_sec) # Results img_sec_mean = np.mean(img_secs) img_sec_conf = 1.96 * np.std(img_secs) log('Img/sec per %s: %.1f +-%.1f' % (device, img_sec_mean, img_sec_conf)) log('Total img/sec on %d %s(s): %.1f +-%.1f' % (hvd.size(), device, hvd.size() * img_sec_mean, hvd.size() * img_sec_conf))
run_mpi.sh is as follows:
#!/bin/bash MY_HOME=/home/ma-user MY_SSHD_PORT=${MY_SSHD_PORT:-"36666"} MY_MPI_BTL_TCP_IF=${MY_MPI_BTL_TCP_IF:-"eth0,bond0"} MY_TASK_INDEX=${MA_TASK_INDEX:-${VC_TASK_INDEX:-${VK_TASK_INDEX}}} MY_MPI_SLOTS=${MY_MPI_SLOTS:-"${MA_NUM_GPUS}"} MY_MPI_TUNE_FILE="${MY_HOME}/env_for_user_process" if [ -z ${MY_MPI_SLOTS} ]; then echo "[run_mpi] MY_MPI_SLOTS is empty, set it be 1" MY_MPI_SLOTS="1" fi printf "MY_HOME: ${MY_HOME}\nMY_SSHD_PORT: ${MY_SSHD_PORT}\nMY_MPI_BTL_TCP_IF: ${MY_MPI_BTL_TCP_IF}\nMY_TASK_INDEX: ${MY_TASK_INDEX}\nMY_MPI_SLOTS: ${MY_MPI_SLOTS}\n" env | grep -E '^MA_|SHARED_|^S3_|^PATH|^VC_WORKER_|^SCC|^CRED' | grep -v '=$' > ${MY_MPI_TUNE_FILE} # add -x to each line sed -i 's/^/-x /' ${MY_MPI_TUNE_FILE} sed -i "s|{{MY_SSHD_PORT}}|${MY_SSHD_PORT}|g" ${MY_HOME}/etc/ssh/sshd_config # start sshd service bash -c "$(which sshd) -f ${MY_HOME}/etc/ssh/sshd_config" # confirm the sshd is up netstat -anp | grep LIS | grep ${MY_SSHD_PORT} if [ $MY_TASK_INDEX -eq 0 ]; then # generate the hostfile of mpi for ((i=0; i<$MA_NUM_HOSTS; i++)) do eval hostname=${MA_VJ_NAME}-${MA_TASK_NAME}-${i}.${MA_VJ_NAME} echo "[run_mpi] hostname: ${hostname}" ip="" while [ -z "$ip" ]; do ip=$(ping -c 1 ${hostname} | grep "PING" | sed -E 's/PING .* .([0-9.]+). .*/\1/g') sleep 1 done echo "[run_mpi] resolved ip: ${ip}" # test the sshd is up while : do if [ cat < /dev/null >/dev/tcp/${ip}/${MY_SSHD_PORT} ]; then break fi sleep 1 done echo "[run_mpi] the sshd of ip ${ip} is up" echo "${ip} slots=$MY_MPI_SLOTS" >> ${MY_HOME}/hostfile done printf "[run_mpi] hostfile:\n`cat ${MY_HOME}/hostfile`\n" fi RET_CODE=0 if [ $MY_TASK_INDEX -eq 0 ]; then echo "[run_mpi] start exec command time: "$(date +"%Y-%m-%d-%H:%M:%S") np=$(( ${MA_NUM_HOSTS} * ${MY_MPI_SLOTS} )) echo "[run_mpi] command: mpirun -np ${np} -hostfile ${MY_HOME}/hostfile -mca plm_rsh_args \"-p ${MY_SSHD_PORT}\" -tune ${MY_MPI_TUNE_FILE} ... $@" # execute mpirun at worker-0 # mpirun mpirun \ -np ${np} \ -hostfile ${MY_HOME}/hostfile \ -mca plm_rsh_args "-p ${MY_SSHD_PORT}" \ -tune ${MY_MPI_TUNE_FILE} \ -bind-to none -map-by slot \ -x NCCL_DEBUG=INFO -x NCCL_SOCKET_IFNAME=${MY_MPI_BTL_TCP_IF} -x NCCL_SOCKET_FAMILY=AF_INET \ -x HOROVOD_MPI_THREADS_DISABLE=1 \ -x LD_LIBRARY_PATH \ -mca pml ob1 -mca btl ^openib -mca plm_rsh_no_tree_spawn true \ "$@" RET_CODE=$? if [ $RET_CODE -ne 0 ]; then echo "[run_mpi] exec command failed, exited with $RET_CODE" else echo "[run_mpi] exec command successfully, exited with $RET_CODE" fi # stop 1...N worker by killing the sleep proc sed -i '1d' ${MY_HOME}/hostfile if [ `cat ${MY_HOME}/hostfile | wc -l` -ne 0 ]; then echo "[run_mpi] stop 1 to (N - 1) worker by killing the sleep proc" sed -i 's/${MY_MPI_SLOTS}/1/g' ${MY_HOME}/hostfile printf "[run_mpi] hostfile:\n`cat ${MY_HOME}/hostfile`\n" mpirun \ --hostfile ${MY_HOME}/hostfile \ --mca btl_tcp_if_include ${MY_MPI_BTL_TCP_IF} \ --mca plm_rsh_args "-p ${MY_SSHD_PORT}" \ -x PATH -x LD_LIBRARY_PATH \ pkill sleep \ > /dev/null 2>&1 fi echo "[run_mpi] exit time: "$(date +"%Y-%m-%d-%H:%M:%S") else echo "[run_mpi] the training log is in worker-0" sleep 365d echo "[run_mpi] exit time: "$(date +"%Y-%m-%d-%H:%M:%S") fi exit $RET_CODE
FAQs
- What Should I Do If RuntimeError: Socket Timeout Is Displayed During Distributed Process Group Initialization Using torchrun?
If the RuntimeError: Socket Timeout error occurs during the distributed process group initialization using torchrun, you can add the following environment variables to create a training job again to view initialization details and further locate the fault.
- LOGLEVEL=INFO
- TORCH_CPP_LOG_LEVEL=INFO
- TORCH_DISTRIBUTED_DEBUG=DETAIL
The RuntimeError: Socket Timeout error is caused by a significant time discrepancy between tasks when running the torchrun command. The time discrepancy is caused by initialization tasks, like downloading the training data and checkpoint read/write, which happen before the torchrun command is run. If the time taken to complete these initialization tasks varies significantly, a Socket Timeout error may occur. When this error happens, check the time difference between the torchrun execution points for each task. If the time difference is too large, optimize the initialization process before running the torchrun command to ensure a reasonable time gap.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot