Compute
Elastic Cloud Server
Huawei Cloud Flexus
Bare Metal Server
Auto Scaling
Image Management Service
Dedicated Host
FunctionGraph
Cloud Phone Host
Huawei Cloud EulerOS
Networking
Virtual Private Cloud
Elastic IP
Elastic Load Balance
NAT Gateway
Direct Connect
Virtual Private Network
VPC Endpoint
Cloud Connect
Enterprise Router
Enterprise Switch
Global Accelerator
Management & Governance
Cloud Eye
Identity and Access Management
Cloud Trace Service
Resource Formation Service
Tag Management Service
Log Tank Service
Config
OneAccess
Resource Access Manager
Simple Message Notification
Application Performance Management
Application Operations Management
Organizations
Optimization Advisor
IAM Identity Center
Cloud Operations Center
Resource Governance Center
Migration
Server Migration Service
Object Storage Migration Service
Cloud Data Migration
Migration Center
Cloud Ecosystem
KooGallery
Partner Center
User Support
My Account
Billing Center
Cost Center
Resource Center
Enterprise Management
Service Tickets
HUAWEI CLOUD (International) FAQs
ICP Filing
Support Plans
My Credentials
Customer Operation Capabilities
Partner Support Plans
Professional Services
Analytics
MapReduce Service
Data Lake Insight
CloudTable Service
Cloud Search Service
Data Lake Visualization
Data Ingestion Service
GaussDB(DWS)
DataArts Studio
Data Lake Factory
DataArts Lake Formation
IoT
IoT Device Access
Others
Product Pricing Details
System Permissions
Console Quick Start
Common FAQs
Instructions for Associating with a HUAWEI CLOUD Partner
Message Center
Security & Compliance
Security Technologies and Applications
Web Application Firewall
Host Security Service
Cloud Firewall
SecMaster
Anti-DDoS Service
Data Encryption Workshop
Database Security Service
Cloud Bastion Host
Data Security Center
Cloud Certificate Manager
Edge Security
Situation Awareness
Managed Threat Detection
Blockchain
Blockchain Service
Web3 Node Engine Service
Media Services
Media Processing Center
Video On Demand
Live
SparkRTC
MetaStudio
Storage
Object Storage Service
Elastic Volume Service
Cloud Backup and Recovery
Storage Disaster Recovery Service
Scalable File Service Turbo
Scalable File Service
Volume Backup Service
Cloud Server Backup Service
Data Express Service
Dedicated Distributed Storage Service
Containers
Cloud Container Engine
SoftWare Repository for Container
Application Service Mesh
Ubiquitous Cloud Native Service
Cloud Container Instance
Databases
Relational Database Service
Document Database Service
Data Admin Service
Data Replication Service
GeminiDB
GaussDB
Distributed Database Middleware
Database and Application Migration UGO
TaurusDB
Middleware
Distributed Cache Service
API Gateway
Distributed Message Service for Kafka
Distributed Message Service for RabbitMQ
Distributed Message Service for RocketMQ
Cloud Service Engine
Multi-Site High Availability Service
EventGrid
Dedicated Cloud
Dedicated Computing Cluster
Business Applications
Workspace
ROMA Connect
Message & SMS
Domain Name Service
Edge Data Center Management
Meeting
AI
Face Recognition Service
Graph Engine Service
Content Moderation
Image Recognition
Optical Character Recognition
ModelArts
ImageSearch
Conversational Bot Service
Speech Interaction Service
Huawei HiLens
Video Intelligent Analysis Service
Developer Tools
SDK Developer Guide
API Request Signing Guide
Terraform
Koo Command Line Interface
Content Delivery & Edge Computing
Content Delivery Network
Intelligent EdgeFabric
CloudPond
Intelligent EdgeCloud
Solutions
SAP Cloud
High Performance Computing
Developer Services
ServiceStage
CodeArts
CodeArts PerfTest
CodeArts Req
CodeArts Pipeline
CodeArts Build
CodeArts Deploy
CodeArts Artifact
CodeArts TestPlan
CodeArts Check
CodeArts Repo
Cloud Application Engine
MacroVerse aPaaS
KooMessage
KooPhone
KooDrive

Starting a Preset Image's Boot File

Updated on 2024-12-26 GMT+08:00

ModelArts Standard offers multiple AI images for model training, which can be adapted by modifying their boot commands.

This section describes how to modify the boot file when creating a training job using different preset images.

Ascend-Powered-Engine Boot Principles

Ascend-Powered-Engine is a unique engine that combines multiple AI frameworks, runtime environments, and boot modes tailored to Ascend accelerator chips.

Major Snt9 Ascend accelerators run on Arm-backed CPU servers, which means the upper-layer Docker images are Arm images. The NVIDIA CUDA (unified computing architecture) compute library is installed in images for GPU scenarios. The Huawei CANN (heterogeneous computing architecture) compute library is installed in images powered by the Ascend-Powered-Engine, adapting to the Ascend driver.

After a training job is submitted, ModelArts Standard automatically runs the boot file.

The default boot mode of the boot file of the Ascend-Powered-Engine framework is as follows:

The number of times the boot file runs for each training job depends on the number of cards used. When a job is running, the boot file is executed once for each card. For example, in a single-node job with one card, the boot file runs once. In a single-node job with eight cards, it runs eight times. So, do not listen on ports in the boot file.

The following environment variables are automatically configured in the boot file:

  • RANK_TABLE_FILE: path of the rank table file.
  • ASCEND_DEVICE_ID: logical device ID. For example, for single-card training, the value is always 0.
  • RANK_ID: logical (sequential) number of a device in a training job.
  • RANK_SIZE: Set this parameter based on the number of devices in the rank table file. For example, the value is 4 for 4 snt9b devices.

To ensure the boot file runs only once, check the ASCEND_DEVICE_ID value. If it is 0, execute the logic; otherwise, exit directly.

For details about the example code file mindspore-verification.py of the Ascend-Powered-Engine framework, see Training the mindspore-verification.py File.

The command for starting Ascend-Powered-Engine in standalone mode is the same as that in distributed mode.

The Ascend-Powered-Engine framework offers multiple boot modes. By default, the boot file is executed based on RANK_TABLE_FILE. You can also configure the MA_RUN_METHOD environment variable to run the boot file using alternative methods. The MA_RUN_METHOD environment variable can be set to torchrun and msrun.

  • If MA_RUN_METHOD is set torchrun, ModelArts Standard uses the torchrun command to run the boot file.
    NOTE:

    The PyTorch version must be 1.11.0 or later.

    • For single-node jobs, ModelArts Standard uses these commands to start the boot file:
      torchrun --standalone --nnodes=${MA_NUM_HOSTS} --nproc_per_node=${MA_NUM_GPUS} ${MA_EXTRA_TORCHRUN_PARAMS} "Boot file" {arg1} {arg2} ...
    • For multi-node jobs, ModelArts Standard uses these commands to start the boot file:
      torchrun --nnodes=${MA_NUM_HOSTS} --nproc_per_node=${MA_NUM_GPUS} --node_rank=${VC_TASK_INDEX} --master_addr={master_addr} --master_port=${MA_TORCHRUN_MASTER_PORT} --rdzv_id={ma_job_name} --rdzv_backend=static ${MA_EXTRA_TORCHRUN_PARAMS} "Boot file" {arg1} {arg2} ...

    Parameters:

    • standalone: identifier of a single-node job.
    • nnodes: number of task nodes.
    • nproc_per_node: number of main processes started by each task node. Set this parameter to the number of NPUs allocated to the task.
    • node_rank: task rank, which is used for multi-task distributed training.
    • master_addr: address of the main task (rank 0). Set it to the communication domain name of worker-0.
    • master_port: port used for communication during distributed training on the main task (rank 0). The default value is 18888. When a master_port conflict occurs, you can modify the port configuration by configuring the MA_TORCHRUN_MASTER_PORT environment variable.
    • rdzv_id: Rendezvous ID. Set it to a value with the training job ID.
    • rdzv_backend: Rendezvous backend, which is fixed at static. That is, master_addr and master_port are used instead of Rendezvous. Additionally, you can configure the MA_EXTRA_TORCHRUN_PARAMS environment variable to add additional torchrun command parameters or overwrite the preset torchrun command parameters. The following is an example of configuring the rdzv_conf parameter in the torchrun command:
      "environments": {
      "MA_RUN_METHOD": "torchrun",
      "MA_EXTRA_TORCHRUN_PARAMS": "--rdzv_conf=timeout=7200"
      }
    NOTE:

    If the RuntimeError: Socket Timeout error occurs during the distributed process group initialization using torchrun, you can add the following environment variables to create a training job again to view initialization details and further locate the fault.

    • LOGLEVEL=INFO
    • TORCH_CPP_LOG_LEVEL=INFO
    • TORCH_DISTRIBUTED_DEBUG=DETAIL

    The RuntimeError: Socket Timeout error is caused by a significant time discrepancy between tasks when running the torchrun command. The time discrepancy is usually caused by initialization tasks, like downloading the training data and checkpoint read/write, which happen before the torchrun command is run. If the time taken to complete these initialization tasks varies significantly, a Socket Timeout error may occur. When this error happens, check the time difference between the torchrun execution points for each task. If the time difference is too large, optimize the initialization process before running the torchrun command to ensure a reasonable time gap.

  • If MA_RUN_METHOD is set msrun, ModelArts Standard uses the msrun command to run the boot file.
    NOTE:

    The MindSpore version must be 2.3.0 or later.

    This solution supports dynamic networking and rank table file-based networking. If you set the environment variable MS_RANKTABLE_ENABLE to True, msrun reads the rank table file for networking. Otherwise, dynamic networking is used by default.

    msrun uses these commands to start the boot file:

    msrun --worker_num=${msrun_worker_num} --local_worker_num=${MA_NUM_GPUS} --master_addr=${msrun_master_addr} --node_rank=${VC_TASK_INDEX} --master_port=${msrun_master_port} --log_dir=${msrun_log_dir} --join=True --cluster_time_out=${MSRUN_CLUSTER_TIME_OUT} --rank_table_file=${msrun_rank_table_file} "Boot file" {arg1} {arg2} ...

    Parameters:

    • worker_num: total number of processes, which is also equivalent to the number of cards involved, as each card initiates a process.
    • local_worker_num: number of processes on the current node, which is also the number of cards used by the current node.
    • master_addr: IP address of the node where the msrun scheduling process is located. This parameter does not need to be configured for single-node jobs.
    • master_port: port number of the msrun scheduling process.
    • node_rank: ID of the current node.
    • log_dir: log output directory of msrun and all processes.
    • join: whether the msrun process still exists after the training process is started. The default value is True, indicating that the msrun process exits after all processes exit.
    • cluster_time_out: timeout interval of the cluster networking. The default value is 600s. The value can be controlled by the MSRUN_CLUSTER_TIME_OUT environment variable.
    • rank_table_file: address of the rank table file. If the environment variable MS_RANKTABLE_ENABLE is set to True, this parameter is added during startup.

PyTorch-GPU Boot Principles

For single-node multi-card scenarios, the platform adds the --init_method "tcp://<ip>:<port>" parameter to the boot file.

For multi-node multi-card scenarios, the platform adds the --init_method "tcp://<ip>:<port>" --rank <rank_id> --world_size <node_num> parameter to the boot file.

The preceding parameters must be parsed in the boot file.

For details about the code example of the PyTorch-GPU framework, see "Method 1" in Example: Creating a DDP Distributed Training Job (PyTorch + GPU).

TensorFlow-GPU Boot Principles

For a single-node job, ModelArts starts a training container that exclusively uses the resources on the node.

For a multi-node job, ModelArts starts a parameter server and a worker on the same node. It allocates parameter server and worker tasks in a 1:1 ratio. For example, in a two-node job, two parameter servers and two workers are allocated. ModelArts also injects the following parameters into the boot file:

--task_index <VC_TASK_INDEX> --ps_hosts <TF_PS_HOSTS> --worker_hosts <TF_WORKER_HOSTS> --job_name <MA_TASK_NAME> 

The following parameters must be parsed in the boot file.

  • VC_TASK_INDEX: task serial number, for example, 0, 1, or 2.
  • TF_PS_HOSTS: addresses of parameter server nodes, for example, [xx-ps-0.xx:TCP_PORT,xx-ps-1.xx:TCP_PORT]. The value of TCP_PORT is a random port ranging from 5,000 to 10,000.
  • TF_WORKER_HOSTS: addresses of worker nodes, for example, [xx-worker-0.xx:TCP_PORT,xx-worker-1.xx:TCP_PORT]. The value of TCP_PORT is a random port ranging from 5,000 to 10,000.
  • MA_TASK_NAME: task name, which can be ps or worker.

For details, see the example code file mnist.py (single-node) of the TensorFlow-GPU framework.

Horovod/MPI/MindSpore-GPU

ModelArts uses mpirun to run boot files for Horovod, MPI, or MindSpore-GPU. To use a preset engine in ModelArts Standard, simply edit the boot file (training script). ModelArts Standard automatically builds the mpirun command and training job cluster. The platform does not add extra parameters to the boot file.

Example of pytorch_synthetic_benchmark.py:

import argparse
import torch.backends.cudnn as cudnn
import torch.nn.functional as F
import torch.optim as optim
import torch.utils.data.distributed
from torchvision import models
import horovod.torch as hvd
import timeit
import numpy as np

# Benchmark settings
parser = argparse.ArgumentParser(description='PyTorch Synthetic Benchmark',
                                 formatter_class=argparse.ArgumentDefaultsHelpFormatter)
parser.add_argument('--fp16-allreduce', action='store_true', default=False,
                    help='use fp16 compression during allreduce')

parser.add_argument('--model', type=str, default='resnet50',
                    help='model to benchmark')
parser.add_argument('--batch-size', type=int, default=32,
                    help='input batch size')

parser.add_argument('--num-warmup-batches', type=int, default=10,
                    help='number of warm-up batches that don\'t count towards benchmark')
parser.add_argument('--num-batches-per-iter', type=int, default=10,
                    help='number of batches per benchmark iteration')
parser.add_argument('--num-iters', type=int, default=10,
                    help='number of benchmark iterations')

parser.add_argument('--no-cuda', action='store_true', default=False,
                    help='disables CUDA training')

parser.add_argument('--use-adasum', action='store_true', default=False,
                    help='use adasum algorithm to do reduction')

args = parser.parse_args()
args.cuda = not args.no_cuda and torch.cuda.is_available()

hvd.init()

if args.cuda:
    # Horovod: pin GPU to local rank.
    torch.cuda.set_device(hvd.local_rank())

cudnn.benchmark = True

# Set up standard model.
model = getattr(models, args.model)()

# By default, Adasum doesn't need scaling up learning rate.
lr_scaler = hvd.size() if not args.use_adasum else 1

if args.cuda:
    # Move model to GPU.
    model.cuda()
    # If using GPU Adasum allreduce, scale learning rate by local_size.
    if args.use_adasum and hvd.nccl_built():
        lr_scaler = hvd.local_size()

optimizer = optim.SGD(model.parameters(), lr=0.01 * lr_scaler)

# Horovod: (optional) compression algorithm.
compression = hvd.Compression.fp16 if args.fp16_allreduce else hvd.Compression.none

# Horovod: wrap optimizer with DistributedOptimizer.
optimizer = hvd.DistributedOptimizer(optimizer,
                                     named_parameters=model.named_parameters(),
                                     compression=compression,
                                     op=hvd.Adasum if args.use_adasum else hvd.Average)

# Horovod: broadcast parameters & optimizer state.
hvd.broadcast_parameters(model.state_dict(), root_rank=0)
hvd.broadcast_optimizer_state(optimizer, root_rank=0)

# Set up fixed fake data
data = torch.randn(args.batch_size, 3, 224, 224)
target = torch.LongTensor(args.batch_size).random_() % 1000
if args.cuda:
    data, target = data.cuda(), target.cuda()


def benchmark_step():
    optimizer.zero_grad()
    output = model(data)
    loss = F.cross_entropy(output, target)
    loss.backward()
    optimizer.step()


def log(s, nl=True):
    if hvd.rank() != 0:
        return
    print(s, end='\n' if nl else '')


log('Model: %s' % args.model)
log('Batch size: %d' % args.batch_size)
device = 'GPU' if args.cuda else 'CPU'
log('Number of %ss: %d' % (device, hvd.size()))

# Warm-up
log('Running warmup...')
timeit.timeit(benchmark_step, number=args.num_warmup_batches)

# Benchmark
log('Running benchmark...')
img_secs = []
for x in range(args.num_iters):
    time = timeit.timeit(benchmark_step, number=args.num_batches_per_iter)
    img_sec = args.batch_size * args.num_batches_per_iter / time
    log('Iter #%d: %.1f img/sec per %s' % (x, img_sec, device))
    img_secs.append(img_sec)

# Results
img_sec_mean = np.mean(img_secs)
img_sec_conf = 1.96 * np.std(img_secs)
log('Img/sec per %s: %.1f +-%.1f' % (device, img_sec_mean, img_sec_conf))
log('Total img/sec on %d %s(s): %.1f +-%.1f' %
    (hvd.size(), device, hvd.size() * img_sec_mean, hvd.size() * img_sec_conf))

run_mpi.sh is as follows:

#!/bin/bash
MY_HOME=/home/ma-user

MY_SSHD_PORT=${MY_SSHD_PORT:-"36666"}

MY_MPI_BTL_TCP_IF=${MY_MPI_BTL_TCP_IF:-"eth0,bond0"}

MY_TASK_INDEX=${MA_TASK_INDEX:-${VC_TASK_INDEX:-${VK_TASK_INDEX}}}

MY_MPI_SLOTS=${MY_MPI_SLOTS:-"${MA_NUM_GPUS}"}

MY_MPI_TUNE_FILE="${MY_HOME}/env_for_user_process"

if [ -z ${MY_MPI_SLOTS} ]; then
    echo "[run_mpi] MY_MPI_SLOTS is empty, set it be 1"
    MY_MPI_SLOTS="1"
fi

printf "MY_HOME: ${MY_HOME}\nMY_SSHD_PORT: ${MY_SSHD_PORT}\nMY_MPI_BTL_TCP_IF: ${MY_MPI_BTL_TCP_IF}\nMY_TASK_INDEX: ${MY_TASK_INDEX}\nMY_MPI_SLOTS: ${MY_MPI_SLOTS}\n"

env | grep -E '^MA_|SHARED_|^S3_|^PATH|^VC_WORKER_|^SCC|^CRED' | grep -v '=$' > ${MY_MPI_TUNE_FILE}
# add -x to each line
sed -i 's/^/-x /' ${MY_MPI_TUNE_FILE}

sed -i "s|{{MY_SSHD_PORT}}|${MY_SSHD_PORT}|g" ${MY_HOME}/etc/ssh/sshd_config

# start sshd service
bash -c "$(which sshd) -f ${MY_HOME}/etc/ssh/sshd_config"

# confirm the sshd is up
netstat -anp | grep LIS | grep ${MY_SSHD_PORT}

if [ $MY_TASK_INDEX -eq 0 ]; then
    # generate the hostfile of mpi
    for ((i=0; i<$MA_NUM_HOSTS; i++))
    do
        eval hostname=${MA_VJ_NAME}-${MA_TASK_NAME}-${i}.${MA_VJ_NAME}
        echo "[run_mpi] hostname: ${hostname}"

        ip=""
        while [ -z "$ip" ]; do
            ip=$(ping -c 1 ${hostname} | grep "PING" | sed -E 's/PING .* .([0-9.]+). .*/\1/g')
            sleep 1
        done
        echo "[run_mpi] resolved ip: ${ip}"

        # test the sshd is up
        while :
        do
            if [ cat < /dev/null >/dev/tcp/${ip}/${MY_SSHD_PORT} ]; then
                break
            fi
            sleep 1
        done

        echo "[run_mpi] the sshd of ip ${ip} is up"

        echo "${ip} slots=$MY_MPI_SLOTS" >> ${MY_HOME}/hostfile
    done

    printf "[run_mpi] hostfile:\n`cat ${MY_HOME}/hostfile`\n"
fi

RET_CODE=0

if [ $MY_TASK_INDEX -eq 0 ]; then

    echo "[run_mpi] start exec command time: "$(date +"%Y-%m-%d-%H:%M:%S")

    np=$(( ${MA_NUM_HOSTS} * ${MY_MPI_SLOTS} ))

    echo "[run_mpi] command: mpirun -np ${np} -hostfile ${MY_HOME}/hostfile -mca plm_rsh_args \"-p ${MY_SSHD_PORT}\" -tune ${MY_MPI_TUNE_FILE} ... $@"

    # execute mpirun at worker-0
    # mpirun
    mpirun \
        -np ${np} \
        -hostfile ${MY_HOME}/hostfile \
        -mca plm_rsh_args "-p ${MY_SSHD_PORT}" \
        -tune ${MY_MPI_TUNE_FILE} \
        -bind-to none -map-by slot \
        -x NCCL_DEBUG=INFO -x NCCL_SOCKET_IFNAME=${MY_MPI_BTL_TCP_IF} -x NCCL_SOCKET_FAMILY=AF_INET \
        -x HOROVOD_MPI_THREADS_DISABLE=1 \
        -x LD_LIBRARY_PATH \
        -mca pml ob1 -mca btl ^openib -mca plm_rsh_no_tree_spawn true \
        "$@"

    RET_CODE=$?

    if [ $RET_CODE -ne 0 ]; then
        echo "[run_mpi] exec command failed, exited with $RET_CODE"
    else
        echo "[run_mpi] exec command successfully, exited with $RET_CODE"
    fi

    # stop 1...N worker by killing the sleep proc
    sed -i '1d' ${MY_HOME}/hostfile
    if [ `cat ${MY_HOME}/hostfile | wc -l` -ne 0 ]; then
        echo "[run_mpi] stop 1 to (N - 1) worker by killing the sleep proc"

        sed -i 's/${MY_MPI_SLOTS}/1/g' ${MY_HOME}/hostfile
        printf "[run_mpi] hostfile:\n`cat ${MY_HOME}/hostfile`\n"

        mpirun \
        --hostfile ${MY_HOME}/hostfile \
        --mca btl_tcp_if_include ${MY_MPI_BTL_TCP_IF} \
        --mca plm_rsh_args "-p ${MY_SSHD_PORT}" \
        -x PATH -x LD_LIBRARY_PATH \
        pkill sleep \
        > /dev/null 2>&1
    fi

    echo "[run_mpi] exit time: "$(date +"%Y-%m-%d-%H:%M:%S")
else
    echo "[run_mpi] the training log is in worker-0"
    sleep 365d
    echo "[run_mpi] exit time: "$(date +"%Y-%m-%d-%H:%M:%S")
fi

exit $RET_CODE

We use cookies to improve our site and your experience. By continuing to browse our site you accept our cookie policy. Find out more

Feedback

Feedback

Feedback

0/500

Selected Content

Submit selected content with the feedback