Help Center/ ModelArts/ Best Practices/ Model Training/ Running a Training Job on ModelArts Standard/ Running a Single-Node Multi-PU Training Job on ModelArts Standard

Updated on 2025-10-16 GMT+08:00

View PDF

Running a Single-Node Multi-PU Training Job on ModelArts Standard

Process

Preparations
1. Purchasing Service Resources (VPC, SFS, SWR, and ECS)
2. Assigning Permissions
3. Creating a Dedicated Resource Pool (interconnecting the VPC)
4. Mounting an SFS Turbo File System to an ECS
5. Granting the Read Permission to ModelArts Users on the ECS
6. Installing and Configuring the OBS CLI
7. (Optional) Configuring Workspaces
Model training

Building and Debugging an Image Locally

In this section, the Conda environment is packaged to set up the runtime. You can also install Conda dependencies manually using pip install or conda install.

The container image should be smaller than 15 GB. For details, see Constraints on Custom Images of the Training Framework.
Build an image through the official open-source website, for example, PyTorch.
Containers should be built by layer. Each layer must have no more than 1 GB of capacity or 100,000 files. You need to start with the layers that change less frequently. For example, build the OS, CUDA driver, Python, PyTorch, and other dependency packages in sequence.
If the training data and code change frequently, do not store them in the container image in case you need to build container images frequently.
The containers can meet the isolation requirements. Do not create conda environments in a container.

Export the conda environment.

Start the offline container image:

# run on terminal
docker run -ti ${your_image:tag}

Obtain pytorch.tar.gz:

# run on container

# Create a conda environment named pytorch based on the target base environment.
conda create --name pytorch --clone base

pip install conda-pack

# Pack pytorch env to generate pytorch.tar.gz.
conda pack -n pytorch -o pytorch.tar.gz

Upload the package to a local path.

# run on terminal
docker cp ${your_container_id}:/xxx/xxx/pytorch.tar.gz .

Upload pytorch.tar.gz to OBS and set it to public read. Obtain, decompress, and clear pytorch.tar.gz using the wget commands during creation.

Create an image.
Choose either the official Ubuntu 18.04 image or the image with the CUDA driver from NVIDIA as the base image. Obtain the images on the Docker Hub official website.

To create an image, do as follows: Install the required apt package and driver, configure the ma-user user, import the conda environment, and configure the notebook dependency.
- Creating images with a Dockerfile is recommended. This ensures Dockerfile traceability and archiving, as well as image content without redundancy or residue.
- To reduce the final image size, delete intermediate files such as TAR packages when building each layer. For details about how to clear the cache, see conda clean.

Refer to the following example.

Dockerfile example:

FROM nvidia/cuda:11.3.1-cudnn8-devel-ubuntu18.04

USER root

# section1: add user ma-user whose uid is 1000 and user group ma-group whose gid is 100. If there already exists 1000:100 but not ma-user:ma-group, below code will remove it
RUN default_user=$(getent passwd 1000 | awk -F ':' '{print $1}') || echo "uid: 1000 does not exist" && \
    default_group=$(getent group 100 | awk -F ':' '{print $1}') || echo "gid: 100 does not exist" && \
    if [ ! -z ${default_group} ] && [ ${default_group} != "ma-group" ]; then \
        groupdel -f ${default_group}; \
        groupadd -g 100 ma-group; \
    fi && \
    if [ -z ${default_group} ]; then \
        groupadd -g 100 ma-group; \
    fi && \
    if [ ! -z ${default_user} ] && [ ${default_user} != "ma-user" ]; then \
        userdel -r ${default_user}; \
        useradd -d /home/ma-user -m -u 1000 -g 100 -s /bin/bash ma-user; \
        chmod -R 750 /home/ma-user; \
    fi && \
    if [ -z ${default_user} ]; then \
        useradd -d /home/ma-user -m -u 1000 -g 100 -s /bin/bash ma-user; \
        chmod -R 750 /home/ma-user; \
    fi && \
    # set bash as default
    rm /bin/sh && ln -s /bin/bash /bin/sh

# section2: config apt source and install tools needed.
RUN sed -i "s@http://.*archive.ubuntu.com@http://repo.huaweicloud.com@g" /etc/apt/sources.list && \
    sed -i "s@http://.*security.ubuntu.com@http://repo.huaweicloud.com@g" /etc/apt/sources.list && \
    apt-get update && \
    apt-get install -y ca-certificates curl ffmpeg git libgl1-mesa-glx libglib2.0-0 libibverbs-dev libjpeg-dev libpng-dev libsm6 libxext6 libxrender-dev ninja-build screen sudo vim wget zip && \
    apt-get clean  && \
    rm -rf /var/lib/apt/lists/*

USER ma-user

# section3: install miniconda and rebuild conda env
RUN mkdir -p /home/ma-user/work/ && cd /home/ma-user/work/ && \
    wget https://repo.anaconda.com/miniconda/Miniconda3-py37_4.12.0-Linux-x86_64.sh && \
    chmod 777 Miniconda3-py37_4.12.0-Linux-x86_64.sh && \
    bash Miniconda3-py37_4.12.0-Linux-x86_64.sh -bfp /home/ma-user/anaconda3 && \
    wget https://${bucketname}.obs.cn-north-4.myhuaweicloud.com/${folder_name}/pytorch.tar.gz && \
    mkdir -p /home/ma-user/anaconda3/envs/pytorch && \
    tar -xzf pytorch.tar.gz -C /home/ma-user/anaconda3/envs/pytorch && \
    source /home/ma-user/anaconda3/envs/pytorch/bin/activate && conda-unpack && \
    /home/ma-user/anaconda3/bin/conda init bash && \
    rm -rf /home/ma-user/work/*

ENV PATH=/home/ma-user/anaconda3/envs/pytorch/bin:$PATH

# section4: settings of Jupyter Notebook for pytorch env
RUN source /home/ma-user/anaconda3/envs/pytorch/bin/activate && \
    pip install ipykernel==6.7.0 --trusted-host https://repo.huaweicloud.com -i https://repo.huaweicloud.com/repository/pypi/simple && \
    ipython kernel install --user --env PATH /home/ma-user/anaconda3/envs/pytorch/bin:$PATH --name=pytorch && \
    rm -rf /home/ma-user/.local/share/jupyter/kernels/pytorch/logo-* && \
    rm -rf ~/.cache/pip/* && \
    echo 'export PATH=$PATH:/home/ma-user/.local/bin' >> /home/ma-user/.bashrc && \
    echo 'export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/nvidia/lib64' >> /home/ma-user/.bashrc && \
    echo 'conda activate pytorch' >> /home/ma-user/.bashrc

ENV DEFAULT_CONDA_ENV_NAME=pytorch

Replace https://${bucket_name}.obs.cn-north-4.myhuaweicloud.com/${folder_name}/pytorch.tar.gz in the Dockerfile with the OBS path of pytorch.tar.gz in 1 (the file must be set to public read).

Go to the Dockerfile directory and run the following commands to create an image:

# Use the cd command to navigate to the directory that contains the Dockerfile and run the build command.
# docker build -t ${image_name}:${image_version} .
# Example:
docker build -t pytorch-1.13-cuda11.3-cudnn8-ubuntu18.04:v1 .

Debug an image.

It is recommended that you use a Dockerfile to include the changes made during debugging in the official container building process and test it again.
1. Ensure that the corresponding script, code, and process are running properly on the Linux server.
  If the running fails, commission the container and then create a container image.
2. Ensure that the image file is located correctly and that you have the required file permission.
  Before training, verify that the custom dependency package is normal, that the required packages are in the pip list, and that the Python used by the container is the correct one. (If you have multiple Pythons installed in the container image, you need to set the Python path environment variable.)
3. Test the training boot script.
  1. Data copy and verification
    Generally, the image does not contain training data and code. You need to copy the required files to the image after starting the image. To avoid running out of disk space, store data, code, and intermediate data in the /cache directory. It is recommended that the Linux server have sufficient memory (more than 8 GB) and hard disk (more than 100 GB).
    
    The following command enables file interaction between Docker and Linux:
```
docker cp data/ 39c9ceedb1f6:/cache/
```
    Once you have prepared the data, run the training script and verify that the training starts correctly. Generally, the boot script is as follows:
```
cd /cache/code/ 
python start_train.py
```
    To troubleshoot the training process, you can access the logs and errors in the container instance and adjust the code and environment variables accordingly.
  2. Preset script testing
    The run.sh script is typically used to copy data and code from OBS to containers, and to copy output results from containers to OBS.
    
    You can edit and iterate the script in the container instance if the preset script does not produce the desired result.
  3. Dedicated pool scenario
    Mounting SFS in dedicated pools allows you to import code and data without worrying about OBS operations.
    
    You can either mount the SFS directory to the /mnt/sfs_turbo directory of the debugging node, or sync the directory content with the SFS disk.
    
    To start a container instance during debugging, use the -v parameter to mount a directory from the host machine to the container environment.
```
docker run -ti -d -v /mnt/sfs_turbo:/sfs my_deeplearning_image:v1
```
    The command above mounts the /mnt/sfs_turbo directory of the host machine to the /sfs directory of the container. Any changes in the corresponding directories of the host machine and container are synchronized in real time.
4. To locate faults, check the logs for the training image, and check the API response for the inference image.
  Run the following command to view all stdout logs output by the container:
```
docker logs -f 39c9ceedb1f6
```
  Some logs are stored in the container when creating an inference image. You need to access the container to view these logs. Note: You need to check whether the logs contain errors (including when the container is started and when the API is executed).
5. To modify the user groups of some files that are inconsistent, run the following command as the root user on the host machine.
```
docker exec -u root:root 39c9ceedb1f6 bash -c "chown -R ma-user:ma-user /cache"
```
6. To fix an error during debugging, edit it in the container instance. You can run the commit command to save the changes.

Uploading an Image

Uploading an image through the client is to run Docker commands on the machine where container engine client is installed to push the image to an image repository of SWR.

If your container engine client is an ECS or CCE node, you can push an image over two types of networks.

If your client and the image repository are in the same region, you can push an image over private networks.
If your client and the image repository are in different regions, you can push an image over public networks and the client needs to be bound to an EIP.

Each image layer uploaded through the client cannot be larger than 10 GB.
Your container engine client version must be 1.11.2 or later.

Access SWR.
1. Log in to the SWR console.
2. Click Create Organization in the upper right corner and enter an organization name to create an organization. Customize the organization name. Replace the organization name deep-learning in subsequent commands with the actual organization name.
3. In the navigation pane on the left, choose Dashboard and click Generate Login Command in the upper right corner. On the displayed page, click to copy the login command.
  - The validity period of the generated login command is 24 hours. To obtain a long-term valid login command, see Obtaining a Long-Term Login or Image Push/Pull Command. After you obtain a long-term valid login command, your temporary login commands will still be valid as long as they are in their validity periods.
  - The domain name at the end of the login command is the image repository address. Record the address for later use.
4. Run the login command on the machine where the container engine is installed.
  The message Login Succeeded will be displayed upon a successful login.
Run the following command on the device where the container engine is installed to label the image:
docker tag [image_name_1:tag_1] [image_repository_address]/[organization_name]/[image_name_2:tag_2]
- [image_name_1:tag_1]: Replace it with the actual name and tag of the image to be uploaded.
- [image_repository_address]: You can query the address on the SWR console, that is, the domain name at the end of the login command in 1.c.
- [organization_name]: Replace it with the name of the organization created.
- [image_name_2:tag_2]: Replace it with the desired image name and tag.
Example:
```
docker tag ${image_name}:${image_version} swr.cn-north-4.myhuaweicloud.com/${organization_name}/${image_name}:${image_version}
```
Upload the image to the image repository.
docker push [image_repository_address]/[organization_name]/[image_name_2:tag 2]

Example:
```
docker push swr.cn-north-4.myhuaweicloud.com/${organization_name}/${image_name}:${image_version}
```
To view the pushed image, go to the SWR console and refresh the My Images page.

Uploading Data and Algorithms to SFS

SFS has been mounted to the ECS. For details, see Mounting an SFS Turbo File System to an ECS.
Permissions have been configured on the ECS. For details, see Granting the Read Permission to ModelArts Users on the ECS.
obsutil has been installed and configured. For details, see Installing and Configuring the OBS CLI.

Prepare data.
1. Visit the official COCO dataset website and download the dataset from https://cocodataset.org/#download.
2. Download Train images (18 GB), Val images (1 GB) and Train/Val annotations (241 MB) of the 2017 COCO dataset, decompress them, and save them to the coco folder.
3. After the download is complete, upload the data to the target directory of SFS. The dataset is very large, so you are advised to use obsutil to upload it to an OBS bucket first, and then move it to SFS.
  1. Run the following commands on the local host and use obsutil to upload the local dataset to an OBS bucket.
```
# Upload local data to OBS.
# ./obsutil cp ${Path of the local folder where the dataset is stored} ${Path of the OBS folder where the dataset is stored} -f -r
# Example:
./obsutil cp ./coco obs://your_bucket/ -f -r
```
  2. Log in to the ECS and use obsutil to migrate the dataset to SFS. The sample code is as follows:
```
# Transfer OBS data to SFS.
# ./obsutil cp ${Path of the OBS folder where the dataset is located} ${Path of the SFS folder} -f -r
# Example:
./obsutil cp obs://your_bucket/coco/ /mnt/sfs_turbo/ -f -r
```
    The directory structure in the /mnt/sfs_turbo/coco folder is as follows:
```
coco
|---annotations
|---train2017
|---val2017
```
    For more obsutil operations, see Introduction to obsutil.
  3. Set the file owner to ma-user.
```
chown -R ma-user:ma-group coco
```

Prepare an algorithm.

Download the YOLOX code. Code repository address: https://github.com/Megvii-BaseDetection/YOLOX.git.

git clone https://github.com/Megvii-BaseDetection/YOLOX.git
cd YOLOX
git checkout 4f8f1d79c8b8e530495b5f183280bab99869e845

Change the onnx version in the requirements.txt file to at least 1.12.0.
Change data_dir = os.path.join(get_yolox_datadir(), "COCO") in line 59 of yolox/data/datasets/coco.py to data_dir = '/home/ma-user/coco'.
```
# data_dir = os.path.join(get_yolox_datadir(), "COCO")
data_dir = '/home/ma-user/coco'
```

Add the following code before line 13 in tools/train.py:

# Add the two lines of code to avoid the yolox module not found error during execution.
import sys
sys.path.append(os.getcwd())

# line13
from yolox.core import launch
from yolox.exp import Exp, get_exp

Change fast_cocoeval to fast_coco_eval_api in line 122 of jit_ops.py in yolox/layers.

# def __init__(self, name="fast_cocoeval"):
def __init__(self, name="fast_coco_eval_api"):

Change from yolox.layers import COCOeval_opt as COCOeval to from pycocotools.cocoeval import COCOeval in line 294 of coco_evaluator.py in yolox\evaluators.

try:
   # from yolox.layers import COCOeval_opt as COCOeval
   from pycocotools.cocoeval import COCOeval
except ImportError:
   from pycocotools.cocoeval import COCOeval

   logger.warning("Use standard COCOeval.")

Create a run.sh script in the tools directory and use it as the boot script. The following is a reference for the run.sh script:

#!/usr/bin/env sh
set -x
set -o pipefail

export NCCL_DEBUG=INFO

DEFAULT_ONE_GPU_BATCH_SIZE=32
BATCH_SIZE=$((${MA_NUM_GPUS:-8} * ${VC_WORKER_NUM:-1} * ${DEFAULT_ONE_GPU_BATCH_SIZE}))
if [ ${VC_WORKER_HOSTS} ];then
    YOLOX_DIST_URL=tcp://$(echo ${VC_WORKER_HOSTS} | cut -d "," -f 1):6666
    /home/ma-user/anaconda3/envs/pytorch/bin/python -u tools/train.py \
                                -n yolox-s \
                                --devices ${MA_NUM_GPUS:-8} \
                                --batch-size ${BATCH_SIZE} \
                                --fp16 \
                                --occupy \
                                --num_machines ${VC_WORKER_NUM:-1} \
                                --machine_rank ${VC_TASK_INDEX:-0} \
                                --dist-url ${YOLOX_DIST_URL}
else
    /home/ma-user/anaconda3/envs/pytorch/bin/python -u tools/train.py \
                                -n yolox-s \
                                --devices ${MA_NUM_GPUS:-8} \
                                --batch-size ${BATCH_SIZE} \
                                --fp16 \
                                --occupy \
                                --num_machines ${VC_WORKER_NUM:-1} \
                                --machine_rank ${VC_TASK_INDEX:-0}
fi

Some environment variables do not exist in the notebook environment. Default values are required.

Save the code to OBS and upload the code to the target directory of SFS through OBS.
1. Run the following commands on the local host and use obsutil to upload the local dataset to an OBS bucket.
```
# Upload local code to OBS.
./obsutil cp ./YOLOX obs://your_bucket/ -f -r
```
2. Log in to the ECS and use obsutil to migrate the dataset to SFS. The sample code is as follows:
```
# Upload code from OBS to SFS.
./obsutil cp obs://your_bucket/YOLOX/ /mnt/sfs_turbo/code/ -f -r
```
In this example, obsutils is used to upload files. You can also use SCP to upload files. For details, see How Can I Use SCP to Transfer Files Between a Local Linux Computer and a Linux ECS?
Set the file owner to ma-user in SFS.
```
chown -R ma-user:ma-group YOLOX
```
Remove \r from the shell script:
```
cd YOLOX
sed -i 's/\r//' run.sh
```
Shell scripts written in Windows have \r\n as the line ending, but Linux uses \n as the line ending. This means that Linux treats \r as part of the script and shows the error message "$'\r': command not found" when running it. To fix this, you need to remove \r from the shell script.

Debugging Code with Notebook

Notebook billing is as follows:
- A running notebook instance will be billed based on used resources. The fees vary depending on your selected resources. For details, see Product Pricing Details. When a notebook instance is not used, stop it.
- If you select EVS for storage when creating a notebook instance, the EVS disk will be continuously billed. Stop and delete the notebook instance if it is not required.
When a notebook instance is created, auto stop is enabled by default. The notebook instance will automatically stop at the specified time.

Only running notebook instances can be accessed or stopped.
A maximum of 10 notebook instances can be created under one account.

Follow these steps:

Register the image. Log in to the ModelArts console. In the navigation pane on the left, choose Image Management. Click Register. Set SWR Source to the image pushed to SWR. Paste the complete SWR address, or click to select a private image from SWR for registration, and add GPU to Type.
Log in to the ModelArts console. In the navigation pane on the left, choose Development Workspace > Notebook.

Click Create Notebook. On the displayed page, configure the parameters.

Configure basic information of the notebook instance, including its name, description, and auto stop status. For details, see Table 1.

**Table 1** Basic parameters
Parameter	Description
Name	Name of a notebook instance, which can contain 1 to 64 characters, including letters, digits, hyphens (-), and underscores (_).
Description	Brief description of a notebook instance.
Auto Stop	Automatically stops the notebook instance at a specified time. This function is enabled by default. The default value is 1 hour, indicating that the notebook instance automatically stops after running for 1 hour and its resource billing will stop then. The options are 1 hour, 2 hours, 4 hours, 6 hours, and Custom. You can select Custom to specify any integer from 1 to 24 hours.

Select an image and configure resource specifications for the instance.
- Image: In the Custom Images tab, select the uploaded custom image.
- Resource Type: Select a created dedicated resource pool based on site requirements.
- Instance Specifications: Select 8-GPUs specifications. This matches the default setting of MA_NUM_GPUS in the run.sh file, which is 8 PUs.
- Storage: Select SFS. Mounted Subdirectory is optional.
To use VS Code to connect to a notebook instance for code debugging, enable Remote SSH and select a key pair. For details, see Connecting to a Notebook Instance Through VS Code.

Click Next.
Confirm the information and click Submit.
Switch to the notebook instance list. The notebook instance is being created. It will take several minutes before its status changes to Running.
In the notebook instance list, click the instance name. On the instance details page that is displayed, view the instance configuration.

Open a terminal in notebook and enter the boot command to debug the code.

# Create a dataset soft link.
# ln -s /home/ma-user/work/${Path of the COCO dataset on SFS} /home/ma-user/coco
# Go to the corresponding directory.
# cd /home/ma-user/work/${YOLOX path on SFS}
# Install the environment and run the script.
# /home/ma-user/anaconda3/envs/pytorch/bin/pip install -r requirements.txt && /bin/sh tools/run.sh 

# Example:
ln -s /home/ma-user/work/coco /home/ma-user/coco
cd /home/ma-user/work/code/YOLOX/
/home/ma-user/anaconda3/envs/pytorch/bin/pip install -r requirements.txt && /bin/sh tools/run.sh

After debugging in a notebook instance, if the image is modified, you can save the image for subsequent training. For details, see Saving a Notebook Environment Image.

Creating a Single-Node Multi-PU Training Job

Log in to the ModelArts console and check whether access authorization has been configured for your account. For details, see Configuring Agency Authorization for ModelArts with One Click. If you have been authorized using access keys, clear the authorization and configure agency authorization.
In the navigation pane on the left, choose Model Training > Training Jobs. The training job list is displayed by default. Click Create Training Job.
On the Create Training Job page, configure parameters and click Submit.
- Algorithm Type: Custom algorithm
- Boot Mode: Custom image
- Image: custom image you have uploaded
- Boot Command:
```
ln -s /home/ma-user/work/coco /home/ma-user/coco && cd /home/ma-user/work/code/YOLOX/ && /home/ma-user/anaconda3/envs/pytorch/bin/pip install -r requirements.txt && /bin/sh tools/run.sh
```
- Resource Pool: In the Dedicated Resource Pool tab, select a GPU dedicated resource pool.
- Specifications: Select 8-GPU specifications.
- Compute Nodes: Enter 1.
- SFS Turbo: Add the mount configuration and select the SFS name. The cloud mount path is /home/ma-user/work.
  
  To ensure that the code path and boot command are the same as those for notebook debugging, set the cloud mount path to /home/ma-user/work.
Click Submit. On the information confirmation page, check the parameters, and click OK.
Wait until the training job is created.
Once you submit the job creation request, the system handles tasks like downloading the container image and code directory, and executing the boot command in the backend. Training jobs take varying amounts of time, from tens of minutes to several hours, depending on the service logic and chosen resources.