Specifications for Custom Images Used for Training Jobs

Updated on 2022-12-08 GMT+08:00

View PDF

NOTE:

This section describes how to use a custom image to train a model based on the training module of the old version. The training module of the old version is only available for its existing users. You are advised to use the new version for model training. For details, see New-Version Training.

When creating an image using locally developed models and training scripts, ensure that they meet the specifications defined by ModelArts.

Specifications

Custom images cannot contain malicious code.
Part of content in the basic images cannot be changed, including all the files in /bin, /sbin, /usr, and /lib(64), some important configuration files in /etc, and the ModelArts tools in $HOME.
A file cannot be added whose owner is root and has permission setuid or setgid.
The size of a custom image cannot exceed 9.5 GB.

To ensure that the log content can be displayed normally, the logs must be standard output.
The default user of a custom image must be the user whose UID is 1101.
Custom images can be developed based on basic ModelArts images. For details about the supported basic images, see Overview of a Basic Image Package.

Overview of a Basic Image Package

To facilitate code download, training log output, and log file upload to OBS, ModelArts provides basic image packages for creating custom images. The basic images provided by ModelArts have the following features:

Some necessary tools are available in the basic image. You need to create a custom image based on the basic images provided by ModelArts.
ModelArts continuously updates the basic image versions. For compatible updates, after the basic images are updated, you can still use the old images. For incompatible updates, the custom images created based on the old version cannot run on ModelArts, but the approved custom images can still be used.
If a custom image fails to be approved and the audit log contains an error message indicating that the basic image does not match, you need to use a new basic image to create an image.

Run the following command to obtain a ModelArts image:

docker pull <Address for obtaining a basic image>

After customizing an image, upload it to SWR. Make sure that you have created an organization and obtained the password for logging in to SWR. For details, see "Image Management" > "Uploading an Image Through SWR Console" in Software Repository for Container User Guide.

docker push  swr.<region>.myhuaweicloud.com/<Organization to which the target image belongs>/<Image name>

Obtain basic images based on chip requirements:

CPU-based Basic Images
GPU-based Basic Images

CPU-based Basic Images

Address for obtaining a basic image

swr.<region>.myhuaweicloud.com/modelarts-job-dev-image/custom-cpu-base:1.3

**Table 1** Optional parameters
Parameter	Optional Value	Description
<region>	ap-southeast-1	Region where the image resides. The possible values are described as follows: CN-Hong Kong

Table 2 and Table 3 list the components and tools used by basic images.

**Table 2** Components
Component	Description
run_train.sh	Training boot script. You can download the code directory, run training commands, redirect training log output, and upload log files to OBS after training commands are executed.

**Table 3** Tool list
Tool	Description
utils.sh	Tool script. The run_train.sh script depends on this script. It provides methods such as SK decryption, code directory download, and log file upload.
ip_mapper.py	Script for obtaining NIC addresses. By default, the IP address of the ib0 NIC is obtained. Training code can use the IP address of the ib0 NIC to accelerate network communications.
dls-downloader.py	OBS download script. The utils.sh script depends on this script.

GPU-based Basic Images

Image of the CUDA 10.0, 10.1, or 10.2 version, using Ubuntu 18.04 as the basic image and with MoXing pre-installed by default
```
swr.<region>.myhuaweicloud.com/modelarts-job-dev-image/custom-base-<cuda version>-<python version>-<os>-<arch>:<image tag>
```

Image of the CUDA 8, 9, or 92 version, with MoXing pre-installed by default

swr.<region>.myhuaweicloud.com/modelarts-job-dev-image/custom-gpu-<cuda version>-inner-moxing-<python version>:<image tag>

Image of the CUDA 8, 9, or 92 version

swr.<region>.myhuaweicloud.com/modelarts-job-dev-image/custom-gpu-<cuda version>-base:<image tag>

**Table 4** Optional parameters
Parameter	Possible Value	Description
<region>	ap-southeast-1	Region where the image resides. The possible values are described as follows: CN-Hong Kong
<cuda version>	cuda92 cuda9 cuda8 cuda10.0 cuda10.1 cuda10.2	CUDA version installed in the image NOTE: Check the CUDA version. After the version is specified, it cannot be changed. Otherwise, the training will fail.
<image tag>	1.1 1.3	Image version Version 1.3 available for CUDA 8, 9, or 92 version Version 1.1 available for CUDA 10.0, 10.1, or 10.2 version
python version	cp27 cp36	Python environment
os	ubuntu18.04	Operating system
arch	x86	Architecture

Table 2 and Table 3 list the components and tools used by basic images.

**Table 5** Components
Component	Description
run_train.sh	Training boot script. You can download the code directory, run training commands, redirect training log output, and upload log files to OBS after training commands are executed.

**Table 6** Tool list
Tool	Description
utils.sh	Tool script. The run_train.sh script depends on this script. It provides methods such as SK decryption, code directory download, and log file upload.
ip_mapper.py	Script for obtaining NIC addresses. By default, the IP address of the ib0 NIC is obtained. Training code can use the IP address of the ib0 NIC to accelerate network communications.
dls-downloader.py	OBS download script. The utils.sh script depends on this script.