Developing Code for Training Using a Custom Image

If the preset images offered by ModelArts Standard do not meet your needs, create custom images for model training.

Customizing an image requires a deep understanding of containers. Use this method only if the subscribed algorithms and preset images cannot meet your requirements. Custom images can be used to train models in ModelArts Standard only after they are uploaded to the Software Repository for Container (SWR).

Boot Command Specifications for Custom Images

You can create an image based on the ModelArts image specifications, select your own image and configure the code directory (optional) and boot command to create a training job.

Figure 1 Selecting a custom image

When you use a custom image to create a training job, the boot command must be executed in the /home/ma-user directory. Otherwise, the training job may run abnormally.

conda env starts training jobs created using custom images. Training jobs do not run in a shell. Therefore, you are not allowed to run the conda activate command to activate a specified Conda environment. In this case, use other methods to start training. For example, Conda in your custom image is installed in the /home/ma-user/anaconda3 directory, the Conda environment is python-3.7.10, and the training script is stored in /home/ma-user/modelarts/user-job-dir/code/train.py. Use a specified Conda environment to start training in one of the following ways:

Method 1: Configure the correct DEFAULT_CONDA_ENV_NAME and ANACONDA_DIR environment variables for the image.
```
ANACONDA_DIR=/home/ma-user/anaconda3
DEFAULT_CONDA_ENV_NAME=python-3.7.10
```
Run the python command to start the training script. The following shows an example:
```
python /home/ma-user/modelarts/user-job-dir/code/train.py
```
Method 2: Use the absolute path of Conda environment Python.
Run the /home/ma-user/anaconda3/envs/python-3.7.10/bin/python command to start the training script. The following shows an example:
```
/home/ma-user/anaconda3/envs/python-3.7.10/bin/python /home/ma-user/modelarts/user-job-dir/code/train.py
```
Method 3: Configure the PATH environment variable.
Configure the bin directory of the specified Conda environment into the path environment variable. Run the python command to start the training script. The following shows an example:
```
export PATH=/home/ma-user/anaconda3/envs/python-3.7.10/bin:$PATH; python /home/ma-user/modelarts/user-job-dir/code/train.py
```
Method 4: Run the conda run -n command.
Run the /home/ma-user/anaconda3/bin/conda run -n python-3.7.10 command to execute the training. The following shows an example:
```
/home/ma-user/anaconda3/bin/conda run -n python-3.7.10 python /home/ma-user/modelarts/user-job-dir/code/train.py
```

If there is an error indicating that the .so file is unavailable in the $ANACONDA_DIR/envs/$DEFAULT_CONDA_ENV_NAME/lib directory, add the directory to LD_LIBRARY_PATH and place the following command before the preceding boot command:

export LD_LIBRARY_PATH=$ANACONDA_DIR/envs/$DEFAULT_CONDA_ENV_NAME/lib:$LD_LIBRARY_PATH;

For example, the example boot command used in method 1 is as follows:

export LD_LIBRARY_PATH=$ANACONDA_DIR/envs/$DEFAULT_CONDA_ENV_NAME/lib:$LD_LIBRARY_PATH; python /home/ma-user/modelarts/user-job-dir/code/train.py

Training Code Adaptation Specifications for Training Using an Ascend-powered Custom Image

When creating a training job that uses NPU resources, the system automatically generates the Ascend HCCL RANK_TABLE_FILE file in the training container. When using a preset image, Ascend HCCL RANK_TABLE_FILE is automatically parsed during training. When using a custom image, the training code must be modified to read and parse Ascend HCCL RANK_TABLE_FILE.

Ascend HCCL RANK_TABLE_FILE file description

Ascend HCCL RANK_TABLE_FILE provides the cluster used by Ascend distributed training jobs. It is used for distributed communication between Ascend chips and can be parsed by the NVIDIA Collective Communication Library (NCCL). The file has two format versions: template 1 and template 2.

ModelArts provides the template 2 format. The Ascend HCCL RANK_TABLE_FILE file in the ModelArts training environment is named jobstart_hccl.json. You can access this file using the preset RANK_TABLE_FILE environment variable.

**Table 1** RANK_TABLE_FILE environment variables
Environment Variable	Description
RANK_TABLE_FILE	Directory of Ascend HCCL RANK_TABLE_FILE, which is /user/config. Obtain the file using ${RANK_TABLE_FILE}/jobstart_hccl.json.

Example of the jobstart_hccl.json file content in the ModelArts training environment (template 2):

{
	"group_count": "1",
	"group_list": [{
		"device_count": "1",
		"group_name": "job-trainjob",
		"instance_count": "1",
		"instance_list": [{
			"devices": [{
				"device_id": "4",
				"device_ip": "192.1.10.254"
			}],
			"pod_name": "jobxxxxxxxx-job-trainjob-0",
			"server_id": "192.168.0.25"
		}]
	}],
	"status": "completed"
}

In jobstart_hccl.json, the status value may not be completed when the training script is started. In this case, wait until the status value changes to completed and read the remaining content of the file.

After the status field is completed, use the training script to convert the jobstart_hccl.json file from template 2 to template 1 format.

Format of the jobstart_hccl.json file after format conversion (template 1):

{
	"server_count": "1",
	"server_list": [{
		"device": [{
			"device_id": "4",
			"device_ip": "192.1.10.254",
			"rank_id": "0"
		}],
		"server_id": "192.168.0.25"
	}],
	"status": "completed",
	"version": "1.0"
}

Mount Points of a Training Job in a Container

When training a model with a custom image, the mount points in the container are shown in Table 2.

**Table 2** Training job mount points
Mount Point	Read Only	Remarks
/xxx	No	Directory where a dedicated resource pool mounts an SFS disk. You can specify this directory.
/home/ma-user/modelarts	No	This folder is empty. You should use it as the main directory.
/cache	No	Used to mount the hard disk of the host NVMe (supported by bare metal specifications).
/dev/shm	No	Used for PyTorch engine acceleration
/usr/local/nvidia	Yes	NVIDIA library of the host machine.