Developing Code for Training Using a Custom Image

ModelArts Standard offers various preset images and algorithms for most needs. If these do not suit you, its custom image training feature allows flexibility. You can create a tailored image to design the exact training environment required.

Customizing an image requires a deep understanding of containers. Use this method only if the subscribed algorithms and preset images cannot meet your requirements. Custom images can be used to train models in ModelArts Standard only after they are uploaded to the Software Repository for Container (SWR).

Boot Command Specifications for Custom Images

You can create an image based on the ModelArts image specifications, select your own image and configure the code directory (optional) and boot command to create a training job.

Figure 1 Selecting a custom image

When you use a custom image to create a training job, the boot command must be executed in the /home/ma-user directory. Otherwise, the training job may run abnormally.
If there are input pipes, output pipes, or hyperparameters, ensure that the last command in the boot commands runs the training script. Otherwise, an error is reported.

You must use conda env to start training jobs created using custom images.
Training jobs do not run in a shell. Therefore, you are not allowed to run the conda activate command to activate a specified Conda environment. In this case, use other methods to start training. For example, Conda in your custom image is installed in the /home/ma-user/anaconda3 directory, the Conda environment is python-3.7.10, and the training script is stored in /home/ma-user/modelarts/user-job-dir/code/train.py. Use a specified Conda environment to start training in one of the following ways:
- Method 1: Configure the correct DEFAULT_CONDA_ENV_NAME and ANACONDA_DIR environment variables for the image.
```
ANACONDA_DIR=/home/ma-user/anaconda3
DEFAULT_CONDA_ENV_NAME=python-3.7.10
```
  Run the python command to start the training script. The following shows an example:
```
python /home/ma-user/modelarts/user-job-dir/code/train.py
```
- Method 2: Use the absolute path of Conda environment Python.
  Run the /home/ma-user/anaconda3/envs/python-3.7.10/bin/python command to start the training script. The following shows an example:
```
/home/ma-user/anaconda3/envs/python-3.7.10/bin/python /home/ma-user/modelarts/user-job-dir/code/train.py
```
- Method 3: Configure the PATH environment variable.
  Configure the bin directory of the specified Conda environment into the path environment variable. Run the python command to start the training script. The following shows an example:
```
export PATH=/home/ma-user/anaconda3/envs/python-3.7.10/bin:$PATH; python /home/ma-user/modelarts/user-job-dir/code/train.py
```
- Method 4: Run the conda run -n command.
  Run the /home/ma-user/anaconda3/bin/conda run -n python-3.7.10 command to execute the training. The following shows an example:
```
/home/ma-user/anaconda3/bin/conda run -n python-3.7.10 python /home/ma-user/modelarts/user-job-dir/code/train.py
```
The last command in the boot commands must run the training script.
If there are input pipes, output pipes, or hyperparameters, ensure that the last command in the boot commands runs the training script.

Reason: The system appends input pipes, output pipes, and hyperparameters to the end of the boot commands. If the last command is not the training script, an error will occur.

Example: If the last line of the boot commands is python train.py and the --data_url hyperparameter exists, the system executes python train.py --data_url=/input when running properly. However, if the boot commands end with another command, such as:
```
python train.py
pwd    # The last command is pwd instead of the training script.
```
The system will execute python train.py pwd --data_url=/input, leading to an error.

Training Code Adaptation Specifications for Training Using an Ascend-powered Custom Image

When creating a training job that uses NPU resources, the system automatically generates the Ascend HCCL RANK_TABLE_FILE file in the training container. When using a preset image, Ascend HCCL RANK_TABLE_FILE is automatically parsed during training. When using a custom image, the training code must be modified to read and parse Ascend HCCL RANK_TABLE_FILE.

Ascend HCCL RANK_TABLE_FILE file description

Ascend HCCL RANK_TABLE_FILE provides the cluster used by Ascend distributed training jobs. It is used for distributed communication between Ascend chips and can be parsed by Huawei Collective Communication Library (HCCL). The file has two format versions: template 1 and template 2.

ModelArts provides the template 2 format. The Ascend HCCL RANK_TABLE_FILE file in the ModelArts training environment is named jobstart_hccl.json. You can access this file using the preset RANK_TABLE_FILE environment variable.

**Table 1** RANK_TABLE_FILE environment variables
Environment Variable	Description
RANK_TABLE_FILE	Directory of Ascend HCCL RANK_TABLE_FILE, which is /user/config. Obtain the file using ${RANK_TABLE_FILE}/jobstart_hccl.json.

Example of the jobstart_hccl.json file content in the ModelArts training environment (template 2):

{
	"group_count": "1",
	"group_list": [{
		"device_count": "1",
		"group_name": "job-trainjob",
		"instance_count": "1",
		"instance_list": [{
			"devices": [{
				"device_id": "4",
				"device_ip": "192.1.10.254"
			}],
			"pod_name": "jobxxxxxxxx-job-trainjob-0",
			"server_id": "192.168.0.25"
		}]
	}],
	"status": "completed"
}

In jobstart_hccl.json, the status value may not be completed when the training script is started. In this case, wait until the status value changes to completed and read the remaining content of the file.

After the status field is completed, use the training script to convert the jobstart_hccl.json file from template 2 to template 1 format.

Format of the jobstart_hccl.json file after format conversion (template 1):

{
	"server_count": "1",
	"server_list": [{
		"device": [{
			"device_id": "4",
			"device_ip": "192.1.10.254",
			"rank_id": "0"
		}],
		"server_id": "192.168.0.25"
	}],
	"status": "completed",
	"version": "1.0"
}

Mount Points of a Training Job in a Container

When training a model with a custom image, the mount points in the container are shown in Table 2.

**Table 2** Training job mount points
Mount Point	Read Only	Remarks
/xxx	No	Directory where a dedicated resource pool mounts an SFS disk. You can specify this directory.
/home/ma-user/modelarts	No	This folder is empty. You should use it as the main directory.
/cache	No	Used to mount the hard disk of the host NVMe (supported by bare metal specifications).
/dev/shm	No	Used for PyTorch engine acceleration
/usr/local/nvidia	Yes	NV library of the host machine.

FAQs

What Should I Do If an Error Is Reported Indicating that the .so File in the $ANACONDA_DIR/envs/$DEFAULT_CONDA_ENV_NAME/lib Directory Cannot Be Found During Training?
If there is an error indicating that the .so file is unavailable in the $ANACONDA_DIR/envs/$DEFAULT_CONDA_ENV_NAME/lib directory, add the directory to LD_LIBRARY_PATH and place the following command before the preceding boot command:
```
export LD_LIBRARY_PATH=$ANACONDA_DIR/envs/$DEFAULT_CONDA_ENV_NAME/lib:$LD_LIBRARY_PATH;
```
For example, the example boot command used in method 1 is as follows:
```
export LD_LIBRARY_PATH=$ANACONDA_DIR/envs/$DEFAULT_CONDA_ENV_NAME/lib:$LD_LIBRARY_PATH; python /home/ma-user/modelarts/user-job-dir/code/train.py
```