Using a Custom Image to Create a CPU- or GPU-based Training Job

Model training is an iterative optimization process. Through unified training management, you can flexibly select algorithms, data, and hyperparameters to obtain the optimal input configuration and model. After comparing metrics between job versions, you can determine the most satisfactory training job.

Prerequisites

The data to be trained has been uploaded to an OBS directory.
At least one empty folder for storing the training output has been created in OBS.
A custom image has been created based on ModelArts specifications. For details about the custom image specifications, see Specifications for Custom Images for Training Jobs.
The custom image has been uploaded to SWR. For details, see How Can I Log In to SWR and Upload Images to It?.

Creating a Training Job

Log in to the ModelArts management console. In the left navigation pane, choose Training Management > Training Jobs.

Click Create Training Job and set parameters. Table 1 lists the parameters.

**Table 1** Job parameters
Parameter	Description
Created By	Select Custom algorithms. This parameter is mandatory. If you have created an algorithm based on a custom image in Algorithm Management, choose the created algorithm from My algorithms.
Boot Mode	Select Custom images. This parameter is mandatory.
Image Path	URL of an SWR image. This parameter is mandatory. Private images or shared images: Click Select on the right to select an SWR image. Ensure that the image has been uploaded to SWR. Public images: You can also manually enter the image path in the format of "<Organization to which your image belongs>/<Image name>" on SWR. Do not contain the domain name (swr.<region>.example.com) in the path because the system will automatically add the domain name to the path. For example: modelarts-job-dev-image/pytorch_1_8:train-pytorch_1.8.0-cuda_10.2-py_3.7-euleros_2.10.1-x86_64-8.1.1
Code Directory	OBS path for storing the training code. This parameter is optional. Take OBS path obs://obs-bucket/training-test/demo-code as an example. The training code in this path will be automatically downloaded to ${MA_JOB_DIR}/demo-code in the training container, where demo-code is the last-level directory of the OBS path and can be customized.
Boot Command	Command for booting an image. This parameter is mandatory. The boot command will be automatically executed after the code directory is downloaded. If the training startup script is a .py file, train.py for example, the boot command can be python ${MA_JOB_DIR}/demo-code/train.py. If the training startup script is a .sh file, main.sh for example, the boot command can be bash ${MA_JOB_DIR}/demo-code/main.sh. In the preceding examples, demo-code is the last-level OBS directory for storing code and can be customized.
Local Code Directory	You can specify the local directory of a training container. When a training job starts, the system automatically downloads the code directory to this directory. The default local code directory is /home/ma-user/modelarts/user-job-dir. This parameter is optional.
Work Directory	Directory where the boot file in the training container is located. When a training job starts, the system automatically runs the cd command to change the work directory to the specified directory.
Training Input - Parameter Name	The recommended value is data_url, which must be the same as the parameter for parsing the input data in the training code. You can set multiple training input parameters. The name of each training input parameter must be unique, for example, car_data_url, dog_data_url, and cat_data_url. For example, if you use argparse in the training code to parse data_url into the data input, set the parameter name of the training input to data_url. import argparse # Create a parsing task. parser = argparse.ArgumentParser(description="train mnist", formatter_class=argparse.ArgumentDefaultsHelpFormatter) # Add parameters. parser.add_argument('--train_url', type=str, help='the path model saved') parser.add_argument('--data_url', type=str, help='the training data') # Parse the parameters. args, unknown = parser.parse_known_args()
Training Input - Data Path	Select Dataset or Data path as the training input. If you select Data path, set an OBS path as the training input. When the training starts, data in the specified path will be automatically downloaded to the training container. Take OBS path obs://obs-bucket/training-test/data as an example. The data will be automatically downloaded to ${MA_MOUNT_PATH}/inputs/${data_url}_N of the training container. The value of N is the number of training input parameters minus 1. For example: If there is only one training input parameter data_url, the data will be automatically downloaded to ${MA_MOUNT_PATH}/inputs/data_url_0/ of the training container. If there are multiple training input parameters car_data_url, dog_data_url, and cat_data_url, the training data will be automatically downloaded to ${MA_MOUNT_PATH}/inputs/car_data_url_0/, ${MA_MOUNT_PATH}/inputs/dog_data_url_1/, and ${MA_MOUNT_PATH}/inputs/cat_data_url_2/ of the container, respectively.
Training Output - Parameter Name	The recommended value is train_url, which must be the same as the parameter for parsing the output data in the training code. You can set multiple training output parameters. The name of each training output parameter must be unique.
Training Output - Data Path	Select an OBS path as the training output. To minimize errors, select an empty directory. The training result file in the training container ${MA_MOUNT_PATH}/outputs/${train_url}_N/ will be automatically uploaded to obs://obs-bucket/training-test/output. The value of N is the number of training output parameters minus 1. For example: If there is only one training output parameter train_url, the container directory of the training output is ${MA_MOUNT_PATH}/outputs/data_url_0/. If there are multiple training output parameters, for example, car_train_url, dog_train_url, and cat_train_url, the container directories of the training output are ${MA_MOUNT_PATH}/outputs/car_train_url_0/, ${MA_MOUNT_PATH}/outputs/dog_train_url_1/, and ${MA_MOUNT_PATH}/outputs/cat_train_url_2/, respectively.
Training Output - Obtained from	The following uses the training output train_url as an example. Obtain the training output from hyperparameters by using the following code: import argparse parser = argparse.ArgumentParser() parser.add_argument('--train_url') args, unknown = parser.parse_known_args() train_url = args.train_url Obtain the training output from environment variables by using the following code: import os train_url = os.getenv("train_url", "")
Training Output - Predownload	If you set Predownload to Yes, the system automatically downloads the files in the training output data path to the local directory of the training container when the training job is started. . Select Yes for resumable training and incremental training.
Hyperparameters	Used for training tuning. This parameter is optional.
Environment Variable	After the container is started, the system loads the default environment variables and the environment variables customized here. Table 2 lists the default environment variables.
Auto Restart	After this function is enabled, you can set the number of restart times for a training failure. This parameter is optional.

**Table 2** Default environment variables
Environment Variable	Description
MA_JOB_DIR	Parent directory of the code directory.
MA_MOUNT_PATH	Parent directory of the training input and output directories.
VC_TASK_INDEX	Container index, starting from 0. This parameter is meaningless for single-node training. In multi-node training jobs, you can use this parameter to determine the algorithm logic of the container.
VC_WORKER_HOSTS	Node communication domain names. Multiple node domain names are separated by commas (,). For example: Single node: ${MA_VJ_NAME}-${MA_TASK_NAME}-0.${MA_VJ_NAME} Two nodes: ${MA_VJ_NAME}-${MA_TASK_NAME}-0.${MA_VJ_NAME},${MA_VJ_NAME}-${MA_TASK_NAME}-1.${MA_VJ_NAME}
MA_NUM_HOSTS	Number of compute nodes, which is automatically obtained from Compute Nodes.
MA_NUM_GPUS	Number of GPUs on a node
${MA_VJ_NAME}-${MA_TASK_NAME}-N.${MA_VJ_NAME}	Communication domain name of a node. For example, the communication domain name of node 0 is ${MA_VJ_NAME}-${MA_TASK_NAME}-0.${MA_VJ_NAME}. N indicates the number of compute nodes. For example, if the number of compute nodes is 4, the environment variables are as follows: ${MA_VJ_NAME}-${MA_TASK_NAME}-0.${MA_VJ_NAME} ${MA_VJ_NAME}-${MA_TASK_NAME}-1.${MA_VJ_NAME} ${MA_VJ_NAME}-${MA_TASK_NAME}-2.${MA_VJ_NAME} ${MA_VJ_NAME}-${MA_TASK_NAME}-3.${MA_VJ_NAME}

Select an instance flavor. The value range of the training parameters is consistent with the constraints of existing custom images.

**Table 3** Resource parameters
Parameter	Description
Resource Pool	Select a resource pool for the job. Public and dedicated resource pools are available for you to select. If you select a dedicated resource pool, you can view details about the pool. If the number of available cards of this pool is insufficient, jobs may need to be queued. In this case, use another resource pool or reduce the number of cards required.
Resource Type	Select CPU or GPU as needed. Set this parameter based on the resource type specified in your training code.
Instance Flavor	Select a resource flavor based on the resource type. If the type of resources to be used has been specified in your training code, only the options that comply with the constraints of the selected algorithm are available for you to choose. For example, if GPU is selected in the training code but you select CPU here, the training may fail. During training, ModelArts will mount NVME SSDs to the /cache directory. You can use this directory to store temporary files. The data disk size varies depending on the resource type. To prevent insufficient memory during training, click Check Input Size and check the disk size of selected instance flavor.
Compute Nodes	Set the number of compute nodes. The default value is 1.
Job Priority	When using a new-version dedicated resource pool, you can set the priority of a training job. The value ranges from 1 to 3. The default priority is 1, and the highest priority is 3. You can change the priority of a pending job.
SFS Turbo	When using a dedicated resource pool, the training job can be mounted with multiple cloud storage disks (NAS). A disk can be mounted only once and to only one mounting path. Each mounting path must be unique. A maximum of 8 disks can be mounted to a training job.
Persistent Log Saving	If you select CPU or GPU flavors, Persistent Log Saving is available for you to set. This function is disabled by default. ModelArts automatically stores the logs for 30 days. You can download all logs on the job details page. After enabling this function, you can store training logs in a specified OBS directory. Set Job Log Path and select an empty OBS directory to store the log files generated during training and ensure you have the reading and writing permissions of the directory.
Job Log Path	If you select Ascend resources, select an empty OBS path for storing training logs. Ensure that you have read and write permissions to the selected OBS directory.
Event Notification	Whether to subscribe to event notifications. After this function is enabled, you will be notified of specific events, such as job status changes or suspected suspensions, via an SMS or email. If you enable this function, set the following parameters: Topic: topic of event notifications. You can create a topic on the SMN console. Event: type of events you want to subscribe to. Options: JobStarted, JobCompleted, JobFailed, JobTerminated, and JobHanged. NOTE: After you create a topic on the SMN console, add a subscription to the topic, and confirm the subscription. Then, you will be notified of events. Currently, only training jobs using GPUs support JobHanged events.
Auto Stop	After this parameter is enabled and the auto stop time is set, a training job automatically stops at the specified time. If this function is disabled, a training job will continue to run. The options are 1hour, 2hours, 4hours, 6hours, and Customization (1 hour to 72 hours).

Click Submit to create the training job.
It takes a period of time to create a training job.

To view the real-time status of a training job, go to the training job list and click the name of the training job. On the training job details page that is displayed, view the basic information of the training job. For details, see Viewing Training Job Details.

Parent topic: Using a Custom Image to Train Models (New-Version Training)

Previous topic: Creating an Algorithm Using a Custom Image

Next topic: Using a Custom Image to Create AI applications for Inference Deployment