Updated on 2024-06-12 GMT+08:00

Using a Custom Image to Create a CPU- or GPU-based Training Job

Model training is an iterative optimization process. Through unified training management, you can flexibly select algorithms, data, and hyperparameters to obtain the optimal input configuration and model. After comparing metrics between job versions, you can determine the most satisfactory training job.

Prerequisites

Creating a Training Job

  1. Log in to the ModelArts management console. In the left navigation pane, choose Training Management > Training Jobs.
  2. Click Create Training Job and set parameters. Table 1 lists the parameters.
    Table 1 Job parameters

    Parameter

    Description

    Created By

    Select Custom algorithms. This parameter is mandatory.

    If you have created an algorithm based on a custom image in Algorithm Management, choose the created algorithm from My algorithms.

    Boot Mode

    Select Custom images. This parameter is mandatory.

    Image Path

    URL of an SWR image. This parameter is mandatory.

    • Private images or shared images: Click Select on the right to select an SWR image. Ensure that the image has been uploaded to SWR.
    • Public images: You can also manually enter the image path in the format of "<Organization to which your image belongs>/<Image name>" on SWR. Do not contain the domain name (swr.<region>.xxx.com) in the path because the system will automatically add the domain name to the path. For example:
      modelarts-job-dev-image/pytorch_1_8:train-pytorch_1.8.0-cuda_10.2-py_3.7-euleros_2.10.1-x86_64-8.1.1

    Code Directory

    OBS path for storing the training code. This parameter is optional.

    Take OBS path obs://obs-bucket/training-test/demo-code as an example. The training code in this path will be automatically downloaded to ${MA_JOB_DIR}/demo-code in the training container, where demo-code is the last-level directory of the OBS path and can be customized.

    Boot Command

    Command for booting an image. This parameter is mandatory. The boot command will be automatically executed after the code directory is downloaded.

    • If the training startup script is a .py file, train.py for example, the boot command can be python ${MA_JOB_DIR}/demo-code/train.py.
    • If the training startup script is a .sh file, main.sh for example, the boot command can be bash ${MA_JOB_DIR}/demo-code/main.sh.

    In the preceding examples, demo-code is the last-level OBS directory for storing code and can be customized.

    Local Code Directory

    You can specify the local directory of a training container. When a training job starts, the system automatically downloads the code directory to this directory.

    The default local code directory is /home/ma-user/modelarts/user-job-dir. This parameter is optional.

    Work Directory

    Directory where the boot file in the training container is located. When a training job starts, the system automatically runs the cd command to change the work directory to the specified directory.

    Training Input - Parameter Name

    The recommended value is data_url, which must be the same as the parameter for parsing the input data in the training code. You can set multiple training input parameters. The name of each training input parameter must be unique, for example, car_data_url, dog_data_url, and cat_data_url.

    For example, if you use argparse in the training code to parse data_url into the data input, set the parameter name of the training input to data_url.

    import argparse
    # Create a parsing task.
    parser = argparse.ArgumentParser(description="train mnist", formatter_class=argparse.ArgumentDefaultsHelpFormatter)
    # Add parameters.
    parser.add_argument('--train_url', type=str, help='the path model saved')
    parser.add_argument('--data_url', type=str, help='the training data')
    # Parse the parameters.
    args, unknown = parser.parse_known_args()

    Training Input - Data Path

    Select Dataset or Data path as the training input. If you select Data path, set an OBS path as the training input.

    When the training starts, data in the specified path will be automatically downloaded to the training container.

    Take OBS path obs://obs-bucket/training-test/data as an example. The data will be automatically downloaded to ${MA_MOUNT_PATH}/inputs/${data_url}_N of the training container. The value of N is the number of training input parameters minus 1.

    For example:

    • If there is only one training input parameter data_url, the data will be automatically downloaded to ${MA_MOUNT_PATH}/inputs/data_url_0/ of the training container.
    • If there are multiple training input parameters car_data_url, dog_data_url, and cat_data_url, the training data will be automatically downloaded to ${MA_MOUNT_PATH}/inputs/car_data_url_0/, ${MA_MOUNT_PATH}/inputs/dog_data_url_1/, and ${MA_MOUNT_PATH}/inputs/cat_data_url_2/ of the container, respectively.

    Training Output - Parameter Name

    The recommended value is train_url, which must be the same as the parameter for parsing the output data in the training code. You can set multiple training output parameters. The name of each training output parameter must be unique.

    Training Output - Data Path

    Select an OBS path as the training output. To minimize errors, select an empty directory.

    The training result file in the training container ${MA_MOUNT_PATH}/outputs/${train_url}_N/ will be automatically uploaded to obs://obs-bucket/training-test/output. The value of N is the number of training output parameters minus 1.

    For example:

    • If there is only one training output parameter train_url, the container directory of the training output is ${MA_MOUNT_PATH}/outputs/data_url_0/.
    • If there are multiple training output parameters, for example, car_train_url, dog_train_url, and cat_train_url, the container directories of the training output are ${MA_MOUNT_PATH}/outputs/car_train_url_0/, ${MA_MOUNT_PATH}/outputs/dog_train_url_1/, and ${MA_MOUNT_PATH}/outputs/cat_train_url_2/, respectively.

    Training Output - Obtained from

    The following uses the training output train_url as an example.

    Obtain the training output from hyperparameters by using the following code:

    import argparse
    parser = argparse.ArgumentParser()
    parser.add_argument('--train_url')
    args, unknown = parser.parse_known_args()
    train_url = args.train_url 

    Obtain the training output from environment variables by using the following code:

    import os
    train_url = os.getenv("train_url", "")

    Training Output - Predownload

    If you set Predownload to Yes, the system automatically downloads the files in the training output data path to the local directory of the training container when the training job is started.

    .

    Select Yes for resumable training and incremental training.

    Hyperparameters

    Used for training tuning. This parameter is optional.

    Environment Variable

    After the container is started, the system loads the default environment variables and the environment variables customized here.

    Table 2 lists the default environment variables.

    Auto Restart

    After this function is enabled, you can set the number of restart times for a training failure. This parameter is optional.

    Table 2 Default environment variables

    Environment Variable

    Description

    MA_JOB_DIR

    Parent directory of the code directory.

    MA_MOUNT_PATH

    Parent directory of the training input and output directories.

    VC_TASK_INDEX

    Container index, starting from 0. This parameter is meaningless for single-node training. In multi-node training jobs, you can use this parameter to determine the algorithm logic of the container.

    VC_WORKER_HOSTS

    Node communication domain names. Multiple node domain names are separated by commas (,). For example:

    • Single node: ${MA_VJ_NAME}-${MA_TASK_NAME}-0.${MA_VJ_NAME}
    • Two nodes: ${MA_VJ_NAME}-${MA_TASK_NAME}-0.${MA_VJ_NAME},${MA_VJ_NAME}-${MA_TASK_NAME}-1.${MA_VJ_NAME}

    MA_NUM_HOSTS

    Number of compute nodes, which is automatically obtained from Compute Nodes.

    MA_NUM_GPUS

    Number of GPUs on a node

    ${MA_VJ_NAME}-${MA_TASK_NAME}-N.${MA_VJ_NAME}

    Communication domain name of a node. For example, the communication domain name of node 0 is ${MA_VJ_NAME}-${MA_TASK_NAME}-0.${MA_VJ_NAME}.

    N indicates the number of compute nodes. For example, if the number of compute nodes is 4, the environment variables are as follows:

    ${MA_VJ_NAME}-${MA_TASK_NAME}-0.${MA_VJ_NAME}

    ${MA_VJ_NAME}-${MA_TASK_NAME}-1.${MA_VJ_NAME}

    ${MA_VJ_NAME}-${MA_TASK_NAME}-2.${MA_VJ_NAME}

    ${MA_VJ_NAME}-${MA_TASK_NAME}-3.${MA_VJ_NAME}

  3. Select an instance flavor. The value range of the training parameters is consistent with the constraints of existing custom images.
    Table 3 Resource parameters

    Parameter

    Description

    Resource Pool

    Select a resource pool for the job. Public and dedicated resource pools are available for you to select.

    If you select a dedicated resource pool, you can view details about the pool. If the number of available cards of this pool is insufficient, jobs may need to be queued. In this case, use another resource pool or reduce the number of cards required.

    Resource Type

    Select CPU or GPU as needed. Set this parameter based on the resource type specified in your training code.

    Instance Flavor

    Select a resource flavor based on the resource type. If the type of resources to be used has been specified in your training code, only the options that comply with the constraints of the selected algorithm are available for you to choose. For example, if GPU is selected in the training code but you select CPU here, the training may fail.

    During training, ModelArts will mount NVME SSDs to the /cache directory. You can use this directory to store temporary files. The data disk size varies depending on the resource type. To prevent insufficient memory during training, click Check Input Size and check the disk size of selected instance flavor.

    Compute Nodes

    Set the number of compute nodes. The default value is 1.

    Job Priority

    When using a new-version dedicated resource pool, you can set the priority of a training job. The value ranges from 1 to 3. The default priority is 1, and the highest priority is 3.

    You can change the priority of a pending job.

    SFS Turbo

    When using a dedicated resource pool, the training job can be mounted with multiple cloud storage disks (NAS).

    A disk can be mounted only once and to only one mounting path. Each mounting path must be unique. A maximum of 8 disks can be mounted to a training job.

    Persistent Log Saving

    If you select CPU or GPU flavors, Persistent Log Saving is available for you to set.

    This function is disabled by default. ModelArts automatically stores the logs for 30 days. You can download all logs on the job details page.

    After enabling this function, you can store training logs in a specified OBS directory. Set Job Log Path and select an empty OBS directory to store the log files generated during training and ensure you have the reading and writing permissions of the directory.

    Job Log Path

    If you select Ascend resources, select an empty OBS path for storing training logs. Ensure that you have read and write permissions to the selected OBS directory.

    Event Notification

    Whether to subscribe to event notifications. After this function is enabled, you will be notified of specific events, such as job status changes or suspected suspensions, via an SMS or email.

    If you enable this function, set the following parameters:

    • Topic: topic of event notifications. You can create a topic on the SMN console.
    • Event: type of events you want to subscribe to. Options: JobStarted, JobCompleted, JobFailed, JobTerminated, and JobHanged.
    NOTE:
    • After you create a topic on the SMN console, add a subscription to the topic, and confirm the subscription. Then, you will be notified of events.
    • Currently, only training jobs using GPUs support JobHanged events.

    Auto Stop

    • After this parameter is enabled and the auto stop time is set, a training job automatically stops at the specified time.
    • If this function is disabled, a training job will continue to run.
    • The options are 1hour, 2hours, 4hours, 6hours, and Customization (1 hour to 72 hours).

  1. Click Submit to create the training job.

    It takes a period of time to create a training job.

    To view the real-time status of a training job, go to the training job list and click the name of the training job. On the training job details page that is displayed, view the basic information of the training job. For details, see Viewing Training Job Details.