Updated on 2024-07-30 GMT+08:00

Creating a Training Job

ModelArts training management enables you to create training jobs, review training statuses, and manage job versions. Model training is an iterative optimization process. Through unified training management, you can flexibly select algorithms, data, and hyperparameters to obtain the optimal input configuration and model. After comparing metrics between training versions, you can determine the most satisfactory training job.

Prerequisites

  • Data is available either by creating a dataset in ModelArts or by uploading the data used for training to an OBS directory.
  • An algorithm has been created either by using a preset image (Using a Preset Image (Custom Script)) or using a custom image (Using a Custom Image).
  • At least one empty folder has been created in OBS for storing the training output. OBS buckets are not encrypted. ModelArts does not support encrypted OBS buckets. When creating an OBS bucket, do not enable bucket encryption.
  • Access authorization has been configured. For details, see Configuring Access Authorization (Global Configuration).

Creating a Training Job

  1. Log in to the ModelArts console.
  2. In the navigation pane, choose Training Management > Training Jobs. The training job list is displayed.
  3. Click Create Training Job. Then, configure parameters.
    Table 1 Basic information

    Parameter

    Description

    Name

    Name of a training job.

    The system automatically generates a name. You can rename it based on the following naming rules:

    • The name contains 1 to 64 characters.
    • Letters, digits, hyphens (-), and underscores (_) are allowed.

    Description

    Description of a training job.

    Experiment

    The options are Create new, Use existing, and Not required. If you set Experiment to Create new, enter an experiment name and description.

    Table 2 Algorithm parameters (algorithm type)

    Parameter

    Option

    Description

    Algorithm Type > Custom algorithm > Boot Mode

    Preset image

    If Boot Mode is set to Preset image, select a preset engine and configure the code directory and boot file.

    • Code Directory: Select the code directory required for this training job. Upload code to an OBS bucket beforehand. The total size of files in the directory cannot exceed 5 GB, the number of files cannot exceed 1,000, and the folder depth cannot exceed 32.
    • Boot File: Select the Python boot script in the code directory. The boot file must be a .py file because ModelArts supports only boot files written in Python.

    Algorithm Type > Custom algorithm > Boot Mode

    Preset image > Customize

    If Boot Mode is set to Preset image and the engine version to Customize, configure the image, code directory, and boot file.

    • Image: Select a container image path.
      • Private images or shared images: Click Select on the right to select an SWR image. Ensure that the required image has been uploaded to SWR.
      • Public images: Enter the SWR image path in the format of Organization name/Image name:Version name. Do not contain the domain name (swr.<region>.example.com) in the path because the system will automatically add the domain name to the path. For example, if the SWR address of a public image is swr.<region>.example.com/test-image/tensorflow2_1_1:1.1.1, set this parameter to test-images/tensorflow2_1_1:1.1.1.
    • Code Directory: Select the code directory required for this training job. Upload code to an OBS bucket beforehand. The total size of files in the directory cannot exceed 5 GB, the number of files cannot exceed 1,000, and the folder depth cannot exceed 32.
    • Boot File: Select the Python boot script in the code directory. The boot file must be a .py file because ModelArts supports only boot files written in Python.

    Algorithm Type > Custom algorithm > Boot Mode

    Custom image

    If Boot Mode is set to Custom image, set Image, Code Directory, User ID, and Boot Command. For details, see Using a Custom Image to Create an Algorithm.

    • Image: Select a container image path.
      • Private images or shared images: Click Select on the right to select an SWR image. Ensure that the required image has been uploaded to SWR.
      • Public images: Enter the SWR image path in the format of Organization name/Image name:Version name. Do not contain the domain name (swr.<region>.example.com) in the path because the system will automatically add the domain name to the path. For example, if the SWR address of a public image is swr.<region>.example.com/test-image/tensorflow2_1_1:1.1.1, set this parameter to test-images/tensorflow2_1_1:1.1.1.
    • Code Directory: Select the code directory required for this training job. This parameter is optional.

      Take OBS path obs://obs-bucket/training-test/demo-code as an example. The content in the OBS path will be automatically downloaded to ${MA_JOB_DIR}/demo-code in the training container, and demo-code (customizable) is the last-level directory of the OBS path.

    • User ID: Enter the user ID for running the container. The default value 1000 is recommended. This parameter is optional.

      If the UID needs to be specified, its value must be within the specified range. The UID ranges of different resource pools are as follows:

      • Public resource pool: 1000 to 65535
      • Dedicated resource pool: 0 to 65535
    • Boot Command: Enter the image boot command. This parameter is mandatory. The boot command will be automatically executed after the code directory is downloaded.
      • If the training boot script is a .py file, train.py for example, the boot command can be python ${MA_JOB_DIR}/demo-code/train.py.
      • If the training boot script is a .sh file, main.sh for example, the boot command can be bash ${MA_JOB_DIR}/demo-code/main.sh.

      Semicolons (;) and ampersands (&&) can be used to combine multiple boot commands. demo-code (customizable) in the boot command is the last-level directory of the OBS path.

    Algorithm Type > Custom algorithm

    Local Code Directory

    You can specify the local directory of a training container. When a training job starts, the system automatically downloads the code directory to this directory.

    The default local code directory is /home/ma-user/modelarts/user-job-dir. This parameter is optional.

    Algorithm Type > Custom algorithm

    Work Directory

    Directory where the boot file in the training container is located. When a training job starts, the system automatically runs the cd command to change the work directory to the specified directory.

    Algorithm Type

    My algorithm

    Select an algorithm or create an algorithm by referring to Creating an Algorithm.

    Table 3 Algorithm parameters (input and output)

    Parameter

    Option

    Description

    Input

    Name

    The recommended value is data_url. The training input must match the input configuration set in your selected algorithm. For details, see Table 2.

    You can select a dataset or data path for data input. When the training job is started, ModelArts automatically downloads the data in the input path to the container directory for training.

    Data path

    Select the training data from your OBS bucket.

    Click Data path and select the OBS bucket and folder in the dialog box displayed.

    NOTE:

    If Data path is unavailable, the training data of the selected algorithm cannot be from an OBS path.

    Obtained from

    The following uses training input data_path as an example.

    If you select Hyperparameters, use this code to obtain the data:

    import argparse
    parser = argparse.ArgumentParser()
    parser.add_argument('--data_path')
    args, unknown = parser.parse_known_args()
    data_path = args.data_path 

    If you select Environment variables, use this code to obtain the data:

    import os
    data_path = os.getenv("data_path", "")

    Output

    Name

    The algorithm code reads the local path to the training output based on this parameter.

    The recommended value is train_url. The training output must match the output configuration set in your selected algorithm. For details, see Table 3.

    You can select an OBS path for data output. During training, ModelArts automatically uploads the training output to the OBS path.

    Data path

    This data path stores the training output. During and after the training, the system automatically synchronizes files from the local directory to the data path. You can only select an OBS path as the data path.

    Select an OBS path for storing the training result. To minimize errors, select an empty directory.

    Obtained from

    The following uses the training output train_url as an example.

    If you select Hyperparameters, use this code to obtain the data:

    import argparse
    parser = argparse.ArgumentParser()
    parser.add_argument('--train_url')
    args, unknown = parser.parse_known_args()
    train_url = args.train_url 

    If you select Environment variables, use this code to obtain the data:

    import os
    train_url = os.getenv("train_url", "")

    Predownload

    If you set Predownload to Yes, the system automatically downloads the files in the training output data path to the local directory of the training container before the training job is started. Select Yes for resumable training and incremental training.

    Hyperparameter

    N/A

    The value of this parameter varies according to the selected algorithm.

    If you have defined hyperparameters when creating an algorithm, all hyperparameters of the algorithm are displayed. Whether hyperparameters can be modified or deleted depends on how you configure the constraints when creating the algorithm. For details, see Configuring Hyperparameters.

    Environment Variable

    N/A

    Environment variables, which you can add as required. For details about the environment variables preset in the training container, see Viewing Environment Variables of a Training Container.

    Auto Restart

    N/A

    Number of retries for a failed training job. If this parameter is enabled, a failed training job will be automatically re-delivered and run. On the training job details page, you can review the number of retries for a failed training job.

    • This function is disabled by default.
    • If you enable this function, set the number of retries. The value ranges from 1 to 3 and cannot be changed.

    The training input, training output, and hyperparameters vary according to the selected algorithm.

    If the system displays a message for Input, indicating there is no input channel for the selected algorithm, you do not need to set data input on this page.

    If the system displays a message for Output, indicating there is no output channel for the selected algorithm, you do not need to set data output on this page.

    If the system displays a message for Hyperparameters, indicating the selected algorithm does not support custom hyperparameters, you do not need to set hyperparameters on this page.

  4. Select the instance specifications. The value range of the training parameters must comply with the constraints of the selected algorithm.
    Table 4 Resource parameters

    Parameter

    Description

    Resource Pool

    Select a resource pool for the job. Public and dedicated resource pools are available for you to select.

    If you select a dedicated resource pool, you can review details about the pool. If the number of available cards of this pool is insufficient, jobs may need to be queued. In this case, use another resource pool or reduce the number of cards required.

    NOTE:

    Dedicated resource pools can be accessed to your VPCs and subnets. For details, see Interconnecting a VPC.

    If you want to change the VPC accessible to your dedicated resource pool, see Interconnecting a VPC.

    Resource Type

    Select CPU or GPU as needed. Set this parameter based on the resource type specified in your training code.

    Specifications

    Select a resource flavor based on the resource type. If the type of resources to be used has been specified in your training code, only the options that comply with the constraints of the selected algorithm are available for you to choose. For example, if GPU is selected in the training code but you select CPU here, the training may fail.

    During training, ModelArts will mount NVME SSDs to the /cache directory. You can use this directory to store temporary files. The data disk size varies depending on the resource type. To prevent insufficient memory during training, click Check Input Size to check whether the disk size of selected instance specifications is sufficient for the input size.

    NOTICE:

    The resource flavor GPU:n*nvidia-t4 (n indicates a specific number) does not support multi-process training.

    Compute Nodes

    Set the number of compute nodes. The default value is 1.

    Job Priority

    When using a dedicated resource pool for training, you can set the priority of the training job. The value ranges from 1 to 3. The default priority is 1, and the highest priority is 3. By default, the job priority can be set to 1 or 2. After the permission to set the highest job priority is configured, the priority can be set to 1 to 3.

    You can change the priority of a pending job.

    Persistent Log Saving

    If you select CPU or GPU resources, Persistent Log Saving is available for you to set.

    This function is disabled by default. ModelArts automatically stores training logs for 30 days. You can download all logs on the job details page.

    After enabling this function, you can store training logs in a specified OBS directory. You are advised to select an empty OBS directory to store the log files generated during training.

    Job Log Path

    If you select Ascend resources, select an empty OBS directory for storing training logs. Ensure that you have read and write permissions to the selected OBS directory.

    Event Notification

    You can enable this function so you will be notified of specific events, such as job status changes or suspected suspensions, via an SMS or email.

    If you enable this function, configure the following parameters:

    • Topic: Specify the topic of event notifications. You can create a topic on the SMN console.
    • Event: Select events you want to subscribe to. The options include JobStarted, JobCompleted, JobFailed, JobTerminated, and JobHanged.
    NOTE:
    • After you create a topic on the SMN console, add a subscription to the topic, and confirm the subscription. Then, you will be notified of events.
    • Currently, only training jobs using GPUs support JobHanged events.

    Auto Stop

    • After this parameter is enabled and the auto stop time is set, a training job automatically stops at the specified time.
    • If this function is disabled, a training job will continue to run.
    • The options are 1 hour, 2 hours, 4 hours, 6 hours, and Customize (1 hour to 720 hours).
  1. Click Submit to create the training job.

    A training job generally runs for a period of time. To check the real-time status and basic information of a training job, switch to the training job list.

    • In the training job list, Status of the newly created training job is Pending.
    • When the status of a training job changes to Completed, the training job is complete, and the generated model is stored in the specified training output path.
    • If the status is Failed or Abnormal, click the job name to go to the job details page and check logs for troubleshooting. For details, see Reviewing Training Job Details.