Updated on 2023-11-21 GMT+08:00

Submitting a ModelArts Training Job

Run the ma-cli ma-job submit command to submit a ModelArts training job.

Before running this command, configure YAML_FILE to specify the path to the configuration file of the target job. If this parameter is not specified, the configuration file is empty. The configuration file is in YAML format, and its parameters are the option parameter of the command. If you specify both the YAML_FILE configuration file and the option parameter in the CLI, the value of the option parameter will overwrite that in the configuration file.

$ma-cli ma-job submit -h
Usage: ma-cli ma-job submit [OPTIONS] [YAML_FILE]...

  Submit training job.

  Example:

  ma-cli ma-job submit --code-dir obs://your_bucket/code/
                       --boot-file main.py
                       --framework-type PyTorch
                       --working-dir /home/ma-user/modelarts/user-job-dir/code
                       --framework-version pytorch_1.8.0-cuda_10.2-py_3.7-ubuntu_18.04-x86_64
                       --data-url obs://your_bucket/dataset/
                       --log-url obs://your_bucket/logs/
                       --train-instance-type modelarts.vm.cpu.8u
                       --train-instance-count 1

Options:
  --name TEXT                     Job name.
  --description TEXT              Job description.
  --image-url TEXT                Full swr custom image path.
  --uid TEXT                      Uid for custom image (default: 1000).
  --working-dir TEXT              ModelArts training job working directory.
  --local-code-dir TEXT           ModelArts training job local code directory.
  --user-command TEXT             Execution command for custom image.
  --pool-id TEXT                  Dedicated pool id.
  --train-instance-type TEXT      Train worker specification.
  --train-instance-count INTEGER  Number of workers.
  --data-url TEXT                 OBS path for training data.
  --log-url TEXT                  OBS path for training log.
  --code-dir TEXT                 OBS path for source code.
  --output TEXT                   Training output parameter with OBS path.
  --input TEXT                    Training input parameter with OBS path.
  --env-variables TEXT            Env variables for training job.
  --parameters TEXT               Training job parameters (only keyword parameters are supported).
  --boot-file TEXT                Training job boot file path behinds `code_dir`.
  --framework-type TEXT           Training job framework type.
  --framework-version TEXT        Training job framework version.
  --workspace-id TEXT             The workspace where you submit training job(default "0")
  --policy [regular|economic|turbo|auto]
                                  Training job policy, default is regular.
  --volumes TEXT                  Information about the volumes attached to the training job.
  -q, --quiet                     Exit without waiting after submit successfully.
  -C, --config-file PATH          Configure file path for authorization.
  -D, --debug                     Debug Mode. Shows full stack trace when error occurs.
  -P, --profile TEXT              CLI connection profile to use. The default profile is "DEFAULT".
  -H, -h, --help                  Show this message and exit.
Table 1 Parameters

Parameter

Type

Mandatory

Description

YAML_FILE

String

No

Configuration file of a training job. If this parameter is not specified, the configuration file is empty.

--code-dir

String

Yes

OBS path to the training source code

--data-url

String

Yes

OBS path to the training data

--log-url

String

Yes

OBS path to training logs

--train-instance-count

String

Yes

Number of compute nodes in a training job. The default value is 1, indicating a standalone node.

--boot-file

String

No

Boot file specified when you use a preset command is used to submit a training job. This parameter can be omitted when you use a custom image or command to submit a training job.

--name

String

No

Name of a training job

--description

String

No

Description of a training job

--image-url

String

No

SWR URL of a custom image, which is in the format of "organization/image_name:tag".

--uid

String

No

Runtime UID of a custom image. The default value is 1000.

--working-dir

String

No

Work directory where an algorithm is executed

--local-code-dir

String

No

Local directory to the training container to which the algorithm code directory is downloaded

--user-command

String

No

Command for executing a custom image. The directory must be under /home. When code-dir is prefixed with file://, this parameter does not take effect.

--pool-id

String

No

Resource pool ID selected for a training job. To obtain the ID, do as follows: Log in to the ModelArts management console, choose Dedicated Resource Pools in the navigation pane on the left, and view the resource pool ID in the dedicated resource pool list.

--train-instance-type

String

No

Resource flavor selected for a training job

--output

String

No

Training output. After this parameter is specified, the training job will upload the output directory of the training container corresponding to the specified output parameter in the training script to a specified OBS path. To specify multiple parameters, use --output output1=obs://bucket/output1 --output output2=obs://bucket/output2.

--input

String

No

Training input. After this parameter is specified, the training job will download the data from OBS to the training container and transfer the data storage path to the training script through the specified parameter. To specify multiple parameters, use --input data_path1=obs://bucket/data1 --input data_path2=obs://bucket/data2.

--env-variables

String

No

Environment variables input during training. To specify multiple parameters, use --env-variables ENV1=env1 --env-variables ENV2=env2.

--parameters

String

No

Training input parameters. To specify multiple parameters, use --parameters "--epoch 0 --pretrained".

--framework-type

String

No

Engine selected for a training job

--framework-version

String

No

Engine version selected for a training job

-q / --quiet

Bool

No

After a training job is submitted, the system exits directly and does not print the job status synchronously.

--workspace-id

String

No

Workspace where a training job is deployed. The default value is 0.

--policy

String

No

Training resource specification mode. The options are regular, economic, turbo, and auto.

--volumes

String

No

Mount EFS disks. To specify multiple parameters, use --volumes.

"local_path=/xx/yy/zz;read_only=false;nfs_server_path=xxx.xxx.xxx.xxx:/" -volumes "local_path=/xxx/yyy/zzz;read_only=false;nfs_server_path=xxx.xxx.xxx.xxx:/"

Submitting a Training Job Based on a Preset ModelArts Image

Submit a training job by specifying the options parameter in the CLI.

ma-cli ma-job submit --code-dir obs://your-bucket/mnist/code/ \
                  --boot-file main.py \
                  --framework-type PyTorch \
                  --working-dir /home/ma-user/modelarts/user-job-dir/code \
                  --framework-version pytorch_1.8.0-cuda_10.2-py_3.7-ubuntu_18.04-x86_64 \
                  --data-url obs://your-bucket/mnist/dataset/MNIST/ \
                  --log-url obs://your-bucket/mnist/logs/ \
                  --train-instance-type modelarts.vm.cpu.8u \
                  --train-instance-count 1  \
                  -q

The following is an example of train.yaml using a preset image:

# Example .ma/train.yaml (preset image)
# pool_id: pool_xxxx
train-instance-type: modelarts.vm.cpu.8u
train-instance-count: 1
data-url: obs://your-bucket/mnist/dataset/MNIST/
code-dir: obs://your-bucket/mnist/code/
working-dir: /home/ma-user/modelarts/user-job-dir/code
framework-type: PyTorch
framework-version: pytorch_1.8.0-cuda_10.2-py_3.7-ubuntu_18.04-x86_64
boot-file: main.py
log-url: obs://your-bucket/mnist/logs/

##[Optional] Uncomment to set uid when use custom image mode
uid: 1000

##[Optional] Uncomment to upload output file/dir to OBS from training platform
output:
    - name: output_dir
      obs_path: obs://your-bucket/mnist/output1/

##[Optional] Uncomment to download input file/dir from OBS to training platform
input:
    - name: data_url
      obs_path: obs://your-bucket/mnist/dataset/MNIST/

##[Optional] Uncomment pass hyperparameters
parameters:
    - epoch: 10
    - learning_rate: 0.01
    - pretrained:

##[Optional] Uncomment to use dedicated pool
pool_id: pool_xxxx

##[Optional] Uncomment to use volumes attached to the training job
volumes:
  - efs:
      local_path: /xx/yy/zz
      read_only: false
      nfs_server_path: xxx.xxx.xxx.xxx:/

Using a Custom Image to Create a Training Job

Submit a training job by specifying the options parameter in the CLI.

ma-cli ma-job submit --image-url atelier/pytorch_1_8:pytorch_1.8.0-cuda_10.2-py_3.7-ubuntu_18.04-x86_64-20220926104358-041ba2e \
                  --code-dir obs://your-bucket/mnist/code/ \
                  --user-command "export LD_LIBRARY_PATH=/usr/local/cuda/compat:$LD_LIBRARY_PATH && cd /home/ma-user/modelarts/user-job-dir/code && /home/ma-user/anaconda3/envs/PyTorch-1.8/bin/python main.py" \
                  --data-url obs://your-bucket/mnist/dataset/MNIST/ \
                  --log-url obs://your-bucket/mnist/logs/ \
                  --train-instance-type modelarts.vm.cpu.8u \
                  --train-instance-count 1  \
                  -q

The following is an example of train.yaml using a custom image:

# Example .ma/train.yaml (custom image)
image-url: atelier/pytorch_1_8:pytorch_1.8.0-cuda_10.2-py_3.7-ubuntu_18.04-x86_64-20220926104358-041ba2e
user-command: export LD_LIBRARY_PATH=/usr/local/cuda/compat:$LD_LIBRARY_PATH && cd /home/ma-user/modelarts/user-job-dir/code && /home/ma-user/anaconda3/envs/PyTorch-1.8/bin/python main.py
train-instance-type: modelarts.vm.cpu.8u
train-instance-count: 1
data-url: obs://your-bucket/mnist/dataset/MNIST/
code-dir: obs://your-bucket/mnist/code/
log-url: obs://your-bucket/mnist/logs/

##[Optional] Uncomment to set uid when use custom image mode
uid: 1000

##[Optional] Uncomment to upload output file/dir to OBS from training platform
output:
    - name: output_dir
      obs_path: obs://your-bucket/mnist/output1/

##[Optional] Uncomment to download input file/dir from OBS to training platform
input:
    - name: data_url
      obs_path: obs://your-bucket/mnist/dataset/MNIST/

##[Optional] Uncomment pass hyperparameters
parameters:
    - epoch: 10
    - learning_rate: 0.01
    - pretrained:

##[Optional] Uncomment to use dedicated pool
pool_id: pool_xxxx

##[Optional] Uncomment to use volumes attached to the training job
volumes:
  - efs:
      local_path: /xx/yy/zz
      read_only: false
      nfs_server_path: xxx.xxx.xxx.xxx:/

Examples

  • Submit a training job based on a YAML file.
    ma-cli ma-job submit ./train-job.yaml

  • Submit a training job using preset image pytorch1.8-cuda10.2-cudnn7-ubuntu18.04 through the CLI.
    ma-cli ma-job submit --code-dir obs://automation-use-only/Original/TrainJob/TrainJob-v2/pytorch1.8.0_cuda10.2/code/ \
                         --boot-file test-pytorch.py \
                         --framework-type PyTorch \
                         --working-dir /home/ma-user/modelarts/user-job-dir/code \
                         --framework-version pytorch_1.8.0-cuda_10.2-py_3.7-ubuntu_18.04-x86_64 \
                         --data-url obs://automation-use-only/Original/TrainJob/TrainJob-v2/pytorch1.8.0_cuda10.2/data/ \
                         --log-url obs://automation-use-only/Original/TrainJob/TrainJob-v2/pytorch1.8.0_cuda10.2/data/logs/ \
                         --train-instance-type modelarts.vm.cpu.8u \
                         --train-instance-count 1 \