Submitting a ModelArts Training Job

Run the ma-cli ma-job submit command to submit a ModelArts training job.

Before running this command, configure YAML_FILE to specify the path to the configuration file of the target job. If this parameter is not specified, the configuration file is empty. The configuration file is in YAML format, and its parameters are the option parameter of the command. If you specify both the YAML_FILE configuration file and the option parameter in the CLI, the value of the option parameter will overwrite that in the configuration file.

$ma-cli ma-job submit -h
Usage: ma-cli ma-job submit [OPTIONS] [YAML_FILE]...

Submit training job.

Example:

ma-cli ma-job submit --code-dir obs://your_bucket/code/
--boot-file main.py
--framework-type PyTorch
--working-dir /home/ma-user/modelarts/user-job-dir/code
--framework-version pytorch_1.8.0-cuda_10.2-py_3.7-ubuntu_18.04-x86_64
--data-url obs://your_bucket/dataset/
--log-url obs://your_bucket/logs/
--train-instance-type modelarts.vm.cpu.8u
--train-instance-count 1

Options:
--name TEXT Job name.
--description TEXT Job description.
--image-url TEXT Full swr custom image path.
--uid TEXT Uid for custom image (default: 1000).
--working-dir TEXT ModelArts training job working directory.
--local-code-dir TEXT ModelArts training job local code directory.
--user-command TEXT Execution command for custom image.
--pool-id TEXT Dedicated pool id.
--train-instance-type TEXT Train worker specification.
--train-instance-count INTEGER Number of workers.
--data-url TEXT OBS path for training data.
--log-url TEXT OBS path for training log.
--code-dir TEXT OBS path for source code.
--output TEXT Training output parameter with OBS path.
--input TEXT Training input parameter with OBS path.
--env-variables TEXT Env variables for training job.
--parameters TEXT Training job parameters (only keyword parameters are supported).
--boot-file TEXT Training job boot file path behinds `code_dir`.
--framework-type TEXT Training job framework type.
--framework-version TEXT Training job framework version.
--workspace-id TEXT The workspace where you submit training job(default "0")
--policy [regular|economic|turbo|auto]
Training job policy, default is regular.
--volumes TEXT Information about the volumes attached to the training job.
-q, --quiet Exit without waiting after submit successfully.
-C, --config-file PATH Configure file path for authorization.
-D, --debug Debug Mode. Shows full stack trace when error occurs.
-P, --profile TEXT CLI connection profile to use. The default profile is "DEFAULT".
-H, -h, --help Show this message and exit.

**Table 1** Parameters
Parameter	Type	Mandatory	Description
YAML_FILE	String	No	Configuration file of a training job. If this parameter is not specified, the configuration file is empty.
--code-dir	String	Yes	OBS path to the training source code
--data-url	String	Yes	OBS path to the training data
--log-url	String	Yes	OBS path to training logs
--train-instance-count	String	Yes	Number of compute nodes in a training job. The default value is 1, indicating a standalone node.
--boot-file	String	No	Boot file specified when you use a preset command is used to submit a training job. This parameter can be omitted when you use a custom image or command to submit a training job.
--name	String	No	Name of a training job
--description	String	No	Description of a training job
--image-url	String	No	SWR URL of a custom image, which is in the format of "organization/image_name:tag".
--uid	String	No	Runtime UID of a custom image. The default value is 1000.
--working-dir	String	No	Work directory where an algorithm is executed
--local-code-dir	String	No	Local directory to the training container to which the algorithm code directory is downloaded
--user-command	String	No	Command for executing a custom image. The directory must be under /home. When code-dir is prefixed with file://, this parameter does not take effect.
--pool-id	String	No	Resource pool ID selected for a training job. To obtain the ID, do as follows: Log in to the ModelArts management console, choose Dedicated Resource Pools in the navigation pane on the left, and view the resource pool ID in the dedicated resource pool list.
--train-instance-type	String	No	Resource flavor selected for a training job
--output	String	No	Training output. After this parameter is specified, the training job will upload the output directory of the training container corresponding to the specified output parameter in the training script to a specified OBS path. To specify multiple parameters, use --output output1=obs://bucket/output1 --output output2=obs://bucket/output2.
--input	String	No	Training input. After this parameter is specified, the training job will download the data from OBS to the training container and transfer the data storage path to the training script through the specified parameter. To specify multiple parameters, use --input data_path1=obs://bucket/data1 --input data_path2=obs://bucket/data2.
--env-variables	String	No	Environment variables input during training. To specify multiple parameters, use --env-variables ENV1=env1 --env-variables ENV2=env2.
--parameters	String	No	Training input parameters. To specify multiple parameters, use --parameters "--epoch 0 --pretrained".
--framework-type	String	No	Engine selected for a training job
--framework-version	String	No	Engine version selected for a training job
-q / --quiet	Bool	No	After a training job is submitted, the system exits directly and does not print the job status synchronously.
--workspace-id	String	No	Workspace where a training job is deployed. The default value is 0.
--policy	String	No	Training resource specification mode. The options are regular, economic, turbo, and auto.
--volumes	String	No	Mount EFS disks. To specify multiple parameters, use --volumes. "local_path=/xx/yy/zz;read_only=false;nfs_server_path=xxx.xxx.xxx.xxx:/" -volumes "local_path=/xxx/yyy/zzz;read_only=false;nfs_server_path=xxx.xxx.xxx.xxx:/"

Submitting a Training Job Based on a Preset ModelArts Image

Submit a training job by specifying the options parameter in the CLI.

ma-cli ma-job submit --code-dir obs://your-bucket/mnist/code/ \
                  --boot-file main.py \
                  --framework-type PyTorch \
                  --working-dir /home/ma-user/modelarts/user-job-dir/code \
                  --framework-version pytorch_1.8.0-cuda_10.2-py_3.7-ubuntu_18.04-x86_64 \
                  --data-url obs://your-bucket/mnist/dataset/MNIST/ \
                  --log-url obs://your-bucket/mnist/logs/ \
                  --train-instance-type modelarts.vm.cpu.8u \
                  --train-instance-count 1  \
                  -q

The following is an example of train.yaml using a preset image:

# Example .ma/train.yaml (preset image)
# pool_id: pool_xxxx
train-instance-type: modelarts.vm.cpu.8u
train-instance-count: 1
data-url: obs://your-bucket/mnist/dataset/MNIST/
code-dir: obs://your-bucket/mnist/code/
working-dir: /home/ma-user/modelarts/user-job-dir/code
framework-type: PyTorch
framework-version: pytorch_1.8.0-cuda_10.2-py_3.7-ubuntu_18.04-x86_64
boot-file: main.py
log-url: obs://your-bucket/mnist/logs/

##[Optional] Uncomment to set uid when use custom image mode
uid: 1000

##[Optional] Uncomment to upload output file/dir to OBS from training platform
output:
    - name: output_dir
      obs_path: obs://your-bucket/mnist/output1/

##[Optional] Uncomment to download input file/dir from OBS to training platform
input:
    - name: data_url
      obs_path: obs://your-bucket/mnist/dataset/MNIST/

##[Optional] Uncomment pass hyperparameters
parameters:
    - epoch: 10
    - learning_rate: 0.01
    - pretrained:

##[Optional] Uncomment to use dedicated pool
pool_id: pool_xxxx

##[Optional] Uncomment to use volumes attached to the training job
volumes:
  - efs:
      local_path: /xx/yy/zz
      read_only: false
      nfs_server_path: xxx.xxx.xxx.xxx:/

Using a Custom Image to Create a Training Job

Submit a training job by specifying the options parameter in the CLI.

ma-cli ma-job submit --image-url atelier/pytorch_1_8:pytorch_1.8.0-cuda_10.2-py_3.7-ubuntu_18.04-x86_64-20220926104358-041ba2e \
                  --code-dir obs://your-bucket/mnist/code/ \
                  --user-command "export LD_LIBRARY_PATH=/usr/local/cuda/compat:$LD_LIBRARY_PATH && cd /home/ma-user/modelarts/user-job-dir/code && /home/ma-user/anaconda3/envs/PyTorch-1.8/bin/python main.py" \
                  --data-url obs://your-bucket/mnist/dataset/MNIST/ \
                  --log-url obs://your-bucket/mnist/logs/ \
                  --train-instance-type modelarts.vm.cpu.8u \
                  --train-instance-count 1  \
                  -q

The following is an example of train.yaml using a custom image:

# Example .ma/train.yaml (custom image)
image-url: atelier/pytorch_1_8:pytorch_1.8.0-cuda_10.2-py_3.7-ubuntu_18.04-x86_64-20220926104358-041ba2e
user-command: export LD_LIBRARY_PATH=/usr/local/cuda/compat:$LD_LIBRARY_PATH && cd /home/ma-user/modelarts/user-job-dir/code && /home/ma-user/anaconda3/envs/PyTorch-1.8/bin/python main.py
train-instance-type: modelarts.vm.cpu.8u
train-instance-count: 1
data-url: obs://your-bucket/mnist/dataset/MNIST/
code-dir: obs://your-bucket/mnist/code/
log-url: obs://your-bucket/mnist/logs/

##[Optional] Uncomment to set uid when use custom image mode
uid: 1000

##[Optional] Uncomment to upload output file/dir to OBS from training platform
output:
    - name: output_dir
      obs_path: obs://your-bucket/mnist/output1/

##[Optional] Uncomment to download input file/dir from OBS to training platform
input:
    - name: data_url
      obs_path: obs://your-bucket/mnist/dataset/MNIST/

##[Optional] Uncomment pass hyperparameters
parameters:
    - epoch: 10
    - learning_rate: 0.01
    - pretrained:

##[Optional] Uncomment to use dedicated pool
pool_id: pool_xxxx

##[Optional] Uncomment to use volumes attached to the training job
volumes:
  - efs:
      local_path: /xx/yy/zz
      read_only: false
      nfs_server_path: xxx.xxx.xxx.xxx:/

Examples

Submit a training job based on a YAML file.
```
ma-cli ma-job submit ./train-job.yaml
```

Submit a training job using preset image pytorch1.8-cuda10.2-cudnn7-ubuntu18.04 through the CLI.

ma-cli ma-job submit --code-dir obs://automation-use-only/Original/TrainJob/TrainJob-v2/pytorch1.8.0_cuda10.2/code/ \
                     --boot-file test-pytorch.py \
                     --framework-type PyTorch \
                     --working-dir /home/ma-user/modelarts/user-job-dir/code \
                     --framework-version pytorch_1.8.0-cuda_10.2-py_3.7-ubuntu_18.04-x86_64 \
                     --data-url obs://automation-use-only/Original/TrainJob/TrainJob-v2/pytorch1.8.0_cuda10.2/data/ \
                     --log-url obs://automation-use-only/Original/TrainJob/TrainJob-v2/pytorch1.8.0_cuda10.2/data/logs/ \
                     --train-instance-type modelarts.vm.cpu.8u \
                     --train-instance-count 1 \

Click to enlarge