Updated on 2024-10-29 GMT+08:00

ma-cli ma-job Commands for Training Jobs

Run the ma-cli ma-job command to submit training jobs, obtain training job logs, events, used AI engines, and resource specifications, and stop training jobs.

$ ma-cli ma-job -h
Usage: ma-cli ma-job [OPTIONS] COMMAND [ARGS]...

  ModelArts job submission and query job details.

Options:
  -h, -H, --help  Show this message and exit.

Commands:
  delete      Delete training job by job id.
  get-engine  Get job engines.
  get-event   Get job running event.
  get-flavor  Get job flavors.
  get-job     Get job details.
  get-log     Get job log details.
  get-pool    Get job engines.
  stop        Stop training job by job id.
  submit      Submit training job.
Table 1 Commands supported by training jobs

Command

Description

get-job

Obtain ModelArts training jobs and their details.

get-log

Obtain runtime logs of a ModelArts training job.

get-engine

Obtain ModelArts AI engines for training.

get-event

Obtain ModelArts training job events.

get-flavor

Obtain ModelArts resource specifications for training.

get-pool

Obtain ModelArts resource pools dedicated for training.

stop

Stop a ModelArts training job.

submit

Submit a ModelArts training job.

delete

Delete a training job with a specified job ID.

Using ma-cli ma-job get-job to Obtain a ModelArts Training Job

Run the ma-cli ma-job get-job command to obtain training jobs or details about a specific job.

$ ma-cli ma-job get-job -h
Usage: ma-cli ma-job get-job [OPTIONS]

  Get job details.

  Example:

  # Get train job details by job name
  ma-cli ma-job get-job -n ${job_name}

  # Get train job details by job id
  ma-cli ma-job get-job -i ${job_id}

  # Get train job list
  ma-cli ma-job get-job --page-size 5 --page-num 1

Options:
  
  -i, --job-id TEXT               Get training job details by job id.
  -n, --job-name TEXT             Get training job details by job name.
  -pn, --page-num INTEGER         Specify which page to query.  [x>=1]
  -ps, --page-size INTEGER RANGE  The maximum number of results for this query.  [1<=x<=50]
  -v, --verbose                   Show detailed information about training job details.
  -C, --config-file TEXT          Configure file path for authorization.
  -D, --debug                     Debug Mode. Shows full stack trace when error occurs.
  -P, --profile TEXT              CLI connection profile to use. The default profile is "DEFAULT".
  -h, -H, --help                  Show this message and exit.
Table 2 Parameters

Parameter

Data Type

Mandatory

Description

-i / --job-id

String

No

ID of the job whose details are to be obtained.

-n / --job-name

String

No

Name of the job to be queried or name keyword used to filter training jobs.

-pn / --page-num

Int

No

Page number. The default value is 1.

-ps / --page-size

Int

No

Number of training jobs displayed on each page. The default value is 10.

-v / --verbose

Bool

No

Whether to display detailed information. It is disabled by default.

  • Example: Obtain a training job with a specified job ID.
    ma-cli ma-job get-job -i b63e90xxx

  • Example: Filter training jobs by job name keyword auto.
    ma-cli ma-job get-job -n auto

Using ma-cli ma-job submit to Submit a ModelArts Training Job

Run the ma-cli ma-job submit command to submit a ModelArts training job.

When running this command, use the YAML_FILE parameter to specify the path to the configuration file of the target job. If this parameter is not specified, the configuration file is empty. The configuration file is in YAML format, and its parameters are values of OPTIONS in the command. If you specify both the YAML_FILE and the OPTIONS parameters, the OPTIONS value will overwrite the same items in the configuration file.

$ma-cli ma-job submit -h
Usage: ma-cli ma-job submit [OPTIONS] [YAML_FILE]...

  Submit training job.

  Example:

  ma-cli ma-job submit --code-dir obs://your_bucket/code/
                       --boot-file main.py
                       --framework-type PyTorch
                       --working-dir /home/ma-user/modelarts/user-job-dir/code
                       --framework-version pytorch_1.8.0-cuda_10.2-py_3.7-ubuntu_18.04-x86_64
                       --data-url obs://your_bucket/dataset/
                       --log-url obs://your_bucket/logs/
                       --train-instance-type modelarts.vm.cpu.8u
                       --train-instance-count 1

Options:
  --name TEXT                     Job name.
  --description TEXT              Job description.
  --image-url TEXT                Full swr custom image path.
  --uid TEXT                      Uid for custom image (default: 1000).
  --working-dir TEXT              ModelArts training job working directory.
  --local-code-dir TEXT           ModelArts training job local code directory.
  --user-command TEXT             Execution command for custom image.
  --pool-id TEXT                  Dedicated pool id.
  --train-instance-type TEXT      Train worker specification.
  --train-instance-count INTEGER  Number of workers.
  --data-url TEXT                 OBS path for training data.
  --log-url TEXT                  OBS path for training log.
  --code-dir TEXT                 OBS path for source code.
  --output TEXT                   Training output parameter with OBS path.
  --input TEXT                    Training input parameter with OBS path.
  --env-variables TEXT            Env variables for training job.
  --parameters TEXT               Training job parameters (only keyword parameters are supported).
  --boot-file TEXT                Training job boot file path behinds `code_dir`.
  --framework-type TEXT           Training job framework type.
  --framework-version TEXT        Training job framework version.
  --workspace-id TEXT             The workspace where you submit training job(default "0")
  --policy [regular|economic|turbo|auto]
                                  Training job policy, default is regular.
  --volumes TEXT                  Information about the volumes attached to the training job.
  -q, --quiet                     Exit without waiting after submit successfully.
  -C, --config-file PATH          Configure file path for authorization.
  -D, --debug                     Debug Mode. Shows full stack trace when error occurs.
  -P, --profile TEXT              CLI connection profile to use. The default profile is "DEFAULT".
  -H, -h, --help                  Show this message and exit.
Table 3 Parameters

Parameter

Data Type

Mandatory

Description

YAML_FILE

String

No

Configuration file of a training job. If this parameter is not specified, the configuration file is empty.

--code-dir

String

Yes

OBS path to the training source code.

--data-url

String

Yes

OBS path to the training data.

--log-url

String

Yes

OBS path to training logs.

--train-instance-count

String

Yes

Number of compute nodes in a training job. The default value is 1, indicating a standalone node.

--boot-file

String

No

Boot file specified when you use a preset command to submit a training job. This parameter can be omitted when you use a custom image or a custom command to submit a training job.

--name

String

No

Name of a training job.

--description

String

No

Description of a training job.

--image-url

String

No

SWR URL of a custom image, which is in the format of "organization/image_name:tag".

--uid

String

No

UID of the custom image. The default value is 1000.

--working-dir

String

No

Work directory where an algorithm is executed.

--local-code-dir

String

No

Local directory of the training container to which the algorithm code directory is downloaded.

--user-command

String

No

Command for executing a custom image. The directory must be under /home. When code-dir is prefixed with file://, this parameter does not take effect.

--pool-id

String

No

Resource pool ID selected for a training job. You can log in to the ModelArts console, choose Dedicated Resource Pools in the navigation pane on the left, and view the resource pool ID in the dedicated resource pool list.

--train-instance-type

String

No

Resource flavor selected for a training job.

--output

String

No

Training output. After this parameter is specified, the training job will upload the output directory of the training container corresponding to the specified output parameter in the training script to a specified OBS path. To specify multiple parameters, use --output output1=obs://bucket/output1 --output output2=obs://bucket/output2.

--input

String

No

Training input. After this parameter is specified, the training job will download the data from OBS to the training container and transfer the data storage path to the training script through the specified parameter. To specify multiple parameters, use --input data_path1=obs://bucket/data1 --input data_path2=obs://bucket/data2.

--env-variables

String

No

Environment variables input during training. To specify multiple parameters, use --env-variables ENV1=env1 --env-variables ENV2=env2.

--parameters

String

No

Training input parameters. To specify multiple parameters, use --parameters "--epoch 0 --pretrained".

--framework-type

String

No

Framework type selected for a training job.

--framework-version

String

No

Framework version selected for a training job.

-q / --quiet

Bool

No

Whether to exit directly without printing the job status synchronously after a training job is submitted.

--workspace-id

String

No

Workspace where a training job is deployed. The default value is 0.

--policy

String

No

Training resource flavor mode. The options are regular, economic, turbo, and auto.

--volumes

String

No

EFS disks to be mounted. To specify multiple parameters, use --volumes.

"local_path=/xx/yy/zz;read_only=false;nfs_server_path=xxx.xxx.xxx.xxx:/" -volumes "local_path=/xxx/yyy/zzz;read_only=false;nfs_server_path=xxx.xxx.xxx.xxx:/"

Example: Submitting a Training Job Based on a Preset ModelArts Image

Submit a training job by specifying the OPTIONS parameter.

ma-cli ma-job submit --code-dir obs://your-bucket/mnist/code/ \
                  --boot-file main.py \
                  --framework-type PyTorch \
                  --working-dir /home/ma-user/modelarts/user-job-dir/code \
                  --framework-version pytorch_1.8.0-cuda_10.2-py_3.7-ubuntu_18.04-x86_64 \
                  --data-url obs://your-bucket/mnist/dataset/MNIST/ \
                  --log-url obs://your-bucket/mnist/logs/ \
                  --train-instance-type modelarts.vm.cpu.8u \
                  --train-instance-count 1  \
                  -q

Example of train.yaml using a preset image:

# Example .ma/train.yaml (preset image)
# pool_id: pool_xxxx
train-instance-type: modelarts.vm.cpu.8u
train-instance-count: 1
data-url: obs://your-bucket/mnist/dataset/MNIST/
code-dir: obs://your-bucket/mnist/code/
working-dir: /home/ma-user/modelarts/user-job-dir/code
framework-type: PyTorch
framework-version: pytorch_1.8.0-cuda_10.2-py_3.7-ubuntu_18.04-x86_64
boot-file: main.py
log-url: obs://your-bucket/mnist/logs/

##[Optional] Uncomment to set uid when use custom image mode
uid: 1000

##[Optional] Uncomment to upload output file/dir to OBS from training platform
output:
    - name: output_dir
      obs_path: obs://your-bucket/mnist/output1/

##[Optional] Uncomment to download input file/dir from OBS to training platform
input:
    - name: data_url
      obs_path: obs://your-bucket/mnist/dataset/MNIST/

##[Optional] Uncomment pass hyperparameters
parameters:
    - epoch: 10
    - learning_rate: 0.01
    - pretrained:

##[Optional] Uncomment to use dedicated pool
pool_id: pool_xxxx

##[Optional] Uncomment to use volumes attached to the training job
volumes:
  - efs:
      local_path: /xx/yy/zz
      read_only: false
      nfs_server_path: xxx.xxx.xxx.xxx:/

Example: Using a Custom Image to Create a Training Job

Submit a training job by specifying the OPTIONS parameter.

ma-cli ma-job submit --image-url atelier/pytorch_1_8:pytorch_1.8.0-cuda_10.2-py_3.7-ubuntu_18.04-x86_64-20220926104358-041ba2e \
                  --code-dir obs://your-bucket/mnist/code/ \
                  --user-command "export LD_LIBRARY_PATH=/usr/local/cuda/compat:$LD_LIBRARY_PATH && cd /home/ma-user/modelarts/user-job-dir/code && /home/ma-user/anaconda3/envs/PyTorch-1.8/bin/python main.py" \
                  --data-url obs://your-bucket/mnist/dataset/MNIST/ \
                  --log-url obs://your-bucket/mnist/logs/ \
                  --train-instance-type modelarts.vm.cpu.8u \
                  --train-instance-count 1  \
                  -q

Example of train.yaml using a custom image:

# Example .ma/train.yaml (custom image)
image-url: atelier/pytorch_1_8:pytorch_1.8.0-cuda_10.2-py_3.7-ubuntu_18.04-x86_64-20220926104358-041ba2e
user-command: export LD_LIBRARY_PATH=/usr/local/cuda/compat:$LD_LIBRARY_PATH && cd /home/ma-user/modelarts/user-job-dir/code && /home/ma-user/anaconda3/envs/PyTorch-1.8/bin/python main.py
train-instance-type: modelarts.vm.cpu.8u
train-instance-count: 1
data-url: obs://your-bucket/mnist/dataset/MNIST/
code-dir: obs://your-bucket/mnist/code/
log-url: obs://your-bucket/mnist/logs/

##[Optional] Uncomment to set uid when use custom image mode
uid: 1000

##[Optional] Uncomment to upload output file/dir to OBS from training platform
output:
    - name: output_dir
      obs_path: obs://your-bucket/mnist/output1/

##[Optional] Uncomment to download input file/dir from OBS to training platform
input:
    - name: data_url
      obs_path: obs://your-bucket/mnist/dataset/MNIST/

##[Optional] Uncomment pass hyperparameters
parameters:
    - epoch: 10
    - learning_rate: 0.01
    - pretrained:

##[Optional] Uncomment to use dedicated pool
pool_id: pool_xxxx

##[Optional] Uncomment to use volumes attached to the training job
volumes:
  - efs:
      local_path: /xx/yy/zz
      read_only: false
      nfs_server_path: xxx.xxx.xxx.xxx:/

Using ma-cli ma-job get-log to Obtain ModelArts Training Job Logs

Run the ma-cli ma-job get-log command to obtain ModelArts training job logs.

$ ma-cli ma-job get-log -h
Usage: ma-cli ma-job get-log [OPTIONS]

  Get job log details.

  Example:

  # Get job log by job id
  ma-cli ma-job get-log --job-id ${job_id}

Options:
  -i, --job-id TEXT       Get training job details by job id.  [required]
  -t, --task-id TEXT      Get training job details by task id (default "worker-0").
  -C, --config-file TEXT  Configure file path for authorization.
  -D, --debug             Debug Mode. Shows full stack trace when error occurs.
  -P, --profile TEXT      CLI connection profile to use. The default profile is "DEFAULT".
  -h, -H, --help          Show this message and exit.

Parameter

Data Type

Mandatory

Description

-i / --job-id

String

Yes

ID of the job whose logs are to be obtained.

-t / --task-id

String

No

ID of the task whose logs are to be obtained. The default value is work-0.

Example: Obtain logs of a specified training job.

ma-cli ma-job get-log --job-id b63e90baxxx

Using ma-cli ma-job get-event to Obtain ModelArts Training Job Events

Run the ma-cli ma-job get-event command to obtain ModelArts training job events.

$ ma-cli ma-job get-event -h
Usage: ma-cli ma-job get-event [OPTIONS]

  Get job running event.

  Example:

  # Get training job running event
  ma-cli ma-job get-event --job-id ${job_id}

Options:
  -i, --job-id TEXT       Get training job event by job id.  [required]
  -C, --config-file TEXT  Configure file path for authorization.
  -D, --debug             Debug Mode. Shows full stack trace when error occurs.
  -P, --profile TEXT      CLI connection profile to use. The default profile is "DEFAULT".
  -H, -h, --help          Show this message and exit.

Parameter

Data Type

Mandatory

Description

-i / --job-id

String

Yes

ID of the job whose events are to be obtained.

Example: Obtain events of a specified training job.

ma-cli ma-job get-event --job-id b63e90baxxx

Using ma-cli ma-job get-engine to Obtain the AI Engines Used by ModelArts Training Jobs

Run the ma-cli ma-job get-engine command to obtain the AI engines used by ModelArts training jobs.

$ ma-cli ma-job get-engine -h
Usage: ma-cli ma-job get-engine [OPTIONS]

  Get job engine info.

  Example:

  # Get training job engines
  ma-cli ma-job get-engine

Options:
  -v, --verbose           Show detailed information about training engines.
  -C, --config-file TEXT  Configure file path for authorization.
  -D, --debug             Debug Mode. Shows full stack trace when error occurs.
  -P, --profile TEXT      CLI connection profile to use. The default profile is "DEFAULT".
  -H, -h, --help          Show this message and exit.
Table 4 Parameters

Parameter

Data Type

Mandatory

Description

-v / --verbose

Bool

No

Whether to display detailed information. It is disabled by default.

Example: Obtain the AI engines used by training jobs.

ma-cli ma-job get-engine

Using ma-cli ma-job get-flavor to Obtain the Resource Flavors Used by ModelArts Training Jobs

Run the ma-cli ma-job get-flavor command to obtain the resource flavors used by ModelArts training jobs.

$ ma-cli ma-job get-flavor -h
Usage: ma-cli ma-job get-flavor [OPTIONS]

  Get job flavor info.

  Example:

  # Get training job flavors
  ma-cli ma-job get-flavor

Options:
  -t, --flavor-type [CPU|GPU|Ascend]
                                  Type of training job flavor.
  -v, --verbose                   Show detailed information about training flavors.
  -C, --config-file TEXT          Configure file path for authorization.
  -D, --debug                     Debug Mode. Shows full stack trace when error occurs.
  -P, --profile TEXT              CLI connection profile to use. The default profile is "DEFAULT".
  -H, -h, --help                  Show this message and exit.
Table 5 Parameters

Parameter

Data Type

Mandatory

Description

-t / --flavor-type

String

No

Resource flavor type. If this parameter is not specified, all resource flavors are returned by default.

-v / --verbose

Bool

No

Whether to display detailed information. It is disabled by default.

Example: Obtain the resource flavors and types of training jobs.

ma-cli ma-job get-flavor

Using ma-cli ma-job stop to Stop a ModelArts Training Job

Run the ma-cli ma-job stop command to stop a training job with a specified job ID.

$ ma-cli ma-job stop -h
Usage: ma-cli ma-job stop [OPTIONS]

  Stop training job by job id.

  Example:

  Stop training job by job id
  ma-cli ma-job stop --job-id ${job_id}

Options:
  -i, --job-id TEXT       Get training job event by job id.  [required]
  -y, --yes               Confirm stop operation.
  -C, --config-file TEXT  Configure file path for authorization.
  -D, --debug             Debug Mode. Shows full stack trace when error occurs.
  -P, --profile TEXT      CLI connection profile to use. The default profile is "DEFAULT".
  -H, -h, --help          Show this message and exit.
Table 6 Parameters

Parameter

Data Type

Mandatory

Description

-i / --job-id

String

Yes

ID of a ModelArts training job

-y / --yes

Bool

No

Whether to forcibly stop a training job

Example: Stop a running training job.

ma-cli ma-job stop --job-id efd3e2f8xxx