Submitting a ModelArts Training Job
Run the ma-cli ma-job submit command to submit a ModelArts training job.
Before running this command, configure YAML_FILE to specify the path to the configuration file of the target job. If this parameter is not specified, the configuration file is empty. The configuration file is in YAML format, and its parameters are the option parameter of the command. If you specify both the YAML_FILE configuration file and the option parameter in the CLI, the value of the option parameter will overwrite that in the configuration file.
$ma-cli ma-job submit -h Usage: ma-cli ma-job submit [OPTIONS] [YAML_FILE]... Submit training job. Example: ma-cli ma-job submit --code-dir obs://your_bucket/code/ --boot-file main.py --framework-type PyTorch --working-dir /home/ma-user/modelarts/user-job-dir/code --framework-version pytorch_1.8.0-cuda_10.2-py_3.7-ubuntu_18.04-x86_64 --data-url obs://your_bucket/dataset/ --log-url obs://your_bucket/logs/ --train-instance-type modelarts.vm.cpu.8u --train-instance-count 1 Options: --name TEXT Job name. --description TEXT Job description. --image-url TEXT Full swr custom image path. --uid TEXT Uid for custom image (default: 1000). --working-dir TEXT ModelArts training job working directory. --local-code-dir TEXT ModelArts training job local code directory. --user-command TEXT Execution command for custom image. --pool-id TEXT Dedicated pool id. --train-instance-type TEXT Train worker specification. --train-instance-count INTEGER Number of workers. --data-url TEXT OBS path for training data. --log-url TEXT OBS path for training log. --code-dir TEXT OBS path for source code. --output TEXT Training output parameter with OBS path. --input TEXT Training input parameter with OBS path. --env-variables TEXT Env variables for training job. --parameters TEXT Training job parameters (only keyword parameters are supported). --boot-file TEXT Training job boot file path behinds `code_dir`. --framework-type TEXT Training job framework type. --framework-version TEXT Training job framework version. --workspace-id TEXT The workspace where you submit training job(default "0") --policy [regular|economic|turbo|auto] Training job policy, default is regular. --volumes TEXT Information about the volumes attached to the training job. -q, --quiet Exit without waiting after submit successfully. -C, --config-file PATH Configure file path for authorization. -D, --debug Debug Mode. Shows full stack trace when error occurs. -P, --profile TEXT CLI connection profile to use. The default profile is "DEFAULT". -H, -h, --help Show this message and exit.
Parameter |
Type |
Mandatory |
Description |
---|---|---|---|
YAML_FILE |
String |
No |
Configuration file of a training job. If this parameter is not specified, the configuration file is empty. |
--code-dir |
String |
Yes |
OBS path to the training source code |
--data-url |
String |
Yes |
OBS path to the training data |
--log-url |
String |
Yes |
OBS path to training logs |
--train-instance-count |
String |
Yes |
Number of compute nodes in a training job. The default value is 1, indicating a standalone node. |
--boot-file |
String |
No |
Boot file specified when you use a preset command is used to submit a training job. This parameter can be omitted when you use a custom image or command to submit a training job. |
--name |
String |
No |
Name of a training job |
--description |
String |
No |
Description of a training job |
--image-url |
String |
No |
SWR URL of a custom image, which is in the format of "organization/image_name:tag". |
--uid |
String |
No |
Runtime UID of a custom image. The default value is 1000. |
--working-dir |
String |
No |
Work directory where an algorithm is executed |
--local-code-dir |
String |
No |
Local directory to the training container to which the algorithm code directory is downloaded |
--user-command |
String |
No |
Command for executing a custom image. The directory must be under /home. When code-dir is prefixed with file://, this parameter does not take effect. |
--pool-id |
String |
No |
Resource pool ID selected for a training job. To obtain the ID, do as follows: Log in to the ModelArts management console, choose Dedicated Resource Pools in the navigation pane on the left, and view the resource pool ID in the dedicated resource pool list. |
--train-instance-type |
String |
No |
Resource flavor selected for a training job |
--output |
String |
No |
Training output. After this parameter is specified, the training job will upload the output directory of the training container corresponding to the specified output parameter in the training script to a specified OBS path. To specify multiple parameters, use --output output1=obs://bucket/output1 --output output2=obs://bucket/output2. |
--input |
String |
No |
Training input. After this parameter is specified, the training job will download the data from OBS to the training container and transfer the data storage path to the training script through the specified parameter. To specify multiple parameters, use --input data_path1=obs://bucket/data1 --input data_path2=obs://bucket/data2. |
--env-variables |
String |
No |
Environment variables input during training. To specify multiple parameters, use --env-variables ENV1=env1 --env-variables ENV2=env2. |
--parameters |
String |
No |
Training input parameters. To specify multiple parameters, use --parameters "--epoch 0 --pretrained". |
--framework-type |
String |
No |
Engine selected for a training job |
--framework-version |
String |
No |
Engine version selected for a training job |
-q / --quiet |
Bool |
No |
After a training job is submitted, the system exits directly and does not print the job status synchronously. |
--workspace-id |
String |
No |
Workspace where a training job is deployed. The default value is 0. |
--policy |
String |
No |
Training resource specification mode. The options are regular, economic, turbo, and auto. |
--volumes |
String |
No |
Mount EFS disks. To specify multiple parameters, use --volumes. "local_path=/xx/yy/zz;read_only=false;nfs_server_path=xxx.xxx.xxx.xxx:/" -volumes "local_path=/xxx/yyy/zzz;read_only=false;nfs_server_path=xxx.xxx.xxx.xxx:/" |
Submitting a Training Job Based on a Preset ModelArts Image
Submit a training job by specifying the options parameter in the CLI.
ma-cli ma-job submit --code-dir obs://your-bucket/mnist/code/ \ --boot-file main.py \ --framework-type PyTorch \ --working-dir /home/ma-user/modelarts/user-job-dir/code \ --framework-version pytorch_1.8.0-cuda_10.2-py_3.7-ubuntu_18.04-x86_64 \ --data-url obs://your-bucket/mnist/dataset/MNIST/ \ --log-url obs://your-bucket/mnist/logs/ \ --train-instance-type modelarts.vm.cpu.8u \ --train-instance-count 1 \ -q
The following is an example of train.yaml using a preset image:
# Example .ma/train.yaml (preset image) # pool_id: pool_xxxx train-instance-type: modelarts.vm.cpu.8u train-instance-count: 1 data-url: obs://your-bucket/mnist/dataset/MNIST/ code-dir: obs://your-bucket/mnist/code/ working-dir: /home/ma-user/modelarts/user-job-dir/code framework-type: PyTorch framework-version: pytorch_1.8.0-cuda_10.2-py_3.7-ubuntu_18.04-x86_64 boot-file: main.py log-url: obs://your-bucket/mnist/logs/ ##[Optional] Uncomment to set uid when use custom image mode uid: 1000 ##[Optional] Uncomment to upload output file/dir to OBS from training platform output: - name: output_dir obs_path: obs://your-bucket/mnist/output1/ ##[Optional] Uncomment to download input file/dir from OBS to training platform input: - name: data_url obs_path: obs://your-bucket/mnist/dataset/MNIST/ ##[Optional] Uncomment pass hyperparameters parameters: - epoch: 10 - learning_rate: 0.01 - pretrained: ##[Optional] Uncomment to use dedicated pool pool_id: pool_xxxx ##[Optional] Uncomment to use volumes attached to the training job volumes: - efs: local_path: /xx/yy/zz read_only: false nfs_server_path: xxx.xxx.xxx.xxx:/
Using a Custom Image to Create a Training Job
Submit a training job by specifying the options parameter in the CLI.
ma-cli ma-job submit --image-url atelier/pytorch_1_8:pytorch_1.8.0-cuda_10.2-py_3.7-ubuntu_18.04-x86_64-20220926104358-041ba2e \ --code-dir obs://your-bucket/mnist/code/ \ --user-command "export LD_LIBRARY_PATH=/usr/local/cuda/compat:$LD_LIBRARY_PATH && cd /home/ma-user/modelarts/user-job-dir/code && /home/ma-user/anaconda3/envs/PyTorch-1.8/bin/python main.py" \ --data-url obs://your-bucket/mnist/dataset/MNIST/ \ --log-url obs://your-bucket/mnist/logs/ \ --train-instance-type modelarts.vm.cpu.8u \ --train-instance-count 1 \ -q
The following is an example of train.yaml using a custom image:
# Example .ma/train.yaml (custom image) image-url: atelier/pytorch_1_8:pytorch_1.8.0-cuda_10.2-py_3.7-ubuntu_18.04-x86_64-20220926104358-041ba2e user-command: export LD_LIBRARY_PATH=/usr/local/cuda/compat:$LD_LIBRARY_PATH && cd /home/ma-user/modelarts/user-job-dir/code && /home/ma-user/anaconda3/envs/PyTorch-1.8/bin/python main.py train-instance-type: modelarts.vm.cpu.8u train-instance-count: 1 data-url: obs://your-bucket/mnist/dataset/MNIST/ code-dir: obs://your-bucket/mnist/code/ log-url: obs://your-bucket/mnist/logs/ ##[Optional] Uncomment to set uid when use custom image mode uid: 1000 ##[Optional] Uncomment to upload output file/dir to OBS from training platform output: - name: output_dir obs_path: obs://your-bucket/mnist/output1/ ##[Optional] Uncomment to download input file/dir from OBS to training platform input: - name: data_url obs_path: obs://your-bucket/mnist/dataset/MNIST/ ##[Optional] Uncomment pass hyperparameters parameters: - epoch: 10 - learning_rate: 0.01 - pretrained: ##[Optional] Uncomment to use dedicated pool pool_id: pool_xxxx ##[Optional] Uncomment to use volumes attached to the training job volumes: - efs: local_path: /xx/yy/zz read_only: false nfs_server_path: xxx.xxx.xxx.xxx:/
Examples
- Submit a training job based on a YAML file.
ma-cli ma-job submit ./train-job.yaml
- Submit a training job using preset image pytorch1.8-cuda10.2-cudnn7-ubuntu18.04 through the CLI.
ma-cli ma-job submit --code-dir obs://automation-use-only/Original/TrainJob/TrainJob-v2/pytorch1.8.0_cuda10.2/code/ \ --boot-file test-pytorch.py \ --framework-type PyTorch \ --working-dir /home/ma-user/modelarts/user-job-dir/code \ --framework-version pytorch_1.8.0-cuda_10.2-py_3.7-ubuntu_18.04-x86_64 \ --data-url obs://automation-use-only/Original/TrainJob/TrainJob-v2/pytorch1.8.0_cuda10.2/data/ \ --log-url obs://automation-use-only/Original/TrainJob/TrainJob-v2/pytorch1.8.0_cuda10.2/data/logs/ \ --train-instance-type modelarts.vm.cpu.8u \ --train-instance-count 1 \
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.