Updated on 2024-04-30 GMT+08:00

Creating a Training Job

Function

This API is used to create a training job.

Debugging

You can debug this API through automatic authentication in API Explorer or use the SDK sample code generated by API Explorer.

URI

POST /v2/{project_id}/training-jobs

Table 1 Path Parameters

Parameter

Mandatory

Type

Description

project_id

Yes

String

Project ID. For details, see Obtaining a Project ID and Name.

Request Parameters

Table 2 Request body parameters

Parameter

Mandatory

Type

Description

kind

Yes

String

Training job type, which is job by default. Options:

  • job: training job

metadata

Yes

JobMetadata object

Metadata of a training job.

algorithm

No

JobAlgorithm object

Algorithm used by a training job. Options:

  • id: Only the algorithm ID is used.

  • subscription_id+item_version_id: The subscription ID and version ID of the algorithm are used.

  • code_dir+boot_file: The code directory and boot file of the training job are used.

tasks

No

Array of Task objects

Task list. This function is not implemented currently.

spec

No

spec object

Specifications of a training job. If this parameter is specified, leave the tasks parameter blank.

Table 3 JobMetadata

Parameter

Mandatory

Type

Description

id

No

String

Training job ID, which is generated and returned by ModelArts after the training job is created.

name

Yes

String

Name of a training job. The value must contain 1 to 64 characters consisting of only digits, letters, underscores (_), and hyphens (-).

workspace_id

No

String

Workspace where a job is located. The default value is 0.

description

No

String

Training job description. The value must contain 0 to 256 characters. The default value is NULL.

create_time

No

Long

Time when a training job was created, in milliseconds. The value is generated and returned by ModelArts after a training job is created.

user_name

No

String

Username for creating a training job. The username is generated and returned by ModelArts after a training job is created.

annotations

No

Map<String,String>

Advanced configuration of a training job. Options:

  • job_template: Template RL (heterogeneous job)

  • fault-tolerance/job-retry-num: 3 (number of retries upon a fault)

Table 4 JobAlgorithm

Parameter

Mandatory

Type

Description

id

No

String

Algorithm ID.

name

No

String

Algorithm name. Leave it blank.

subscription_id

No

String

Subscription ID of a subscribed algorithm, which must be used with item_version_id

item_version_id

No

String

Version ID of the subscribed algorithm, which must be used with subscription_id

code_dir

No

String

Code directory of a training job, for example, /usr/app/. This parameter must be used together with boot_file. If id or subscription_id+item_version_id is set, leave it blank.

boot_file

No

String

Boot file of a training job, which must be stored in the code directory, for example, /usr/app/boot.py. This parameter must be used with code_dir. Leave this parameter blank if id, or subscription_id and item_version_id are specified.

autosearch_config_path

No

String

YAML configuration path of auto search jobs. An OBS URL is required.

autosearch_framework_path

No

String

Framework code directory of auto search jobs. An OBS URL is required.

command

No

String

Command for starting the container of the custom image of a training job in the custom image scenario.

parameters

No

Array of parameters objects

Running parameters of a training job.

policies

No

policies object

Policies supported by jobs, which are used for hyperparameter search.

inputs

No

Array of Input objects

Input of a training job.

outputs

No

Array of Output objects

Output of a training job.

engine

No

engine object

Engine of a training job. Leave this parameter blank if the job is created using id of the algorithm in algorithm management, or subscription_id+item_version_id of the subscribed algorithm.

local_code_dir

No

String

Local directory to the training container to which the algorithm code directory is downloaded Rules:

  • The value must be a directory in /home.

  • In v1 compatibility mode, the current field does not take effect.

  • When code_dir is prefixed with file://, the current field does not take effect.

working_dir

No

String

Work directory where an algorithm is executed. Note that this parameter does not take effect in v1 compatibility mode.

environments

No

Array of Map<String,String> objects

Environment variables of a training job. The format is key: value. Leave this parameter blank.

Table 5 parameters

Parameter

Mandatory

Type

Description

name

No

String

Parameter name.

value

No

String

Parameter value.

description

No

String

Parameter description.

constraint

No

constraint object

Parameter constraint.

i18n_description

No

i18n_description object

Internationalization description.

Table 6 constraint

Parameter

Mandatory

Type

Description

type

No

String

Parameter type.

editable

No

Boolean

Whether the parameter is editable.

required

No

Boolean

Whether the parameter is mandatory.

sensitive

No

Boolean

Whether the parameter is sensitive This function is not implemented currently.

valid_type

No

String

Valid type.

valid_range

No

Array of strings

Valid range.

Table 7 i18n_description

Parameter

Mandatory

Type

Description

language

No

String

Internationalization language.

description

No

String

Description.

Table 8 policies

Parameter

Mandatory

Type

Description

auto_search

No

auto_search object

Hyperparameter search configuration.

Table 10 reward_attrs

Parameter

Mandatory

Type

Description

name

No

String

Metric name.

mode

No

String

Search direction.

  • max: A larger metric value indicates better performance.

  • min: A smaller metric value indicates better performance.

regex

No

String

Regular expression of a metric.

Table 11 search_params

Parameter

Mandatory

Type

Description

name

No

String

Hyperparameter name.

param_type

No

String

Parameter type

  • If continuous is specified, the hyperparameter is of the continuous type. When an algorithm is used in a training job, continuous hyperparameters are displayed as text boxes on the console. - discrete: The hyperparameter is of the discrete type. When an algorithm is used for training jobs, discrete hyperparameters are displayed as a drop-down list box on the console.

lower_bound

No

String

Lower bound of the hyperparameter.

upper_bound

No

String

Upper bound of the hyperparameter.

discrete_points_num

No

String

Number of discrete points of a continuous hyperparameter.

discrete_values

No

Array of strings

List of discrete hyperparameter values.

Table 12 algo_configs

Parameter

Mandatory

Type

Description

name

No

String

Name of the search algorithm.

params

No

Array of AutoSearchAlgoConfigParameter objects

Search algorithm parameters.

Table 13 AutoSearchAlgoConfigParameter

Parameter

Mandatory

Type

Description

key

No

String

Parameter key.

value

No

String

Parameter value.

type

No

String

Parameter type.

Table 14 engine

Parameter

Mandatory

Type

Description

engine_id

No

String

Engine ID selected for a training job. You can set this parameter to engine_id, engine_name + engine_version, or image_url.

engine_name

No

String

Name of the engine selected for a training job. If engine_id is set, leave this parameter blank.

engine_version

No

String

Name of the engine version selected for a training job. If engine_id is set, leave this parameter blank.

image_url

No

String

Custom image URL selected for a training job.

Table 15 Task

Parameter

Mandatory

Type

Description

role

No

String

Task role. This function is not supported currently.

algorithm

No

algorithm object

Algorithm management and configuration.

task_resource

No

task_resource object

Resource flavors of a training job.

Table 16 algorithm

Parameter

Mandatory

Type

Description

job_config

No

job_config object

Algorithm configuration, such as the boot file.

code_dir

No

String

Algorithm code directory, for example, /usr/app/. This parameter must be used together with boot_file.

boot_file

No

String

Code boot file of the algorithm, which needs to be stored in the code directory, for example, /usr/app/boot.py. This parameter must be used together with code_dir.

engine

No

engine object

Engine of a heterogeneous job algorithm.

inputs

No

Array of inputs objects

Data input of an algorithm.

outputs

No

Array of outputs objects

Data output of an algorithm.

local_code_dir

No

String

Local directory to the training container to which the algorithm code directory is downloaded. Ensure that the following rules are complied with: - The directory must be in the /home directory. - In v1 compatibility mode, the current field does not take effect. - When code_dir is prefixed with file://, the current field does not take effect.

working_dir

No

String

Work directory where an algorithm is executed. Note that this parameter does not take effect in v1 compatibility mode.

Table 17 job_config

Parameter

Mandatory

Type

Description

parameters

No

Array of Parameter objects

Running parameter of an algorithm.

inputs

No

Array of Input objects

Data input of an algorithm.

outputs

No

Array of Output objects

Data output of an algorithm.

engine

No

engine object

Algorithm engine.

Table 18 Parameter

Parameter

Mandatory

Type

Description

name

No

String

Parameter name.

value

No

String

Parameter value.

description

No

String

Parameter description.

constraint

No

constraint object

Parameter constraint.

i18n_description

No

i18n_description object

Internationalization description.

Table 19 constraint

Parameter

Mandatory

Type

Description

type

No

String

Parameter type.

editable

No

Boolean

Whether the parameter is editable.

required

No

Boolean

Whether the parameter is mandatory.

sensitive

No

Boolean

Whether the parameter is sensitive This function is not implemented currently.

valid_type

No

String

Valid type.

valid_range

No

Array of strings

Valid range.

Table 20 i18n_description

Parameter

Mandatory

Type

Description

language

No

String

Language. Options:

  • zh-cn: Chinese

  • en-us: English

description

No

String

Description.

Table 21 Input

Parameter

Mandatory

Type

Description

name

Yes

String

Name of the data input channel.

description

No

String

Description of the data input channel.

local_dir

No

String

Local directory of the container to which the data input channel is mapped.

remote

Yes

InputDataInfo object

Data input. Options:

  • dataset: Dataset as the data input

  • obs: OBS path as the data input

remote_constraint

No

Array of remote_constraint objects

Data input constraint

Table 22 InputDataInfo

Parameter

Mandatory

Type

Description

dataset

No

dataset object

Dataset as the data input.

obs

No

obs object

OBS in which data input and output stored.

Table 23 dataset

Parameter

Mandatory

Type

Description

id

Yes

String

Dataset ID of a training job.

version_id

Yes

String

Dataset version ID of a training job.

obs_url

No

String

OBS URL of the dataset required by a training job. ModelArts automatically parses and generates the URL based on the dataset and dataset version IDs. For example, /usr/data/.

Table 24 obs

Parameter

Mandatory

Type

Description

obs_url

Yes

String

OBS URL of the dataset required by a training job. For example, /usr/data/.

Table 25 remote_constraint

Parameter

Mandatory

Type

Description

data_type

No

String

Data input type, including the data storage location and dataset.

attributes

No

String

Attributes if a dataset is used as the data input. Options:

  • data_format: Data format

  • data_segmentation: Data segmentation

  • dataset_type: Labeling type

Table 26 Output

Parameter

Mandatory

Type

Description

name

Yes

String

Name of the data output channel.

description

No

String

Description of the data output channel.

local_dir

No

String

Local directory of the container to which the data output channel is mapped.

remote

Yes

remote object

Description of the actual data output.

Table 27 remote

Parameter

Mandatory

Type

Description

obs

Yes

obs object

OBS to which data is actually exported.

Table 28 obs

Parameter

Mandatory

Type

Description

obs_url

Yes

String

OBS URL to which data is actually exported.

Table 29 engine

Parameter

Mandatory

Type

Description

engine_id

No

String

Engine ID selected for an algorithm.

engine_name

No

String

Engine version name selected for an algorithm. If engine_id is specified, leave this parameter blank.

engine_version

No

String

Engine version name selected for an algorithm. If engine_id is specified, leave this parameter blank.

image_url

No

String

Custom image URL selected by an algorithm.

Table 30 engine

Parameter

Mandatory

Type

Description

engine_id

No

String

Engine ID of a heterogeneous job, for example, caffe-1.0.0-python2.7.

engine_name

No

String

Engine name of a heterogeneous job, for example, Caffe.

engine_version

No

String

Engine version of a heterogeneous job.

image_url

No

String

Custom image URL selected by an algorithm.

Table 31 inputs

Parameter

Mandatory

Type

Description

name

Yes

String

Name of the data input channel.

description

No

String

Description of the data input channel.

local_dir

No

String

Local directory of the container to which the data input channel is mapped.

remote

Yes

remote object

Data input. Options:

  • dataset: Dataset as the data input

  • obs: OBS path as the data input

Table 32 remote

Parameter

Mandatory

Type

Description

obs

No

obs object

OBS in which data input and output stored.

Table 33 obs

Parameter

Mandatory

Type

Description

obs_url

Yes

String

OBS URL of the dataset required by a training job. For example, /usr/data/.

Table 34 outputs

Parameter

Mandatory

Type

Description

name

Yes

String

Name of the data output channel.

description

No

String

Description of the data output channel.

local_dir

No

String

Local directory of the container to which the data output channel is mapped.

remote

Yes

remote object

Description of the actual data output.

Table 35 remote

Parameter

Mandatory

Type

Description

obs

Yes

obs object

OBS to which data is actually exported.

Table 36 obs

Parameter

Mandatory

Type

Description

obs_url

Yes

String

OBS URL to which data is actually exported.

Table 37 task_resource

Parameter

Mandatory

Type

Description

flavor_id

No

String

Resource flavor ID of a training job.

node_count

Yes

Integer

Number of resource replicas selected for a training job.

Table 38 spec

Parameter

Mandatory

Type

Description

resource

No

resource object

Resource flavors of a training job. Select either flavor_id or pool_id+[flavor_id].

volumes

No

Array of volumes objects

Volumes attached to a training job.

log_export_path

No

log_export_path object

Export path of training job logs.

auto_stop

No

auto_stop object

Auto stop configuration of a training job

Table 39 resource

Parameter

Mandatory

Type

Description

flavor_id

No

String

ID of the resource flavor selected for a training job. flavor_id cannot be specified for dedicated resource pools with CPU specifications. The options for dedicated resource pools with GPU/Ascend specifications are as follows:

  • modelarts.pool.visual.xlarge (1 card)

  • modelarts.pool.visual.2xlarge (2 cards)

  • modelarts.pool.visual.4xlarge (4 cards)

  • modelarts.pool.visual.8xlarge (8 cards)

node_count

No

Integer

Number of nodes used for creating a training job in a pool. By default, a single node is used.

pool_id

No

String

Dedicated resource pool ID.

Table 40 volumes

Parameter

Mandatory

Type

Description

nfs

No

nfs object

Volumes attached in NFS mode.

Table 41 nfs

Parameter

Mandatory

Type

Description

nfs_server_path

No

String

NFS server path.

local_path

No

String

Path for attaching volumes to the training container.

read_only

No

Boolean

Whether the volumes attached to the container in NFS mode are read-only.

Table 42 log_export_path

Parameter

Mandatory

Type

Description

obs_url

No

String

OBS URL for storing training job logs.

host_path

No

String

Path of the host where training job logs are stored.

Table 43 auto_stop

Parameter

Mandatory

Type

Description

time_unit

Yes

String

Time unit. Options:

  • HOURS

duration

Yes

Integer

Running time. The minimum value is 1.

Response Parameters

Status code: 201

Table 44 Response body parameters

Parameter

Type

Description

kind

String

Training job type, which is job by default. Options:

  • job: training job

metadata

JobMetadata object

Metadata of a training job.

status

Status object

Status of a training job. You do not need to set this parameter when creating a job.

algorithm

JobAlgorithmResponse object

Algorithm used by a training job. Options:

  • id: Only the algorithm ID is used.

  • subscription_id+item_version_id: The subscription ID and version ID of the algorithm are used.

  • code_dir+boot_file: The code directory and boot file of the training job are used.

tasks

Array of TaskResponse objects

List of tasks in heterogeneous training jobs.

spec

spec object

Specifications of a training job.

Table 45 JobMetadata

Parameter

Type

Description

id

String

Training job ID, which is generated and returned by ModelArts after the training job is created.

name

String

Name of a training job. The value must contain 1 to 64 characters consisting of only digits, letters, underscores (_), and hyphens (-).

workspace_id

String

Workspace where a job is located. The default value is 0.

description

String

Training job description. The value must contain 0 to 256 characters. The default value is NULL.

create_time

Long

Time when a training job was created, in milliseconds. The value is generated and returned by ModelArts after a training job is created.

user_name

String

Username for creating a training job. The username is generated and returned by ModelArts after a training job is created.

annotations

Map<String,String>

Advanced configuration of a training job. Options:

  • job_template: Template RL (heterogeneous job)

  • fault-tolerance/job-retry-num: 3 (number of retries upon a fault)

Table 46 Status

Parameter

Type

Description

phase

String

Level-1 status of a training job. The options are as follows: Creating Pending Running Failed Completed, Terminating Terminated Abnormal

secondary_phase

String

The level-2 status of a training job is an internal detailed status, which may be added, modified, or deleted. Dependency is not recommended. The options are as follows: Creating Queuing Running Failed Completed, Terminating Terminated CreateFailed TerminatedFailed Unknown Lost

duration

Long

Running duration of a training job, in milliseconds

node_count_metrics

Array<Array<Integer>>

Node count changes during the training job running period.

tasks

Array of strings

Tasks of a training job.

start_time

Long

Start time of a training job. The value is in timestamp format.

task_statuses

Array of task_statuses objects

Status of a training job task.

running_records

Array of running_records objects

Running and fault recovery records of a training job

Table 47 task_statuses

Parameter

Type

Description

task

String

Name of a training job task.

exit_code

Integer

Exit code of a training job task.

message

String

Error message of a training job task.

Table 48 running_records

Parameter

Type

Description

start_at

Integer

Unix timestamp of the start time in the current running record, in seconds

end_at

Integer

Unix timestamp of the end time in the current running record, in seconds

start_type

String

Startup mode of the current running record. The options are as follows: init_or_rescheduled: This startup is the first running after scheduling, including the first startup and the running after scheduling recovery. restarted: This startup is not the first running after scheduling but the running after a process restart.

end_reason

String

Reason why the current running record ends

end_related_task

String

ID of the task worker that causes the end of the current running record, for example, worker-0

end_recover

String

Fault tolerance policy used after the current running record ends. The options are as follows: npu_proc_restart: NPU in-place hot recovery gpu_proc_restart: GPU in-place hot recovery proc_restart: Process in-place recovery pod_reschedule: Pod-level rescheduling job_reschedule: Job-level rescheduling job_reschedule_with_taint: Isolated job-level rescheduling

end_recover_before_downgrade

String

Tolerance policy used after the current running record ends and before the fault tolerance policy is degraded. The options are the same as those of end_recover.

Table 49 JobAlgorithmResponse

Parameter

Type

Description

id

String

Algorithm used by a training job. Options:

  • id: Only the algorithm ID is used.

  • subscription_id+item_version_id: The subscription ID and version ID of the algorithm are used.

  • code_dir+boot_file: The code directory and boot file of the training job are used.

name

String

Algorithm name.

subscription_id

String

Subscription ID of a subscribed algorithm, which must be used with item_version_id

item_version_id

String

Version ID of the subscribed algorithm, which must be used with subscription_id

code_dir

String

Code directory of a training job, for example, /usr/app/. This parameter must be used together with boot_file. If id or subscription_id+item_version_id is set, leave it blank.

boot_file

String

Boot file of a training job, which must be stored in the code directory, for example, /usr/app/boot.py. This parameter must be used with code_dir. Leave this parameter blank if id, or subscription_id and item_version_id are specified.

autosearch_config_path

String

YAML configuration path of auto search jobs. An OBS URL is required.

autosearch_framework_path

String

Framework code directory of auto search jobs. An OBS URL is required.

command

String

Boot command used to start the container of a custom image of a training job. For example, python train.py.

parameters

Array of Parameter objects

Running parameters of a training job.

policies

policies object

Policies supported by jobs.

inputs

Array of Input objects

Input of a training job.

outputs

Array of Output objects

Output of a training job.

engine

engine object

Engine of a training job. Leave this parameter blank if the job is created using id of the algorithm in algorithm management, or subscription_id+item_version_id of the subscribed algorithm.

local_code_dir

String

Local directory to the training container to which the algorithm code directory is downloaded. Ensure that the following rules are complied with: - The directory must be in the /home directory. - In v1 compatibility mode, the current field does not take effect. - When code_dir is prefixed with file://, the current field does not take effect.

working_dir

String

Work directory where an algorithm is executed. Note that this parameter does not take effect in v1 compatibility mode.

environments

Array of Map<String,String> objects

Environment variables of a training job. The format is key: value. Leave this parameter blank.

Table 50 Parameter

Parameter

Type

Description

name

String

Parameter name.

value

String

Parameter value.

description

String

Parameter description.

constraint

constraint object

Parameter constraint.

i18n_description

i18n_description object

Internationalization description.

Table 51 constraint

Parameter

Type

Description

type

String

Parameter type.

editable

Boolean

Whether the parameter is editable.

required

Boolean

Whether the parameter is mandatory.

sensitive

Boolean

Whether the parameter is sensitive This function is not implemented currently.

valid_type

String

Valid type.

valid_range

Array of strings

Valid range.

Table 52 i18n_description

Parameter

Type

Description

language

String

Language. Options:

  • zh-cn: Chinese

  • en-us: English

description

String

Description.

Table 53 policies

Parameter

Type

Description

auto_search

auto_search object

Hyperparameter search configuration.

Table 55 reward_attrs

Parameter

Type

Description

name

String

Metric name.

mode

String

Search direction.

  • max: A larger metric value indicates better performance.

  • min: A smaller metric value indicates better performance.

regex

String

Regular expression of a metric.

Table 56 search_params

Parameter

Type

Description

name

String

Hyperparameter name.

param_type

String

Parameter type

  • If continuous is specified, the hyperparameter is of the continuous type. When an algorithm is used in a training job, continuous hyperparameters are displayed as text boxes on the console. - discrete: The hyperparameter is of the discrete type. When an algorithm is used for training jobs, discrete hyperparameters are displayed as a drop-down list box on the console.

lower_bound

String

Lower bound of the hyperparameter.

upper_bound

String

Upper bound of the hyperparameter.

discrete_points_num

String

Number of discrete points of a continuous hyperparameter.

discrete_values

Array of strings

List of discrete hyperparameter values.

Table 57 algo_configs

Parameter

Type

Description

name

String

Name of the search algorithm.

params

Array of AutoSearchAlgoConfigParameter objects

Search algorithm parameters.

Table 58 AutoSearchAlgoConfigParameter

Parameter

Type

Description

key

String

Parameter key.

value

String

Parameter value.

type

String

Parameter type.

Table 59 Input

Parameter

Type

Description

name

String

Name of the data input channel.

description

String

Description of the data input channel.

local_dir

String

Local directory of the container to which the data input channel is mapped.

remote

InputDataInfo object

Data input. Options:

  • dataset: Dataset as the data input

  • obs: OBS path as the data input

remote_constraint

Array of remote_constraint objects

Data input constraint

Table 60 InputDataInfo

Parameter

Type

Description

dataset

dataset object

Dataset as the data input.

obs

obs object

OBS in which data input and output stored.

Table 61 dataset

Parameter

Type

Description

id

String

Dataset ID of a training job.

version_id

String

Dataset version ID of a training job.

obs_url

String

OBS URL of the dataset required by a training job. ModelArts automatically parses and generates the URL based on the dataset and dataset version IDs. For example, /usr/data/.

Table 62 obs

Parameter

Type

Description

obs_url

String

OBS URL of the dataset required by a training job. For example, /usr/data/.

Table 63 remote_constraint

Parameter

Type

Description

data_type

String

Data input type, including the data storage location and dataset.

attributes

String

Attributes if a dataset is used as the data input. Options:

  • data_format: Data format

  • data_segmentation: Data segmentation

  • dataset_type: Labeling type

Table 64 Output

Parameter

Type

Description

name

String

Name of the data output channel.

description

String

Description of the data output channel.

local_dir

String

Local directory of the container to which the data output channel is mapped.

remote

remote object

Description of the actual data output.

Table 65 remote

Parameter

Type

Description

obs

obs object

OBS to which data is actually exported.

Table 66 obs

Parameter

Type

Description

obs_url

String

OBS URL to which data is actually exported.

Table 67 engine

Parameter

Type

Description

engine_id

String

Engine ID selected for a training job. You can set this parameter to engine_id, engine_name + engine_version, or image_url.

engine_name

String

Name of the engine selected for a training job. If engine_id is set, leave this parameter blank.

engine_version

String

Name of the engine version selected for a training job. If engine_id is set, leave this parameter blank.

image_url

String

Custom image URL selected for a training job.

Table 68 TaskResponse

Parameter

Type

Description

role

String

Task role. This function is not supported currently.

algorithm

algorithm object

Algorithm management and configuration.

task_resource

FlavorResponse object

Flavors of a training job or an algorithm.

Table 69 algorithm

Parameter

Type

Description

code_dir

String

Absolute path of the directory where the algorithm boot file is stored.

boot_file

String

Absolute path of the algorithm boot file.

inputs

inputs object

Algorithm input channel.

outputs

outputs object

Algorithm output channel.

engine

engine object

Engine on which a heterogeneous job depends.

local_code_dir

String

Local directory to the training container to which the algorithm code directory is downloaded. Ensure that the following rules are complied with: - The directory must be in the /home directory. - In v1 compatibility mode, the current field does not take effect. - When code_dir is prefixed with file://, the current field does not take effect.

working_dir

String

Work directory where an algorithm is executed. Note that this parameter does not take effect in v1 compatibility mode.

Table 70 inputs

Parameter

Type

Description

name

String

Name of the data input channel.

local_dir

String

Local path of the container to which the data input and output channels are mapped.

remote

remote object

Actual data input. Heterogeneous jobs support only OBS.

Table 71 remote

Parameter

Type

Description

obs

obs object

OBS in which data input and output stored.

Table 72 obs

Parameter

Type

Description

obs_url

String

OBS URL of the dataset required by a training job. For example, /usr/data/.

Table 73 outputs

Parameter

Type

Description

name

String

Name of the data output channel.

local_dir

String

Local directory of the container to which the data output channel is mapped.

remote

remote object

Description of the actual data output.

mode

String

Data transmission mode. The default value is upload_periodically.

period

String

Data transmission period. The default value is 30s.

Table 74 remote

Parameter

Type

Description

obs

obs object

OBS to which data is actually exported.

Table 75 obs

Parameter

Type

Description

obs_url

String

OBS URL to which data is actually exported.

Table 76 engine

Parameter

Type

Description

engine_id

String

Engine ID of a heterogeneous job, for example, caffe-1.0.0-python2.7.

engine_name

String

Engine name of a heterogeneous job, for example, Caffe.

engine_version

String

Engine version of a heterogeneous job.

v1_compatible

Boolean

Whether the v1 compatibility mode is used.

run_user

String

User UID started by default by the engine.

image_url

String

Custom image URL selected by an algorithm.

Table 77 FlavorResponse

Parameter

Type

Description

flavor_id

String

ID of the resource flavor.

flavor_name

String

Name of the resource flavor.

max_num

Integer

Maximum number of nodes in a resource flavor.

flavor_type

String

Resource flavor type. Options:

  • CPU

  • GPU

  • Ascend

billing

billing object

Billing information of a resource flavor.

flavor_info

flavor_info object

Resource flavor details.

attributes

Map<String,String>

Other specification attributes.

Table 78 billing

Parameter

Type

Description

code

String

Billing code.

unit_num

Integer

Number of billing units.

Table 79 flavor_info

Parameter

Type

Description

max_num

Integer

Maximum number of nodes that can be selected. The value 1 indicates that the distributed mode is not supported.

cpu

cpu object

CPU specifications.

gpu

gpu object

GPU specifications.

npu

npu object

Ascend specifications

memory

memory object

Memory information.

disk

disk object

Disk information.

Table 80 cpu

Parameter

Type

Description

arch

String

CPU architecture.

core_num

Integer

Number of cores.

Table 81 gpu

Parameter

Type

Description

unit_num

Integer

Number of GPUs.

product_name

String

Product name.

memory

String

Memory.

Table 82 npu

Parameter

Type

Description

unit_num

String

Number of NPUs.

product_name

String

Product name.

memory

String

Memory.

Table 83 memory

Parameter

Type

Description

size

Integer

Memory size.

unit

String

Memory size

Table 84 disk

Parameter

Type

Description

size

Integer

Disk size.

unit

String

Unit of the disk size.

Table 85 spec

Parameter

Type

Description

resource

Resource object

Resource flavors of a training job. Select either flavor_id or pool_id+[flavor_id].

volumes

Array of volumes objects

Volumes attached to a training job.

log_export_path

log_export_path object

Export path of training job logs.

Table 86 Resource

Parameter

Type

Description

policy

String

Resource flavor of a training job. Options: regular

flavor_id

String

ID of the resource flavor selected for a training job. flavor_id cannot be specified for dedicated resource pools with CPU specifications. The options for dedicated resource pools with GPU/Ascend specifications are as follows:

  • modelarts.pool.visual.xlarge (1 card)

  • modelarts.pool.visual.2xlarge (2 cards)

  • modelarts.pool.visual.4xlarge (4 cards)

  • modelarts.pool.visual.8xlarge (8 cards)

flavor_name

String

Read-only flavor name returned by ModelArts when flavor_id is used.

node_count

Integer

Number of resource replicas selected for a training job.

pool_id

String

Resource pool ID selected for a training job.

flavor_detail

flavor_detail object

Flavors of a training job or an algorithm.

Table 87 flavor_detail

Parameter

Type

Description

flavor_type

String

Resource flavor type. Options:

  • CPU

  • GPU

  • Ascend

billing

billing object

Billing information of a resource flavor.

flavor_info

flavor_info object

Resource flavor details.

Table 88 billing

Parameter

Type

Description

code

String

Billing code.

unit_num

Integer

Number of billing units.

Table 89 flavor_info

Parameter

Type

Description

max_num

Integer

Maximum number of nodes that can be selected. The value 1 indicates that the distributed mode is not supported.

cpu

cpu object

CPU specifications.

gpu

gpu object

GPU specifications.

npu

npu object

Ascend specifications

memory

memory object

Memory information.

disk

disk object

Disk information.

Table 90 cpu

Parameter

Type

Description

arch

String

CPU architecture.

core_num

Integer

Number of cores.

Table 91 gpu

Parameter

Type

Description

unit_num

Integer

Number of GPUs.

product_name

String

Product name.

memory

String

Memory.

Table 92 npu

Parameter

Type

Description

unit_num

String

Number of NPUs.

product_name

String

Product name.

memory

String

Memory.

Table 93 memory

Parameter

Type

Description

size

Integer

Memory size.

unit

String

Number of memory units.

Table 94 disk

Parameter

Type

Description

size

String

Disk size.

unit

String

Unit of the disk size. Generally, the value is GB.

Table 95 volumes

Parameter

Type

Description

nfs

nfs object

Volumes attached in NFS mode.

Table 96 nfs

Parameter

Type

Description

nfs_server_path

String

NFS server path.

local_path

String

Path for attaching volumes to the training container.

read_only

Boolean

Whether the volumes attached to the container in NFS mode are read-only.

Table 97 log_export_path

Parameter

Type

Description

obs_url

String

OBS URL for storing training job logs.

host_path

String

Path of the host where training job logs are stored.

Status code: 400

Table 98 Response body parameters

Parameter

Type

Description

error_msg

String

Error message

error_code

String

Error code

error_solution

String

Solution

Example Requests

  • The following is an example of how to create a training job with free specifications. The job name has been set to TestModelArtsJob and the description has been set to This is a ModelArts job. The required algorithm's ID is 3f5d6706-7b67-408d-8ba0-ec08048c45ed. The inputs and outputs have not been defined for the algorithm.

    POST https://endpoint/v2/{project_id}/training-jobs
    
    {
      "kind" : "job",
      "metadata" : {
        "name" : "TestModelArtsJob",
        "description" : "This is a ModelArts job"
      },
      "algorithm" : {
        "id" : "3f5d6706-7b67-408d-8ba0-ec08048c45ed",
        "parameters" : [ {
          "name" : "input_dir",
          "value" : "obs://cn-north-4-rse/test/moxingtest-dir/"
        }, {
          "name" : "input_file",
          "value" : "obs://cn-north-4-rse/test/moxingtest/"
        }, {
          "name" : "large_file_method",
          "value" : "1"
        } ],
        "policies" : {
          "auto_search" : null
        },
        "environments" : { }
      },
      "spec" : {
        "resource" : {
          "flavor_id" : "modelarts.p3.large.public.free",
          "node_count" : 1
        },
        "log_export_path" : {
          "obs_url" : ""
        }
      }
    }
  • The following is an example of how to use a custom image to create a training job whose name is TestModelArtsJob2 and description is This is a ModelArts job2. A dedicated resource pool and NFS mounting are used.

    POST https://endpoint/v2/{project_id}/training-jobs
    
    {
      "kind" : "job",
      "metadata" : {
        "name" : "TestModelArtsJob2",
        "description" : "This is a ModelArts job2"
      },
      "algorithm" : {
        "engine" : {
          "image_url" : "xxxxxxxx/fastseq:1.2"
        },
        "command" : "cd /home/ma-user/ddp_demo && sh run_ddp.sh",
        "parameters" : [ ],
        "policies" : {
          "auto_search" : null
        },
        "environments" : {
          "NCCL_DEBUG" : "INFO",
          "NCCL_IB_DISABLE" : "0"
        }
      },
      "spec" : {
        "resource" : {
          "flavor_id" : "modelarts.pool.visual.xlarge",
          "node_count" : 1,
          "pool_id" : "poolfaf38d76"
        },
        "log_export_path" : {
          "obs_url" : "/cn-north-4-training-test/limou/ddp-demo-log/"
        },
        "volumes" : [ {
          "nfs" : {
            "nfs_server_path" : "192.168.0.82:/",
            "local_path" : "/home/ma-user/nfs/",
            "read_only" : false
          }
        } ]
      }
    }

Example Responses

Status code: 201

ok

{
  "kind" : "job",
  "metadata" : {
    "id" : "425b7087-83de-49ed-9e40-5bb642be956f",
    "name" : "TestModelArtsJob",
    "description" : "This is a ModelArts job",
    "create_time" : 1637045545982,
    "workspace_id" : "0",
    "user_name" : ""
  },
  "status" : {
    "phase" : "Creating",
    "secondary_phase" : "Creating",
    "duration" : 0,
    "start_time" : 0,
    "node_count_metrics" : null,
    "tasks" : [ "worker-0", "server-0" ]
  },
  "algorithm" : {
    "id" : "3f5d6706-7b67-408d-8ba0-ec08048c45ed",
    "name" : "ttt-obs-gpu",
    "code_dir" : "/cn-north-4-rse/test/moxingtest-code/",
    "boot_file" : "/cn-north-4-rse/test/moxingtest-code/test_obs_gpu.py",
    "parameters" : [ {
      "name" : "input_dir",
      "description" : "",
      "i18n_description" : null,
      "value" : "s://cn-north-4-rse/test/moxingtest-dir/",
      "constraint" : {
        "type" : "String",
        "editable" : true,
        "required" : true,
        "sensitive" : false,
        "valid_type" : "None",
        "valid_range" : [ ]
      }
    }, {
      "name" : "input_file",
      "description" : "",
      "i18n_description" : null,
      "value" : "obs://cn-north-4-rse/test/moxingtest/",
      "constraint" : {
        "type" : "String",
        "editable" : true,
        "required" : true,
        "sensitive" : false,
        "valid_type" : "None",
        "valid_range" : [ ]
      }
    }, {
      "name" : "large_file_method",
      "description" : "",
      "i18n_description" : null,
      "value" : "1",
      "constraint" : {
        "type" : "Integer",
        "editable" : true,
        "required" : true,
        "sensitive" : false,
        "valid_type" : "None",
        "valid_range" : [ ]
      }
    } ],
    "engine" : {
      "engine_id" : "horovod-cp36-tf-1.16.2",
      "engine_name" : "Horovod",
      "engine_version" : "0.16.2-TF-1.13.1-python3.6"
    },
    "policies" : { }
  },
  "spec" : {
    "resource" : {
      "policy" : "regular",
      "flavor_id" : "modelarts.p3.large.public.free",
      "flavor_name" : "Computing GPU(V100) instance",
      "node_count" : 1,
      "flavor_detail" : {
        "flavor_type" : "GPU",
        "billing" : {
          "code" : "modelarts.vm.gpu.free",
          "unit_num" : 1
        },
        "flavor_info" : {
          "cpu" : {
            "arch" : "x86",
            "core_num" : 8
          },
          "gpu" : {
            "unit_num" : 1,
            "product_name" : "NVIDIA-V100",
            "memory" : "32GB"
          },
          "memory" : {
            "size" : 64,
            "unit" : "GB"
          }
        }
      }
    },
    "log_export_path" : { }
  }
}

Status code: 400

Format of the body for a common error response. The following shows the returned information when an algorithm with ID 3f5d6706-7b67-408d-8ba0-ec08048c45ee is not found.

{
  "error_msg" : "algorithm not found.",
  "error_code" : "ModelArts.2755",
  "error_solution" : "Check whether the training project information in the request is valid."
}

Status Codes

Status Code

Description

201

ok

400

Format of the body for a common error response. The following shows the returned information when an algorithm with ID 3f5d6706-7b67-408d-8ba0-ec08048c45ee is not found.

Error Codes

See Error Codes.