Creating a Training Job
Function
This API is used to create a training job.
Debugging
You can debug this API through automatic authentication in API Explorer or use the SDK sample code generated by API Explorer.
URI
POST /v2/{project_id}/training-jobs
Parameter |
Mandatory |
Type |
Description |
---|---|---|---|
project_id |
Yes |
String |
Project ID. For details, see Obtaining a Project ID and Name. |
Request Parameters
Parameter |
Mandatory |
Type |
Description |
---|---|---|---|
kind |
Yes |
String |
Training job type. The default value is job, indicating a training job. visualization_job: visualization job |
metadata |
Yes |
JobMetadata object |
Metadata of a training job. |
algorithm |
No |
JobAlgorithm object |
Algorithm used by a training job. The options are as follows: |
tasks |
No |
Array of Task objects |
Task list. This function is not implemented currently. |
spec |
No |
Spec object |
Specifications of a training job. If this parameter is specified, leave the tasks parameter blank. |
endpoints |
No |
JobEndpointsReq object |
This section describes the configurations required for remotely accessing a training job. |
Parameter |
Mandatory |
Type |
Description |
---|---|---|---|
name |
Yes |
String |
Name of a training job. The value must contain 1 to 64 characters consisting of only digits, letters, underscores (_), and hyphens (-). |
workspace_id |
No |
String |
Workspace where a job is located. The default value is 0. |
description |
No |
String |
Training job description. The value must contain 0 to 256 characters. The default value is NULL. |
annotations |
No |
Map<String,String> |
Advanced configurations of a training job. The options are as follows: |
Parameter |
Mandatory |
Type |
Description |
---|---|---|---|
id |
No |
String |
Algorithm ID. |
name |
No |
String |
Algorithm name. Leave it blank. |
subscription_id |
No |
String |
Subscription ID of a subscribed algorithm, which must be used with item_version_id |
item_version_id |
No |
String |
Version ID of the subscribed algorithm, which must be used with subscription_id |
code_dir |
No |
String |
Code directory of a training job, for example, /usr/app/. This parameter must appear together with boot_file. If boot_file is set to id or subscription_id+item_version_id, you do not need to set this parameter. |
boot_file |
No |
String |
Boot file of a training job, which needs to be stored in the code directory, for example, /usr/app/boot.py. This parameter must be used together with code_dir. If code_dir is set to id or subscription_id+item_version_id, you do not need to set this parameter. |
autosearch_config_path |
No |
String |
YAML configuration path of auto search jobs. An OBS URL is required. |
autosearch_framework_path |
No |
String |
Framework code directory of auto search jobs. An OBS URL is required. |
command |
No |
String |
Command for starting the container of the custom image of a training job in the custom image scenario. |
parameters |
No |
Array of Parameters objects |
Running parameters of a training job. |
policies |
No |
JobPolicies object |
Policies supported by jobs, which are used for hyperparameter search. |
inputs |
No |
Array of Input objects |
Input of a training job. |
outputs |
No |
Array of Output objects |
Output of a training job. |
engine |
No |
JobEngine object |
Engine of a training job. Leave this parameter blank if the job is created using id of the algorithm in algorithm management, or subscription_id+item_version_id of the subscribed algorithm. |
local_code_dir |
No |
String |
Local directory to the training container to which the algorithm code directory is downloaded Rules: |
working_dir |
No |
String |
Work directory where an algorithm is executed. Note that this parameter does not take effect in v1 compatibility mode. |
environments |
No |
Map<String,String> |
Environment variables of a training job, in the format of "key":"value". The key can contain a maximum of 8192 characters, and the value can contain a maximum of 4096 characters. A maximum of 100 key-value pairs are allowed. The variable name can contain only letters, digits, and underscores (), and must start with a letter or underscore (). |
summary |
No |
Summary object |
Visualization log summary. |
Parameter |
Mandatory |
Type |
Description |
---|---|---|---|
name |
No |
String |
Parameter name. |
value |
No |
String |
Parameter value. |
description |
No |
String |
Parameter description. |
constraint |
No |
ParametersConstraint object |
Parameter constraint. |
i18n_description |
No |
I18nDescription object |
Internationalization description. |
Parameter |
Mandatory |
Type |
Description |
---|---|---|---|
type |
No |
String |
Parameter type. |
editable |
No |
Boolean |
Whether the parameter is editable. |
required |
No |
Boolean |
Whether the parameter is mandatory. |
sensitive |
No |
Boolean |
Whether the parameter is sensitive. This function is not implemented currently. |
valid_type |
No |
String |
Valid type. |
valid_range |
No |
Array of strings |
Valid range. |
Parameter |
Mandatory |
Type |
Description |
---|---|---|---|
language |
No |
String |
Internationalization language. |
description |
No |
String |
Description. |
Parameter |
Mandatory |
Type |
Description |
---|---|---|---|
auto_search |
No |
AutoSearch object |
Hyperparameter search configuration. |
Parameter |
Mandatory |
Type |
Description |
---|---|---|---|
skip_search_params |
No |
String |
Hyperparameter parameters that need to be skipped. |
reward_attrs |
No |
Array of RewardAttrs objects |
Search metrics. |
search_params |
No |
Array of SearchParams objects |
Search parameters. |
algo_configs |
No |
Array of AlgoConfigs objects |
Search algorithm configurations. |
Parameter |
Mandatory |
Type |
Description |
---|---|---|---|
name |
No |
String |
Metric name. |
mode |
No |
String |
Search mode. - If max is specified, the larger the metric value, the better. - If min is specified, the smaller the metric value, the better. |
regex |
No |
String |
Regular expression of a metric. |
Parameter |
Mandatory |
Type |
Description |
---|---|---|---|
name |
No |
String |
Hyperparameter name. |
param_type |
No |
String |
Parameter type. - continuous: The hyperparameter is of the continuous type. When an algorithm is used in a training job, continuous hyperparameters are displayed as text boxes on the console. - discrete: The hyperparameter is of the discrete type. When an algorithm is used in a training job, discrete hyperparameters are displayed as drop-down lists on the console. |
lower_bound |
No |
String |
Lower bound of the hyperparameter. |
upper_bound |
No |
String |
Upper bound of the hyperparameter. |
discrete_points_num |
No |
String |
Number of discrete points of a hyperparameter with continuous values. |
discrete_values |
No |
Array of strings |
Discrete hyperparameter values. |
Parameter |
Mandatory |
Type |
Description |
---|---|---|---|
name |
No |
String |
Name of the search algorithm. |
params |
No |
Array of AutoSearchAlgoConfigParameter objects |
Search algorithm parameters. |
Parameter |
Mandatory |
Type |
Description |
---|---|---|---|
key |
No |
String |
Parameter key. |
value |
No |
String |
Parameter value. |
type |
No |
String |
Parameter type. |
Parameter |
Mandatory |
Type |
Description |
---|---|---|---|
engine_id |
No |
String |
Engine ID selected for a training job. The value can be engine_id, engine_name + engine_version, or image_url. |
engine_name |
No |
String |
Name of the engine selected for a training job. If engine_id has been set, you do not need to set this parameter. |
engine_version |
No |
String |
Version of the engine selected for a training job. If engine_id has been set, you do not need to set this parameter. |
image_url |
No |
String |
Custom image URL selected for a training job. The URL is obtained from SWR. |
install_sys_packages |
No |
Boolean |
Whether to install the MoXing version specified by the training platform. Value true means to install the specified MoXing version. This parameter is available only when engine_name, engine_version, and image_url are set. |
Parameter |
Mandatory |
Type |
Description |
---|---|---|---|
log_type |
No |
String |
Visualization log type of a training job. After this parameter is configured, the training job can be used as the data source of a visualization job. The options are as follows: |
log_dir |
No |
LogDir object |
Visualization log output of a training job. This parameter is mandatory when log_type is not empty. |
data_sources |
No |
Array of DataSource objects |
Visualization log input of a visualization job or debug training job. This parameter is mandatory when tensorboard/enable or mindstudio-insight/enable is set to true for advanced training functions. |
Parameter |
Mandatory |
Type |
Description |
---|---|---|---|
pfs |
Yes |
PFSSummary object |
Output of an OBS parallel file system. |
Parameter |
Mandatory |
Type |
Description |
---|---|---|---|
pfs_path |
Yes |
String |
URL of an OBS parallel file system. |
Parameter |
Mandatory |
Type |
Description |
---|---|---|---|
role |
No |
String |
Task role. This function is not supported currently. |
algorithm |
No |
algorithm object |
Algorithm management and configuration. |
task_resource |
No |
task_resource object |
Resource flavors of a training job. |
Parameter |
Mandatory |
Type |
Description |
---|---|---|---|
job_config |
No |
job_config object |
Algorithm configuration, such as the boot file. |
code_dir |
No |
String |
Algorithm code directory, for example, /usr/app/. This parameter must be used together with boot_file. |
boot_file |
No |
String |
Code boot file of the algorithm, which needs to be stored in the code directory, for example, /usr/app/boot.py. This parameter must be used together with code_dir. |
engine |
No |
engine object |
Engine of a heterogeneous job algorithm. |
inputs |
No |
Array of inputs objects |
Data input of an algorithm. |
outputs |
No |
Array of outputs objects |
Data output of an algorithm. |
local_code_dir |
No |
String |
Local directory of the training container to which the algorithm code directory is downloaded. The rules are as follows: |
working_dir |
No |
String |
Work directory where an algorithm is executed. Note that this parameter does not take effect in v1 compatibility mode. |
Parameter |
Mandatory |
Type |
Description |
---|---|---|---|
parameters |
No |
Array of Parameter objects |
Running parameter of an algorithm. |
inputs |
No |
Array of Input objects |
Data input of an algorithm. |
outputs |
No |
Array of Output objects |
Data output of an algorithm. |
engine |
No |
engine object |
Algorithm engine. |
Parameter |
Mandatory |
Type |
Description |
---|---|---|---|
name |
No |
String |
Parameter name. |
value |
No |
String |
Parameter value. |
description |
No |
String |
Parameter description. |
constraint |
No |
constraint object |
Parameter constraint. |
i18n_description |
No |
i18n_description object |
Internationalization description. |
Parameter |
Mandatory |
Type |
Description |
---|---|---|---|
type |
No |
String |
Parameter type. |
editable |
No |
Boolean |
Whether the parameter is editable. |
required |
No |
Boolean |
Whether the parameter is mandatory. |
sensitive |
No |
Boolean |
Whether the parameter is sensitive This function is not implemented currently. |
valid_type |
No |
String |
Valid type. |
valid_range |
No |
Array of strings |
Valid range. |
Parameter |
Mandatory |
Type |
Description |
---|---|---|---|
language |
No |
String |
International language[. The options are as follows: |
description |
No |
String |
Description of an international language. |
Parameter |
Mandatory |
Type |
Description |
---|---|---|---|
name |
Yes |
String |
Name of the data input channel. |
description |
No |
String |
Description of the data input channel. |
local_dir |
No |
String |
Local directory of the container to which the data input channel is mapped Example: /home/ma-user/modelarts/inputs/data_url_0. |
remote |
Yes |
InputDataInfo object |
Information of the data input. Enums: |
remote_constraint |
No |
Array of remote_constraint objects |
Data input constraint |
Parameter |
Mandatory |
Type |
Description |
---|---|---|---|
dataset |
No |
dataset object |
Dataset as the data input. |
obs |
No |
obs object |
OBS in which data input and output stored. |
Parameter |
Mandatory |
Type |
Description |
---|---|---|---|
id |
Yes |
String |
Dataset ID of a training job. |
version_id |
Yes |
String |
Dataset version ID of a training job. |
Parameter |
Mandatory |
Type |
Description |
---|---|---|---|
obs_url |
Yes |
String |
OBS URL of the dataset required by a training job. For example, /usr/data/. |
Parameter |
Mandatory |
Type |
Description |
---|---|---|---|
data_type |
No |
String |
Data input type, including the data storage location and dataset. |
attributes |
No |
String |
Attributes if a dataset is used as the data input. Options: |
Parameter |
Mandatory |
Type |
Description |
---|---|---|---|
name |
Yes |
String |
Name of the data output channel. |
description |
No |
String |
Description of the data output channel. |
local_dir |
No |
String |
Local directory of the container to which the data output channel is mapped. |
remote |
Yes |
Remote object |
Description of the actual data output. |
Parameter |
Mandatory |
Type |
Description |
---|---|---|---|
obs |
Yes |
RemoteObs object |
OBS to which data is actually exported. |
Parameter |
Mandatory |
Type |
Description |
---|---|---|---|
obs_url |
Yes |
String |
OBS URL to which data is exported. |
Parameter |
Mandatory |
Type |
Description |
---|---|---|---|
engine_id |
No |
String |
Engine ID selected for an algorithm. |
engine_name |
No |
String |
Engine version name selected for an algorithm. If engine_id is specified, leave this parameter blank. |
engine_version |
No |
String |
Engine version name selected for an algorithm. If engine_id is specified, leave this parameter blank. |
image_url |
No |
String |
Custom image URL selected by an algorithm. |
Parameter |
Mandatory |
Type |
Description |
---|---|---|---|
engine_id |
No |
String |
Engine ID of a heterogeneous job, for example, caffe-1.0.0-python2.7. |
engine_name |
No |
String |
Engine name of a heterogeneous job, for example, Caffe. |
engine_version |
No |
String |
Engine version of a heterogeneous job. |
image_url |
No |
String |
Custom image URL selected by an algorithm. |
Parameter |
Mandatory |
Type |
Description |
---|---|---|---|
name |
Yes |
String |
Name of the data input channel. |
description |
No |
String |
Description of the data input channel. |
local_dir |
No |
String |
Local directory of the container to which the data input channel is mapped. |
remote |
Yes |
remote object |
Information of the data input. Enums: |
Parameter |
Mandatory |
Type |
Description |
---|---|---|---|
obs |
No |
obs object |
OBS in which data input and output stored. |
Parameter |
Mandatory |
Type |
Description |
---|---|---|---|
obs_url |
Yes |
String |
OBS URL of the dataset required by a training job. For example, /usr/data/. |
Parameter |
Mandatory |
Type |
Description |
---|---|---|---|
name |
Yes |
String |
Name of the data output channel. |
description |
No |
String |
Description of the data output channel. |
local_dir |
No |
String |
Local directory of the container to which the data output channel is mapped. |
remote |
Yes |
remote object |
Description of the actual data output. |
Parameter |
Mandatory |
Type |
Description |
---|---|---|---|
obs |
Yes |
obs object |
OBS to which data is actually exported. |
Parameter |
Mandatory |
Type |
Description |
---|---|---|---|
obs_url |
Yes |
String |
OBS URL to which data is exported. |
Parameter |
Mandatory |
Type |
Description |
---|---|---|---|
flavor_id |
No |
String |
Resource flavor ID of a training job. |
node_count |
Yes |
Integer |
Number of resource replicas selected for a training job. |
Parameter |
Mandatory |
Type |
Description |
---|---|---|---|
resource |
No |
SpecResource object |
Resource flavors of a training job. Select either flavor_id or pool_id+[flavor_id]. |
volumes |
No |
Array of SpecVolumes objects |
Volumes attached for a training job. |
log_export_path |
No |
LogExportPath object |
Export path of training job logs. |
auto_stop |
No |
AutoStop object |
Auto stop configuration of a training job. |
schedule_policy |
No |
SchedulePolicy object |
Training job scheduling policy. |
notification |
No |
Notification object |
Training event notification |
custom_metrics |
No |
Array of CustomMetrics objects |
Training metric collection configuration |
Parameter |
Mandatory |
Type |
Description |
---|---|---|---|
flavor_id |
No |
String |
ID of the resource flavor selected for a training job. flavor_id cannot be specified for dedicated resource pools with CPU specifications. The options for dedicated resource pools with GPU/Ascend specifications are as follows: |
node_count |
No |
Integer |
Number of nodes used for creating a training job in a pool. By default, a single node is used. |
pool_id |
No |
String |
Dedicated resource pool ID. |
Parameter |
Mandatory |
Type |
Description |
---|---|---|---|
nfs |
No |
Nfs object |
NFS volumes attached for a training job. |
pfs |
No |
Pfs object |
obsfs volumes attached for a training job. |
obs |
No |
Obs object |
OBS volumes attached for a training job |
Parameter |
Mandatory |
Type |
Description |
---|---|---|---|
nfs_server_path |
No |
String |
NFS server path, for example, 10.10.10.10:/example/path. |
local_path |
No |
String |
Path for attaching volumes to the training container, for example, /example/path. |
read_only |
No |
Boolean |
Whether the disks attached to the container in NFS mode are read-only. |
Parameter |
Mandatory |
Type |
Description |
---|---|---|---|
pfs_path |
No |
String |
obsfs path, for example, /test-bucket/path. |
local_path |
No |
String |
Path for attaching volumes to the training container, for example, /example/path. |
Parameter |
Mandatory |
Type |
Description |
---|---|---|---|
obs_path |
No |
String |
OBS path to be attached, for example, /test-bucket/path |
local_path |
No |
String |
Path for attaching volumes to the training container, for example, /example/path |
Parameter |
Mandatory |
Type |
Description |
---|---|---|---|
obs_url |
No |
String |
OBS path for storing training job logs, for example, obs://example/path. |
host_path |
No |
String |
Path of the host where training job logs are stored, for example, /example/path. |
Parameter |
Mandatory |
Type |
Description |
---|---|---|---|
time_unit |
Yes |
String |
Time unit. The options are as follows: |
duration |
Yes |
Integer |
Running duration. The minimum value is 1. |
Parameter |
Mandatory |
Type |
Description |
---|---|---|---|
required_affinity |
No |
RequiredAffinity object |
Affinity requirements for training jobs. |
priority |
No |
Integer |
Priority of the training job. |
preemptible |
No |
Boolean |
Whether preemption is allowed |
Parameter |
Mandatory |
Type |
Description |
---|---|---|---|
affinity_type |
No |
String |
Affinity scheduling policy. Possible values are as follows: |
affinity_group_size |
No |
Integer |
Affinity group size. This parameter is mandatory when affinity_type is set to hyperinstance. In this case, the system schedules tasks specified by affinity_group_size to a supernode to form an affinity group. When a user delivers a training job to the supernode resource pool, if the affinity group size is not set, the system sets the value to 1 by default. |
Parameter |
Mandatory |
Type |
Description |
---|---|---|---|
topic_urn |
No |
String |
URN of the selected topic in SMN |
events |
No |
Array of strings |
Training event that triggers message notification. The value can be: JobStarted: The job is started. JobCompleted: The job is completed. JobFailed: The job is failed. JobTerminated: The job is terminated. JobRestarted: The job is restarted. JobHanged: The job is suspended. JobPreempted: The job is preempted. |
Parameter |
Mandatory |
Type |
Description |
---|---|---|---|
metrics_url |
No |
String |
URL for collecting metrics. Either configure all ports or leave all ports blank. |
metrics_port |
No |
Integer |
Port for collecting metrics. Either configure all ports or leave all ports blank. |
Parameter |
Mandatory |
Type |
Description |
---|---|---|---|
ssh |
No |
SSHReq object |
SSH connection information. |
Response Parameters
Status code: 201
Parameter |
Type |
Description |
---|---|---|
kind |
String |
Training job type, which is job by default. Options: |
metadata |
JobMetadata object |
Metadata of a training job. |
status |
Status object |
Status of a training job. You do not need to set this parameter when creating a job. |
algorithm |
JobAlgorithmResponse object |
Algorithm used by a training job. The options are as follows: |
tasks |
Array of TaskResponse objects |
List of tasks in heterogeneous training jobs. |
spec |
SpecResponce object |
Specifications of a training job. |
endpoints |
JobEndpointsResp object |
This section describes the configurations required for remotely accessing a training job. |
Parameter |
Type |
Description |
---|---|---|
id |
String |
Training job ID, which is generated and returned by ModelArts after the training job is created. |
name |
String |
Name of a training job. The value must contain 1 to 64 characters consisting of only digits, letters, underscores (_), and hyphens (-). |
workspace_id |
String |
Workspace where a job is located. The default value is 0. |
description |
String |
Training job description. The value must contain 0 to 256 characters. The default value is NULL. |
create_time |
Long |
Time when a training job was created, in milliseconds. The value is generated and returned by ModelArts after a training job is created. |
user_name |
String |
Username for creating a training job. The username is generated and returned by ModelArts after a training job is created. |
annotations |
Map<String,String> |
Advanced configurations of a training job. The options are as follows: |
Parameter |
Type |
Description |
---|---|---|
phase |
String |
Level-1 status of a training job. The options are: |
secondary_phase |
String |
The level-2 status of a training job is an internal detailed status, which may be added, modified, or deleted. Dependency is not recommended. The options are: |
duration |
Long |
Running duration of a training job, in milliseconds |
node_count_metrics |
Array<Array<Integer>> |
Node count changes during the training job running period. |
tasks |
Array of strings |
Tasks of a training job. |
start_time |
Long |
Start time of a training job. The value is in timestamp format. |
task_statuses |
Array of TaskStatuses objects |
Status of a training job task. |
running_records |
Array of RunningRecord objects |
Running and fault recovery records of a training job |
Parameter |
Type |
Description |
---|---|---|
task |
String |
Task of a training job. |
exit_code |
Integer |
Exit code of a training job task. |
message |
String |
Error message of a training job task. |
Parameter |
Type |
Description |
---|---|---|
start_at |
Integer |
Unix timestamp of the start time in the current running record, in seconds. |
end_at |
Integer |
Unix timestamp of the end time in the current running record, in seconds. |
start_type |
String |
Startup mode of the current running record. |
end_reason |
String |
Reason why the current running record ends. |
end_related_task |
String |
ID of the task worker that causes the end of the current running record, for example, worker-0. |
end_recover |
String |
Fault tolerance policy used after the current running record ends. The enums are as follows: |
end_recover_before_downgrade |
String |
Tolerance policy used after the current running record ends and before the fault tolerance policy is degraded. The options are the same as those of end_recover. |
Parameter |
Type |
Description |
---|---|---|
id |
String |
Algorithm used by a training job. The options are as follows: |
name |
String |
Algorithm name. |
subscription_id |
String |
Subscription ID of a subscribed algorithm, which must be used with item_version_id |
item_version_id |
String |
Version ID of the subscribed algorithm, which must be used with subscription_id |
code_dir |
String |
Code directory of a training job, for example, /usr/app/. This parameter must be set together with boot_file. If id or subscription_id+item_version_id has been set for boot_file, you do not need to set this parameter. |
boot_file |
String |
Boot file of a training job, which needs to be stored in the code directory. for example, /usr/app/boot.py. This parameter must be used together with code_dir. If id or subscription_id+item_version_id has been set for code_dir, you do not need to set this parameter. |
autosearch_config_path |
String |
YAML configuration path of an auto search job. An OBS URL is required. For example, obs://bucket/file.yaml. |
autosearch_framework_path |
String |
Framework code directory of auto search jobs. An OBS URL is required. For example, obs://bucket/files/. |
command |
String |
Boot command for starting the container of a custom image for a training job. For example, python train.py. |
parameters |
Array of Parameter objects |
Running parameters of a training job. |
policies |
policies object |
Policies supported by jobs. |
inputs |
Array of Input objects |
Input of a training job. |
outputs |
Array of Output objects |
Output of a training job. |
engine |
JobEngine object |
Engine of a training job. Leave this parameter blank if the job is created using id of the algorithm in algorithm management, or subscription_id+item_version_id of the subscribed algorithm. |
local_code_dir |
String |
Local directory of the training container to which the algorithm code directory is downloaded. The rules are as follows: |
working_dir |
String |
Work directory where an algorithm is executed. Note that this parameter does not take effect in v1 compatibility mode. |
environments |
Array of Map<String,String> objects |
Environment variables of a training job. The format is key:value. Leave this parameter blank. |
summary |
Summary object |
Visualization log summary. |
Parameter |
Type |
Description |
---|---|---|
name |
String |
Parameter name. |
value |
String |
Parameter value. |
description |
String |
Parameter description. |
constraint |
constraint object |
Parameter constraint. |
i18n_description |
i18n_description object |
Internationalization description. |
Parameter |
Type |
Description |
---|---|---|
type |
String |
Parameter type. |
editable |
Boolean |
Whether the parameter is editable. |
required |
Boolean |
Whether the parameter is mandatory. |
sensitive |
Boolean |
Whether the parameter is sensitive This function is not implemented currently. |
valid_type |
String |
Valid type. |
valid_range |
Array of strings |
Valid range. |
Parameter |
Type |
Description |
---|---|---|
language |
String |
International language[. The options are as follows: |
description |
String |
Description of an international language. |
Parameter |
Type |
Description |
---|---|---|
auto_search |
auto_search object |
Hyperparameter search configuration. |
Parameter |
Type |
Description |
---|---|---|
skip_search_params |
String |
Hyperparameter parameters that need to be skipped. |
reward_attrs |
Array of reward_attrs objects |
List of search metrics. |
search_params |
Array of search_params objects |
Search parameters. |
algo_configs |
Array of algo_configs objects |
Search algorithm configurations. |
Parameter |
Type |
Description |
---|---|---|
name |
String |
Metric name. |
mode |
String |
Search mode. |
regex |
String |
Regular expression of a metric. |
Parameter |
Type |
Description |
---|---|---|
name |
String |
Name of the search algorithm. |
params |
Array of AutoSearchAlgoConfigParameter objects |
Search algorithm parameters. |
Parameter |
Type |
Description |
---|---|---|
key |
String |
Parameter key. |
value |
String |
Parameter value. |
type |
String |
Parameter type. |
Parameter |
Type |
Description |
---|---|---|
name |
String |
Name of the data input channel. |
description |
String |
Description of the data input channel. |
local_dir |
String |
Local directory of the container to which the data input channel is mapped Example: /home/ma-user/modelarts/inputs/data_url_0. |
remote |
InputDataInfo object |
Information of the data input. Enums: |
remote_constraint |
Array of remote_constraint objects |
Data input constraint |
Parameter |
Type |
Description |
---|---|---|
dataset |
dataset object |
Dataset as the data input. |
obs |
obs object |
OBS in which data input and output stored. |
Parameter |
Type |
Description |
---|---|---|
id |
String |
Dataset ID of a training job. |
version_id |
String |
Dataset version ID of a training job. |
obs_url |
String |
OBS URL of the dataset for a training job. It is automatically parsed by ModelArts based on the dataset ID and dataset version ID. For example, /usr/data/. |
Parameter |
Type |
Description |
---|---|---|
obs_url |
String |
OBS URL of the dataset required by a training job. For example, /usr/data/. |
Parameter |
Type |
Description |
---|---|---|
data_type |
String |
Data input type, including the data storage location and dataset. |
attributes |
String |
Attributes if a dataset is used as the data input. Options: |
Parameter |
Type |
Description |
---|---|---|
name |
String |
Name of the data output channel. |
description |
String |
Description of the data output channel. |
local_dir |
String |
Local directory of the container to which the data output channel is mapped. |
remote |
Remote object |
Description of the actual data output. |
Parameter |
Type |
Description |
---|---|---|
engine_id |
String |
Engine ID selected for a training job. The value can be engine_id, engine_name + engine_version, or image_url. |
engine_name |
String |
Name of the engine selected for a training job. If engine_id has been set, you do not need to set this parameter. |
engine_version |
String |
Version of the engine selected for a training job. If engine_id has been set, you do not need to set this parameter. |
image_url |
String |
Custom image URL selected for a training job. The URL is obtained from SWR. |
install_sys_packages |
Boolean |
Whether to install the MoXing version specified by the training platform. Value true means to install the specified MoXing version. This parameter is available only when engine_name, engine_version, and image_url are set. |
Parameter |
Type |
Description |
---|---|---|
log_type |
String |
Visualization log type of a training job. After this parameter is configured, the training job can be used as the data source of a visualization job. The options are as follows: |
log_dir |
LogDir object |
Visualization log output of a training job. This parameter is mandatory when log_type is not empty. |
data_sources |
Array of DataSource objects |
Visualization log input of a visualization job or debug training job. This parameter is mandatory when tensorboard/enable or mindstudio-insight/enable is set to true for advanced training functions. |
Parameter |
Type |
Description |
---|---|---|
pfs |
PFSSummary object |
Output of an OBS parallel file system. |
Parameter |
Type |
Description |
---|---|---|
role |
String |
Task role. This function is not supported currently. |
algorithm |
TaskResponseAlgorithm object |
Algorithm management and configuration. |
task_resource |
FlavorResponse object |
Flavors of a training job or an algorithm. |
Parameter |
Type |
Description |
---|---|---|
code_dir |
String |
Absolute path of the directory where the algorithm boot file is stored. |
boot_file |
String |
Absolute path of the algorithm boot file. |
inputs |
AlgorithmInput object |
Algorithm input channel. |
outputs |
AlgorithmOutput object |
Algorithm output channel. |
engine |
AlgorithmEngine object |
Engine on which a heterogeneous job depends. |
local_code_dir |
String |
Local directory of the training container to which the algorithm code directory is downloaded. The rules are as follows: |
working_dir |
String |
Work directory where an algorithm is executed. Note that this parameter does not take effect in v1 compatibility mode. |
Parameter |
Type |
Description |
---|---|---|
name |
String |
Name of the data input channel. |
local_dir |
String |
Local path of the container to which the data input and output channels are mapped. |
remote |
AlgorithmRemote object |
Actual data input, which can only be OBS for heterogeneous jobs. |
Parameter |
Type |
Description |
---|---|---|
obs |
RemoteObs object |
OBS in which data input and output are stored. |
Parameter |
Type |
Description |
---|---|---|
name |
String |
Name of the data output channel. |
local_dir |
String |
Local directory of the container to which the data output channel is mapped. |
remote |
Remote object |
Description of the actual data output. |
mode |
String |
Data transmission mode. The default value is upload_periodically. |
period |
String |
Data transmission period. The default value is 30s. |
Parameter |
Type |
Description |
---|---|---|
obs |
RemoteObs object |
OBS to which data is actually exported. |
Parameter |
Type |
Description |
---|---|---|
engine_id |
String |
Engine ID, for example, caffe-1.0.0-python2.7. |
engine_name |
String |
Engine name, for example, Caffe. |
engine_version |
String |
Engine version. Engines with the same name have multiple versions, for example, Caffe-1.0.0-python2.7 of Python 2.7. |
v1_compatible |
Boolean |
Whether the v1 compatibility mode is used. |
run_user |
String |
User UID started by default by the engine. |
image_url |
String |
Custom image URL selected for an algorithm. |
Parameter |
Type |
Description |
---|---|---|
flavor_id |
String |
ID of the resource flavor. |
flavor_name |
String |
Name of the resource flavor. |
max_num |
Integer |
Maximum number of nodes in a resource flavor. |
flavor_type |
String |
Resource flavor type. Options: |
billing |
BillingInfo object |
Billing information of a resource flavor. |
flavor_info |
FlavorInfoResponse object |
Resource flavor details. |
attributes |
Map<String,String> |
Other specification attributes. |
Parameter |
Type |
Description |
---|---|---|
max_num |
Integer |
Maximum number of nodes that can be selected. The value 1 indicates that the distributed mode is not supported. |
cpu |
Cpu object |
CPU specifications. |
gpu |
Gpu object |
GPU specifications. |
npu |
Npu object |
Ascend specifications. |
memory |
Memory object |
Memory information. |
disk |
DiskResponse object |
Disk information. |
Parameter |
Type |
Description |
---|---|---|
size |
Integer |
Disk size. |
unit |
String |
Unit of the disk size. |
Parameter |
Type |
Description |
---|---|---|
resource |
Resource object |
Resource flavors of a training job. Select either flavor_id or pool_id+[flavor_id]. |
volumes |
Array of JobVolume objects |
Volumes attached for a training job. |
log_export_path |
LogExportPath object |
Export path of training job logs. |
schedule_policy |
SchedulePolicy object |
Training job scheduling policy. |
custom_metrics |
Array of CustomMetrics objects |
Metric collection configuration |
Parameter |
Type |
Description |
---|---|---|
policy |
String |
Resource specification mode of a training job. The value can be regular, indicating the standard mode. |
flavor_id |
String |
ID of the resource flavor selected for a training job. flavor_id cannot be specified for dedicated resource pools with CPU specifications. The options for dedicated resource pools with GPU/Ascend specifications are as follows: |
flavor_name |
String |
Read-only flavor name returned by ModelArts when flavor_id is used. |
node_count |
Integer |
Number of resource replicas selected for a training job. |
pool_id |
String |
Resource pool ID selected for a training job. |
flavor_detail |
FlavorDetail object |
Flavor details of a training job or algorithm. This parameter is available only for public resource pools. |
Parameter |
Type |
Description |
---|---|---|
flavor_type |
String |
Resource flavor type. The options are as follows: |
billing |
BillingInfo object |
Billing information of a resource flavor. |
flavor_info |
FlavorInfo object |
Resource flavor details. |
Parameter |
Type |
Description |
---|---|---|
code |
String |
Billing code. |
unit_num |
Integer |
Billing unit. |
Parameter |
Type |
Description |
---|---|---|
max_num |
Integer |
Maximum number of nodes that can be selected. The value 1 indicates that the distributed mode is not supported. |
cpu |
Cpu object |
CPU specifications. |
gpu |
Gpu object |
GPU specifications. |
npu |
Npu object |
Ascend specifications. |
memory |
Memory object |
Memory information. |
disk |
Disk object |
Disk information. |
Parameter |
Type |
Description |
---|---|---|
arch |
String |
CPU architecture. |
core_num |
Integer |
Number of cores. |
Parameter |
Type |
Description |
---|---|---|
unit_num |
Integer |
Number of GPUs. |
product_name |
String |
Product name. |
memory |
String |
Memory. |
Parameter |
Type |
Description |
---|---|---|
unit_num |
String |
Number of NPUs. |
product_name |
String |
Product name. |
memory |
String |
Memory. |
Parameter |
Type |
Description |
---|---|---|
size |
Integer |
Memory size. |
unit |
String |
Number of memory units. |
Parameter |
Type |
Description |
---|---|---|
size |
String |
Disk size. |
unit |
String |
Unit of the disk size, which is GB generally. |
Parameter |
Type |
Description |
---|---|---|
nfs_server_path |
String |
NFS server path, for example, 10.10.10.10:/example/path. |
local_path |
String |
Path for attaching volumes to the training container, for example, /example/path. |
read_only |
Boolean |
Whether the disks attached to the container in NFS mode are read-only. |
Parameter |
Type |
Description |
---|---|---|
obs_url |
String |
OBS path for storing training job logs, for example, obs://example/path. |
host_path |
String |
Path of the host where training job logs are stored, for example, /example/path. |
Parameter |
Type |
Description |
---|---|---|
required_affinity |
RequiredAffinity object |
Affinity requirements for training jobs. |
priority |
Integer |
Priority of the training job. |
preemptible |
Boolean |
Whether preemption is allowed |
Parameter |
Type |
Description |
---|---|---|
affinity_type |
String |
Affinity scheduling policy. Possible values are as follows: |
affinity_group_size |
Integer |
Affinity group size. This parameter is mandatory when affinity_type is set to hyperinstance. In this case, the system schedules tasks specified by affinity_group_size to a supernode to form an affinity group. When a user delivers a training job to the supernode resource pool, if the affinity group size is not set, the system sets the value to 1 by default. |
Parameter |
Type |
Description |
---|---|---|
metrics_url |
String |
URL for collecting metrics. Either configure all ports or leave all ports blank. |
metrics_port |
Integer |
Port for collecting metrics. Either configure all ports or leave all ports blank. |
Parameter |
Type |
Description |
---|---|---|
ssh |
SSHResp object |
SSH connection information. |
jupyter_lab |
JupyterLab object |
JupyterLab connection information. |
tensorboard |
Tensorboard object |
TensorBoard connection information. |
mindstudio_insight |
MindStudioInsight object |
MindStudio Insight connection information. |
Parameter |
Type |
Description |
---|---|---|
key_pair_names |
Array of strings |
Specifies the SSH key pair name, which can be created and viewed on the Key Pair page of the ECS console. |
task_urls |
Array of TaskUrls objects |
SSH connection address information. |
Parameter |
Type |
Description |
---|---|---|
task |
String |
ID of a training job. |
url |
String |
SSH connection address of a training job. |
Parameter |
Type |
Description |
---|---|---|
url |
String |
JupyterLab address of a training job. |
token |
String |
JupyterLab token of a training job. |
Parameter |
Type |
Description |
---|---|---|
url |
String |
TensorBoard URL of a training job. |
token |
String |
TensorBoard token of a training job |
Parameter |
Type |
Description |
---|---|---|
url |
String |
MindStudio Insight URL of a training job. |
token |
String |
MindStudio Insight token of a training job. |
Status code: 400
Parameter |
Type |
Description |
---|---|---|
error_msg |
String |
Error message |
error_code |
String |
Error code |
error_solution |
String |
Solution |
Example Requests
-
The following is an example of how to create a training job with free specifications. The job name has been set to TestModelArtsJob and the description has been set to This is a ModelArts job. The required algorithm's ID is 3f5d6706-7b67-408d-8ba0-ec08048c45ed. The inputs and outputs have not been defined for the algorithm.
POST https://endpoint/v2/{project_id}/training-jobs { "kind" : "job", "metadata" : { "name" : "TestModelArtsJob", "description" : "This is a ModelArts job" }, "algorithm" : { "id" : "3f5d6706-7b67-408d-8ba0-ec08048c45ed", "parameters" : [ { "name" : "input_dir", "value" : "obs://cn-north-4-rse/test/moxingtest-dir/" }, { "name" : "input_file", "value" : "obs://cn-north-4-rse/test/moxingtest/" }, { "name" : "large_file_method", "value" : "1" } ], "policies" : { "auto_search" : null }, "environments" : { } }, "spec" : { "resource" : { "flavor_id" : "modelarts.p3.large.public.free", "node_count" : 1 }, "log_export_path" : { "obs_url" : "" } } }
-
The following is an example of how to use a custom image to create a training job whose name is TestModelArtsJob2 and description is This is a ModelArts job2. A dedicated resource pool and NFS mounting are used.
POST https://endpoint/v2/{project_id}/training-jobs { "kind" : "job", "metadata" : { "name" : "TestModelArtsJob2", "description" : "This is a ModelArts job2" }, "algorithm" : { "engine" : { "image_url" : "xxxxxxxx/fastseq:1.2" }, "command" : "cd /home/ma-user/ddp_demo && sh run_ddp.sh", "parameters" : [ ], "policies" : { "auto_search" : null }, "environments" : { "NCCL_DEBUG" : "INFO", "NCCL_IB_DISABLE" : "0" } }, "spec" : { "resource" : { "flavor_id" : "modelarts.pool.visual.xlarge", "node_count" : 1, "pool_id" : "poolfaf38d76" }, "log_export_path" : { "obs_url" : "/cn-north-4-training-test/limou/ddp-demo-log/" }, "volumes" : [ { "nfs" : { "nfs_server_path" : "192.168.0.82:/", "local_path" : "/home/ma-user/nfs/", "read_only" : false } } ] } }
Example Responses
Status code: 201
ok
{
"kind" : "job",
"metadata" : {
"id" : "425b7087-83de-49ed-9e40-5bb642be956f",
"name" : "TestModelArtsJob",
"description" : "This is a ModelArts job",
"create_time" : 1637045545982,
"workspace_id" : "0",
"user_name" : ""
},
"status" : {
"phase" : "Creating",
"secondary_phase" : "Creating",
"duration" : 0,
"start_time" : 0,
"node_count_metrics" : null,
"tasks" : [ "worker-0", "server-0" ]
},
"algorithm" : {
"id" : "3f5d6706-7b67-408d-8ba0-ec08048c45ed",
"name" : "ttt-obs-gpu",
"code_dir" : "/cn-north-4-rse/test/moxingtest-code/",
"boot_file" : "/cn-north-4-rse/test/moxingtest-code/test_obs_gpu.py",
"parameters" : [ {
"name" : "input_dir",
"description" : "",
"i18n_description" : null,
"value" : "s://cn-north-4-rse/test/moxingtest-dir/",
"constraint" : {
"type" : "String",
"editable" : true,
"required" : true,
"sensitive" : false,
"valid_type" : "None",
"valid_range" : [ ]
}
}, {
"name" : "input_file",
"description" : "",
"i18n_description" : null,
"value" : "obs://cn-north-4-rse/test/moxingtest/",
"constraint" : {
"type" : "String",
"editable" : true,
"required" : true,
"sensitive" : false,
"valid_type" : "None",
"valid_range" : [ ]
}
}, {
"name" : "large_file_method",
"description" : "",
"i18n_description" : null,
"value" : "1",
"constraint" : {
"type" : "Integer",
"editable" : true,
"required" : true,
"sensitive" : false,
"valid_type" : "None",
"valid_range" : [ ]
}
} ],
"engine" : {
"engine_id" : "horovod-cp36-tf-1.16.2",
"engine_name" : "Horovod",
"engine_version" : "0.16.2-TF-1.13.1-python3.6"
},
"policies" : { }
},
"spec" : {
"resource" : {
"policy" : "regular",
"flavor_id" : "modelarts.p3.large.public.free",
"flavor_name" : "Computing GPU(Vnt1) instance",
"node_count" : 1,
"flavor_detail" : {
"flavor_type" : "GPU",
"billing" : {
"code" : "modelarts.vm.gpu.free",
"unit_num" : 1
},
"flavor_info" : {
"cpu" : {
"arch" : "x86",
"core_num" : 8
},
"gpu" : {
"unit_num" : 1,
"product_name" : "GP-Vnt1",
"memory" : "32GB"
},
"memory" : {
"size" : 64,
"unit" : "GB"
}
}
}
},
"log_export_path" : { },
"custom_metrics" : [ {
"metrics_url" : "/raw_text",
"metrics_port" : 5006
} ]
}
}
Status code: 400
Format of the body for a common error response. The following shows the returned information when an algorithm with ID 3f5d6706-7b67-408d-8ba0-ec08048c45ee is not found.
{
"error_msg" : "algorithm not found.",
"error_code" : "ModelArts.2755",
"error_solution" : "Check whether the training project information in the request is valid."
}
Status Codes
Status Code |
Description |
---|---|
201 |
ok |
400 |
Format of the body for a common error response. The following shows the returned information when an algorithm with ID 3f5d6706-7b67-408d-8ba0-ec08048c45ee is not found. |
Error Codes
See Error Codes.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot