Querying the Details About a Training Job
Function
This API is used to query the details about a training job.
URI
GET /v2/{project_id}/training-jobs/{training_job_id}
|
Parameter |
Mandatory |
Type |
Description |
|---|---|---|---|
|
project_id |
Yes |
String |
Project ID. For details, see Obtaining a Project ID and Name. |
|
training_job_id |
Yes |
String |
ID of a training job. |
Request Parameters
None
Response Parameters
Status code: 200
|
Parameter |
Type |
Description |
|---|---|---|
|
kind |
String |
Training job type, which is job by default. Options: |
|
metadata |
JobMetadata object |
Metadata of a training job. |
|
status |
Status object |
Status of a training job. You do not need to set this parameter when creating a job. |
|
algorithm |
JobAlgorithmResponse object |
Algorithm used by a training job. Options: |
|
tasks |
Array of TaskResponse objects |
List of tasks in heterogeneous training jobs. |
|
spec |
spec object |
Specifications of a training job. |
|
endpoints |
JobEndpointsResp object |
Configuration required for remotely accessing a training job. |
|
Parameter |
Type |
Description |
|---|---|---|
|
id |
String |
Training job ID, which is generated and returned by ModelArts after the training job is created. |
|
name |
String |
Name of a training job. The value must contain 1 to 64 characters consisting of only digits, letters, underscores (_), and hyphens (-). |
|
workspace_id |
String |
Workspace where a job is located. The default value is 0. |
|
description |
String |
Training job description. The value must contain 0 to 256 characters. The default value is NULL. |
|
create_time |
Long |
Time when a training job was created, in milliseconds. The value is generated and returned by ModelArts after a training job is created. |
|
user_name |
String |
Username for creating a training job. The username is generated and returned by ModelArts after a training job is created. |
|
annotations |
Map<String,String> |
Advanced configuration of a training job. Options: |
|
Parameter |
Type |
Description |
|---|---|---|
|
phase |
String |
Level-1 status of a training job. The options are as follows: Creating Pending Running Failed Completed, Terminating Terminated Abnormal |
|
secondary_phase |
String |
The level-2 status of a training job is an internal detailed status, which may be added, modified, or deleted. Dependency is not recommended. The options are as follows: Creating Queuing Running Failed Completed, Terminating Terminated CreateFailed TerminatedFailed Unknown Lost |
|
duration |
Long |
Running duration of a training job, in milliseconds |
|
node_count_metrics |
Array<Array<Integer>> |
Node count changes during the training job running period. |
|
tasks |
Array of strings |
Tasks of a training job. |
|
start_time |
Long |
Start time of a training job. The value is in timestamp format. |
|
task_statuses |
Array of task_statuses objects |
Status of a training job task. |
|
running_records |
Array of running_records objects |
Running and fault recovery records of a training job |
|
Parameter |
Type |
Description |
|---|---|---|
|
task |
String |
Name of a training job task. |
|
exit_code |
Integer |
Exit code of a training job task. |
|
message |
String |
Error message of a training job task. |
|
Parameter |
Type |
Description |
|---|---|---|
|
start_at |
Integer |
Unix timestamp of the start time in the current running record, in seconds |
|
end_at |
Integer |
Unix timestamp of the end time in the current running record, in seconds |
|
start_type |
String |
Startup mode of the current running record. The options are as follows: init_or_rescheduled: This startup is the first running after scheduling, including the first startup and the running after scheduling recovery. restarted: This startup is not the first running after scheduling but the running after a process restart. |
|
end_reason |
String |
Reason why the current running record ends |
|
end_related_task |
String |
ID of the task worker that causes the end of the current running record, for example, worker-0 |
|
end_recover |
String |
Fault tolerance policy used after the current running record ends. The options are as follows: npu_proc_restart: NPU in-place hot recovery gpu_proc_restart: GPU in-place hot recovery proc_restart: Process in-place recovery pod_reschedule: Pod-level rescheduling job_reschedule: Job-level rescheduling job_reschedule_with_taint: Isolated job-level rescheduling |
|
end_recover_before_downgrade |
String |
Tolerance policy used after the current running record ends and before the fault tolerance policy is degraded. The options are the same as those of end_recover. |
|
Parameter |
Type |
Description |
|---|---|---|
|
id |
String |
Algorithm used by a training job. Options: |
|
name |
String |
Algorithm name. |
|
subscription_id |
String |
Subscription ID of a subscribed algorithm, which must be used with item_version_id |
|
item_version_id |
String |
Version ID of the subscribed algorithm, which must be used with subscription_id |
|
code_dir |
String |
Code directory of a training job, for example, /usr/app/. This parameter must be used together with boot_file. If id or subscription_id+item_version_id is set, leave it blank. |
|
boot_file |
String |
Boot file of a training job, which must be stored in the code directory, for example, /usr/app/boot.py. This parameter must be used with code_dir. Leave this parameter blank if id, or subscription_id and item_version_id are specified. |
|
autosearch_config_path |
String |
YAML configuration path of auto search jobs. An OBS URL is required. |
|
autosearch_framework_path |
String |
Framework code directory of auto search jobs. An OBS URL is required. |
|
command |
String |
Boot command used to start the container of a custom image of a training job. For example, python train.py. |
|
parameters |
Array of Parameter objects |
Running parameters of a training job. |
|
policies |
policies object |
Policies supported by jobs. |
|
inputs |
Array of Input objects |
Input of a training job. |
|
outputs |
Array of Output objects |
Output of a training job. |
|
engine |
engine object |
Engine of a training job. Leave this parameter blank if the job is created using id of the algorithm in algorithm management, or subscription_id+item_version_id of the subscribed algorithm. |
|
local_code_dir |
String |
Local directory to the training container to which the algorithm code directory is downloaded. Ensure that the following rules are complied with: |
|
working_dir |
String |
Work directory where an algorithm is executed. Note that this parameter does not take effect in v1 compatibility mode. |
|
environments |
Array of Map<String,String> objects |
Environment variables of a training job. The format is key: value. Leave this parameter blank. |
|
Parameter |
Type |
Description |
|---|---|---|
|
name |
String |
Parameter name. |
|
value |
String |
Parameter value. |
|
description |
String |
Parameter description. |
|
constraint |
constraint object |
Parameter constraint. |
|
i18n_description |
i18n_description object |
Internationalization description. |
|
Parameter |
Type |
Description |
|---|---|---|
|
type |
String |
Parameter type. |
|
editable |
Boolean |
Whether the parameter is editable. |
|
required |
Boolean |
Whether the parameter is mandatory. |
|
sensitive |
Boolean |
Whether the parameter is sensitive This function is not implemented currently. |
|
valid_type |
String |
Valid type. |
|
valid_range |
Array of strings |
Valid range. |
|
Parameter |
Type |
Description |
|---|---|---|
|
language |
String |
Language. Options: |
|
description |
String |
Description. |
|
Parameter |
Type |
Description |
|---|---|---|
|
auto_search |
auto_search object |
Hyperparameter search configuration. |
|
Parameter |
Type |
Description |
|---|---|---|
|
skip_search_params |
String |
Hyperparameter parameters that need to be skipped. |
|
reward_attrs |
Array of reward_attrs objects |
List of search metrics. |
|
search_params |
Array of search_params objects |
Search parameters. |
|
algo_configs |
Array of algo_configs objects |
Search algorithm configurations. |
|
Parameter |
Type |
Description |
|---|---|---|
|
name |
String |
Metric name. |
|
mode |
String |
Search direction. |
|
regex |
String |
Regular expression of a metric. |
|
Parameter |
Type |
Description |
|---|---|---|
|
name |
String |
Name of the search algorithm. |
|
params |
Array of AutoSearchAlgoConfigParameter objects |
Search algorithm parameters. |
|
Parameter |
Type |
Description |
|---|---|---|
|
key |
String |
Parameter key. |
|
value |
String |
Parameter value. |
|
type |
String |
Parameter type. |
|
Parameter |
Type |
Description |
|---|---|---|
|
name |
String |
Name of the data input channel. |
|
description |
String |
Description of the data input channel. |
|
local_dir |
String |
Local directory of the container to which the data input channel is mapped. |
|
remote |
InputDataInfo object |
Data input. Options: |
|
remote_constraint |
Array of remote_constraint objects |
Data input constraint |
|
Parameter |
Type |
Description |
|---|---|---|
|
dataset |
dataset object |
Dataset as the data input. |
|
obs |
obs object |
OBS in which data input and output stored. |
|
Parameter |
Type |
Description |
|---|---|---|
|
id |
String |
Dataset ID of a training job. |
|
version_id |
String |
Dataset version ID of a training job. |
|
obs_url |
String |
OBS URL of the dataset required by a training job. ModelArts automatically parses and generates the URL based on the dataset and dataset version IDs. For example, /usr/data/. |
|
Parameter |
Type |
Description |
|---|---|---|
|
obs_url |
String |
OBS URL of the dataset required by a training job. For example, /usr/data/. |
|
Parameter |
Type |
Description |
|---|---|---|
|
data_type |
String |
Data input type, including the data storage location and dataset. |
|
attributes |
String |
Attributes if a dataset is used as the data input. Options: |
|
Parameter |
Type |
Description |
|---|---|---|
|
name |
String |
Name of the data output channel. |
|
description |
String |
Description of the data output channel. |
|
local_dir |
String |
Local directory of the container to which the data output channel is mapped. |
|
remote |
remote object |
Description of the actual data output. |
|
Parameter |
Type |
Description |
|---|---|---|
|
obs_url |
String |
OBS URL to which data is actually exported. |
|
Parameter |
Type |
Description |
|---|---|---|
|
engine_id |
String |
Engine ID selected for a training job. You can set this parameter to engine_id, engine_name + engine_version, or image_url. |
|
engine_name |
String |
Name of the engine selected for a training job. If engine_id is set, leave this parameter blank. |
|
engine_version |
String |
Name of the engine version selected for a training job. If engine_id is set, leave this parameter blank. |
|
image_url |
String |
Custom image URL selected for a training job. |
|
Parameter |
Type |
Description |
|---|---|---|
|
role |
String |
Task role. This function is not supported currently. |
|
algorithm |
algorithm object |
Algorithm management and configuration. |
|
task_resource |
FlavorResponse object |
Flavors of a training job or an algorithm. |
|
Parameter |
Type |
Description |
|---|---|---|
|
code_dir |
String |
Absolute path of the directory where the algorithm boot file is stored. |
|
boot_file |
String |
Absolute path of the algorithm boot file. |
|
inputs |
inputs object |
Algorithm input channel. |
|
outputs |
outputs object |
Algorithm output channel. |
|
engine |
engine object |
Engine on which a heterogeneous job depends. |
|
local_code_dir |
String |
Local directory to the training container to which the algorithm code directory is downloaded. Ensure that the following rules are complied with: |
|
working_dir |
String |
Work directory where an algorithm is executed. Note that this parameter does not take effect in v1 compatibility mode. |
|
Parameter |
Type |
Description |
|---|---|---|
|
name |
String |
Name of the data input channel. |
|
local_dir |
String |
Local path of the container to which the data input and output channels are mapped. |
|
remote |
remote object |
Actual data input. Heterogeneous jobs support only OBS. |
|
Parameter |
Type |
Description |
|---|---|---|
|
obs |
obs object |
OBS in which data input and output stored. |
|
Parameter |
Type |
Description |
|---|---|---|
|
obs_url |
String |
OBS URL of the dataset required by a training job. For example, /usr/data/. |
|
Parameter |
Type |
Description |
|---|---|---|
|
name |
String |
Name of the data output channel. |
|
local_dir |
String |
Local directory of the container to which the data output channel is mapped. |
|
remote |
remote object |
Description of the actual data output. |
|
mode |
String |
Data transmission mode. The default value is upload_periodically. |
|
period |
String |
Data transmission period. The default value is 30s. |
|
Parameter |
Type |
Description |
|---|---|---|
|
obs |
obs object |
OBS to which data is actually exported. |
|
Parameter |
Type |
Description |
|---|---|---|
|
obs_url |
String |
OBS URL to which data is actually exported. |
|
Parameter |
Type |
Description |
|---|---|---|
|
engine_id |
String |
Engine ID of a heterogeneous job, for example, caffe-1.0.0-python2.7. |
|
engine_name |
String |
Engine name of a heterogeneous job, for example, Caffe. |
|
engine_version |
String |
Engine version of a heterogeneous job. |
|
v1_compatible |
Boolean |
Whether the v1 compatibility mode is used. |
|
run_user |
String |
User UID started by default by the engine. |
|
image_url |
String |
Custom image URL selected by an algorithm. |
|
Parameter |
Type |
Description |
|---|---|---|
|
flavor_id |
String |
ID of the resource flavor. |
|
flavor_name |
String |
Name of the resource flavor. |
|
max_num |
Integer |
Maximum number of nodes in a resource flavor. |
|
flavor_type |
String |
Resource flavor type. Options: |
|
billing |
billing object |
Billing information of a resource flavor. |
|
flavor_info |
flavor_info object |
Resource flavor details. |
|
attributes |
Map<String,String> |
Other specification attributes. |
|
Parameter |
Type |
Description |
|---|---|---|
|
code |
String |
Billing code. |
|
unit_num |
Integer |
Number of billing units. |
|
Parameter |
Type |
Description |
|---|---|---|
|
max_num |
Integer |
Maximum number of nodes that can be selected. The value 1 indicates that the distributed mode is not supported. |
|
cpu |
cpu object |
CPU specifications. |
|
gpu |
gpu object |
GPU specifications. |
|
npu |
npu object |
Ascend specifications |
|
memory |
memory object |
Memory information. |
|
disk |
disk object |
Disk information. |
|
Parameter |
Type |
Description |
|---|---|---|
|
arch |
String |
CPU architecture. |
|
core_num |
Integer |
Number of cores. |
|
Parameter |
Type |
Description |
|---|---|---|
|
unit_num |
Integer |
Number of GPUs. |
|
product_name |
String |
Product name. |
|
memory |
String |
Memory. |
|
Parameter |
Type |
Description |
|---|---|---|
|
unit_num |
String |
Number of NPUs. |
|
product_name |
String |
Product name. |
|
memory |
String |
Memory. |
|
Parameter |
Type |
Description |
|---|---|---|
|
resource |
Resource object |
Resource flavors of a training job. Select either flavor_id or pool_id+[flavor_id]. |
|
volumes |
Array of volumes objects |
Volumes attached to a training job. |
|
log_export_path |
log_export_path object |
Export path of training job logs. |
|
Parameter |
Type |
Description |
|---|---|---|
|
policy |
String |
Resource flavor of a training job. Options: regular |
|
flavor_id |
String |
ID of the resource flavor selected for a training job. flavor_id cannot be specified for dedicated resource pools with CPU specifications. The options for dedicated resource pools with GPU/Ascend specifications are as follows: |
|
flavor_name |
String |
Read-only flavor name returned by ModelArts when flavor_id is used. |
|
node_count |
Integer |
Number of resource replicas selected for a training job. |
|
pool_id |
String |
Resource pool ID selected for a training job. |
|
flavor_detail |
flavor_detail object |
Flavors of a training job or an algorithm. |
|
Parameter |
Type |
Description |
|---|---|---|
|
flavor_type |
String |
Resource flavor type. Options: |
|
billing |
billing object |
Billing information of a resource flavor. |
|
flavor_info |
flavor_info object |
Resource flavor details. |
|
Parameter |
Type |
Description |
|---|---|---|
|
code |
String |
Billing code. |
|
unit_num |
Integer |
Number of billing units. |
|
Parameter |
Type |
Description |
|---|---|---|
|
max_num |
Integer |
Maximum number of nodes that can be selected. The value 1 indicates that the distributed mode is not supported. |
|
cpu |
cpu object |
CPU specifications. |
|
gpu |
gpu object |
GPU specifications. |
|
npu |
npu object |
Ascend specifications |
|
memory |
memory object |
Memory information. |
|
disk |
disk object |
Disk information. |
|
Parameter |
Type |
Description |
|---|---|---|
|
arch |
String |
CPU architecture. |
|
core_num |
Integer |
Number of cores. |
|
Parameter |
Type |
Description |
|---|---|---|
|
unit_num |
Integer |
Number of GPUs. |
|
product_name |
String |
Product name. |
|
memory |
String |
Memory. |
|
Parameter |
Type |
Description |
|---|---|---|
|
unit_num |
String |
Number of NPUs. |
|
product_name |
String |
Product name. |
|
memory |
String |
Memory. |
|
Parameter |
Type |
Description |
|---|---|---|
|
size |
Integer |
Memory size. |
|
unit |
String |
Number of memory units. |
|
Parameter |
Type |
Description |
|---|---|---|
|
size |
String |
Disk size. |
|
unit |
String |
Unit of the disk size. Generally, the value is GB. |
|
Parameter |
Type |
Description |
|---|---|---|
|
nfs_server_path |
String |
NFS server path. |
|
local_path |
String |
Path for attaching volumes to the training container. |
|
read_only |
Boolean |
Whether the volumes attached to the container in NFS mode are read-only. |
|
Parameter |
Type |
Description |
|---|---|---|
|
obs_url |
String |
OBS URL for storing training job logs. |
|
host_path |
String |
Path of the host where training job logs are stored. |
|
Parameter |
Type |
Description |
|---|---|---|
|
ssh |
SSHResp object |
SSHConnection information. |
|
jupyter_lab |
JupyterLab object |
JupyterLabConnection information. |
|
Parameter |
Type |
Description |
|---|---|---|
|
key_pair_names |
Array of strings |
SSH key pair name, which can be created and viewed on the Key Pair page of the ECS console. |
|
task_urls |
Array of TaskUrls objects |
SSH connection address information. |
Example Requests
The following shows how to query a training job whose UUID is 3faf5c03-aaa1-4cbe-879d-24b05d997347.
GET https://endpoint/v2/{project_id}/training-jobs/3faf5c03-aaa1-4cbe-879d-24b05d997347
Example Responses
Status code: 200
ok
{
"kind" : "job",
"metadata" : {
"id" : "3faf5c03-aaa1-4cbe-879d-24b05d997347",
"name" : "trainjob--py14_mem06-108",
"description" : "",
"create_time" : 1636447346315,
"workspace_id" : "0",
"user_name" : ""
},
"status" : {
"phase" : "Abnormal",
"secondary_phase" : "CreateFailed",
"duration" : 0,
"start_time" : 0,
"node_count_metrics" : [ [ 1636447746000, 0 ], [ 1636447755000, 0 ], [ 1636447756000, 0 ] ],
"tasks" : [ "worker-0" ],
"running_records" : [ {
"start_at" : 1701327093,
"end_at" : 1701322341,
"start_type" : "init_or_rescheduled",
"end_recover" : "job_reschedule",
"end_reason" : "exit with 127",
"end_related_task" : "worker-2",
"end_recover_before_downgrade" : "npu_proc_restart"
}, {
"start_at" : 1701323345,
"end_at" : 1701325432,
"start_type" : "init_or_rescheduled",
"end_reason" : "job completed"
} ]
},
"algorithm" : {
"code_dir" : "obs://test/economic_test/py_minist/",
"boot_file" : "obs://test/economic_test/py_minist/minist_common.py",
"inputs" : [ {
"name" : "data_url",
"local_dir" : "/home/ma-user/modelarts/inputs/data_url_0",
"remote" : {
"obs" : {
"obs_url" : "/test/data/py_minist/"
}
}
} ],
"outputs" : [ {
"name" : "train_url",
"local_dir" : "/home/ma-user/modelarts/outputs/train_url_0",
"remote" : {
"obs" : {
"obs_url" : "/test/train_output/"
}
}
} ],
"engine" : {
"engine_id" : "pytorch-cp36-1.4.0-v2",
"engine_name" : "PyTorch",
"engine_version" : "PyTorch-1.4.0-python3.6-v2"
}
},
"spec" : {
"resource" : {
"flavor_id" : "modelarts.vm.p100.large.eco",
"node_count" : 1,
"flavor_detail" : {
"flavor_type" : "GPU",
"billing" : {
"code" : "modelarts.vm.gpu.p100.eco",
"unit_num" : 1
},
"flavor_info" : {
"cpu" : {
"arch" : "x86",
"core_num" : 8
},
"gpu" : {
"unit_num" : 1,
"memory" : "8GB"
},
"memory" : {
"size" : 64,
"unit" : "GB"
}
}
}
}
}
}
Status Codes
|
Status Code |
Description |
|---|---|
|
200 |
ok |
Error Codes
See Error Codes.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.