Help Center/ ModelArts/ API Reference/ Training Management/ Querying the Details About a Training Job

Updated on 2025-12-05 GMT+08:00

View PDF

Querying the Details About a Training Job

Function

This API is used to query the details about a training job on ModelArts.

This API applies to the following scenario: When you need to view the running status and configuration information of a specific training job, you can use this API to obtain the job details. Before using this API, ensure that you have obtained the training job ID and have the permission to view job details. After the query is complete, the platform returns the details about the training job, including the job status, configuration, and logs. If the training job ID does not exist or you do not have the operation permission, the API will return an error message.

Debugging

You can debug this API through automatic authentication in API Explorer or use the SDK sample code generated by API Explorer. Obtain its CLI example hcloud ModelArts ShowTrainingJobDetails.

Authorization Information

Each account has all the permissions required to call all APIs, but IAM users must be assigned the required permissions.

If you are using role/policy-based authorization, see Permissions Policies and Supported Actions for details on the required permissions.

If you are using identity policy-based authorization, the following identity policy-based permissions are required.

Action	Access Level	Resource Type (*: required)	Condition Key	Alias	Dependencies
modelarts:trainJob:get	Read	trainJob *	g:ResourceTag/<tag-key>	-	-
modelarts:trainJob:get	Read	-	modelarts:poolType modelarts:poolId	-	-

URI

GET /v2/{project_id}/training-jobs/{training_job_id}

**Table 1** Path Parameters
Parameter	Mandatory	Type	Description
project_id	Yes	String	Definition: Project ID. For details, see Obtaining a Project ID and Name. Constraints: The value can contain 1 to 64 characters. Letters, digits, and hyphens (-) are allowed. Range: N/A Default Value: N/A
training_job_id	Yes	String	Definition: Training job ID. For details, see Obtaining Training Jobs. Constraints: N/A Range: N/A Default Value: N/A

Request Parameters

None

Response Parameters

Status code: 200

**Table 2** Response body parameters
Parameter	Type	Description
kind	String	Definition: Type of a training job. Range job: common job federated_pool_job: resource pool federated job edge_job: edge job hetero_job: heterogeneous job mrs_job: MRS job autosearch_job: auto search job diag_job: diagnosis job visualization_job: visualization job
metadata	JobMetadataResponse object	Definition: Training job metadata.
status	Status object	Definition: Training job status information.
algorithm	JobAlgorithmResponse object	Definition: Training job algorithm.
tasks	Array of TaskResponse objects	Definition: Heterogeneous training tasks.
spec	SpecResponse object	Definition: Training job specifications.
endpoints	JobEndpointsResp object	Definition: Configurations required for remotely accessing a training job.

**Table 3** JobMetadataResponse
Parameter	Type	Description
id	String	Definition: Training job ID, which is generated and returned by ModelArts after a training job is created. Range: N/A
name	String	Definition: Name of a training job. Range: The value must contain 1 to 64 characters consisting of only digits, letters, underscores (_), and hyphens (-).
workspace_id	String	Definition: Workspace where a specified job is located. Range: N/A
description	String	Definition: Definition of a training job. Range: N/A
create_time	Long	Definition: Time when a training job was created, in milliseconds. The value is generated and returned by ModelArts after a training job is created. Range: N/A
user_name	String	Definition: Username for creating a training job. The username is generated and returned by ModelArts after a training job is created. Range: N/A
annotations	Map<String,String>	Definition: Advanced functions of a training job.

**Table 4** Status
Parameter	Type	Description
phase	String	Definition: Level-1 status of a training job. Range: Creating: The job is being created. Pending: The job is pending. Running: The job is running. Failed: The job failed to run. Completed: The job is complete. Terminating: The job is being stopped. Terminated: The job has been stopped. Abnormal: The job is abnormal.
secondary_phase	String	Definition: Level-2 status of a training job. The values are internal detailed statuses and may be added, changed, or deleted. Dependency on the status is not recommended. Range: Creating: The job is being created. Queuing: The job is queuing. Running: The job is running. Failed: The job failed to run. Completed: The job is complete. Terminating: The job is being stopped. Terminated: The job has been stopped. CreateFailed: The job fails to be created. TerminatedFailed: The job fails to be stopped. Unknown: The job is in an unknown state. Lost: The job is abnormal.
duration	Long	Definition: Running duration of a training job, in ms. Range: N/A
node_count_metrics	Array<Array<Integer>>	Definition: Node quantity change metric during a training job runtime.
tasks	Array of strings	Definition: Training job subtask name.
start_time	Long	Definition: Timestamp when a training job is started. Range: N/A
task_statuses	Array of TaskStatuses objects	Definition: Status of the first failed subtask of a training job.
running_records	Array of RunningRecord objects	Definition: Running and fault recovery records of a training job.

**Table 5** TaskStatuses
Parameter	Type	Description
task	String	Definition: Training job subtask name. Range: N/A
exit_code	Integer	Definition: Exit code of a training job subtask. Range: N/A
message	String	Definition: Error message of a training job subtask. Range: N/A

**Table 6** RunningRecord
Parameter	Type	Description
start_at	Integer	Definition: Unix timestamp of the start time in the current running record, in seconds. Range: N/A
end_at	Integer	Definition: Unix timestamp of the end time in the current running record, in seconds. Range: N/A
xpu_start_at	Integer	Definition: Unix timestamp of the accelerator card startup time in the current running record, in seconds. Range: N/A
start_type	String	Definition: Startup mode of the current execution. Range init_or_rescheduled: This startup is the first running after scheduling, including the first startup and the running after scheduling recovery. restarted: This startup is not the first running after scheduling but the running after a process restart.
end_reason	String	Definition: Reason why the running ends. Range: N/A
end_related_task	String	Definition: ID of the task worker (for example, worker-0) that ends the running. Range: N/A
end_recover	String	Definition: Fault tolerance policy adopted when the execution ends abnormally. Range npu_proc_restart: NPU in-place hot recovery proc_restart: in-place process recovery npu_step_retry: step recomputation pod_reschedule: pod-level rescheduling job_reschedule: job-level rescheduling job_reschedule_with_taint: isolated job-level rescheduling
end_recover_before_downgrade	String	Definition: There is a downgrade relationship between policies. If a policy fails to be executed, it will be downgraded to another specified policy. end_recover_before_downgrade indicates the tolerance policy used before end_recover is downgraded. Range: same as that of end_recover.
recover_records	Array of RecoverRecord objects	Definition: details about all fault tolerance policies adopted when the execution ends abnormally.

**Table 7** RecoverRecord
Parameter	Type	Description
recover_start_at	Integer	Unix timestamp of the start time of the fault tolerance policy, in seconds. The timestamp is also the fault occurrence time.
recover_end_at	Integer	Unix timestamp of the end time of the fault tolerance policy, in seconds.
recover	String	Fault tolerance policy. Options: npu_step_retry: step recomputation npu_proc_restart: NPU in-place hot recovery proc_restart: in-place process recovery pod_reschedule: pod-level rescheduling job_reschedule: job-level rescheduling job_reschedule_with_taint: isolated job-level rescheduling
fault_scenario	String	Fault scenario. Options: chip_fault: chip fault node_fault: node fault job_failed: job exit upon a failure job_hanged: job suspension job_subhealth: job subhealth error_in_log: log exception
reason	String	Cause of the fault.
related_task	String	ID of the task worker that causes the end of the current running record, for example, worker-0.
recover_result	String	Execution result of the fault. Options: recovering: executing success: successful failed: failed downgrade: policy downgrade

**Table 8** JobAlgorithmResponse
Parameter	Type	Description
id	String	Definition: Training job algorithm. Range: id: Only the algorithm ID is used. subscription_id+item_version_id: The subscription ID and version ID of the algorithm are used. code_dir+boot_file: The code directory and boot file of the training job are used.
name	String	Definition: Algorithm name. Range: N/A
subscription_id	String	Definition: Subscription ID of a subscription algorithm, which must be used with item_version_id. Range: N/A
item_version_id	String	Definition: Version of a subscription algorithm, which must be used with subscription_id. Range: N/A
code_dir	String	Definition: Code directory of a training job, for example, /usr/app/. This parameter must be used with boot_file. Leave this parameter blank if id, or subscription_id and item_version_id are specified. Range: N/A
boot_file	String	Definition: Boot file of a training job, which must be stored in the code directory, for example, /usr/app/boot.py. This parameter must be used with code_dir. Leave this parameter blank if id, or subscription_id and item_version_id are specified. Range: N/A
autosearch_config_path	String	Definition: YAML configuration path of an auto search job. An OBS URL is required. For example, obs://bucket/file.yaml. Range: N/A
autosearch_framework_path	String	Definition: Framework code directory of an auto search job. An OBS URL is required. For example, obs://bucket/files/. Range: N/A
command	String	Definition: Boot command for starting the container of a custom image for a training job. For example, python train.py. Range: N/A
parameters	Array of ParameterResp objects	Definition: Running parameters of the training job.
policies	policies object	Definition: Policy supported by a job.
inputs	Array of InputResp objects	Definition: Data input of a training job.
outputs	Array of OutputResp objects	Definition: Output of the training job.
engine	JobEngineResp object	Definition: Engine of a training job. Leave this parameter blank if the job is created using id of the algorithm in algorithm management, or subscription_id+item_version_id of the subscribed algorithm.
local_code_dir	String	Definition: Local directory of the training container to which the algorithm code directory is downloaded. The rules are as follows: The directory must be under /home. In v1 compatibility mode, the current field does not take effect. When code_dir is prefixed with file://, the current field does not take effect. Range: N/A
working_dir	String	Definition: Work directory where an algorithm is executed. Rules: In v1 compatibility mode, this parameter does not take effect. Range: N/A
environments	Array of Map<String,String> objects	Definition: Environment variables of a training job. The format is key:value. Leave this parameter blank.
summary	SummaryResp object	Definition: Visualization log summary.

**Table 9** ParameterResp
Parameter	Type	Description
name	String	Definition: Parameter name. Range: N/A
value	String	Definition: Parameter value. Range: N/A
description	String	Definition: Parameter description. Range: N/A
constraint	constraint object	Definition: Parameter attribute.
i18n_description	i18n_description object	Definition: Internationalization description.

**Table 10** constraint
Parameter	Type	Description
type	String	Definition: Parameter type. Range: N/A
editable	Boolean	Definition: Whether the parameter can be edited. Range: true: editable false: Not uneditable
required	Boolean	Definition: Whether the parameter is mandatory. Range: true: mandatory false: optional
sensitive	Boolean	Definition: Whether the parameter is sensitive. This function is unavailable currently. Range: true: sensitive false: insensitive
valid_type	String	Definition: Valid type. Range: N/A
valid_range	Array of strings	Definition: Valid range.

**Table 11** i18n_description
Parameter	Type	Description
language	String	Definition: Internationalization language. The options are as follows: zh-cn: Chinese en-us: English](tag:hc,hk) Range: N/A
description	String	Definition: Internationalization language description. Range: N/A

**Table 12** policies
Parameter	Type	Description
auto_search	auto_search object	Definition: Hyperparameter search configuration.

**Table 13** auto_search
Parameter	Type	Description
skip_search_params	String	Definition: Hyperparameter parameters that need to be skipped. Range: N/A
reward_attrs	Array of reward_attrs objects	Definition: Search metrics.
search_params	Array of search_params objects	Definition: Search parameters.
algo_configs	Array of algo_configs objects	Definition: Search algorithm configurations.

**Table 14** reward_attrs
Parameter	Type	Description
name	String	Definition: Metric name. Range: N/A
mode	String	Definition: Search mode. Range: max: A larger metric value is preferred. min: A smaller metric value is preferred.
regex	String	Definition: Regular expression of a metric. Range: N/A

**Table 15** search_params
Parameter	Type	Description
name	String	Definition: Hyperparameter name. Range: N/A
param_type	String	Definition: Parameter type. Range: continuous: The hyperparameter is of the continuous type. When an algorithm is used in a training job, continuous hyperparameters are displayed as text boxes on the console. discrete: The hyperparameter is of the discrete type. When an algorithm is used in a training job, discrete hyperparameters are displayed as drop-down lists on the console.
lower_bound	String	Definition: Lower bound of the hyperparameter. Range: N/A
upper_bound	String	Definition: Upper bound of the hyperparameter. Range: N/A
discrete_points_num	String	Definition: Number of discrete points of a hyperparameter with continuous values. Range: N/A
discrete_values	Array of strings	Definition: Discrete hyperparameter values.

**Table 16** algo_configs
Parameter	Type	Description
name	String	Definition: Search algorithm name. Range: N/A
params	Array of AutoSearchAlgoConfigParameterResp objects	Definition: Search algorithm parameters.

**Table 17** AutoSearchAlgoConfigParameterResp
Parameter	Type	Description
key	String	Definition: Parameter key. Range: N/A
value	String	Definition: Parameter value. Range: N/A
type	String	Definition: Parameter type. Range: N/A

**Table 18** InputResp
Parameter	Type	Description
name	String	Definition: Name of the data input channel. Range: N/A
description	String	Definition: Description of the data input channel. Range: N/A
local_dir	String	Definition: Local path of the container to which the data input channels are mapped. Example: /home/ma-user/modelarts/inputs/data_url_0 Range: N/A
access_method	String	Definition: Access method of the input data channel path (local_dir). Range: parameter: hyperparameters env: environment variables
remote	InputDataInfoResp object	Definition: Description of the actual data input.
remote_constraint	Array of remote_constraint objects	Definition: Data input constraint.

**Table 19** InputDataInfoResp
Parameter	Type	Description
dataset	dataset object	Definition: The input is a dataset.
obs	obs object	Definition: OBS in which data input and output are stored.

**Table 20** dataset
Parameter	Type	Description
id	String	Definition: Dataset ID of a training job. Range: N/A
version_id	String	Definition: Dataset version ID of a training job. Range: N/A
obs_url	String	Definition: OBS URL of the dataset for a training job. It is automatically parsed by ModelArts based on the dataset ID and dataset version ID. For example, /usr/data/. Range: N/A

**Table 21** obs
Parameter	Type	Description
obs_url	String	Definition: OBS URL of the dataset for a training job, For example, /usr/data/. Range: N/A

**Table 22** remote_constraint
Parameter	Type	Description
data_type	String	Definition: Data input type, including the data storage location and dataset. Constraints: N/A Range: N/A Default Value: N/A
attributes	String	Definition: Related attributes. Constraints: N/A Range: If the input is a dataset: data_format: data format data_segmentation: data segmentation method dataset_type: data labeling type Default Value: N/A

**Table 23** OutputResp
Parameter	Type	Description
name	String	Definition: Name of the data output channel. Range: N/A
description	String	Definition: Description of the data output channel. Range: N/A
local_dir	String	Definition: Local path of the container to which the data output channels are mapped. Range: N/A
access_method	String	Definition: Access method of the input data channel path (local_dir). Range: parameter: hyperparameters env: environment variables
remote	RemoteResp object	Definition: Description of the actual data output.

**Table 24** JobEngineResp
Parameter	Type	Description
engine_id	String	Definition: Engine ID selected for a training job. Range: N/A
engine_name	String	Definition: Engine name selected for a training job. Range: N/A
engine_version	String	Definition: Engine version selected for a training job. Range: N/A
image_url	String	Definition: Custom image URL selected for a training job. The URL is obtained from SWR. Range: N/A
install_sys_packages	Boolean	Definition: Specifies whether to install the MoXing version specified by the training platform. Range: true: yes false: no

**Table 25** SummaryResp
Parameter	Type	Description
log_type	String	Definition: Visualization log type of a training job. After this parameter is configured, the training job can be used as the data source of a visualization job. Range: tensorboard: TensorBoard mindstudio-insight: MindStudio Insight
log_dir	LogDirResp object	Definition: Visualization log output of a training job.
data_sources	Array of DataSourceResp objects	Definition: Visualization log input of the visualization job or training job debugging mode.

**Table 26** LogDirResp
Parameter	Type	Description
pfs	PFSSummaryResp object	Definition: Output of an OBS parallel file system.

**Table 27** PFSSummaryResp
Parameter	Type	Description
pfs_path	String	Definition: URL of the OBS parallel file system. Range: N/A

**Table 28** DataSourceResp
Parameter	Type	Description
job	JobSummaryResp object	Definition: Job data source.

**Table 29** JobSummaryResp
Parameter	Type	Description
job_id	String	Definition: ID of a training job. Range: N/A

**Table 30** TaskResponse
Parameter	Type	Description
role	String	Definition: Task role. This function is not supported currently. Range: N/A
algorithm	TaskResponseAlgorithm object	Definition: Algorithm configurations for algorithm management.
task_resource	FlavorResponse object	Definition: Specifications of a training job or algorithm.
log_export_path	log_export_path object	Definition: Saved information about training job logs.

**Table 31** TaskResponseAlgorithm
Parameter	Type	Description
code_dir	String	Definition: Absolute path of the directory where the algorithm boot file is stored. Range: N/A
boot_file	String	Definition: Absolute path of an algorithm boot file. Range: N/A
inputs	AlgorithmInput object	Definition: Information about the algorithm input channel.
outputs	AlgorithmOutput object	Definition: Information about the algorithm output channel.
engine	AlgorithmEngine object	Definition: Engine that a heterogeneous job depends on.
local_code_dir	String	Definition: Local directory of the training container to which the algorithm code directory is downloaded. The rules are as follows: The directory must be under /home. In v1 compatibility mode, the current field does not take effect. When code_dir is prefixed with file://, the current field does not take effect. Range: N/A
working_dir	String	Definition: Work directory where an algorithm is executed. Note that this parameter does not take effect in v1 compatibility mode. Range: N/A
environments	Map<String,String>	Definition: Environment variables related to a training job. Range: N/A

**Table 32** AlgorithmInput
Parameter	Type	Description
name	String	Definition: Name of the data input channel. Range: N/A
local_dir	String	Definition: Local path of the container to which the data input and output channels are mapped. Range: N/A
remote	AlgorithmRemote object	Definition: Actual data input, which can only be OBS for heterogeneous jobs.

**Table 33** AlgorithmRemote
Parameter	Type	Description
obs	RemoteObsResp object	Definition: OBS in which data input and output are stored.

**Table 34** AlgorithmOutput
Parameter	Type	Description
name	String	Definition: Name of the data output channel. Range: N/A
local_dir	String	Definition: Local path of the container to which the data output channels are mapped. Range: N/A
remote	RemoteResp object	Definition: Description of the actual data output.
mode	String	Definition: Data transmission mode. The default value is upload_periodically. Range: N/A
period	String	Definition: Data transmission period. The default value is 30s. Range: N/A

**Table 35** RemoteResp
Parameter	Type	Description
obs	RemoteObsResp object	Definition: Data actually output to OBS.

**Table 36** RemoteObsResp
Parameter	Type	Description
obs_url	String	Definition: Path of the data output to OBS. Range: N/A

**Table 37** AlgorithmEngine
Parameter	Type	Description
engine_id	String	Definition: Engine flavor ID, for example, caffe-1.0.0-python2.7. Range: N/A
engine_name	String	Definition: Engine flavor name, for example, Caffe. Range: N/A
engine_version	String	Definition: Engine flavor version. Engines with the same name have multiple versions, for example, Caffe-1.0.0-python2.7 of Python 2.7. Range: N/A
v1_compatible	Boolean	Definition: Specifies whether the v1 compatibility mode is used. Range: true: The v1 compatibility mode is used. false: The v1 compatibility mode is not used.
run_user	String	Definition: Default UID for the engine startup. Range: N/A
image_url	String	Definition: Custom image URL selected for an algorithm. Range: N/A

**Table 38** FlavorResponse
Parameter	Type	Description
pool_id	String	Definition: ID of the resource pool selected for a training job. Range: N/A
flavor_id	String	Definition: Resource flavor ID. Range: N/A
flavor_name	String	Definition: Resource flavor name. Range: N/A
max_num	Integer	Definition: Maximum number of nodes supported by a flavor. Range: N/A
flavor_type	String	Definition: Resource flavor type. Range: CPU GPU Ascend
billing	BillingInfo object	Definition: Billing information of a resource flavor.
flavor_info	FlavorInfoResponse object	Definition: Resource flavor details.
attributes	Map<String,String>	Definition: Other flavor attributes. Range: N/A

**Table 39** FlavorInfoResponse
Parameter	Type	Description
max_num	Integer	Definition: Maximum number of nodes that can be selected. The value 1 indicates that the distributed mode is not supported. Range: N/A
cpu	Cpu object	Definition: CPU specifications.
gpu	Gpu object	Definition: GPU specifications.
npu	Npu object	Definition: Ascend specifications.
memory	Memory object	Definition: Memory information.
disk	DiskResponse object	Definition: Disk information.

**Table 40** DiskResponse
Parameter	Type	Description
size	Integer	Definition: Disk size. Range: N/A
unit	String	Definition: Unit of the disk size. Range: N/A

**Table 41** log_export_path
Parameter	Type	Description
obs_url	String	Definition: OBS path for storing training job logs.

**Table 42** SpecResponse
Parameter	Type	Description
resource	Resource object	Definition: Resource flavor of a training job. Select either flavor_id or pool_id and flavor_id.
volumes	Array of JobVolumeResp objects	Definition: Mounting volume information of a training job.
log_export_path	LogExportPathResp object	Definition: Log output of a training job.
schedule_policy	SchedulePolicyResp object	Definition: Scheduling policy of a training job.
custom_metrics	Array of CustomMetrics objects	Definition: Metric collection configuration.

**Table 43** Resource
Parameter	Type	Description
policy	String	Definition: Resource flavor mode of a training job. Range: regular: standard mode
flavor_id	String	Definition: ID of the resource flavor of a training job. Range: The flavor_id parameter cannot be specified for a dedicated resource pool of CPU specifications. The options for dedicated resource pools with GPU/Ascend specifications are as follows: modelarts.pool.visual.xlarge (1 PU) modelarts.pool.visual.2xlarge (2 PUs) modelarts.pool.visual.4xlarge (4 PUs) modelarts.pool.visual.8xlarge (8 PUs)
flavor_name	String	Definition: Read-only flavor name returned by ModelArts when flavor_id is used. Range: N/A
node_count	Integer	Definition: Number of resource replicas selected for a training job. Range: N/A
pool_id	String	Definition: ID of the resource pool selected for a training job. Range: N/A
pool_group_id	String	Definition: ID of the resource pool federation selected for a training job. Range: N/A
flavor_detail	FlavorDetail object	Definition: Flavor details of a training job or algorithm. This parameter is available only for public resource pools.
main_container_allocated_resources	MainContainerAllocatedResources object	Definition: Resource specifications actually obtained by the training container of a training job.
main_container_customized_flavor	MainContainerCustomizedFlavor object	Definition: Custom flavor of a training job. Range: The number of CPU cores and memory size must be greater than 0, and the number of accelerator PUs must be greater than or equal to 0.

**Table 44** FlavorDetail
Parameter	Type	Description
flavor_type	String	Definition: Resource flavor type. Range: CPU GPU Ascend
billing	BillingInfo object	Definition: Billing information of a resource flavor.
flavor_info	FlavorInfo object	Definition: Resource flavor details.

**Table 45** BillingInfo
Parameter	Type	Description
code	String	Definition: Billing code. Range: N/A
unit_num	Integer	Definition: Billing unit. Range: N/A

**Table 46** FlavorInfo
Parameter	Type	Description
max_num	Integer	Definition: Maximum number of nodes that can be selected. The value 1 indicates that the distributed mode is not supported. Range: N/A
cpu	Cpu object	Definition: CPU specifications.
gpu	Gpu object	Definition: GPU specifications.
npu	Npu object	Definition: Ascend specifications.
memory	Memory object	Definition: Memory information.
disk	Disk object	Definition: Disk information.

**Table 47** Cpu
Parameter	Type	Description
arch	String	Definition: CPU architecture. Range: N/A
core_num	Integer	Definition: Number of cores. Range: N/A

**Table 48** Gpu
Parameter	Type	Description
unit_num	Integer	Definition: Number of GPUs. Range: N/A
product_name	String	Definition: Product name. Range: N/A
memory	String	Definition: Memory. Range: N/A

**Table 49** Npu
Parameter	Type	Description
unit_num	String	Definition: Number of NPUs. Range: N/A
product_name	String	Definition: Product name. Range: N/A
memory	String	Definition: Memory. Range: N/A

**Table 50** Memory
Parameter	Type	Description
size	Integer	Definition: Memory size. Range: N/A
unit	String	Definition: Number of memory units. Range: N/A

**Table 51** Disk
Parameter	Type	Description
size	String	Definition: Disk size. Range: N/A
unit	String	Definition: Unit of the disk size. Generally, the unit is GB. Range: N/A

**Table 52** MainContainerAllocatedResources
Parameter	Type	Description
cpu_arch	String	Definition: CPU architecture. Range: N/A
cpu_core_num	Float	Definition: Number of cores. Range: N/A
mem_size	Float	Definition: Memory information. Range: N/A
accelerator_num	Float	Definition: Number of accelerator cards. Range: N/A
accelerator_type	String	Definition: Type of accelerator cards. Range: N/A

**Table 53** MainContainerCustomizedFlavor
Parameter	Type	Description
cpu_core_num	Float	Definition: Number of CPU cores. Range: greater than 0
mem_size	Float	Definition: Memory size. Range: greater than 0
accelerator_num	Float	Definition: Number of accelerator cards. Range: greater than or equal to 0

**Table 54** JobVolumeResp
Parameter	Type	Description
nfs	NfsResp object	Definition: Volumes attached in NFS mode.

**Table 55** NfsResp
Parameter	Type	Description
nfs_server_path	String	Definition: NFS server path, for example, 10.10.10.10:/example/path. Range: N/A
local_path	String	Definition: Path for attaching volumes to the training container, for example, /example/path. Range: N/A
read_only	Boolean	Definition: Specifies whether the disks attached to the container in NFS mode are read-only. Range: true: read only false: non-read-only

**Table 56** LogExportPathResp
Parameter	Type	Description
obs_url	String	Definition: OBS path for storing training job logs, for example, obs://example/path. Range: N/A
host_path	String	Definition: Path of the host where training job logs are stored, for example, /example/path. Range: N/A

**Table 57** SchedulePolicyResp
Parameter	Type	Description
required_affinity	RequiredAffinityResp object	Definition: Affinity requirements of a training job.
priority	Integer	Definition: Priority of a training job. Range: 0 to 3
preemptible	Boolean	Definition: Whether the resource can be preempted. Range: true: The resource can be preempted. false: The resource cannot be preempted.

**Table 58** RequiredAffinityResp
Parameter	Type	Description
affinity_type	String	Definition: Affinity scheduling policy. Range: cabinet: strong cabinet scheduling hyperinstance: supernode affinity scheduling
affinity_group_size	Integer	Definition: Size of an affinity group. Range: N/A

**Table 59** CustomMetrics
Parameter	Type	Description
exec	Exec object	Definition: Metrics are collected in CLI mode.
http_get	HttpGet object	Definition: Metrics are collected in HTTP mode.

**Table 60** Exec
Parameter	Type	Description
command	Array of strings	Definition: Metrics are collected in CLI mode.

**Table 61** HttpGet
Parameter	Type	Description
path	String	Definition: URL for obtaining metrics over HTTP. Both the URL and the port below must either be configured together or remain empty. Range: N/A
port	Integer	Definition: Port for obtaining metrics over HTTP. This parameter and the URL above must be set or left blank at the same time. Range: N/A

**Table 62** JobEndpointsResp
Parameter	Type	Description
ssh	SSHResp object	Definition: SSH connection information.
jupyter_lab	JupyterLab object	Definition: JupyterLab connection information.
tensorboard	Tensorboard object	Definition: TensorBoard connection information.
mindstudio_insight	MindStudioInsight object	Definition: MindStudio Insight connection information.

**Table 63** SSHResp
Parameter	Type	Description
key_pair_names	Array of strings	Definition: Name of the SSH key pair, which can be created and viewed on the Key Pair page of the Elastic Cloud Server (ECS) console. Range: N/A
task_urls	Array of TaskUrls objects	Definition: SSH connection address.

**Table 64** TaskUrls
Parameter	Type	Description
task	String	Definition: Task ID of a training job. Range: N/A
url	String	Definition: SSH connection address of a training job. Range: N/A

**Table 65** JupyterLab
Parameter	Type	Description
url	String	Definition: JupyterLab address of a training job. Range: N/A
token	String	Definition: JupyterLab token of a training job. Range: N/A

**Table 66** Tensorboard
Parameter	Type	Description
url	String	Definition: TensorBoard address of a training job. Range: N/A
token	String	Definition: TensorBoard token of a training job. Range: N/A

**Table 67** MindStudioInsight
Parameter	Type	Description
url	String	Definition: MindStudio Insight address of a training job. Range: N/A
token	String	Definition: MindStudio Insight token of a training job. Range: N/A

Example Requests

The following shows how to query a training job whose UUID is 3faf5c03-aaa1-4cbe-879d-24b05d997347.

GET https://endpoint/v2/{project_id}/training-jobs/3faf5c03-aaa1-4cbe-879d-24b05d997347

Example Responses

Status code: 200

{
  "kind" : "job",
  "metadata" : {
    "id" : "3faf5c03-aaa1-4cbe-879d-24b05d997347",
    "name" : "trainjob--py14_mem06-108",
    "description" : "",
    "create_time" : 1636447346315,
    "workspace_id" : "0",
    "user_name" : ""
  },
  "status" : {
    "phase" : "Abnormal",
    "secondary_phase" : "CreateFailed",
    "duration" : 0,
    "start_time" : 0,
    "node_count_metrics" : [ [ 1636447746000, 0 ], [ 1636447755000, 0 ], [ 1636447756000, 0 ] ],
    "tasks" : [ "worker-0" ],
    "running_records" : [ {
      "start_at" : 1701327093,
      "end_at" : 1701322341,
      "start_type" : "init_or_rescheduled",
      "end_recover" : "job_reschedule",
      "end_reason" : "exit with 127",
      "end_related_task" : "worker-2",
      "end_recover_before_downgrade" : "npu_proc_restart"
    }, {
      "start_at" : 1701323345,
      "end_at" : 1701325432,
      "start_type" : "init_or_rescheduled",
      "end_reason" : "job completed"
    } ]
  },
  "algorithm" : {
    "code_dir" : "obs://test/economic_test/py_minist/",
    "boot_file" : "obs://test/economic_test/py_minist/minist_common.py",
    "inputs" : [ {
      "name" : "data_url",
      "local_dir" : "/home/ma-user/modelarts/inputs/data_url_0",
      "remote" : {
        "obs" : {
          "obs_url" : "/test/data/py_minist/"
        }
      }
    } ],
    "outputs" : [ {
      "name" : "train_url",
      "local_dir" : "/home/ma-user/modelarts/outputs/train_url_0",
      "remote" : {
        "obs" : {
          "obs_url" : "/test/train_output/"
        }
      }
    } ],
    "engine" : {
      "engine_id" : "pytorch-cp36-1.4.0-v2",
      "engine_name" : "PyTorch",
      "engine_version" : "PyTorch-1.4.0-python3.6-v2"
    }
  },
  "spec" : {
    "resource" : {
      "flavor_id" : "modelarts.vm.pnt1.large.eco",
      "node_count" : 1,
      "flavor_detail" : {
        "flavor_type" : "GPU",
        "billing" : {
          "code" : "modelarts.vm.gpu.pnt1.eco",
          "unit_num" : 1
        },
        "flavor_info" : {
          "cpu" : {
            "arch" : "x86",
            "core_num" : 8
          },
          "gpu" : {
            "unit_num" : 1,
            "memory" : "8GB"
          },
          "memory" : {
            "size" : 64,
            "unit" : "GB"
          }
        }
      },
      "main_container_allocated_resources" : {
        "cpu_arch" : "x86",
        "cpu_core_num" : 5,
        "mem_size" : 44,
        "accelerator_num" : 1,
        "accelerator_type" : "nvidia-v100-pcie32"
      }
    },
    "custom_metrics" : [ {
      "exec" : {
        "command" : [ "cat", "/a/b/c.prom" ]
      }
    }, {
      "http_get" : {
        "path" : "/raw_text",
        "port" : 10001
      }
    } ]
  }
}