Help Center/ ModelArts/ API Reference/ Training Management/ Querying the Details About a Training Job
Updated on 2025-08-20 GMT+08:00

Querying the Details About a Training Job

Function

This API is used to query the details about a training job on ModelArts.

This API applies to the following scenario: When you need to view the running status and configuration information of a specific training job, you can use this API to obtain the job details. Before using this API, ensure that you have obtained the training job ID and have the permission to view job details. After the query is complete, the platform returns the details about the training job, including the job status, configuration, and logs. If the training job ID does not exist or you do not have the operation permission, the API will return an error message.

Debugging

You can debug this API through automatic authentication in API Explorer or use the SDK sample code generated by API Explorer.

URI

GET /v2/{project_id}/training-jobs/{training_job_id}

Table 1 Path Parameters

Parameter

Mandatory

Type

Description

project_id

Yes

String

Definition: Project ID. For details, see Obtaining a Project ID and Name.

Constraints: The value can contain 1 to 64 characters. Letters, digits, and hyphens (-) are allowed.

Range: N/A

Default Value: N/A

training_job_id

Yes

String

Definition: ID of a training job

Constraints: For details, see Querying a Training Job List.

Range: N/A

Default Value: N/A

Request Parameters

None

Response Parameters

Status code: 200

Table 2 Response body parameters

Parameter

Type

Description

kind

String

Definition: Type of a training job.

Range:

  • job: common job

  • edge_job: edge job

  • hetero_job: heterogeneous job

  • mrs_job: MRS job

  • autosearch_job: auto search job

  • diag_job: diagnosis job

  • visualization_job: visualization job

metadata

JobMetadataResponse object

Definition: Training job metadata.

status

Status object

Definition: Training job status information.

algorithm

JobAlgorithmResponse object

Definition: Training job algorithm.

tasks

Array of TaskResponse objects

Definition: Heterogeneous training tasks.

spec

SpecResponce object

Definition: Training job specifications.

endpoints

JobEndpointsResp object

Definition: Configurations required for remotely accessing a training job.

Table 3 JobMetadataResponse

Parameter

Type

Description

id

String

Definition: Training job ID, which is generated and returned by ModelArts after a training job is created.

Range: N/A

name

String

Definition: Name of a training job.

Range: The value must contain 1 to 64 characters consisting of only digits, letters, underscores (_), and hyphens (-).

workspace_id

String

Definition: Workspace where a specified job is located.

Range: N/A

description

String

Definition: Definition of a training job.

Range: N/A

create_time

Long

Definition: Time when a training job was created, in milliseconds. The value is generated and returned by ModelArts after a training job is created.

Range: N/A

user_name

String

Definition: Username for creating a training job. The username is generated and returned by ModelArts after a training job is created.

Range: N/A

annotations

Map<String,String>

Definition: Advanced functions of a training job.

Table 4 Status

Parameter

Type

Description

phase

String

Definition: Level-1 status of a training job.

Range:

  • Creating: The job is being created.

  • Pending: The job is pending.

  • Running: The job is running.

  • Failed: The job failed to run.

  • Completed: The job is complete.

  • Terminating: The job is being stopped.

  • Terminated: The job has been stopped.

  • Abnormal: The job is abnormal.

secondary_phase

String

Definition: Level-2 status of a training job. The values are internal detailed statuses and may be added, changed, or deleted. Dependency on the status is not recommended.

Range:

  • Creating: The job is being created.

  • Queuing: The job is queuing.

  • Running: The job is running.

  • Failed: The job failed to run.

  • Completed: The job is complete.

  • Terminating: The job is being stopped.

  • Terminated: The job has been stopped.

  • CreateFailed: The job fails to be created.

  • TerminatedFailed: The job fails to be stopped.

  • Unknown: The job is in an unknown state.

  • Lost: The job is abnormal.

duration

Long

Definition: Running duration of a training job, in ms.

Range: N/A

node_count_metrics

Array<Array<Integer>>

Definition: Node quantity change metric during a training job runtime.

tasks

Array of strings

Definition: Training job subtask name.

start_time

Long

Definition: Timestamp when a training job is started.

Range: N/A

task_statuses

Array of TaskStatuses objects

Definition: Training job subtask status.

running_records

Array of RunningRecord objects

Definition: Running and fault recovery records of a training job.

Table 5 TaskStatuses

Parameter

Type

Description

task

String

Definition: Training job subtask name.

Range: N/A

exit_code

Integer

Definition: Exit code of a training job subtask.

Range: N/A

message

String

Definition: Error message of a training job subtask.

Range: N/A

Table 6 RunningRecord

Parameter

Type

Description

start_at

Integer

Definition: Unix timestamp of the start time in the current running record, in seconds.

Range: N/A

end_at

Integer

Definition: Unix timestamp of the end time in the current running record, in seconds.

Range: N/A

start_type

String

Definition: Local running startup mode.

Range:

  • init_or_rescheduled: This startup is the first running after scheduling, including the first startup and the running after scheduling recovery.

  • restarted: This startup is not the first running after scheduling but the running after a process restart.

end_reason

String

Definition: Reason why the running ends.

Range: N/A

end_related_task

String

Definition: ID of the task worker (for example, worker-0) that ends the running.

Range: N/A

end_recover

String

Definition: Fault tolerance policy used after the running ends.

Range:

  • npu_proc_restart: NPU in-place hot recovery

  • gpu_proc_restart: GPU in-place hot recovery

  • proc_restart: Process in-place recovery

  • pod_reschedule: Pod-level rescheduling

  • job_reschedule: Job-level rescheduling

  • job_reschedule_with_taint: Isolated job-level rescheduling

end_recover_before_downgrade

String

Definition: Fault tolerance policy adopted after the running is complete but before the fault tolerance policy is degraded.

Range: same as that of end_recover.

Table 7 JobAlgorithmResponse

Parameter

Type

Description

id

String

Definition: Training job algorithm.

Range:

  • id: Only the algorithm ID is used.

  • subscription_id+item_version_id: The subscription ID and version ID of the algorithm are used.

  • code_dir+boot_file: The code directory and boot file of the training job are used.

name

String

Definition: Algorithm name.

Range: N/A

subscription_id

String

Definition: Subscription ID of a subscription algorithm, which must be used with item_version_id.

Range: N/A

item_version_id

String

Definition: Version of a subscription algorithm, which must be used with subscription_id.

Range: N/A

code_dir

String

Definition: Code directory of a training job, for example, /usr/app/. This parameter must be used with boot_file. Leave this parameter blank if id, or subscription_id and item_version_id are specified.

Range: N/A

boot_file

String

Definition: Boot file of a training job, which must be stored in the code directory, for example, /usr/app/boot.py. This parameter must be used with code_dir. Leave this parameter blank if id, or subscription_id and item_version_id are specified.

Range: N/A

autosearch_config_path

String

Definition: YAML configuration path of an auto search job. An OBS URL is required. For example, obs://bucket/file.yaml.

Range: N/A

autosearch_framework_path

String

Definition: Framework code directory of an auto search job. An OBS URL is required. For example, obs://bucket/files/.

Range: N/A

command

String

Definition: Boot command for starting the container of a custom image for a training job. For example, python train.py.

Range: N/A

parameters

Array of ParameterResp objects

Definition: Running parameters of the training job.

policies

policies object

Definition: Policy supported by a job.

inputs

Array of InputResp objects

Definition: Data input of a training job.

outputs

Array of OutputResp objects

Definition: Output of the training job.

engine

JobEngineResp object

Definition: Engine of a training job. Leave this parameter blank if the job is created using id of the algorithm in algorithm management, or subscription_id+item_version_id of the subscribed algorithm.

local_code_dir

String

Definition: Local directory of the training container to which the algorithm code directory is downloaded. The rules are as follows:

  • The directory must be under /home.

  • In v1 compatibility mode, the current field does not take effect.

  • When code_dir is prefixed with file://, the current field does not take effect.

Range: N/A

working_dir

String

Definition: Work directory where an algorithm is executed. Rules:

In v1 compatibility mode, this parameter does not take effect.

Range: N/A

environments

Array of Map<String,String> objects

Definition: Environment variables of a training job. The format is key:value. Leave this parameter blank.

summary

SummaryResp object

Definition: Visualization log summary.

Table 8 ParameterResp

Parameter

Type

Description

name

String

Definition: Parameter name.

Range: N/A

value

String

Definition: Parameter value.

Range: N/A

description

String

Definition: Parameter description.

Range: N/A

constraint

constraint object

Definition: Parameter attribute.

i18n_description

i18n_description object

Definition: Internationalization description.

Table 9 constraint

Parameter

Type

Description

type

String

Definition: Parameter type.

Range: N/A

editable

Boolean

Definition: Whether the parameter can be edited.

Range:

  • true: editable

  • false: Not uneditable

required

Boolean

Definition: Whether the parameter is mandatory.

Range:

  • true: mandatory

  • false: optional

sensitive

Boolean

Definition: Whether the parameter is sensitive. This function is unavailable currently.

Range:

  • true: sensitive

  • false: insensitive

valid_type

String

Definition: Valid type.

Range: N/A

valid_range

Array of strings

Definition: Valid range.

Table 10 i18n_description

Parameter

Type

Description

language

String

Definition: Internationalization language. The options are as follows:

  • zh-cn: Chinese

  • en-us: English

Range: N/A

description

String

Definition: Internationalization language description.

Range: N/A

Table 11 policies

Parameter

Type

Description

auto_search

auto_search object

Definition: Hyperparameter search configuration.

Table 13 reward_attrs

Parameter

Type

Description

name

String

Definition: Metric name.

Range: N/A

mode

String

Definition: Search mode.

Range:

  • max: A larger metric value is preferred.

  • min: A smaller metric value is preferred.

regex

String

Definition: Regular expression of a metric.

Range: N/A

Table 14 search_params

Parameter

Type

Description

name

String

Definition: Hyperparameter name.

Range: N/A

param_type

String

Definition: Parameter type.

Range:

  • continuous: The hyperparameter is of the continuous type. When an algorithm is used in a training job, continuous hyperparameters are displayed as text boxes on the console.

  • discrete: The hyperparameter is of the discrete type. When an algorithm is used in a training job, discrete hyperparameters are displayed as drop-down lists on the console.

lower_bound

String

Definition: Lower bound of the hyperparameter.

Range: N/A

upper_bound

String

Definition: Upper bound of the hyperparameter.

Range: N/A

discrete_points_num

String

Definition: Number of discrete points of a hyperparameter with continuous values.

Range: N/A

discrete_values

Array of strings

Definition: Discrete hyperparameter values.

Table 15 algo_configs

Parameter

Type

Description

name

String

Definition: Search algorithm name.

Range: N/A

params

Array of AutoSearchAlgoConfigParameterResp objects

Definition: Search algorithm parameters.

Table 16 AutoSearchAlgoConfigParameterResp

Parameter

Type

Description

key

String

Definition: Parameter key.

Range: N/A

value

String

Definition: Parameter value.

Range: N/A

type

String

Definition: Parameter type.

Range: N/A

Table 17 InputResp

Parameter

Type

Description

name

String

Definition: Name of the data input channel.

Range: N/A

description

String

Definition: Description of the data input channel.

Range: N/A

local_dir

String

Definition: Local path of the container to which the data input channels are mapped. Example: /home/ma-user/modelarts/inputs/data_url_0

Range: N/A

access_method

String

Definition: Access method of the input data channel path (local_dir).

Range:

  • parameter: hyperparameters

  • env: environment variables

remote

InputDataInfoResp object

Definition: Description of the actual data input.

remote_constraint

Array of remote_constraint objects

Definition: Data input constraint.

Table 18 InputDataInfoResp

Parameter

Type

Description

dataset

dataset object

Definition: The input is a dataset.

obs

obs object

Definition: OBS in which data input and output are stored.

Table 19 dataset

Parameter

Type

Description

id

String

Definition: Dataset ID of a training job.

Range: N/A

version_id

String

Definition: Dataset version ID of a training job.

Range: N/A

obs_url

String

Definition: OBS URL of the dataset for a training job. It is automatically parsed by ModelArts based on the dataset ID and dataset version ID. For example, /usr/data/.

Range: N/A

Table 20 obs

Parameter

Type

Description

obs_url

String

Definition: OBS URL of the dataset for a training job, For example, /usr/data/.

Range: N/A

Table 21 remote_constraint

Parameter

Type

Description

data_type

String

Definition: Data input type, including the data storage location and dataset.

Constraints: N/A

Range: N/A

Default Value: N/A

attributes

String

Definition: Related attributes.

Constraints: N/A

Range:

If the input is a dataset:

  • data_format: data format

  • data_segmentation: data segmentation method

  • dataset_type: data labeling type

Default Value: N/A

Table 22 OutputResp

Parameter

Type

Description

name

String

Definition: Name of the data output channel.

Range: N/A

description

String

Definition: Description of the data output channel.

Range: N/A

local_dir

String

Definition: Local path of the container to which the data output channels are mapped.

Range: N/A

access_method

String

Definition: Access method of the input data channel path (local_dir).

Range:

  • parameter: hyperparameters

  • env: environment variables

remote

RemoteResp object

Definition: Description of the actual data output.

Table 23 JobEngineResp

Parameter

Type

Description

engine_id

String

Definition: Engine ID selected for a training job.

Range: N/A

engine_name

String

Definition: Engine name selected for a training job.

Range: N/A

engine_version

String

Definition: Engine version selected for a training job.

Range: N/A

image_url

String

Definition: Custom image URL selected for a training job. The URL is obtained from SWR.

Range: N/A

install_sys_packages

Boolean

Definition: Specifies whether to install the MoXing version specified by the training platform.

Range:

  • true: yes

  • false: no

Table 24 SummaryResp

Parameter

Type

Description

log_type

String

Definition: Visualization log type of a training job. After this parameter is configured, the training job can be used as the data source of a visualization job.

Range:

  • tensorboard: TensorBoard

  • mindstudio-insight: MindStudio Insight

log_dir

LogDirResp object

Definition: Visualization log output of a training job.

data_sources

Array of DataSourceResp objects

Definition: Visualization log input of the visualization job or training job debugging mode.

Table 25 LogDirResp

Parameter

Type

Description

pfs

PFSSummaryResp object

Definition: Output of an OBS parallel file system.

Table 26 PFSSummaryResp

Parameter

Type

Description

pfs_path

String

Definition: URL of the OBS parallel file system.

Range: N/A

Table 27 DataSourceResp

Parameter

Type

Description

job

JobSummaryResp object

Definition: Job data source.

Table 28 JobSummaryResp

Parameter

Type

Description

job_id

String

Definition: ID of a training job.

Range: N/A

Table 29 TaskResponse

Parameter

Type

Description

role

String

Definition: Task role. This function is not supported currently.

Range: N/A

algorithm

TaskResponseAlgorithm object

Definition: Algorithm configurations for algorithm management.

task_resource

FlavorResponse object

Definition: Specifications of a training job or algorithm.

Table 30 TaskResponseAlgorithm

Parameter

Type

Description

code_dir

String

Definition: Absolute path of the directory where the algorithm boot file is stored.

Range: N/A

boot_file

String

Definition: Absolute path of an algorithm boot file.

Range: N/A

inputs

AlgorithmInput object

Definition: Information about the algorithm input channel.

outputs

AlgorithmOutput object

Definition: Information about the algorithm output channel.

engine

AlgorithmEngine object

Definition: Engine that a heterogeneous job depends on.

local_code_dir

String

Definition: Local directory of the training container to which the algorithm code directory is downloaded. The rules are as follows:

  • The directory must be under /home.

  • In v1 compatibility mode, the current field does not take effect.

  • When code_dir is prefixed with file://, the current field does not take effect.

Range: N/A

working_dir

String

Definition: Work directory where an algorithm is executed. Note that this parameter does not take effect in v1 compatibility mode.

Range: N/A

Table 31 AlgorithmInput

Parameter

Type

Description

name

String

Definition: Name of the data input channel.

Range: N/A

local_dir

String

Definition: Local path of the container to which the data input and output channels are mapped.

Range: N/A

remote

AlgorithmRemote object

Definition: Actual data input, which can only be OBS for heterogeneous jobs.

Table 32 AlgorithmRemote

Parameter

Type

Description

obs

RemoteObsResp object

Definition: OBS in which data input and output are stored.

Table 33 AlgorithmOutput

Parameter

Type

Description

name

String

Definition: Name of the data output channel.

Range: N/A

local_dir

String

Definition: Local path of the container to which the data output channels are mapped.

Range: N/A

remote

RemoteResp object

Definition: Description of the actual data output.

mode

String

Definition: Data transmission mode. The default value is upload_periodically.

Range: N/A

period

String

Definition: Data transmission period. The default value is 30s.

Range: N/A

Table 34 RemoteResp

Parameter

Type

Description

obs

RemoteObsResp object

Definition: Data actually output to OBS.

Table 35 RemoteObsResp

Parameter

Type

Description

obs_url

String

Definition: Path of the data output to OBS.

Range: N/A

Table 36 AlgorithmEngine

Parameter

Type

Description

engine_id

String

Definition: Engine flavor ID, for example, caffe-1.0.0-python2.7.

Range: N/A

engine_name

String

Definition: Engine flavor name, for example, Caffe.

Range: N/A

engine_version

String

Definition: Engine flavor version. Engines with the same name have multiple versions, for example, Caffe-1.0.0-python2.7 of Python 2.7.

Range: N/A

v1_compatible

Boolean

Definition: Specifies whether the v1 compatibility mode is used.

Range:

  • true: The v1 compatibility mode is used.

  • false: The v1 compatibility mode is not used.

run_user

String

Definition: Default UID for the engine startup.

Range: N/A

image_url

String

Definition: Custom image URL selected for an algorithm.

Range: N/A

Table 37 FlavorResponse

Parameter

Type

Description

flavor_id

String

Definition: Resource flavor ID.

Range: N/A

flavor_name

String

Definition: Resource flavor name.

Range: N/A

max_num

Integer

Definition: Maximum number of nodes supported by a flavor.

Range: N/A

flavor_type

String

Definition: Resource flavor type.

Range:

  • CPU

  • GPU

  • Ascend

billing

BillingInfo object

Definition: Billing information of a resource flavor.

flavor_info

FlavorInfoResponse object

Definition: Resource flavor details.

attributes

Map<String,String>

Definition: Other flavor attributes.

Range: N/A

Table 38 FlavorInfoResponse

Parameter

Type

Description

max_num

Integer

Definition: Maximum number of nodes that can be selected. The value 1 indicates that the distributed mode is not supported.

Range: N/A

cpu

Cpu object

Definition: CPU specifications.

gpu

Gpu object

Definition: GPU specifications.

npu

Npu object

Definition: Ascend specifications.

memory

Memory object

Definition: Memory information.

disk

DiskResponse object

Definition: Disk information.

Table 39 DiskResponse

Parameter

Type

Description

size

Integer

Definition: Disk size.

Range: N/A

unit

String

Definition: Unit of the disk size.

Range: N/A

Table 40 SpecResponce

Parameter

Type

Description

resource

Resource object

Definition: Resource flavor of a training job. Select either flavor_id or pool_id and flavor_id.

volumes

Array of JobVolumeResp objects

Definition: Mounting volume information of a training job.

log_export_path

LogExportPathResp object

Definition: Log output of a training job.

schedule_policy

SchedulePolicyResp object

Definition: Scheduling policy of a training job.

custom_metrics

Array of CustomMetrics objects

Metric collection configuration

Table 41 Resource

Parameter

Type

Description

policy

String

Definition: Resource flavor mode of a training job.

Range:

  • regular: standard mode

flavor_id

String

Definition: ID of the resource flavor of a training job.

Range: The flavor_id parameter cannot be specified for a dedicated resource pool of CPU specifications. The options for dedicated resource pools with GPU/Ascend specifications are as follows:

  • modelarts.pool.visual.xlarge (1 PU)

  • modelarts.pool.visual.2xlarge (2 PUs)

  • modelarts.pool.visual.4xlarge (4 PUs)

  • modelarts.pool.visual.8xlarge (8 PUs)

flavor_name

String

Definition: Read-only flavor name returned by ModelArts when flavor_id is used.

Range: N/A

node_count

Integer

Definition: Number of resource replicas selected for a training job.

Range: N/A

pool_id

String

Definition: ID of the resource pool selected for a training job.

Range: N/A

flavor_detail

FlavorDetail object

Definition: Flavor details of a training job or algorithm. This parameter is available only for public resource pools.

main_container_allocated_resources

MainContainerAllocatedResources object

Resource specifications actually obtained by the training container of a training job.

Table 42 FlavorDetail

Parameter

Type

Description

flavor_type

String

Definition: Resource flavor type.

Range:

  • CPU

  • GPU

  • Ascend

billing

BillingInfo object

Definition: Billing information of a resource flavor.

flavor_info

FlavorInfo object

Definition: Resource flavor details.

Table 43 BillingInfo

Parameter

Type

Description

code

String

Definition: Billing code.

Range: N/A

unit_num

Integer

Definition: Billing unit.

Range: N/A

Table 44 FlavorInfo

Parameter

Type

Description

max_num

Integer

Definition: Maximum number of nodes that can be selected. The value 1 indicates that the distributed mode is not supported.

Range: N/A

cpu

Cpu object

Definition: CPU specifications.

gpu

Gpu object

Definition: GPU specifications.

npu

Npu object

Definition: Ascend specifications.

memory

Memory object

Definition: Memory information.

disk

Disk object

Definition: Disk information.

Table 45 Cpu

Parameter

Type

Description

arch

String

Definition: CPU architecture.

Range: N/A

core_num

Integer

Definition: Number of cores.

Range: N/A

Table 46 Gpu

Parameter

Type

Description

unit_num

Integer

Definition: Number of GPUs.

Range: N/A

product_name

String

Definition: Product name.

Range: N/A

memory

String

Definition: Memory.

Range: N/A

Table 47 Npu

Parameter

Type

Description

unit_num

String

Definition: Number of NPUs.

Range: N/A

product_name

String

Definition: Product name.

Range: N/A

memory

String

Definition: Memory.

Range: N/A

Table 48 Memory

Parameter

Type

Description

size

Integer

Definition: Memory size.

Range: N/A

unit

String

Definition: Number of memory units.

Range: N/A

Table 49 Disk

Parameter

Type

Description

size

String

Definition: Disk size.

Range: N/A

unit

String

Definition: Unit of the disk size. Generally, the unit is GB.

Range: N/A

Table 50 MainContainerAllocatedResources

Parameter

Type

Description

cpu_arch

String

CPU architecture.

cpu_core_num

Float

Number of cores.

mem_size

Float

Memory information.

accelerator_num

Float

Number of accelerator cards.

accelerator_type

String

Accelerator card type.

Table 51 JobVolumeResp

Parameter

Type

Description

nfs

NfsResp object

Definition: Volumes attached in NFS mode.

Table 52 NfsResp

Parameter

Type

Description

nfs_server_path

String

Definition: NFS server path, for example, 10.10.10.10:/example/path.

Range: N/A

local_path

String

Definition: Path for attaching volumes to the training container, for example, /example/path.

Range: N/A

read_only

Boolean

Definition: Specifies whether the disks attached to the container in NFS mode are read-only.

Range:

  • true: read only

  • false: non-read-only

Table 53 LogExportPathResp

Parameter

Type

Description

obs_url

String

Definition: OBS path for storing training job logs, for example, obs://example/path.

Range: N/A

host_path

String

Definition: Path of the host where training job logs are stored, for example, /example/path.

Range: N/A

Table 54 SchedulePolicyResp

Parameter

Type

Description

required_affinity

RequiredAffinityResp object

Definition: Affinity requirements of a training job.

priority

Integer

Definition: Priority of a training job.

Range: 0 to 3

preemptible

Boolean

Definition: Whether the resource can be preempted.

Range:

  • true: The resource can be preempted.

  • false: The resource cannot be preempted.

Table 55 RequiredAffinityResp

Parameter

Type

Description

affinity_type

String

Definition: Affinity scheduling policy.

Range:

  • cabinet: strong cabinet scheduling

  • hyperinstance: supernode affinity scheduling

affinity_group_size

Integer

Definition: Size of an affinity group.

Range: N/A

Table 56 CustomMetrics

Parameter

Type

Description

exec

Exec object

Metrics are collected using commands.

http_get

HttpGet object

Metrics are collected using HTTP.

Table 57 Exec

Parameter

Type

Description

command

Array of strings

Metrics are collected using commands.

Table 58 HttpGet

Parameter

Type

Description

path

String

URL for obtaining metrics over HTTP. Both the URL and the port below must either be configured together or remain empty.

port

Integer

Port for obtaining metrics over HTTP. This parameter and the URL above must be set or left blank at the same time.

Table 59 JobEndpointsResp

Parameter

Type

Description

ssh

SSHResp object

Definition: SSH connection information.

jupyter_lab

JupyterLab object

Definition: JupyterLab connection information.

tensorboard

Tensorboard object

Definition: TensorBoard connection information.

mindstudio_insight

MindStudioInsight object

Definition: MindStudio Insight connection information.

Table 60 SSHResp

Parameter

Type

Description

key_pair_names

Array of strings

Definition: Name of the SSH key pair, which can be created and viewed on the Key Pair page of the Elastic Cloud Server (ECS) console.

Range: N/A

task_urls

Array of TaskUrls objects

Definition: SSH connection address.

Table 61 TaskUrls

Parameter

Type

Description

task

String

Definition: Task ID of a training job.

Range: N/A

url

String

Definition: SSH connection address of a training job.

Range: N/A

Table 62 JupyterLab

Parameter

Type

Description

url

String

Definition: JupyterLab address of a training job.

Range: N/A

token

String

Definition: JupyterLab token of a training job.

Range: N/A

Table 63 Tensorboard

Parameter

Type

Description

url

String

Definition: TensorBoard address of a training job.

Range: N/A

token

String

Definition: TensorBoard token of a training job.

Range: N/A

Table 64 MindStudioInsight

Parameter

Type

Description

url

String

Definition: MindStudio Insight address of a training job.

Range: N/A

token

String

Definition: MindStudio Insight token of a training job.

Range: N/A

Example Requests

The following shows how to query a training job whose UUID is 3faf5c03-aaa1-4cbe-879d-24b05d997347.

GET https://endpoint/v2/{project_id}/training-jobs/3faf5c03-aaa1-4cbe-879d-24b05d997347

Example Responses

Status code: 200

ok

{
  "kind" : "job",
  "metadata" : {
    "id" : "3faf5c03-aaa1-4cbe-879d-24b05d997347",
    "name" : "trainjob--py14_mem06-108",
    "description" : "",
    "create_time" : 1636447346315,
    "workspace_id" : "0",
    "user_name" : ""
  },
  "status" : {
    "phase" : "Abnormal",
    "secondary_phase" : "CreateFailed",
    "duration" : 0,
    "start_time" : 0,
    "node_count_metrics" : [ [ 1636447746000, 0 ], [ 1636447755000, 0 ], [ 1636447756000, 0 ] ],
    "tasks" : [ "worker-0" ],
    "running_records" : [ {
      "start_at" : 1701327093,
      "end_at" : 1701322341,
      "start_type" : "init_or_rescheduled",
      "end_recover" : "job_reschedule",
      "end_reason" : "exit with 127",
      "end_related_task" : "worker-2",
      "end_recover_before_downgrade" : "npu_proc_restart"
    }, {
      "start_at" : 1701323345,
      "end_at" : 1701325432,
      "start_type" : "init_or_rescheduled",
      "end_reason" : "job completed"
    } ]
  },
  "algorithm" : {
    "code_dir" : "obs://test/economic_test/py_minist/",
    "boot_file" : "obs://test/economic_test/py_minist/minist_common.py",
    "inputs" : [ {
      "name" : "data_url",
      "local_dir" : "/home/ma-user/modelarts/inputs/data_url_0",
      "remote" : {
        "obs" : {
          "obs_url" : "/test/data/py_minist/"
        }
      }
    } ],
    "outputs" : [ {
      "name" : "train_url",
      "local_dir" : "/home/ma-user/modelarts/outputs/train_url_0",
      "remote" : {
        "obs" : {
          "obs_url" : "/test/train_output/"
        }
      }
    } ],
    "engine" : {
      "engine_id" : "pytorch-cp36-1.4.0-v2",
      "engine_name" : "PyTorch",
      "engine_version" : "PyTorch-1.4.0-python3.6-v2"
    }
  },
  "spec" : {
    "resource" : {
      "flavor_id" : "modelarts.vm.pnt1.large.eco",
      "node_count" : 1,
      "flavor_detail" : {
        "flavor_type" : "GPU",
        "billing" : {
          "code" : "modelarts.vm.gpu.pnt1.eco",
          "unit_num" : 1
        },
        "flavor_info" : {
          "cpu" : {
            "arch" : "x86",
            "core_num" : 8
          },
          "gpu" : {
            "unit_num" : 1,
            "memory" : "8GB"
          },
          "memory" : {
            "size" : 64,
            "unit" : "GB"
          }
        }
      },
      "main_container_allocated_resources" : {
        "cpu_arch" : "x86",
        "cpu_core_num" : 5,
        "mem_size" : 44,
        "accelerator_num" : 1,
        "accelerator_type" : "nvidia-v100-pcie32"
      }
    },
    "custom_metrics" : [ {
      "exec" : {
        "command" : [ "cat", "/a/b/c.porm" ]
      }
    }, {
      "http_get" : {
        "path" : "/raw_text",
        "port" : 10001
      }
    } ]
  }
}

Status Codes

Status Code

Description

200

ok

Error Codes

See Error Codes.