Help Center/ ModelArts/ API Reference/ Training Management/ Querying a Training Job List
Updated on 2024-06-13 GMT+08:00

Querying a Training Job List

Function

This API is used to query the the created training jobs that meet the search criteria.

URI

POST /v2/{project_id}/training-job-searches

Table 1 Path Parameters

Parameter

Mandatory

Type

Description

project_id

Yes

String

Project ID. For details, see Obtaining a Project ID and Name.

Request Parameters

Table 2 Request body parameters

Parameter

Mandatory

Type

Description

offset

No

Integer

Offset for querying jobs. The minimum value is 0. For example, if this parameter is set to 1, the query starts from the second one.

limit

No

Integer

Maximum number of jobs to be queried. The value ranges from 1 to 50.

sort_by

No

String

Metric for sorting jobs to be queried. create_time is used by default for sorting.

order

No

String

Order of queried jobs. The default value is desc, indicating the descending order. You can also set this parameter to asc, indicating the ascending order.

group_by

No

String

Condition for grouping the jobs to be queried.

filters

No

Array of filters objects

Filters for querying jobs.

Table 3 filters

Parameter

Mandatory

Type

Description

key

No

String

Grouping condition key.

operator

No

String

Grouping condition key-value relationship. The options are between (range), like (similar), in (included), and not (not).

value

No

Array of strings

Value of the grouping condition key.

Response Parameters

Status code: 200

Table 4 Response body parameters

Parameter

Type

Description

total

Integer

Total number of queried jobs of the current user.

count

Integer

Total number of jobs that meet the search criteria of the current user.

limit

Integer

Maximum number of jobs to be queried. The value ranges from 1 to 50.

offset

Integer

Offset for querying jobs. The minimum value is 0. For example, if this parameter is set to 1, the query starts from the second one.

sort_by

String

Metric for sorting jobs to be queried. create_time is used by default for sorting.

order

String

Order of queried jobs. The default value is desc, indicating the descending order. You can also set this parameter to asc, indicating the ascending order.

group_by

String

Condition for grouping the jobs to be queried.

workspace_id

String

Workspace where a job is located. The default value is 0.

ai_project

String

AI project to which a job belongs. The default value is default-ai-project.

items

Array of JobResponse objects

Jobs that meet the search criteria of the current user.

Table 5 JobResponse

Parameter

Type

Description

kind

String

Training job type, which is job by default. Options:

  • job: training job

metadata

JobMetadata object

Metadata of a training job.

status

Status object

Status of a training job. You do not need to set this parameter when creating a job.

algorithm

JobAlgorithmResponse object

Algorithm used by a training job. Options:

  • id: Only the algorithm ID is used.

  • subscription_id+item_version_id: The subscription ID and version ID of the algorithm are used.

  • code_dir+boot_file: The code directory and boot file of the training job are used.

tasks

Array of TaskResponse objects

List of tasks in heterogeneous training jobs.

spec

spec object

Specifications of a training job.

Table 6 JobMetadata

Parameter

Type

Description

id

String

Training job ID, which is generated and returned by ModelArts after the training job is created.

name

String

Name of a training job. The value must contain 1 to 64 characters consisting of only digits, letters, underscores (_), and hyphens (-).

workspace_id

String

Workspace where a job is located. The default value is 0.

description

String

Training job description. The value must contain 0 to 256 characters. The default value is NULL.

create_time

Long

Time when a training job was created, in milliseconds. The value is generated and returned by ModelArts after a training job is created.

user_name

String

Username for creating a training job. The username is generated and returned by ModelArts after a training job is created.

annotations

Map<String,String>

Advanced configuration of a training job. Options:

  • job_template: Template RL (heterogeneous job)

  • fault-tolerance/job-retry-num: 3 (number of retries upon a fault)

Table 7 Status

Parameter

Type

Description

phase

String

Level-1 status of a training job. The options are as follows: Creating Pending Running Failed Completed, Terminating Terminated Abnormal

secondary_phase

String

The level-2 status of a training job is an internal detailed status, which may be added, modified, or deleted. Dependency is not recommended. The options are as follows: Creating Queuing Running Failed Completed, Terminating Terminated CreateFailed TerminatedFailed Unknown Lost

duration

Long

Running duration of a training job, in milliseconds

node_count_metrics

Array<Array<Integer>>

Node count changes during the training job running period.

tasks

Array of strings

Tasks of a training job.

start_time

String

Start time of a training job. The value is in timestamp format.

task_statuses

Array of task_statuses objects

Status of a training job task.

Table 8 task_statuses

Parameter

Type

Description

task

String

Name of a training job task.

exit_code

Integer

Exit code of a training job task.

message

String

Error message of a training job task.

Table 9 JobAlgorithmResponse

Parameter

Type

Description

id

String

Algorithm used by a training job. Options:

  • id: Only the algorithm ID is used.

  • subscription_id+item_version_id: The subscription ID and version ID of the algorithm are used.

  • code_dir+boot_file: The code directory and boot file of the training job are used.

name

String

Algorithm name.

subscription_id

String

Subscription ID of a subscribed algorithm, which must be used with item_version_id

item_version_id

String

Version ID of the subscribed algorithm, which must be used with subscription_id

code_dir

String

Code directory of a training job, for example, /usr/app/. This parameter must be used together with boot_file. If id or subscription_id+item_version_id is set, leave it blank.

boot_file

String

Boot file of a training job, which must be stored in the code directory, for example, /usr/app/boot.py. This parameter must be used with code_dir. Leave this parameter blank if id, or subscription_id and item_version_id are specified.

autosearch_config_path

String

YAML configuration path of auto search jobs. An OBS URL is required.

autosearch_framework_path

String

Framework code directory of auto search jobs. An OBS URL is required.

command

String

Boot command used to start the container of a custom image of a training job. For example, python train.py.

parameters

Array of Parameter objects

Running parameters of a training job.

policies

policies object

Policies supported by jobs.

inputs

Array of Input objects

Input of a training job.

outputs

Array of Output objects

Output of a training job.

engine

engine object

Engine of a training job. Leave this parameter blank if the job is created using id of the algorithm in algorithm management, or subscription_id+item_version_id of the subscribed algorithm.

local_code_dir

String

Local directory to the training container to which the algorithm code directory is downloaded. Ensure that the following rules are complied with: - The directory must be in the /home directory. - In v1 compatibility mode, the current field does not take effect. - When code_dir is prefixed with file://, the current field does not take effect.

working_dir

String

Work directory where an algorithm is executed. Note that this parameter does not take effect in v1 compatibility mode.

environments

Array of Map<String,String> objects

Environment variables of a training job. The format is key: value. Leave this parameter blank.

Table 10 Parameter

Parameter

Type

Description

name

String

Parameter name.

value

String

Parameter value.

description

String

Parameter description.

constraint

constraint object

Parameter constraint.

i18n_description

i18n_description object

Internationalization description.

Table 11 constraint

Parameter

Type

Description

type

String

Parameter type.

editable

Boolean

Whether the parameter is editable.

required

Boolean

Whether the parameter is mandatory.

sensitive

Boolean

Whether the parameter is sensitive This function is not implemented currently.

valid_type

String

Valid type.

valid_range

Array of strings

Valid range.

Table 12 i18n_description

Parameter

Type

Description

language

String

Language.

description

String

Description.

Table 13 policies

Parameter

Type

Description

auto_search

auto_search object

Hyperparameter search configuration.

Table 15 reward_attrs

Parameter

Type

Description

name

String

Metric name.

mode

String

Search direction.

  • max: A larger metric value indicates better performance.

  • min: A smaller metric value indicates better performance.

regex

String

Regular expression of a metric.

Table 16 search_params

Parameter

Type

Description

name

String

Hyperparameter name.

param_type

String

Parameter type

  • If continuous is specified, the hyperparameter is of the continuous type. When an algorithm is used in a training job, continuous hyperparameters are displayed as text boxes on the console. - discrete: The hyperparameter is of the discrete type. When an algorithm is used for training jobs, discrete hyperparameters are displayed as a drop-down list box on the console.

lower_bound

String

Lower bound of the hyperparameter.

upper_bound

String

Upper bound of the hyperparameter.

discrete_points_num

String

Number of discrete points of a continuous hyperparameter.

discrete_values

Array of strings

List of discrete hyperparameter values.

Table 17 algo_configs

Parameter

Type

Description

name

String

Name of the search algorithm.

params

Array of AutoSearchAlgoConfigParameter objects

Search algorithm parameters.

Table 18 AutoSearchAlgoConfigParameter

Parameter

Type

Description

key

String

Parameter key.

value

String

Parameter value.

type

String

Parameter type.

Table 19 Input

Parameter

Type

Description

name

String

Name of the data input channel.

description

String

Description of the data input channel.

local_dir

String

Local directory of the container to which the data input channel is mapped.

remote

InputDataInfo object

Data input. Options:

  • dataset: Dataset as the data input

  • obs: OBS path as the data input

remote_constraint

Array of remote_constraint objects

Data input constraint

Table 20 InputDataInfo

Parameter

Type

Description

dataset

dataset object

Dataset as the data input.

obs

obs object

OBS in which data input and output stored.

Table 21 dataset

Parameter

Type

Description

id

String

Dataset ID of a training job.

version_id

String

Dataset version ID of a training job.

obs_url

String

OBS URL of the dataset required by a training job. ModelArts automatically parses and generates the URL based on the dataset and dataset version IDs. For example, /usr/data/.

Table 22 obs

Parameter

Type

Description

obs_url

String

OBS URL of the dataset required by a training job. For example, /usr/data/.

Table 23 remote_constraint

Parameter

Type

Description

data_type

String

Data input type, including the data storage location and dataset.

attributes

String

Attributes if a dataset is used as the data input. Options:

  • data_format: Data format

  • data_segmentation: Data segmentation

  • dataset_type: Labeling type

Table 24 Output

Parameter

Type

Description

name

String

Name of the data output channel.

description

String

Description of the data output channel.

local_dir

String

Local directory of the container to which the data output channel is mapped.

remote

remote object

Description of the actual data output.

Table 25 remote

Parameter

Type

Description

obs

obs object

OBS to which data is actually exported.

Table 26 obs

Parameter

Type

Description

obs_url

String

OBS URL to which data is actually exported.

Table 27 engine

Parameter

Type

Description

engine_id

String

Engine ID selected for a training job. You can set this parameter to engine_id, engine_name + engine_version, or image_url.

engine_name

String

Name of the engine selected for a training job. If engine_id is set, leave this parameter blank.

engine_version

String

Name of the engine version selected for a training job. If engine_id is set, leave this parameter blank.

image_url

String

Custom image URL selected for a training job.

Table 28 TaskResponse

Parameter

Type

Description

role

String

Task role. This function is not supported currently.

algorithm

algorithm object

Algorithm management and configuration.

task_resource

FlavorResponse object

Flavors of a training job or an algorithm.

Table 29 algorithm

Parameter

Type

Description

code_dir

String

Absolute path of the directory where the algorithm boot file is stored.

boot_file

String

Absolute path of the algorithm boot file.

inputs

inputs object

Algorithm input channel.

outputs

outputs object

Algorithm output channel.

engine

engine object

Engine on which a heterogeneous job depends.

local_code_dir

String

Local directory to the training container to which the algorithm code directory is downloaded. Ensure that the following rules are complied with: - The directory must be in the /home directory. - In v1 compatibility mode, the current field does not take effect. - When code_dir is prefixed with file://, the current field does not take effect.

working_dir

String

Work directory where an algorithm is executed. Note that this parameter does not take effect in v1 compatibility mode.

Table 30 inputs

Parameter

Type

Description

name

String

Name of the data input channel.

local_dir

String

Local path of the container to which the data input and output channels are mapped.

remote

remote object

Actual data input. Heterogeneous jobs support only OBS.

Table 31 remote

Parameter

Type

Description

obs

obs object

OBS in which data input and output stored.

Table 32 obs

Parameter

Type

Description

obs_url

String

OBS URL of the dataset required by a training job. For example, /usr/data/.

Table 33 outputs

Parameter

Type

Description

name

String

Name of the data output channel.

local_dir

String

Local directory of the container to which the data output channel is mapped.

remote

remote object

Description of the actual data output.

mode

String

Data transmission mode. The default value is upload_periodically.

period

String

Data transmission period. The default value is 30s.

Table 34 remote

Parameter

Type

Description

obs

obs object

OBS to which data is actually exported.

Table 35 obs

Parameter

Type

Description

obs_url

String

OBS URL to which data is actually exported.

Table 36 engine

Parameter

Type

Description

engine_id

String

Engine ID of a heterogeneous job, for example, caffe-1.0.0-python2.7.

engine_name

String

Engine name of a heterogeneous job, for example, Caffe.

engine_version

String

Engine version of a heterogeneous job.

v1_compatible

Boolean

Whether the v1 compatibility mode is used.

run_user

String

User UID started by default by the engine.

image_url

String

Custom image URL selected by an algorithm.

Table 37 FlavorResponse

Parameter

Type

Description

flavor_id

String

ID of the resource flavor.

flavor_name

String

Name of the resource flavor.

max_num

Integer

Maximum number of nodes in a resource flavor.

flavor_type

String

Resource flavor type. Options:

  • CPU

  • GPU

billing

billing object

Billing information of a resource flavor.

flavor_info

flavor_info object

Resource flavor details.

attributes

Map<String,String>

Other specification attributes.

Table 38 billing

Parameter

Type

Description

code

String

Billing code.

unit_num

Integer

Number of billing units.

Table 39 flavor_info

Parameter

Type

Description

max_num

Integer

Maximum number of nodes that can be selected. The value 1 indicates that the distributed mode is not supported.

cpu

cpu object

CPU specifications.

gpu

gpu object

GPU specifications.

npu

npu object

Ascend specifications

memory

memory object

Memory information.

disk

disk object

Disk information.

Table 40 cpu

Parameter

Type

Description

arch

String

CPU architecture.

core_num

Integer

Number of cores.

Table 41 gpu

Parameter

Type

Description

unit_num

Integer

Number of GPUs.

product_name

String

Product name.

memory

String

Memory.

Table 42 npu

Parameter

Type

Description

unit_num

String

Number of NPUs.

product_name

String

Product name.

memory

String

Memory.

Table 43 memory

Parameter

Type

Description

size

Integer

Memory size.

unit

String

Memory size

Table 44 disk

Parameter

Type

Description

size

Integer

Disk size.

unit

String

Unit of the disk size.

Table 45 spec

Parameter

Type

Description

resource

Resource object

Resource flavors of a training job. Select either flavor_id or pool_id+[flavor_id].

volumes

Array of volumes objects

Volumes attached to a training job.

log_export_path

log_export_path object

Export path of training job logs.

Table 46 Resource

Parameter

Type

Description

policy

String

Resource flavor of a training job. Options: regular

flavor_id

String

ID of the resource flavor selected for a training job. flavor_id cannot be specified for dedicated resource pools with CPU specifications. The options for dedicated resource pools with GPU/Ascend specifications are as follows:

  • modelarts.pool.visual.xlarge (1 card)

  • modelarts.pool.visual.2xlarge (2 cards)

  • modelarts.pool.visual.4xlarge (4 cards)

  • modelarts.pool.visual.8xlarge (8 cards)

flavor_name

String

Read-only flavor name returned by ModelArts when flavor_id is used.

node_count

Integer

Number of resource replicas selected for a training job.

pool_id

String

Resource pool ID selected for a training job.

flavor_detail

flavor_detail object

Flavors of a training job or an algorithm.

Table 47 flavor_detail

Parameter

Type

Description

flavor_type

String

Resource flavor type. Options:

  • CPU

  • GPU

billing

billing object

Billing information of a resource flavor.

flavor_info

flavor_info object

Resource flavor details.

Table 48 billing

Parameter

Type

Description

code

String

Billing code.

unit_num

Integer

Number of billing units.

Table 49 flavor_info

Parameter

Type

Description

max_num

Integer

Maximum number of nodes that can be selected. The value 1 indicates that the distributed mode is not supported.

cpu

cpu object

CPU specifications.

gpu

gpu object

GPU specifications.

npu

npu object

Ascend specifications

memory

memory object

Memory information.

disk

disk object

Disk information.

Table 50 cpu

Parameter

Type

Description

arch

String

CPU architecture.

core_num

Integer

Number of cores.

Table 51 gpu

Parameter

Type

Description

unit_num

Integer

Number of GPUs.

product_name

String

Product name.

memory

String

Memory.

Table 52 npu

Parameter

Type

Description

unit_num

String

Number of NPUs.

product_name

String

Product name.

memory

String

Memory.

Table 53 memory

Parameter

Type

Description

size

Integer

Memory size.

unit

String

Number of memory units.

Table 54 disk

Parameter

Type

Description

size

String

Disk size.

unit

String

Unit of the disk size. Generally, the value is GB.

Table 55 volumes

Parameter

Type

Description

nfs

nfs object

Volumes attached in NFS mode.

Table 56 nfs

Parameter

Type

Description

nfs_server_path

String

NFS server path.

local_path

String

Path for attaching volumes to the training container.

read_only

Boolean

Whether the volumes attached to the container in NFS mode are read-only.

Table 57 log_export_path

Parameter

Type

Description

obs_url

String

OBS URL for storing training job logs.

host_path

String

Path of the host where training job logs are stored.

Example Requests

The following is an example of how to obtain training jobs. The number of obtained training jobs has been limited to 1, and the system will only query data for training jobs with names containing trainjob.

POST https://endpoint/v2/{project_id}/training-job-searches?limit=1

{
  "offset" : 0,
  "limit" : 1,
  "filters" : [ {
    "key" : "name",
    "operator" : "like",
    "value" : [ "trainjob" ]
  }, {
    "key" : "create_time",
    "operator" : "between",
    "value" : [ "", "" ]
  }, {
    "key" : "phase",
    "operator" : "in",
    "value" : [ "" ]
  }, {
    "key" : "algorithm_name",
    "operator" : "like",
    "value" : [ "" ]
  }, {
    "key" : "kind",
    "operator" : "in",
    "value" : [ ]
  }, {
    "key" : "user_id",
    "operator" : "in",
    "value" : [ "" ]
  } ]
}

Example Responses

Status code: 200

ok

{
  "total" : 5059,
  "count" : 1,
  "limit" : 1,
  "offset" : 0,
  "sort_by" : "create_time",
  "order" : "desc",
  "group_by" : "",
  "workspace_id" : "0",
  "ai_project" : "default-ai-project",
  "items" : [ {
    "kind" : "job",
    "metadata" : {
      "id" : "3faf5c03-aaa1-4cbe-879d-24b05d997347",
      "name" : "trainjob--py14_mem06-byd-108",
      "description" : "",
      "create_time" : 1636447346315,
      "workspace_id" : "0",
      "user_name" : "ei_modelarts_q00357245_01"
    },
    "status" : {
      "phase" : "Abnormal",
      "secondary_phase" : "CreateFailed",
      "duration" : 0,
      "start_time" : 0,
      "node_count_metrics" : [ [ 1636447746000, 0 ], [ 1636447755000, 0 ], [ 1636447756000, 0 ] ],
      "tasks" : [ "worker-0" ]
    },
    "algorithm" : {
      "code_dir" : "obs://test-crq/economic_test/py_minist/",
      "boot_file" : "obs://test-crq/economic_test/py_minist/minist_common.py",
      "inputs" : [ {
        "name" : "data_url",
        "local_dir" : "/home/ma-user/modelarts/inputs/data_url_0",
        "remote" : {
          "obs" : {
            "obs_url" : "/test-crq/data/py_minist/"
          }
        }
      } ],
      "outputs" : [ {
        "name" : "train_url",
        "local_dir" : "/home/ma-user/modelarts/outputs/train_url_0",
        "remote" : {
          "obs" : {
            "obs_url" : "/test-crq/train_output/"
          }
        }
      } ],
      "engine" : {
        "engine_id" : "pytorch-cp36-1.4.0-v2",
        "engine_name" : "PyTorch",
        "engine_version" : "PyTorch-1.4.0-python3.6-v2"
      }
    },
    "spec" : {
      "resource" : {
        "policy" : "economic",
        "flavor_id" : "modelarts.vm.p100.large.eco",
        "flavor_name" : "Computing GPU(P100) instance",
        "node_count" : 1,
        "flavor_detail" : {
          "flavor_type" : "GPU",
          "billing" : {
            "code" : "modelarts.vm.gpu.p100.eco",
            "unit_num" : 1
          },
          "flavor_info" : {
            "cpu" : {
              "arch" : "x86",
              "core_num" : 8
            },
            "gpu" : {
              "unit_num" : 1,
              "product_name" : "NVIDIA-P100",
              "memory" : "8GB"
            },
            "memory" : {
              "size" : 64,
              "unit" : "GB"
            }
          }
        }
      }
    }
  } ]
}

Status Codes

Status Code

Description

200

ok

Error Codes

See Error Codes.