Help Center/ ModelArts/ API Reference/ Training Management/ Terminating a Training Job
Updated on 2024-06-20 GMT+08:00

Terminating a Training Job

Function

This API is used to terminate a training job. Only jobs in the creating, awaiting, or running state can be terminated.

Debugging

You can debug this API through automatic authentication in API Explorer or use the SDK sample code generated by API Explorer.

URI

POST /v2/{project_id}/training-jobs/{training_job_id}/actions

Table 1 Path Parameters

Parameter

Mandatory

Type

Description

project_id

Yes

String

Project ID. For details, see Obtaining a Project ID and Name.

training_job_id

Yes

String

ID of a training job.

Request Parameters

Table 2 Request body parameters

Parameter

Mandatory

Type

Description

action_type

Yes

String

Operation request for a training job. If this parameter is set to terminate, the training job is terminated.

Response Parameters

Status code: 202

Table 3 Response body parameters

Parameter

Type

Description

kind

String

Training job type, which is job by default. Options:

  • job: training job

metadata

JobMetadata object

Metadata of a training job.

status

Status object

Status of a training job. You do not need to set this parameter when creating a job.

algorithm

JobAlgorithmResponse object

Algorithm used by a training job. The options are as follows:

  • id: Only the algorithm ID is used.

  • subscription_id+item_version_id: The subscription ID and version ID of the algorithm are used.

  • code_dir+boot_file: The code directory and boot file of the training job are used.

tasks

Array of TaskResponse objects

List of tasks in heterogeneous training jobs.

spec

spec object

Specifications of a training job.

Table 4 JobMetadata

Parameter

Type

Description

id

String

Training job ID, which is generated and returned by ModelArts after the training job is created.

name

String

Name of a training job. The value must contain 1 to 64 characters consisting of only digits, letters, underscores (_), and hyphens (-).

workspace_id

String

Workspace where a job is located. The default value is 0.

description

String

Training job description. The value must contain 0 to 256 characters. The default value is NULL.

create_time

Long

Time when a training job was created, in milliseconds. The value is generated and returned by ModelArts after a training job is created.

user_name

String

Username for creating a training job. The username is generated and returned by ModelArts after a training job is created.

annotations

Map<String,String>

Advanced configuration of a training job. Options:

  • job_template: Template RL (heterogeneous job)

  • fault-tolerance/job-retry-num: 3 (number of retries upon a fault)

Table 5 Status

Parameter

Type

Description

phase

String

Level-1 status of a training job. The options are as follows: Creating, Pending, Running, Failed, Completed, Terminating, Terminated, Abnormal.

secondary_phase

String

Level-2 status of a training job. The values are internal detailed statuses and may be added, changed, or deleted. Dependency on the status is not recommended. The options are as follows: Creating, Queuing, Running, Failed, Completed, Terminating, Terminated, CreateFailed, TerminatedFailed, Unknown, Lost.

duration

Long

Running duration of a training job, in milliseconds

node_count_metrics

Array<Array<Integer>>

Node count changes during the training job running period.

tasks

Array of strings

Tasks of a training job.

start_time

Long

Start time of a training job. The value is in timestamp format.

task_statuses

Array of task_statuses objects

Status of a training job task.

running_records

Array of running_records objects

Running and fault recovery records of a training job

Table 6 task_statuses

Parameter

Type

Description

task

String

Name of a training job task.

exit_code

Integer

Exit code of a training job task.

message

String

Error message of a training job task.

Table 7 running_records

Parameter

Type

Description

start_at

Integer

Unix timestamp of the start time in the current running record, in seconds

end_at

Integer

Unix timestamp of the end time in the current running record, in seconds

start_type

String

Startup mode of the current running record. The options are as follows: init_or_rescheduled: This startup is the first running after scheduling, including the first startup and the running after scheduling recovery. restarted: This startup is not the first running after scheduling but the running after a process restart.

end_reason

String

Reason why the current running record ends

end_related_task

String

ID of the task worker that causes the end of the current running record, for example, worker-0

end_recover

String

Fault tolerance policy used after the current running record ends. The enums are as follows:

  • npu_proc_restart: NPU in-place hot recovery

  • gpu_proc_restart: GPU in-place hot recovery

  • proc_restart: Process in-place recovery

  • pod_reschedule: Pod-level rescheduling

  • job_reschedule: Job-level rescheduling

  • job_reschedule_with_taint: Isolated job-level rescheduling

end_recover_before_downgrade

String

Tolerance policy used after the current running record ends and before the fault tolerance policy is degraded. The options are the same as those of end_recover.

Table 8 JobAlgorithmResponse

Parameter

Type

Description

id

String

Algorithm used by a training job. The options are as follows:

  • id: Only the algorithm ID is used.

  • subscription_id+item_version_id: The subscription ID and version ID of the algorithm are used.

  • code_dir+boot_file: The code directory and boot file of the training job are used.

name

String

Algorithm name.

subscription_id

String

Subscription ID of a subscribed algorithm, which must be used with item_version_id

item_version_id

String

Version ID of the subscribed algorithm, which must be used with subscription_id

code_dir

String

Code directory of a training job, for example, /usr/app/. This parameter must be used together with boot_file. If id or subscription_id+item_version_id is set, leave it blank.

boot_file

String

Boot file of a training job, which must be stored in the code directory, for example, /usr/app/boot.py. This parameter must be used with code_dir. Leave this parameter blank if id, or subscription_id and item_version_id are specified.

autosearch_config_path

String

YAML configuration path of auto search jobs. An OBS URL is required.

autosearch_framework_path

String

Framework code directory of auto search jobs. An OBS URL is required.

command

String

Boot command for starting the container of a custom image for a training job. For example, python train.py.

parameters

Array of Parameter objects

Running parameters of a training job.

policies

policies object

Policies supported by jobs.

inputs

Array of Input objects

Input of a training job.

outputs

Array of Output objects

Output of a training job.

engine

engine object

Engine of a training job. Leave this parameter blank if the job is created using id of the algorithm in algorithm management, or subscription_id+item_version_id of the subscribed algorithm.

local_code_dir

String

Local directory of the training container to which the algorithm code directory is downloaded. The rules are as follows:

  • The directory must be under /home.

  • In v1 compatibility mode, the current field does not take effect.

  • When code_dir is prefixed with file://, the current field does not take effect.

working_dir

String

Work directory where an algorithm is executed. Note that this parameter does not take effect in v1 compatibility mode.

environments

Array of Map<String,String> objects

Environment variables of a training job. The format is key:value. Leave this parameter blank.

Table 9 Parameter

Parameter

Type

Description

name

String

Parameter name.

value

String

Parameter value.

description

String

Parameter description.

constraint

constraint object

Parameter constraint.

i18n_description

i18n_description object

Internationalization description.

Table 10 constraint

Parameter

Type

Description

type

String

Parameter type.

editable

Boolean

Whether the parameter is editable.

required

Boolean

Whether the parameter is mandatory.

sensitive

Boolean

Whether the parameter is sensitive This function is not implemented currently.

valid_type

String

Valid type.

valid_range

Array of strings

Valid range.

Table 11 i18n_description

Parameter

Type

Description

language

String

Language. Options:

  • zh-cn: Chinese

  • en-us: English

description

String

Description.

Table 12 policies

Parameter

Type

Description

auto_search

auto_search object

Hyperparameter search configuration.

Table 14 reward_attrs

Parameter

Type

Description

name

String

Metric name.

mode

String

Search mode.

  • max: A larger metric value is preferred.

  • min: A smaller metric value is preferred.

regex

String

Regular expression of a metric.

Table 15 search_params

Parameter

Type

Description

name

String

Hyperparameter name.

param_type

String

Parameter type.

  • continuous: The hyperparameter is of the continuous type. When an algorithm is used in a training job, continuous hyperparameters are displayed as text boxes on the console.

  • discrete: The hyperparameter is of the discrete type. When an algorithm is used in a training job, discrete hyperparameters are displayed as drop-down lists on the console.

lower_bound

String

Lower bound of the hyperparameter.

upper_bound

String

Upper bound of the hyperparameter.

discrete_points_num

String

Number of discrete points of a continuous hyperparameter.

discrete_values

Array of strings

List of discrete hyperparameter values.

Table 16 algo_configs

Parameter

Type

Description

name

String

Name of the search algorithm.

params

Array of AutoSearchAlgoConfigParameter objects

Search algorithm parameters.

Table 17 AutoSearchAlgoConfigParameter

Parameter

Type

Description

key

String

Parameter key.

value

String

Parameter value.

type

String

Parameter type.

Table 18 Input

Parameter

Type

Description

name

String

Name of the data input channel.

description

String

Description of the data input channel.

local_dir

String

Local directory of the container to which the data input channel is mapped.

remote

InputDataInfo object

Information of the data input. Enums:

  • dataset: The data input is a dataset.

  • obs: The data input is an OBS path.

remote_constraint

Array of remote_constraint objects

Data input constraint

Table 19 InputDataInfo

Parameter

Type

Description

dataset

dataset object

Dataset as the data input.

obs

obs object

OBS in which data input and output stored.

Table 20 dataset

Parameter

Type

Description

id

String

Dataset ID of a training job.

version_id

String

Dataset version ID of a training job.

obs_url

String

OBS URL of the dataset for a training job. It is automatically parsed by ModelArts based on the dataset ID and dataset version ID. For example, /usr/data/.

Table 21 obs

Parameter

Type

Description

obs_url

String

OBS URL of the dataset required by a training job. For example, /usr/data/.

Table 22 remote_constraint

Parameter

Type

Description

data_type

String

Data input type, including the data storage location and dataset.

attributes

String

Attributes if a dataset is used as the data input. Options:

  • data_format: Data format

  • data_segmentation: Data segmentation

  • dataset_type: Labeling type

Table 23 Output

Parameter

Type

Description

name

String

Name of the data output channel.

description

String

Description of the data output channel.

local_dir

String

Local directory of the container to which the data output channel is mapped.

remote

remote object

Description of the actual data output.

Table 24 remote

Parameter

Type

Description

obs

obs object

OBS to which data is actually exported.

Table 25 obs

Parameter

Type

Description

obs_url

String

OBS URL to which data is actually exported.

Table 26 engine

Parameter

Type

Description

engine_id

String

Engine ID selected for a training job. The value can be engine_id, engine_name + engine_version, or image_url.

engine_name

String

Name of the engine selected for a training job. If engine_id is set, leave this parameter blank.

engine_version

String

Name of the engine version selected for a training job. If engine_id is set, leave this parameter blank.

image_url

String

Custom image URL selected for a training job.

Table 27 TaskResponse

Parameter

Type

Description

role

String

Task role. This function is not supported currently.

algorithm

algorithm object

Algorithm management and configuration.

task_resource

FlavorResponse object

Flavors of a training job or an algorithm.

Table 28 algorithm

Parameter

Type

Description

code_dir

String

Absolute path of the directory where the algorithm boot file is stored.

boot_file

String

Absolute path of the algorithm boot file.

inputs

inputs object

Algorithm input channel.

outputs

outputs object

Algorithm output channel.

engine

engine object

Engine on which a heterogeneous job depends.

local_code_dir

String

Local directory of the training container to which the algorithm code directory is downloaded. The rules are as follows:

  • The directory must be under /home.

  • In v1 compatibility mode, the current field does not take effect.

  • When code_dir is prefixed with file://, the current field does not take effect.

working_dir

String

Work directory where an algorithm is executed. Note that this parameter does not take effect in v1 compatibility mode.

Table 29 inputs

Parameter

Type

Description

name

String

Name of the data input channel.

local_dir

String

Local path of the container to which the data input and output channels are mapped.

remote

remote object

Actual data input. Heterogeneous jobs support only OBS.

Table 30 remote

Parameter

Type

Description

obs

obs object

OBS in which data input and output stored.

Table 31 obs

Parameter

Type

Description

obs_url

String

OBS URL of the dataset required by a training job. For example, /usr/data/.

Table 32 outputs

Parameter

Type

Description

name

String

Name of the data output channel.

local_dir

String

Local directory of the container to which the data output channel is mapped.

remote

remote object

Description of the actual data output.

mode

String

Data transmission mode. The default value is upload_periodically.

period

String

Data transmission period. The default value is 30s.

Table 33 remote

Parameter

Type

Description

obs

obs object

OBS to which data is actually exported.

Table 34 obs

Parameter

Type

Description

obs_url

String

OBS URL to which data is exported.

Table 35 engine

Parameter

Type

Description

engine_id

String

Engine ID of a heterogeneous job, for example, caffe-1.0.0-python2.7.

engine_name

String

Engine name of a heterogeneous job, for example, Caffe.

engine_version

String

Engine version of a heterogeneous job.

v1_compatible

Boolean

Whether the v1 compatibility mode is used.

run_user

String

User UID started by default by the engine.

image_url

String

Custom image URL selected by an algorithm.

Table 36 FlavorResponse

Parameter

Type

Description

flavor_id

String

ID of the resource flavor.

flavor_name

String

Name of the resource flavor.

max_num

Integer

Maximum number of nodes in a resource flavor.

flavor_type

String

Resource flavor type. Options:

  • CPU

  • GPU

  • Ascend

billing

billing object

Billing information of a resource flavor.

flavor_info

flavor_info object

Resource flavor details.

attributes

Map<String,String>

Other specification attributes.

Table 37 billing

Parameter

Type

Description

code

String

Billing code.

unit_num

Integer

Number of billing units.

Table 38 flavor_info

Parameter

Type

Description

max_num

Integer

Maximum number of nodes that can be selected. The value 1 indicates that the distributed mode is not supported.

cpu

cpu object

CPU specifications.

gpu

gpu object

GPU specifications.

npu

npu object

Ascend specifications

memory

memory object

Memory information.

disk

disk object

Disk information.

Table 39 cpu

Parameter

Type

Description

arch

String

CPU architecture.

core_num

Integer

Number of cores.

Table 40 gpu

Parameter

Type

Description

unit_num

Integer

Number of GPUs.

product_name

String

Product name.

memory

String

Memory.

Table 41 npu

Parameter

Type

Description

unit_num

String

Number of NPUs.

product_name

String

Product name.

memory

String

Memory.

Table 42 memory

Parameter

Type

Description

size

Integer

Memory size.

unit

String

Memory size

Table 43 disk

Parameter

Type

Description

size

Integer

Disk size.

unit

String

Unit of the disk size.

Table 44 spec

Parameter

Type

Description

resource

Resource object

Resource flavors of a training job. Select either flavor_id or pool_id+[flavor_id].

volumes

Array of volumes objects

Volumes attached to a training job.

log_export_path

log_export_path object

Export path of training job logs.

Table 45 Resource

Parameter

Type

Description

policy

String

Resource flavor mode of a training job. The value is regular.

flavor_id

String

ID of the resource flavor selected for a training job. flavor_id cannot be specified for dedicated resource pools with CPU specifications. The options for dedicated resource pools with GPU/Ascend specifications are as follows:

  • modelarts.pool.visual.xlarge (1 card)

  • modelarts.pool.visual.2xlarge (2 cards)

  • modelarts.pool.visual.4xlarge (4 cards)

  • modelarts.pool.visual.8xlarge (8 cards)

flavor_name

String

Read-only flavor name returned by ModelArts when flavor_id is used.

node_count

Integer

Number of resource replicas selected for a training job.

pool_id

String

Resource pool ID selected for a training job.

flavor_detail

flavor_detail object

Flavor details of a training job or algorithm. This parameter is available only for public resource pools.

Table 46 flavor_detail

Parameter

Type

Description

flavor_type

String

Resource flavor type. Options:

  • CPU

  • GPU

  • Ascend

billing

billing object

Billing information of a resource flavor.

flavor_info

flavor_info object

Resource flavor details.

Table 47 billing

Parameter

Type

Description

code

String

Billing code.

unit_num

Integer

Number of billing units.

Table 48 flavor_info

Parameter

Type

Description

max_num

Integer

Maximum number of nodes that can be selected. The value 1 indicates that the distributed mode is not supported.

cpu

cpu object

CPU specifications.

gpu

gpu object

GPU specifications.

npu

npu object

Ascend specifications

memory

memory object

Memory information.

disk

disk object

Disk information.

Table 49 cpu

Parameter

Type

Description

arch

String

CPU architecture.

core_num

Integer

Number of cores.

Table 50 gpu

Parameter

Type

Description

unit_num

Integer

Number of GPUs.

product_name

String

Product name.

memory

String

Memory.

Table 51 npu

Parameter

Type

Description

unit_num

String

Number of NPUs.

product_name

String

Product name.

memory

String

Memory.

Table 52 memory

Parameter

Type

Description

size

Integer

Memory size.

unit

String

Number of memory units.

Table 53 disk

Parameter

Type

Description

size

String

Disk size.

unit

String

Unit of the disk size. Generally, the value is GB.

Table 54 volumes

Parameter

Type

Description

nfs

nfs object

Volumes attached in NFS mode.

Table 55 nfs

Parameter

Type

Description

nfs_server_path

String

NFS server path.

local_path

String

Path for attaching volumes to the training container.

read_only

Boolean

Whether the volumes attached to the container in NFS mode are read-only.

Table 56 log_export_path

Parameter

Type

Description

obs_url

String

OBS URL for storing training job logs.

host_path

String

Path of the host where training job logs are stored.

Example Requests

The following is an example of how to stop the training job whose UUID is 3faf5c03-aaa1-4cbe-879d-24b05d997347.

POST https://endpoint/v2/{project_id}/training-jobs/cf63aba9-63b1-4219-b717-708a2665100b/actions

{
  "action_type" : "terminate"
}

Example Responses

Status code: 202

ok

{
  "kind" : "job",
  "metadata" : {
    "id" : "cf63aba9-63b1-4219-b717-708a2665100b",
    "name" : "trainjob--py14_mem06-110",
    "description" : "",
    "create_time" : 1636515222282,
    "workspace_id" : "0",
    "user_name" : "ei_modelarts_z00424192_01"
  },
  "status" : {
    "phase" : "Terminating",
    "secondary_phase" : "Terminating",
    "duration" : 0,
    "start_time" : 0,
    "node_count_metrics" : null,
    "tasks" : [ "worker-0" ]
  },
  "algorithm" : {
    "code_dir" : "obs://test/economic_test/py_minist/",
    "boot_file" : "obs://test/economic_test/py_minist/minist_common.py",
    "inputs" : [ {
      "name" : "data_url",
      "local_dir" : "/home/ma-user/modelarts/inputs/data_url_0",
      "remote" : {
        "obs" : {
          "obs_url" : "/test/data/py_minist/"
        }
      }
    } ],
    "outputs" : [ {
      "name" : "train_url",
      "local_dir" : "/home/ma-user/modelarts/outputs/train_url_0",
      "remote" : {
        "obs" : {
          "obs_url" : "/test/train_output/"
        }
      }
    } ],
    "engine" : {
      "engine_id" : "pytorch-cp36-1.4.0-v2",
      "engine_name" : "PyTorch",
      "engine_version" : "PyTorch-1.4.0-python3.6-v2"
    }
  },
  "spec" : {
    "resource" : {
      "policy" : "economic",
      "flavor_id" : "modelarts.vm.pnt1.large.eco",
      "flavor_name" : "Computing GPU(Pnt1) instance",
      "node_count" : 1,
      "flavor_detail" : {
        "flavor_type" : "GPU",
        "billing" : {
          "code" : "modelarts.vm.gpu.pnt1.eco",
          "unit_num" : 1
        },
        "flavor_info" : {
          "cpu" : {
            "arch" : "x86",
            "core_num" : 8
          },
          "gpu" : {
            "unit_num" : 1,
            "product_name" : "GP-Pnt1",
            "memory" : "8GB"
          },
          "memory" : {
            "size" : 64,
            "unit" : "GB"
          }
        }
      }
    }
  }
}

Status Codes

Status Code

Description

202

ok

Error Codes

See Error Codes.