本文导读

功能介绍
调试
URI
请求参数
响应参数
请求示例
响应示例
状态码
错误码

展开导读

文档首页/ AI开发平台ModelArts/ API参考/ 训练管理/ 创建训练作业

创建训练作业

更新时间：2025-06-30 GMT+08:00

查看PDF

功能介绍

创建训练作业。

调试

您可以在API Explorer中调试该接口，支持自动认证鉴权。API Explorer可以自动生成SDK代码示例，并提供SDK代码示例调试功能。

URI

POST /v2/{project_id}/training-jobs

表1 路径参数
参数	是否必选	参数类型	描述
project_id	是	String	用户项目ID。获取方法请参见获取项目ID和名称。

请求参数

表2 请求Body参数
参数	是否必选	参数类型	描述
kind	是	String	训练作业类型。默认使用job，表示训练作业。 visualization_job：表示可视化作业
metadata	是	JobMetadata object	训练作业元信息。
algorithm	否	JobAlgorithm object	训练作业算法。目前支持三种形式： id：只取算法的id； subscription_id+item_version_id：取算法的订阅id和版本id； code_dir+boot_file：取训练作业的代码目录和启动文件。
tasks	否	Array of Task objects	任务列表。该功能暂未实现。
spec	否	Spec object	训练作业规格参数。有此字段时，无需填写tasks字段。
endpoints	否	JobEndpointsReq object	远程接入训练作业时需要的相关配置。

表3 JobMetadata
参数	是否必选	参数类型	描述
name	是	String	训练作业名称。限制为1-64位只含数字、字母、下划线和中划线的名称。
workspace_id	否	String	指定作业所处的工作空间，默认值为“0”。
description	否	String	对训练作业的描述，默认为“NULL”，字符串的长度限制为[0, 256]。
annotations	否	Map<String,String>	训练作业高级功能配置，可选取值如下： "job_template": "Template RL"（异构作业）。 "fault-tolerance/job-retry-num": "3"（故障自动重启次数）。 "fault-tolerance/job-unconditional-retry": "true"（无条件重启） "fault-tolerance/hang-retry": "true"（卡死重启） "jupyter-lab/enable": "true"（JupyterLab训练应用程序）。 "tensorboard/enable": "true"（TensorBoard训练应用程序）。 "mindstudio-insight/enable": "true"（MindStudio Insight训练应用程序）。

表4 JobAlgorithm
参数	是否必选	参数类型	描述
id	否	String	算法管理的算法id。
name	否	String	算法名称。无需填写。
subscription_id	否	String	订阅算法的订阅ID。应与item_version_id一同出现。
item_version_id	否	String	订阅算法的版本。应与subscription_id一同出现。
code_dir	否	String	训练作业的代码目录。如：“/usr/app/”。应与boot_file一同出现，如果boot_file填入id或subscription_id+item_version_id，则此参数无需填写。
boot_file	否	String	训练作业的代码启动文件，需要在代码目录下。如：“/usr/app/boot.py”。应与code_dir一同出现，如果code_dir填入id或subscription_id+item_version_id，则此参数无需填写。
autosearch_config_path	否	String	自动化搜索作业的yaml配置路径，需要提供一个OBS路径。
autosearch_framework_path	否	String	自动化搜索作业的框架代码目录，需要提供一个OBS路径。
command	否	String	自定义镜像场景下，训练作业的自定义镜像的容器的启动命令。
parameters	否	Array of Parameters objects	训练作业的运行参数。
policies	否	JobPolicies object	作业支持的策略，用于超参搜索。
inputs	否	Array of Input objects	训练作业的数据输入。
outputs	否	Array of Output objects	训练作业的结果输出。
engine	否	JobEngine object	训练作业的引擎。使用算法管理的算法id或订阅算法subscription_id+item_version_id创建作业时，无需填写。
local_code_dir	否	String	算法的代码目录下载到训练容器内的本地路径。规则：必须为/home下的目录。 v1兼容模式下，当前字段不生效。当code_dir以file://为前缀时，当前字段不生效。
working_dir	否	String	运行算法时所在的工作目录。规则：v1兼容模式下，当前字段不生效。
environments	否	Map<String,String>	训练作业的环境变量。格式："key":"value"。其中key最大允许填写8192字符，value最大允许填写4096字符，最多允许100对环境变量。变量名应该仅包含字母、数字、下划线（），且以字母或下划线（）开头。注：不支持使用符号 $ 引用变量。
summary	否	Summary object	可视化日志summary。

表5 Parameters
参数	是否必选	参数类型	描述
name	否	String	参数名称。
value	否	String	参数值。
description	否	String	参数描述信息。
constraint	否	ParametersConstraint object	参数属性。
i18n_description	否	I18nDescription object	国际化描述。

表6 ParametersConstraint
参数	是否必选	参数类型	描述
type	否	String	参数种类。
editable	否	Boolean	是否可编辑。
required	否	Boolean	是否必须。
sensitive	否	Boolean	是否敏感。该功能暂未实现。
valid_type	否	String	有效种类。
valid_range	否	Array of strings	有效范围。

表7 I18nDescription
参数	是否必选	参数类型	描述
language	否	String	国际语种。
description	否	String	描述信息。

表8 JobPolicies
参数	是否必选	参数类型	描述
auto_search	否	AutoSearch object	超参搜索配置。

表9 AutoSearch
参数	是否必选	参数类型	描述
skip_search_params	否	String	需要排除的超参组合。
reward_attrs	否	Array of RewardAttrs objects	搜索指标列表。
search_params	否	Array of SearchParams objects	搜索参数。
algo_configs	否	Array of AlgoConfigs objects	搜索算法配置。

**表10** RewardAttrs
参数	是否必选	参数类型	描述
name	否	String	指标名称。
mode	否	String	搜索方向。 - max指定时表示指标值越大越好； - min指定时表示指标值越小越好。
regex	否	String	指标正则表达式。

**表11** SearchParams
参数	是否必选	参数类型	描述
name	否	String	超参名称。
param_type	否	String	参数类型。 - continuous：指定时表示这个超参是连续类型的。连续类型的超参在算法使用于训练作业时，控制台显示为输入框。 - discrete：指定时表示这个超参是离散类型的。离散类型的超参在算法使用于训练作业时，控制台显示为下拉选择框架。
lower_bound	否	String	超参下界。
upper_bound	否	String	超参上界。
discrete_points_num	否	String	连续型超参离散化取值个数。
discrete_values	否	Array of strings	离散型超参的取值列表。

**表12** AlgoConfigs
参数	是否必选	参数类型	描述
name	否	String	搜索算法名称。
params	否	Array of AutoSearchAlgoConfigParameter objects	搜索算法参数。

**表13** AutoSearchAlgoConfigParameter
参数	是否必选	参数类型	描述
key	否	String	参数键。
value	否	String	参数值。
type	否	String	参数种类。

**表14** JobEngine
参数	是否必选	参数类型	描述
engine_id	否	String	训练作业选择的引擎规格ID。engine_id，engine_name+engine_version和image_url方式三选一。
engine_name	否	String	训练作业选择的引擎名称。如果已填写engine_id，则此参数无需填写。如果使用预置框架+自定义镜像的创建方式时需要同时传入此参数和image_url参数。
engine_version	否	String	训练作业选择的引擎版本名称。如果已填写engine_id，则此参数无需填写。
image_url	否	String	训练作业选择的自定义镜像地址，地址从swr服务获取。格式：组织名/镜像名:版本号。
install_sys_packages	否	Boolean	是否需要安装训练平台指定的 moxing 版本。true为需要。只有填写了engine_name，engine_version，image_url参数时支持该设置。

**表15** Summary
参数	是否必选	参数类型	描述
log_type	否	String	训练作业可视化日志类型，配置后训练作业可作为可视化作业数据源。可选取值如下： "tensorboard" "mindstudio-insight"
log_dir	否	LogDir object	训练作业可视化日志输出，log_type非空时必填。
data_sources	否	Array of DataSource objects	可视化作业或训练作业调试模式的可视化日志输入，训练作业高级功能开启"tensorboard/enable": "true"或"mindstudio-insight/enable": "true"时必填。

**表16** LogDir
参数	是否必选	参数类型	描述
pfs	是	PFSSummary object	obs并行文件系统输出。

**表17** PFSSummary
参数	是否必选	参数类型	描述
pfs_path	是	String	obs并行文件系统路径url。

**表18** DataSource
参数	是否必选	参数类型	描述
job	是	JobSummary object	作业数据源。

**表19** JobSummary
参数	是否必选	参数类型	描述
job_id	是	String	训练作业id。

**表20** Task
参数	是否必选	参数类型	描述
role	否	String	任务角色，该功能暂未支持。
algorithm	否	algorithm object	算法管理算法配置。
task_resource	否	task_resource object	训练作业资源规格信息。

**表21** algorithm
参数	是否必选	参数类型	描述
job_config	否	job_config object	算法配置信息，如启动文件等。
code_dir	否	String	算法的代码目录。如：“/usr/app/”。应与boot_file一同出现。
boot_file	否	String	算法的代码启动文件，需要在代码目录下。如：“/usr/app/boot.py”。应与code_dir一同出现。
engine	否	engine object	异构作业算法的引擎。
inputs	否	Array of inputs objects	算法的数据输入。
outputs	否	Array of outputs objects	算法的数据输出。
local_code_dir	否	String	算法的代码目录下载到训练容器内的本地路径。规则如下：必须为/home下的目录； v1兼容模式下，当前字段不生效；当code_dir以file://为前缀时，当前字段不生效。
working_dir	否	String	运行算法时所在的工作目录。规则：v1兼容模式下，当前字段不生效。

**表22** job_config
参数	是否必选	参数类型	描述
parameters	否	Array of Parameter objects	算法的运行参数。
inputs	否	Array of Input objects	算法的数据输入。
outputs	否	Array of Output objects	算法的数据输出。
engine	否	engine object	算法的引擎。

**表23** Parameter
参数	是否必选	参数类型	描述
name	否	String	参数名称。
value	否	String	参数值。
description	否	String	参数描述信息。
constraint	否	constraint object	参数属性。
i18n_description	否	i18n_description object	国际化描述。

**表24** constraint
参数	是否必选	参数类型	描述
type	否	String	参数种类。
editable	否	Boolean	是否可编辑。
required	否	Boolean	是否必须。
sensitive	否	Boolean	是否敏感。该功能暂未实现。
valid_type	否	String	有效种类。
valid_range	否	Array of strings	有效范围。

**表25** i18n_description
参数	是否必选	参数类型	描述
language	否	String	国际语种，可选值如下： zh-cn（中文） en-us（英文）
description	否	String	国际化语种的描述信息。

**表26** Input
参数	是否必选	参数类型	描述
name	是	String	数据输入通道名称。
description	否	String	数据输入通道描述信息。
local_dir	否	String	数据输入通道映射的容器本地路径。例如，“/home/ma-user/modelarts/inputs/data_url_0”。
access_method	否	String	数据输入通道路径（local_dir）的下发方式，为空时默认超参形式。 parameter，超参形式； env，环境变量形式。
remote	是	InputDataInfo object	数据实际输入信息。枚举值： dataset：指定输入为数据集； obs：指定输入为OBS路径。
remote_constraint	否	Array of remote_constraint objects	数据输入约束。

**表27** InputDataInfo
参数	是否必选	参数类型	描述
dataset	否	dataset object	数据输入信息为数据集。
obs	否	obs object	数据输入输出信息为OBS方式。

**表28** dataset
参数	是否必选	参数类型	描述
id	是	String	训练作业的数据集ID。
version_id	是	String	训练作业的数据集版本ID。

**表29** obs
参数	是否必选	参数类型	描述
obs_url	是	String	训练作业需要的数据集OBS路径URL。如：“/usr/data/”。

**表30** remote_constraint
参数	是否必选	参数类型	描述
data_type	否	String	数据输入类型，包括数据存储位置、数据集两种方式。
attributes	否	String	数据输入为数据集时的相关属性。枚举值： data_format 数据格式； data_segmentation 数据切分方式； dataset_type 标注类型。

**表31** Output
参数	是否必选	参数类型	描述
name	是	String	数据输出通道名称。
description	否	String	数据输出通道描述信息。
local_dir	否	String	数据输出通道映射的容器本地路径。
access_method	否	String	数据输出通道路径（local_dir）的下发方式，为空时默认超参形式。 parameter，超参形式； env，环境变量形式。
remote	是	Remote object	数据实际输出信息。

**表32** Remote
参数	是否必选	参数类型	描述
obs	是	RemoteObs object	数据实际输出到OBS。

**表33** RemoteObs
参数	是否必选	参数类型	描述
obs_url	是	String	数据实际输出到OBS的路径。

**表34** engine
参数	是否必选	参数类型	描述
engine_id	否	String	算法选择的引擎规格ID。
engine_name	否	String	算法选择的引擎版本名称。若填入engine_id则无需填写。
engine_version	否	String	算法选择的引擎版本名称。若填入engine_id则无需填写。
image_url	否	String	算法选择的自定义镜像地址。

**表35** engine
参数	是否必选	参数类型	描述
engine_id	否	String	异构作业引擎规格的ID。如“caffe-1.0.0-python2.7”。
engine_name	否	String	异构作业引擎规格的名称。如“Caffe”。
engine_version	否	String	异构作业引擎规格的版本。
image_url	否	String	算法选择的自定义镜像地址。

**表36** inputs
参数	是否必选	参数类型	描述
name	是	String	数据输入通道名称。
description	否	String	数据输入通道描述信息。
local_dir	否	String	数据输入通道映射的容器本地路径。
remote	是	remote object	数据实际输入信息。枚举值： dataset：指定输入为数据集； obs：指定输入为OBS路径。

**表37** remote
参数	是否必选	参数类型	描述
obs	否	obs object	数据输入输出信息为OBS方式。

**表38** obs
参数	是否必选	参数类型	描述
obs_url	是	String	训练作业需要的数据集OBS路径URL。如：“/usr/data/”。

**表39** outputs
参数	是否必选	参数类型	描述
name	是	String	数据输出通道名称。
description	否	String	数据输出通道描述信息。
local_dir	否	String	数据输出通道映射的容器本地路径。
remote	是	remote object	数据实际输出信息。

**表40** remote
参数	是否必选	参数类型	描述
obs	是	obs object	数据实际输出到OBS。

**表41** obs
参数	是否必选	参数类型	描述
obs_url	是	String	数据实际输出到OBS的路径。

**表42** task_resource
参数	是否必选	参数类型	描述
flavor_id	否	String	训练作业选择的资源规格ID。
node_count	是	Integer	训练作业选择的资源副本数。

**表43** Spec
参数	是否必选	参数类型	描述
resource	否	SpecResource object	训练作业资源规格信息。flavor_id和pool_id+[flavor_id]方式二选一。选择公共资源池时，仅上送flavor_id，选择训练作业需要的卡数、内存等资源规格，当公共资源池空闲资源满足选择的规格需求时，作业可被调度；选择专属资源池时，需上送pool_id与flavor_id，选择专属资源池下可选的实际规格，即满足训练作业条件的最小卡数，以便节省专属资源，提高利用率。
volumes	否	Array of SpecVolumes objects	训练作业挂载卷信息。
log_export_path	否	LogExportPath object	训练作业日志输出信息。
auto_stop	否	AutoStop object	训练作业的自动停止配置。
schedule_policy	否	SchedulePolicy object	训练作业调度策略
notification	否	Notification object	训练事件的消息通知
custom_metrics	否	Array of CustomMetrics objects	指标采集配置

**表44** SpecResource
参数	是否必选	参数类型	描述
flavor_id	否	String	训练作业资源规格id。CPU规格专属资源池不支持指定flavor_id。GPU/Ascend规格专属资源池可选取值如下： modelarts.pool.visual.xlarge（1卡） modelarts.pool.visual.2xlarge（2卡） modelarts.pool.visual.4xlarge（4卡） modelarts.pool.visual.8xlarge（8卡） modelarts.pool.visual.16xlarge（16卡，当前仅限Snt9b23超节点资源池）
node_count	否	Integer	资源池创建训练作业使用节点数。默认单节点。
pool_id	否	String	专属资源池id。

**表45** SpecVolumes
参数	是否必选	参数类型	描述
nfs	否	Nfs object	训练作业nfs挂载卷信息。
pfs	否	Pfs object	训练作业obsfs挂载卷信息。
obs	否	Obs object	训练作业obs挂载卷信息。

**表46** Nfs
参数	是否必选	参数类型	描述
nfs_server_path	否	String	nfs服务端路径，如：“10.10.10.10:/example/path”。
local_path	否	String	挂载到训练容器中的路径，如：“/example/path”。
read_only	否	Boolean	nfs挂载卷在容器中是否只读。

**表47** Pfs
参数	是否必选	参数类型	描述
pfs_path	否	String	obsfs的地址。如：“/test-bucket/path”。
local_path	否	String	挂载到训练容器中的路径，如：“/example/path”。

**表48** Obs
参数	是否必选	参数类型	描述
obs_path	否	String	需要挂载的obs路径。如：“/test-bucket/path”。
local_path	否	String	挂载到训练容器中的路径，如：“/example/path”。

**表49** LogExportPath
参数	是否必选	参数类型	描述
obs_url	否	String	训练作业日志保存的OBS地址，如：“obs://example/path”。
host_path	否	String	训练作业日志保存的宿主机的路径，如：“/example/path”。

**表50** AutoStop
参数	是否必选	参数类型	描述
time_unit	是	String	时间单位。可选取值如下： HOURS
duration	是	Integer	运行时长，最小值为1。

**表51** SchedulePolicy
参数	是否必选	参数类型	描述
required_affinity	否	RequiredAffinity object	训练作业亲和要求。
priority	否	Integer	训练作业优先级。约束限制：仅使用专属资源池训练时才支持设置训练作业优先级。作业优先级取值为1~3，默认优先级为1，最高优先级为3。默认用户权限可选择优先级1和2，配置了“设置作业为高优先级权限”的用户可选择优先级1~3。
preemptible	否	Boolean	是否可以被抢占。

**表52** RequiredAffinity
参数	是否必选	参数类型	描述
affinity_type	否	String	亲和调度策略，可选取值如下: cabinet 强整柜调度 hyperinstance 超节点亲和调度
affinity_group_size	否	Integer	亲和组大小，affinity_type为hyperinstance时必填，系统会将affinity_group_size个task调度到一个超节点内组成亲和组。用户向超节点资源池投递训练作业，如果未设置亲和组大小，系统会默认赋值为1。

**表53** Notification
参数	是否必选	参数类型	描述
topic_urn	否	String	消息通知服务中所选主题的URN唯一资源标识
events	否	Array of strings	触发消息通知的训练事件。可选值如下： JobStarted：作业开始 JobCompleted：作业结束 JobFailed：作业失败 JobTerminated：作业终止 JobRestarted：作业重启 JobHanged：作业卡死 JobPreempted：作业抢占

**表54** CustomMetrics
参数	是否必选	参数类型	描述
exec	否	Exec object	命令行方式采集指标
http_get	否	HttpGet object	http方式采集指标

**表55** Exec
参数	是否必选	参数类型	描述
command	否	Array of strings	命令行方式采集指标

**表56** HttpGet
参数	是否必选	参数类型	描述
path	否	String	http获取指标的url路径，与下面的端口必须同时填或者不填
port	否	Integer	http获取指标的端口，与上面的url路径必须同时填或者不填

**表57** JobEndpointsReq
参数	是否必选	参数类型	描述
ssh	否	SSHReq object	SSH连接信息。

**表58** SSHReq
参数	是否必选	参数类型	描述
key_pair_names	否	Array of strings	SSH密钥对名称，可以在云服务器控制台（ECS）“密钥对”页面创建和查看。

响应参数

状态码：201

**表59** 响应Body参数
参数	参数类型	描述
kind	String	训练作业类型。默认使用job。枚举值： job 训练作业。
metadata	JobMetadata object	训练作业元信息。
status	Status object	训练作业状态信息。创建作业无需填写。
algorithm	JobAlgorithmResponse object	训练作业算法。目前支持三种形式： id：只取算法的id； subscription_id+item_version_id：取算法的订阅id和版本id； code_dir+boot_file：取训练作业的代码目录和启动文件。
tasks	Array of TaskResponse objects	异构训练作业的任务列表。
spec	SpecResponce object	训练作业规格参数。
endpoints	JobEndpointsResp object	远程接入训练作业时需要的相关配置。

**表60** JobMetadata
参数	参数类型	描述
id	String	训练作业ID，创建成功后由ModelArts生成返回，无需填写。
name	String	训练作业名称。限制为1-64位只含数字、字母、下划线和中划线的名称。
workspace_id	String	指定作业所处的工作空间，默认值为“0”。
description	String	对训练作业的描述，默认为“NULL”，字符串的长度限制为[0, 256]。
create_time	Long	训练作业创建时间戳，单位为毫秒，创建成功后由ModelArts生成返回，无需填写。
user_name	String	训练作业创建用户的用户名，创建成功后由ModelArts生成返回，无需填写。
annotations	Map<String,String>	训练作业高级功能配置，可选取值如下： "job_template": "Template RL"（异构作业）。 "fault-tolerance/job-retry-num": "3"（故障自动重启次数）。 "fault-tolerance/job-unconditional-retry": "true"（无条件重启） "fault-tolerance/hang-retry": "true"（卡死重启） "jupyter-lab/enable": "true"（JupyterLab训练应用程序）。 "tensorboard/enable": "true"（TensorBoard训练应用程序）。 "mindstudio-insight/enable": "true"（MindStudio Insight训练应用程序）。

**表61** Status
参数	参数类型	描述
phase	String	训练作业一级状态。可选值如下： Creating：创建中 Pending：等待中 Running：运行中 Failed：运行失败 Completed：已完成 Terminating：停止中 Terminated：已停止 Abnormal：异常
secondary_phase	String	训练作业二级状态为内部详细状态，可能会增加、修改、删除，不建议依赖。可选值如下： Creating：创建中 Queuing：排队中 Running：运行中 Failed：运行失败 Completed：已完成 Terminating：停止中 Terminated：已停止 CreateFailed：创建失败 TerminatedFailed：停止失败 Unknown：未知状态 Lost：异常
duration	Long	训练作业运行时长，单位为毫秒。
node_count_metrics	Array<Array<Integer>>	训练作业运行时节点数变化指标。
tasks	Array of strings	训练作业子任务名称。
start_time	Long	训练作业开始时间，格式为时间戳。
task_statuses	Array of TaskStatuses objects	训练在子任务状态信息。
running_records	Array of RunningRecord objects	训练作业运行及故障恢复记录。

**表62** TaskStatuses
参数	参数类型	描述
task	String	训练作业子任务名称。
exit_code	Integer	训练作业子任务退出码。
message	String	训练作业子任务错误消息。

**表63** RunningRecord
参数	参数类型	描述
start_at	Integer	本次运行开始时间的unix时间戳，单位为秒(s)。
end_at	Integer	本次运行结束时间的unix时间戳，单位为秒(s)。
start_type	String	本地运行的启动方式： init_or_rescheduled：代表本次启动为被调度后的首次运行，包括初次启动及调度恢复后的运行。 restarted：代表本次启动非被调度后的首次运行，为进程重启后的运行。
end_reason	String	本次运行结束原因。
end_related_task	String	引发本次运行结束的task worker ID(如worker-0)。
end_recover	String	本次运行结束后所采取的故障容忍策略，枚举值如下： npu_proc_restart: NPU原地热恢复 gpu_proc_restart: GPU原地热恢复 proc_restart: 进程原地重启 pod_reschedule: Pod级重调度 job_reschedule: Job级重调度 job_reschedule_with_taint: 隔离式Job重调度
end_recover_before_downgrade	String	本次运行结束后在故障容忍策略降级前所采取的容忍策略，取值范围同end_recover。

**表64** JobAlgorithmResponse
参数	参数类型	描述
id	String	训练作业算法。目前支持三种形式： id：只取算法的id； subscription_id+item_version_id：取算法的订阅id和版本id； code_dir+boot_file：取训练作业的代码目录和启动文件。
name	String	算法名称。
subscription_id	String	订阅算法的订阅ID。应与item_version_id一同出现。
item_version_id	String	订阅算法的版本。应与subscription_id一同出现。
code_dir	String	训练作业的代码目录。如：“/usr/app/”。应与boot_file一同出现，如果boot_file已经填入id或subscription_id+item_version_id，则无需填写此参数。
boot_file	String	训练作业的代码启动文件，需要在代码目录下。如：“/usr/app/boot.py”。应与code_dir一同出现，如果code_dir已经填入id或subscription_id+item_version_id，则无需填写此参数。
autosearch_config_path	String	自动化搜索作业的yaml配置路径，需要提供一个OBS路径。如：“obs://bucket/file.yaml”。
autosearch_framework_path	String	自动化搜索作业的框架代码目录，需要提供一个OBS路径。如：“obs://bucket/files/”。
command	String	自定义镜像训练作业的自定义镜像的容器的启动命令。例如python train.py。
parameters	Array of Parameter objects	训练作业的运行参数。
policies	policies object	作业支持的策略。
inputs	Array of Input objects	训练作业的数据输入。
outputs	Array of Output objects	训练作业的结果输出。
engine	JobEngine object	训练作业的引擎。使用算法管理的算法id或订阅算法subscription_id+item_version_id创建作业时，无需填写。
local_code_dir	String	算法的代码目录下载到训练容器内的本地路径。规则如下：必须为/home下的目录； v1兼容模式下，当前字段不生效；当code_dir以file://为前缀时，当前字段不生效。
working_dir	String	运行算法时所在的工作目录。规则：v1兼容模式下，当前字段不生效。
environments	Array of Map<String,String> objects	训练作业的环境变量。格式："key":"value"，无需填写。
summary	Summary object	可视化日志summary。

**表65** Parameter
参数	参数类型	描述
name	String	参数名称。
value	String	参数值。
description	String	参数描述信息。
constraint	constraint object	参数属性。
i18n_description	i18n_description object	国际化描述。

**表66** constraint
参数	参数类型	描述
type	String	参数种类。
editable	Boolean	是否可编辑。
required	Boolean	是否必须。
sensitive	Boolean	是否敏感。该功能暂未实现。
valid_type	String	有效种类。
valid_range	Array of strings	有效范围。

**表67** i18n_description
参数	参数类型	描述
language	String	国际语种，可选值如下： zh-cn（中文） en-us（英文）
description	String	国际化语种的描述信息。

**表68** policies
参数	参数类型	描述
auto_search	auto_search object	超参搜索配置。

**表69** auto_search
参数	参数类型	描述
skip_search_params	String	需要排除的超参组合。
reward_attrs	Array of reward_attrs objects	搜索指标列表。
search_params	Array of search_params objects	搜索参数。
algo_configs	Array of algo_configs objects	搜索算法配置。

**表70** reward_attrs
参数	参数类型	描述
name	String	指标名称。
mode	String	搜索方向。 max指定时表示指标值越大越好； min指定时表示指标值越小越好。
regex	String	指标正则表达式。

**表71** search_params
参数	参数类型	描述
name	String	超参名称。
param_type	String	参数类型。 continuous：指定时表示这个超参是连续类型的。连续类型的超参在算法使用于训练作业时，控制台显示为输入框。 discrete：指定时表示这个超参是离散类型的。离散类型的超参在算法使用于训练作业时，控制台显示为下拉选择框架。
lower_bound	String	超参下界。
upper_bound	String	超参上界。
discrete_points_num	String	连续型超参离散化取值个数。
discrete_values	Array of strings	离散型超参的取值列表。

**表72** algo_configs
参数	参数类型	描述
name	String	搜索算法名称。
params	Array of AutoSearchAlgoConfigParameter objects	搜索算法参数。

**表73** AutoSearchAlgoConfigParameter
参数	参数类型	描述
key	String	参数键。
value	String	参数值。
type	String	参数种类。

**表74** Input
参数	参数类型	描述
name	String	数据输入通道名称。
description	String	数据输入通道描述信息。
local_dir	String	数据输入通道映射的容器本地路径。例如，“/home/ma-user/modelarts/inputs/data_url_0”。
access_method	String	数据输入通道路径（local_dir）的下发方式，为空时默认超参形式。 parameter，超参形式； env，环境变量形式。
remote	InputDataInfo object	数据实际输入信息。枚举值： dataset：指定输入为数据集； obs：指定输入为OBS路径。
remote_constraint	Array of remote_constraint objects	数据输入约束。

**表75** InputDataInfo
参数	参数类型	描述
dataset	dataset object	数据输入信息为数据集。
obs	obs object	数据输入输出信息为OBS方式。

**表76** dataset
参数	参数类型	描述
id	String	训练作业的数据集ID。
version_id	String	训练作业的数据集版本ID。
obs_url	String	训练作业需要的数据集OBS路径URL，ModelArts会通过数据集ID和数据集版本ID自动解析生成。如：“/usr/data/”。

**表77** obs
参数	参数类型	描述
obs_url	String	训练作业需要的数据集OBS路径URL。如：“/usr/data/”。

**表78** remote_constraint
参数	参数类型	描述
data_type	String	数据输入类型，包括数据存储位置、数据集两种方式。
attributes	String	数据输入为数据集时的相关属性。枚举值： data_format 数据格式； data_segmentation 数据切分方式； dataset_type 标注类型。

**表79** Output
参数	参数类型	描述
name	String	数据输出通道名称。
description	String	数据输出通道描述信息。
local_dir	String	数据输出通道映射的容器本地路径。
access_method	String	数据输出通道路径（local_dir）的下发方式，为空时默认超参形式。 parameter，超参形式； env，环境变量形式。
remote	Remote object	数据实际输出信息。

**表80** JobEngine
参数	参数类型	描述
engine_id	String	训练作业选择的引擎规格ID。engine_id，engine_name+engine_version和image_url方式三选一。
engine_name	String	训练作业选择的引擎名称。如果已填写engine_id，则此参数无需填写。如果使用预置框架+自定义镜像的创建方式时需要同时传入此参数和image_url参数。
engine_version	String	训练作业选择的引擎版本名称。如果已填写engine_id，则此参数无需填写。
image_url	String	训练作业选择的自定义镜像地址，地址从swr服务获取。格式：组织名/镜像名:版本号。
install_sys_packages	Boolean	是否需要安装训练平台指定的 moxing 版本。true为需要。只有填写了engine_name，engine_version，image_url参数时支持该设置。

**表81** Summary
参数	参数类型	描述
log_type	String	训练作业可视化日志类型，配置后训练作业可作为可视化作业数据源。可选取值如下： "tensorboard" "mindstudio-insight"
log_dir	LogDir object	训练作业可视化日志输出，log_type非空时必填。
data_sources	Array of DataSource objects	可视化作业或训练作业调试模式的可视化日志输入，训练作业高级功能开启"tensorboard/enable": "true"或"mindstudio-insight/enable": "true"时必填。

**表82** LogDir
参数	参数类型	描述
pfs	PFSSummary object	obs并行文件系统输出。

**表83** PFSSummary
参数	参数类型	描述
pfs_path	String	obs并行文件系统路径url。

**表84** DataSource
参数	参数类型	描述
job	JobSummary object	作业数据源。

**表85** JobSummary
参数	参数类型	描述
job_id	String	训练作业id。

**表86** TaskResponse
参数	参数类型	描述
role	String	任务角色，该功能暂未支持。
algorithm	TaskResponseAlgorithm object	算法管理算法配置。
task_resource	FlavorResponse object	训练作业、算法的规格信息。

**表87** TaskResponseAlgorithm
参数	参数类型	描述
code_dir	String	算法启动文件所在目录绝对路径。
boot_file	String	算法启动文件绝对路径。
inputs	AlgorithmInput object	算法输入通道信息。
outputs	AlgorithmOutput object	算法输出通道信息。
engine	AlgorithmEngine object	异构作业所依赖的引擎。
local_code_dir	String	算法的代码目录下载到训练容器内的本地路径。规则如下：必须为/home下的目录； v1兼容模式下，当前字段不生效；当code_dir以file://为前缀时，当前字段不生效。
working_dir	String	运行算法时所在的工作目录。规则：v1兼容模式下，当前字段不生效。

**表88** AlgorithmInput
参数	参数类型	描述
name	String	数据输入通道名称。
local_dir	String	数据输入输出通道映射的容器本地路径。
remote	AlgorithmRemote object	数据实际输入信息，异构作业只支持OBS。

**表89** AlgorithmRemote
参数	参数类型	描述
obs	RemoteObs object	数据输入输出信息为OBS方式。

**表90** AlgorithmOutput
参数	参数类型	描述
name	String	数据输出通道名称。
local_dir	String	数据输出通道映射的容器本地路径。
remote	Remote object	数据实际输出信息。
mode	String	数据传输模式，默认为“upload_periodically”。
period	String	数据传输周期，默认为30s。

**表91** Remote
参数	参数类型	描述
obs	RemoteObs object	数据实际输出到OBS。

**表92** RemoteObs
参数	参数类型	描述
obs_url	String	数据实际输出到OBS的路径。

**表93** AlgorithmEngine
参数	参数类型	描述
engine_id	String	引擎规格的ID。如“caffe-1.0.0-python2.7”。
engine_name	String	引擎规格的名称。如“Caffe”。
engine_version	String	引擎规格的版本。对一个引擎名称，有多个版本的引擎，如使用python2.7的"Caffe-1.0.0-python2.7"等。
v1_compatible	Boolean	是否为v1兼容模式。
run_user	String	引擎默认启动用户uid。
image_url	String	算法选择的自定义镜像地址。

**表94** FlavorResponse
参数	参数类型	描述
flavor_id	String	资源规格的ID。
flavor_name	String	资源规格的名称。
max_num	Integer	资源规格的最大节点数。
flavor_type	String	资源规格的类型。可选值如下： CPU GPU Ascend
billing	BillingInfo object	资源规格计费信息。
flavor_info	FlavorInfoResponse object	资源规格详细信息。
attributes	Map<String,String>	其他规格属性。

**表95** FlavorInfoResponse
参数	参数类型	描述
max_num	Integer	可以选择的最大节点数量（max_num，为1代表不支持分布式）。
cpu	Cpu object	cpu规格信息。
gpu	Gpu object	gpu规格信息。
npu	Npu object	Ascend规格信息。
memory	Memory object	内存信息。
disk	DiskResponse object	磁盘信息。

**表96** DiskResponse
参数	参数类型	描述
size	Integer	磁盘大小。
unit	String	磁盘大小单位。

**表97** SpecResponce
参数	参数类型	描述
resource	Resource object	训练作业资源规格信息。flavor_id和pool_id+[flavor_id]方式二选一。
volumes	Array of JobVolume objects	训练作业挂载卷信息。
log_export_path	LogExportPath object	训练作业日志输出信息。
schedule_policy	SchedulePolicy object	训练作业调度策略
custom_metrics	Array of CustomMetrics objects	指标采集配置

**表98** Resource
参数	参数类型	描述
policy	String	训练作业资源规格模式，可选值为regular，表示为标准模式。
flavor_id	String	训练作业资源规格id。CPU规格专属资源池不支持指定flavor_id。GPU/Ascend规格专属资源池可选取值如下： modelarts.pool.visual.xlarge（1卡） modelarts.pool.visual.2xlarge（2卡） modelarts.pool.visual.4xlarge（4卡） modelarts.pool.visual.8xlarge（8卡）
flavor_name	String	使用flavor_id时，由ModelArts返回的只读规格名称。
node_count	Integer	训练作业选择的资源副本数。
pool_id	String	训练作业选择的资源池ID。
flavor_detail	FlavorDetail object	训练作业、算法的规格信息（该字段只有公共资源池存在）。
main_container_allocated_resources	MainContainerAllocatedResources object	训练作业训练容器实际到手的资源规格。

**表99** FlavorDetail
参数	参数类型	描述
flavor_type	String	资源规格的类型。可选值如下： CPU GPU Ascend
billing	BillingInfo object	资源规格计费信息。
flavor_info	FlavorInfo object	资源规格详细信息。

**表100** BillingInfo
参数	参数类型	描述
code	String	计费码。
unit_num	Integer	计费单元。

**表101** FlavorInfo
参数	参数类型	描述
max_num	Integer	可以选择的最大节点数量（max_num，为1代表不支持分布式）。
cpu	Cpu object	cpu规格信息。
gpu	Gpu object	gpu规格信息。
npu	Npu object	Ascend规格信息。
memory	Memory object	内存信息。
disk	Disk object	磁盘信息。

**表102** Cpu
参数	参数类型	描述
arch	String	cpu架构。
core_num	Integer	核数。

**表103** Gpu
参数	参数类型	描述
unit_num	Integer	gpu卡数。
product_name	String	产品名。
memory	String	内存。

**表104** Npu
参数	参数类型	描述
unit_num	String	npu卡数。
product_name	String	产品名。
memory	String	内存。

**表105** Memory
参数	参数类型	描述
size	Integer	内存大小。
unit	String	内存单元数。

**表106** Disk
参数	参数类型	描述
size	String	磁盘大小。
unit	String	磁盘大小单位，一般为GB。

**表107** MainContainerAllocatedResources
参数	参数类型	描述
cpu_arch	String	cpu架构。
cpu_core_num	Float	核数。
mem_size	Float	内存信息。
accelerator_num	Float	加速卡卡数。
accelerator_type	String	加速卡类型。

**表108** JobVolume
参数	参数类型	描述
nfs	Nfs object	nfs方式的挂载卷。

**表109** Nfs
参数	参数类型	描述
nfs_server_path	String	nfs服务端路径，如：“10.10.10.10:/example/path”。
local_path	String	挂载到训练容器中的路径，如：“/example/path”。
read_only	Boolean	nfs挂载卷在容器中是否只读。

**表110** LogExportPath
参数	参数类型	描述
obs_url	String	训练作业日志保存的OBS地址，如：“obs://example/path”。
host_path	String	训练作业日志保存的宿主机的路径，如：“/example/path”。

**表111** SchedulePolicy
参数	参数类型	描述
required_affinity	RequiredAffinity object	训练作业亲和要求。
priority	Integer	训练作业优先级。约束限制：仅使用专属资源池训练时才支持设置训练作业优先级。作业优先级取值为1~3，默认优先级为1，最高优先级为3。默认用户权限可选择优先级1和2，配置了“设置作业为高优先级权限”的用户可选择优先级1~3。
preemptible	Boolean	是否可以被抢占。

**表112** RequiredAffinity
参数	参数类型	描述
affinity_type	String	亲和调度策略，可选取值如下: cabinet 强整柜调度 hyperinstance 超节点亲和调度
affinity_group_size	Integer	亲和组大小，affinity_type为hyperinstance时必填，系统会将affinity_group_size个task调度到一个超节点内组成亲和组。用户向超节点资源池投递训练作业，如果未设置亲和组大小，系统会默认赋值为1。

**表113** CustomMetrics
参数	参数类型	描述
exec	Exec object	命令行方式采集指标
http_get	HttpGet object	http方式采集指标

**表114** Exec
参数	参数类型	描述
command	Array of strings	命令行方式采集指标

**表115** HttpGet
参数	参数类型	描述
path	String	http获取指标的url路径，与下面的端口必须同时填或者不填
port	Integer	http获取指标的端口，与上面的url路径必须同时填或者不填

**表116** JobEndpointsResp
参数	参数类型	描述
ssh	SSHResp object	SSH连接信息。
jupyter_lab	JupyterLab object	JupyterLab连接信息。
tensorboard	Tensorboard object	Tensorboard连接信息。
mindstudio_insight	MindStudioInsight object	MindStudio Insight连接信息。

**表117** SSHResp
参数	参数类型	描述
key_pair_names	Array of strings	SSH密钥对名称，可以在云服务器控制台（ECS）“密钥对”页面创建和查看。
task_urls	Array of TaskUrls objects	SSH连接地址信息。

**表118** TaskUrls
参数	参数类型	描述
task	String	训练作业的任务ID。
url	String	训练作业SSH连接地址。

**表119** JupyterLab
参数	参数类型	描述
url	String	训练作业的JupyterLab地址。
token	String	训练作业的JupyterLab token。

**表120** Tensorboard
参数	参数类型	描述
url	String	训练作业的Tensorboard地址。
token	String	训练作业的Tensorboard token。

**表121** MindStudioInsight
参数	参数类型	描述
url	String	训练作业的MindStudio Insight地址。
token	String	训练作业的MindStudio Insight token。

状态码：400

**表122** 响应Body参数
参数	参数类型	描述
error_msg	String	错误信息。
error_code	String	错误码。
error_solution	String	错误解决建议。

请求示例

创建免费规格的训练作业。设置作业名称为“TestModelArtsJob”，描述为“This is a ModelArts job”。算法依赖的是id为3f5d6706-7b67-408d-8ba0-ec08048c45ed的算法，该算法未定义inputs与outputs，规格选用的是gpu免费规格。

POST https://endpoint/v2/{project_id}/training-jobs

{
  "kind" : "job",
  "metadata" : {
    "id" : "425b7087-83de-49ed-9e40-5bb642be956f",
    "name" : "TestModelArtsJob",
    "description" : "This is a ModelArts job",
    "create_time" : 1637045545982,
    "workspace_id" : "0",
    "user_name" : ""
  },
  "algorithm" : {
    "id" : "3f5d6706-7b67-408d-8ba0-ec08048c45ed",
    "name" : "ttt-obs-gpu",
    "code_dir" : "/cn-north-4-rse/test/moxingtest-code/",
    "boot_file" : "/cn-north-4-rse/test/moxingtest-code/test_obs_gpu.py",
    "parameters" : [ {
      "name" : "input_dir",
      "description" : "",
      "i18n_description" : null,
      "value" : "s://cn-north-4-rse/test/moxingtest-dir/",
      "constraint" : {
        "type" : "String",
        "editable" : true,
        "required" : true,
        "sensitive" : false,
        "valid_type" : "None",
        "valid_range" : [ ]
      }
    }, {
      "name" : "input_file",
      "description" : "",
      "i18n_description" : null,
      "value" : "obs://cn-north-4-rse/test/moxingtest/",
      "constraint" : {
        "type" : "String",
        "editable" : true,
        "required" : true,
        "sensitive" : false,
        "valid_type" : "None",
        "valid_range" : [ ]
      }
    }, {
      "name" : "large_file_method",
      "description" : "",
      "i18n_description" : null,
      "value" : "1",
      "constraint" : {
        "type" : "Integer",
        "editable" : true,
        "required" : true,
        "sensitive" : false,
        "valid_type" : "None",
        "valid_range" : [ ]
      }
    } ],
    "engine" : {
      "engine_id" : "horovod-cp36-tf-1.16.2",
      "engine_name" : "Horovod",
      "engine_version" : "0.16.2-TF-1.13.1-python3.6"
    },
    "policies" : { }
  },
  "spec" : {
    "resource" : {
      "flavor_id" : "modelarts.p3.large.public.free",
      "node_count" : 1
    },
    "log_export_path" : { },
    "custom_metrics" : [ {
      "http_get" : {
        "path" : "/raw_text",
        "port" : 10001
      }
    } ]
  }
}

使用自定义镜像创建训练作业。设置作业名称为“TestModelArtsJob2”，描述为“This is a ModelArts job2”的自定义镜像训练作业。使用专属资源池和nfs挂载。

POST https://endpoint/v2/{project_id}/training-jobs

{
  "kind" : "job",
  "metadata" : {
    "name" : "TestModelArtsJob2",
    "description" : "This is a ModelArts job2"
  },
  "algorithm" : {
    "engine" : {
      "image_url" : "xxxxxxxx/fastseq:1.2"
    },
    "command" : "cd /home/ma-user/ddp_demo && sh run_ddp.sh",
    "parameters" : [ ],
    "policies" : {
      "auto_search" : null
    },
    "environments" : {
      "NCCL_DEBUG" : "INFO",
      "NCCL_IB_DISABLE" : "0"
    }
  },
  "spec" : {
    "resource" : {
      "flavor_id" : "modelarts.pool.visual.xlarge",
      "node_count" : 1,
      "pool_id" : "poolfaf38d76"
    },
    "log_export_path" : {
      "obs_url" : "/cn-north-4-training-test/limou/ddp-demo-log/"
    },
    "volumes" : [ {
      "nfs" : {
        "nfs_server_path" : "192.168.0.82:/",
        "local_path" : "/home/ma-user/nfs/",
        "read_only" : false
      }
    } ]
  }
}

响应示例

状态码：201

{
  "kind" : "job",
  "metadata" : {
    "id" : "425b7087-83de-49ed-9e40-5bb642be956f",
    "name" : "TestModelArtsJob",
    "description" : "This is a ModelArts job",
    "create_time" : 1637045545982,
    "workspace_id" : "0",
    "user_name" : ""
  },
  "status" : {
    "phase" : "Creating",
    "secondary_phase" : "Creating",
    "duration" : 0,
    "start_time" : 0,
    "node_count_metrics" : null,
    "tasks" : [ "worker-0", "server-0" ]
  },
  "algorithm" : {
    "id" : "3f5d6706-7b67-408d-8ba0-ec08048c45ed",
    "name" : "ttt-obs-gpu",
    "code_dir" : "/cn-north-4-rse/test/moxingtest-code/",
    "boot_file" : "/cn-north-4-rse/test/moxingtest-code/test_obs_gpu.py",
    "parameters" : [ {
      "name" : "input_dir",
      "description" : "",
      "i18n_description" : null,
      "value" : "s://cn-north-4-rse/test/moxingtest-dir/",
      "constraint" : {
        "type" : "String",
        "editable" : true,
        "required" : true,
        "sensitive" : false,
        "valid_type" : "None",
        "valid_range" : [ ]
      }
    }, {
      "name" : "input_file",
      "description" : "",
      "i18n_description" : null,
      "value" : "obs://cn-north-4-rse/test/moxingtest/",
      "constraint" : {
        "type" : "String",
        "editable" : true,
        "required" : true,
        "sensitive" : false,
        "valid_type" : "None",
        "valid_range" : [ ]
      }
    }, {
      "name" : "large_file_method",
      "description" : "",
      "i18n_description" : null,
      "value" : "1",
      "constraint" : {
        "type" : "Integer",
        "editable" : true,
        "required" : true,
        "sensitive" : false,
        "valid_type" : "None",
        "valid_range" : [ ]
      }
    } ],
    "engine" : {
      "engine_id" : "horovod-cp36-tf-1.16.2",
      "engine_name" : "Horovod",
      "engine_version" : "0.16.2-TF-1.13.1-python3.6"
    },
    "policies" : { }
  },
  "spec" : {
    "resource" : {
      "policy" : "regular",
      "flavor_id" : "modelarts.p3.large.public.free",
      "flavor_name" : "Computing GPU(Vnt1) instance",
      "node_count" : 1,
      "flavor_detail" : {
        "flavor_type" : "GPU",
        "billing" : {
          "code" : "modelarts.vm.gpu.free",
          "unit_num" : 1
        },
        "flavor_info" : {
          "cpu" : {
            "arch" : "x86",
            "core_num" : 8
          },
          "gpu" : {
            "unit_num" : 1,
            "product_name" : "GP-Vnt1",
            "memory" : "32GB"
          },
          "memory" : {
            "size" : 64,
            "unit" : "GB"
          }
        }
      },
      "main_container_allocated_resources" : {
        "cpu_arch" : "x86",
        "cpu_core_num" : 5,
        "mem_size" : 44,
        "accelerator_num" : 1,
        "accelerator_type" : "nvidia-v100-pcie32"
      }
    },
    "log_export_path" : { },
    "custom_metrics" : [ {
      "exec" : {
        "command" : [ "cat", "/a/b/c.porm" ]
      }
    }, {
      "http_get" : {
        "path" : "/raw_text",
        "port" : 10001
      }
    } ]
  }
}

状态码：400

通用的错误应答消息体格式；如下为id是3f5d6706-7b67-408d-8ba0-ec08048c45ee的算法未找到时的返回信息。

{
  "error_msg" : "algorithm not found.",
  "error_code" : "ModelArts.2755",
  "error_solution" : "Check whether the training project information in the request is valid."
}

状态码


状态码	描述
201	ok
400	通用的错误应答消息体格式；如下为id是3f5d6706-7b67-408d-8ba0-ec08048c45ee的算法未找到时的返回信息。

错误码

请参见错误码。

父主题： 训练管理

上一篇：创建训练实验

下一篇：查询训练作业详情

意见反馈

文档内容是否对您有帮助？

有帮助没帮助

提供反馈

提交成功！非常感谢您的反馈，我们会继续努力做到更好！您可在我的云声建议查看反馈及问题处理状态。

系统繁忙，请稍后重试

在使用文档中是否遇到以下问题

内容与产品页面不一致

内容不易理解

缺失示例代码

步骤不可操作

搜不到想要的内容

缺少最佳实践

意见反馈（选填）

0/500

请至少选择一项反馈信息并填写问题反馈

字符长度不能超过500

直接提交取消

如您有其它疑问，您也可以通过华为云社区问答频道来与我们联系探讨

盘古Doer提问云社区提问