文档首页/ 魔坊(ModelArts)模型训推平台/ API参考/ 应用示例/ 基于自定义镜像创建训练作业场景
更新时间:2026-05-07 GMT+08:00
分享

基于自定义镜像创建训练作业场景

本节通过调用一系列API,以训练模型为例介绍ModelArts API的使用流程。

概述

使用PyTorch框架创建训练作业的流程如下:

  1. 调用认证鉴权接口获取用户Token,在后续的请求中需要将Token放到请求消息头中作为认证。
  2. 调用创建训练作业接口使用刚创建的算法返回的uuid创建一个训练作业,记录训练作业id。
  3. 调用查询训练作业详情接口使用刚创建的训练作业返回的id查询训练作业状态。
  4. 调用查询训练作业指定任务的日志(OBS链接)接口获取训练作业日志的对应的obs路径。
  5. 调用查询训练作业指定任务的运行指标接口查看训练作业的运行指标详情。
  6. 当训练作业使用完成或不再需要时,调用删除训练作业接口删除训练作业。

前提条件

  • 已准备好PyTorch框架的训练代码,例如将启动文件“test-pytorch.py”存放在OBS的“obs://cnnorth4-job-test-v2/pytorch/fast_example/code/cpu”目录下。
  • 已经创建好训练作业的日志输出位置,例如“obs://cnnorth4-job-test-v2/pytorch/fast_example/log”

操作步骤

  1. 调用认证鉴权接口获取用户的Token。
    1. 请求消息体:

      URI格式:POST https://{iam_endpoint}/v3/auth/tokens

      请求消息头:Content-Type →application/json

      请求Body:
      {
        "auth": {
          "identity": {
            "methods": ["password"],
            "password": {
              "user": {
                "name": "user_name", 
                "password": "user_password",
                "domain": {
                  "name": "domain_name"  
                }
              }
            }
          },
          "scope": {
            "project": {
              "name": "cn-north-1"  
            }
          }
        }
      }
      其中,加粗的斜体字段需要根据实际值填写:
      • iam_endpoint为IAM的终端节点。
      • user_name为IAM用户名。
      • user_password为用户登录密码。
      • domain_name为用户所属的账号名。
      • cn-north-1为项目名,代表服务的部署区域。
    2. 返回状态码“201 Created”,在响应Header中获取“X-Subject-Token”的值即为Token,如下所示:
      x-subject-token →MIIZmgYJKoZIhvcNAQcCoIIZizCCGYcCAQExDTALBglghkgBZQMEAgEwgXXXXXX...
  2. 调用创建训练作业接口使用刚创建的算法返回的uuid创建一个训练作业,记录训练作业id。
    1. 请求消息体:

      URI格式:POST https://{ma_endpoint}/v2/{project_id}/training-jobs

      请求消息头:

      • X-Auth-Token →MIIZmgYJKoZIhvcNAQcCoIIZizCCGYcCAQExDTALBglghkgBZQMEAgEwgXXXXXX...
      • Content-Type →application/json

      其中,加粗的斜体字段需要根据实际值填写。

      请求Body:

      {
          "kind": "job",
          "metadata": {
              "name": "test-pytorch-cpu01",
              "description": "test pytorch work cpu"
          },
          "algorithm": {
              "code_dir": "/cnnorth4-job-test-v2/pytorch/fast_example/code/cpu/",
              "local_code_dir": "/home/ma-user/modelarts/user-job-dir",
              "engine": {
                  "image_url": "atelier/pytorch_cuda:pytorch_2.7.0-cuda_12.8-py_3.11.10-ubuntu_22.04-x86_64-20251215163925-4e5422a"
              },
              "command": "python ${MA_JOB_DIR}/cpu/test-pytorch.py"
          },
          "spec": {
              "resource": {
                  "node_count": 1,
                  "pool_id": "pool-maostest-train-06024304be00d5092fbdc0013d201342"
              },
              "log_export_path": {
                  "obs_url": "/cnnorth4-job-test-v2/pytorch/fast_example/log/"
              }
          }
      }

      其中,加粗的斜体字段需要根据实际值填写:

      • “kind”填写训练作业的类型,默认为job。
      • “metadata”下的“name”“description”填写训练作业的名称和描述。
      • “algorithm”下的“code_dir”“local_code_dir”分别为代码目录和代码下载到作业内的本地目录。
      • “algorithm”下的“image_url”填写训练作业镜像的地址。
      • “algorithm”下的“command”填写训练作业启动命令。
      • “spec”字段下的“pool_id”表示训练作业所依赖的资源池ID。“node_count”表示训练是否需要多机训练(分布式训练),此处为单机情况使用默认值“1”“log_export_path”用于指定用户需要上传日志的obs目录。
    2. 返回状态码“201 Created”,表示训练作业创建成功,响应Body如下所示:
      {
          "kind": "job",
          "metadata": {
              "id": "31318695-2011-4e48-9b90-9c9178c57951",
              "name": "test-pytorch-cpu01",
              "description": "test pytorch work cpu",
              "create_time": 1777545352008,
              "workspace_id": "0",
              "ai_project": "default-ai-project",
              "labels": {
                  "training-job": "modelarts-os"
              },
              "user_name": "",
              "annotations": {
                  "job_template": "Template DL",
                  "key_task": "worker"
              },
              "training_experiment_reference": {},
              "tags": []
          },
          "status": {
              "phase": "Pending",
              "secondary_phase": "Creating",
              "pending_time": 1000,
              "duration": 0,
              "is_hanged": false,
              "retry_count": 0,
              "start_time": 0,
              "node_count_metrics": null,
              "tasks": [
                  "worker-0"
              ],
              "metrics_statistics": {
                  "cpu_usage": {
                      "average": -1,
                      "max": -1,
                      "min": -1
                  },
                  "mem_usage": {
                      "average": -1,
                      "max": -1,
                      "min": -1
                  }
              }
          },
          "algorithm": {
              "code_dir": "/cnnorth4-job-test-v2/pytorch/fast_example/code/cpu/",
              "local_code_dir": "/home/ma-user/modelarts/user-job-dir",
              "command": "python ${MA_JOB_DIR}/cpu/test-pytorch.py",
              "engine": {
                  "engine_id": "",
                  "engine_name": "",
                  "engine_version": "",
                  "v1_compatible": false,
                  "image_url": "atelier/pytorch_cuda:pytorch_2.7.0-cuda_12.8-py_3.11.10-ubuntu_22.04-x86_64-20251215163925-4e5422a",
                  "non_swr_image": false,
                  "run_user": "",
                  "image_source": true,
                  "image_repo_id": "",
                  "image_id": ""
              }
          },
          "spec": {
              "resource": {
                  "pool_id": "pool-maostest-train-06024304be00d5092fbdc0013d201342",
                  "pool_resource_flavor": "",
                  "node_count": 1,
                  "pool_info": {
                      "cpu_arch": "x86",
                      "core_num": 5,
                      "mem_size": 22,
                      "cache_size": 0,
                      "accelerator": "",
                      "accelerator_num": 0,
                      "accelerator_type": "",
                      "accelerator_size": 0,
                      "variant": "",
                      "huge_pages": 0,
                      "x_parameter_plane": "",
                      "use_privileged": false,
                      "use_host_network": false,
                      "use_ib_network": false,
                      "project_id": "",
                      "pool_resource_flavor": "liumuqi-eni-test",
                      "pool_id": "pool-maostest-train-06024304be00d5092fbdc0013d201342",
                      "cluster_id": "",
                      "maos_pool": true,
                      "quota_id": "",
                      "maos_migrated": false,
                      "detect_all_in_int": false,
                      "pool_type": "",
                      "enable_cabinet": false,
                      "enable_memarts": false,
                      "enable_ems": false,
                      "empty_dir_size": 0
                  },
                  "main_container_allocated_resources": {
                      "cpu_arch": "x86",
                      "cpu_core_num": 4,
                      "mem_size": 20,
                      "accelerator_num": 0,
                      "accelerator_type": ""
                  }
              },
              "log_export_path": {
                  "obs_url": "/cnnorth4-job-test-v2/pytorch/fast_example/log/"
              },
              "is_hosted_log": true,
              "runtime_type": "production"
          },
          "ftjob_config": {
              "checkpoint_config": {
                  "save_checkpoints_max": 0,
                  "checkpoint_id": "",
                  "skipped_steps": 0,
                  "restore_training": 0
              },
              "task_env": {
                  "envs": null
              }
          }
      }
      • 记录“metadata”下的“id”(训练作业的任务ID)字段的值便于后续步骤使用。
      • “Status”下的“phase”“secondary_phase”为表示训练作业的状态和下一步状态。示例中“Creating”表示训练作业正在创建中。
  3. 调用查询训练作业详情接口使用刚创建的训练作业返回的uuid查询训练作业状态。
    1. 请求消息体:

      URI格式:GET https://{ma_endpoint}/v2/{project_id}/training-jobs/{training_job_id}

      请求消息头:X-Auth-Token →MIIZmgYJKoZIhvcNAQcCoIIZizCCGYcCAQExDTALBglghkgBZQMEAgEwgXXXXXX...

      其中,加粗的斜体字段需要根据实际值填写:

      “training_job_id”2记录的训练作业的任务ID。

    2. 返回状态码“200 OK”,响应Body如下所示:
      {
          "kind": "job",
          "metadata": {
              "id": "31318695-2011-4e48-9b90-9c9178c57951",
              "name": "test-pytorch-cpu01",
              "description": "test pytorch work cpu",
              "create_time": 1777545352008,
              "workspace_id": "0",
              "ai_project": "default-ai-project",
              "labels": {
                  "training-job": "modelarts-os"
              },
              "user_name": "modelarts_xxx",
              "annotations": {
                  "job_template": "Template DL",
                  "key_task": "worker"
              },
              "training_experiment_reference": {},
              "tags": []
          },
          "status": {
              "phase": "Running",
              "secondary_phase": "Running",
              "pending_time": 68992,
              "duration": 4000,
              "is_hanged": false,
              "retry_count": 0,
              "task_ips": [
                  {
                      "task": "worker-0",
                      "ip": "172.16.0.31",
                      "host_ip": "192.168.140.98",
                      "schedule_count": 1
                  }
              ],
              "start_time": 1777545421000,
              "node_count_metrics": [
                  [
                      1777545411000,
                      0
                  ],
                  [
                      1777545420000,
                      0
                  ],
                  [
                      1777545421000,
                      1
                  ],
                  [
                      1777545424000,
                      1
                  ],
                  [
                      1777545425000,
                      1
                  ]
              ],
              "tasks": [
                  "worker-0"
              ],
              "metrics_statistics": {
                  "cpu_usage": {
                      "average": -1,
                      "max": -1,
                      "min": -1
                  },
                  "mem_usage": {
                      "average": -1,
                      "max": -1,
                      "min": -1
                  }
              },
              "running_records": [
                  {
                      "start_at": 1777545424,
                      "start_type": "init_or_rescheduled"
                  }
              ]
          },
          "algorithm": {
              "code_dir": "/cnnorth4-job-test-v2/pytorch/fast_example/code/cpu/",
              "local_code_dir": "/home/ma-user/modelarts/user-job-dir",
              "command": "python ${MA_JOB_DIR}/cpu/test-pytorch.py",
              "engine": {
                  "engine_id": "",
                  "engine_name": "",
                  "engine_version": "",
                  "v1_compatible": false,
                  "image_url": "atelier/pytorch_cuda:pytorch_2.7.0-cuda_12.8-py_3.11.10-ubuntu_22.04-x86_64-20251215163925-4e5422a",
                  "non_swr_image": false,
                  "run_user": "",
                  "image_source": true,
                  "image_repo_id": "",
                  "image_id": ""
              }
          },
          "spec": {
              "resource": {
                  "pool_id": "pool-maostest-train-06024304be00d5092fbdc0013d201342",
                  "pool_resource_flavor": "",
                  "node_count": 1,
                  "pool_info": {
                      "cpu_arch": "x86",
                      "core_num": 5,
                      "mem_size": 22,
                      "cache_size": 0,
                      "accelerator": "",
                      "accelerator_num": 0,
                      "accelerator_type": "",
                      "accelerator_size": 0,
                      "variant": "",
                      "huge_pages": 0,
                      "x_parameter_plane": "",
                      "use_privileged": false,
                      "use_host_network": false,
                      "use_ib_network": false,
                      "project_id": "",
                      "pool_resource_flavor": "liumuqi-eni-test",
                      "pool_id": "pool-maostest-train-06024304be00d5092fbdc0013d201342",
                      "cluster_id": "",
                      "maos_pool": true,
                      "quota_id": "",
                      "maos_migrated": false,
                      "detect_all_in_int": false,
                      "pool_type": "",
                      "enable_cabinet": false,
                      "enable_memarts": false,
                      "enable_ems": false,
                      "empty_dir_size": 0
                  },
                  "main_container_allocated_resources": {
                      "cpu_arch": "x86",
                      "cpu_core_num": 4,
                      "mem_size": 20,
                      "accelerator_num": 0,
                      "accelerator_type": ""
                  }
              },
              "log_export_path": {
                  "obs_url": "/cnnorth4-job-test-v2/pytorch/fast_example/log/"
              },
              "is_hosted_log": true,
              "runtime_type": "production"
          },
          "ftjob_config": {
              "checkpoint_config": {
                  "save_checkpoints_max": 0,
                  "checkpoint_id": "",
                  "skipped_steps": 0,
                  "restore_training": 0
              },
              "task_env": {
                  "envs": null
              }
          }
      }

      根据响应可以了解训练作业的版本详情,其中“status”“Running”表示训练作业正在运行。

  4. 调用查询训练作业指定任务的日志(OBS链接)接口获取训练作业日志的对应的obs路径。
    1. 请求消息体:

      URI格式:GET https://{ma_endpoint}/v2/{project_id}/training-jobs/{training_job_id}/tasks/{task_id}/logs/url

      请求消息头:

      X-Auth-Token→MIIZmgYJKoZIhvcNAQcCoIIZizCCGYcCAQExDTALBglghkgBZQMEAgEwgXXXXXX...

      Content-Type→text/plain

      其中,加粗的斜体字段需要根据实际值填写:

      • “task_id”为训练作业的任务名称,一般使用work-0。
      • Content-Type可以设置成不同方式。text/plain,返回OBS临时预览链接。application/octet-stream,返回OBS临时下载链接。
    2. 返回状态码“200 OK”,响应Body如下所示:
      {
          "obs_url": "https://modelarts-training-log-cn-north-4.obs.cn-north-4.myhuaweicloud.com:443/66ff6991-fd66-40b6-8101-0829a46d3731/worker-0/modelarts-job-66ff6991-fd66-40b6-8101-0829a46d3731-worker-0.log?AWSAccessKeyId=GFGTBKOZENDD83QEMZMV&Expires=1641896599&Signature=BedFZHEU1oCmqlI912UL9mXlhkg%3D"
      }

      返回字段表示日志的obs路径。复制至浏览器即可看到对应效果。

  5. 调用查询训练作业指定任务的运行指标接口查看训练作业的运行指标详情。
    1. 请求消息体:

      URI格式:GET https://{ma_endpoint}/v2/{project_id}/training-jobs/{training_job_id}/metrics/{task_id}

      请求消息头:X-Auth-Token →MIIZmgYJKoZIhvcNAQcCoIIZizCCGYcCAQExDTALBglghkgBZQMEAgEwgXXXXXX...

      其中,加粗的斜体字段需要根据实际值填写。

    2. 返回状态码“200 OK”,响应Body如下所示:
      {
          "metrics": [
              {
                  "metric": "cpuUsage",
                  "value": [
                      -1,
                      -1,
                      28.622,
                      35.053,
                      39.988,
                      40.069,
                      40.082,
                      40.094
                  ]
              },
              {
                  "metric": "memUsage",
                  "value": [
                      -1,
                      -1,
                      0.544,
                      0.641,
                      0.736,
                      0.737,
                      0.738,
                      0.739
                  ]
              },
              {
                  "metric": "npuUtil",
                  "value": [
                      -1,
                      -1,
                      -1,
                      -1,
                      -1,
                      -1,
                      -1,
                      -1
                  ]
              },
              {
                  "metric": "npuMemUsage",
                  "value": [
                      -1,
                      -1,
                      -1,
                      -1,
                      -1,
                      -1,
                      -1,
                      -1
                  ]
              },
              {
                  "metric": "gpuUtil",
                  "value": [
                      -1,
                      -1,
                      -1,
                      -1,
                      -1,
                      -1,
                      -1,
                      -1
                  ]
              },
              {
                  "metric": "gpuMemUsage",
                  "value": [
                      -1,
                      -1,
                      -1,
                      -1,
                      -1,
                      -1,
                      -1,
                      -1
                  ]
              }
          ]
      }

      可以看到CPU等相关的使用率指标。

  6. 当训练作业使用完成或不再需要时,调用删除训练作业接口删除训练作业。
    1. 请求消息体:

      URI格式:DELETE https://{ma_endpoint}/v2/{project_id}/training-jobs/{training_job_id}

      请求消息头:X-Auth-Token →MIIZmgYJKoZIhvcNAQcCoIIZizCCGYcCAQExDTALBglghkgBZQMEAgEwgXXXXXX...

      其中,加粗的斜体字段需要根据实际值填写。

    2. 返回状态码“202 No Content”响应,则表示删除作业成功。

相关文档