文档首页/ 魔坊(ModelArts)模型训推平台/ API参考/ 应用示例/ 基于自定义镜像创建Tensorboard训练作业场景
更新时间:2026-05-07 GMT+08:00
分享

基于自定义镜像创建Tensorboard训练作业场景

本节通过调用一系列API,以训练模型为例介绍ModelArts API的使用流程。

概述

使用PyTorch框架创建训练作业的流程如下:

  1. 调用认证鉴权接口获取用户Token,在后续的请求中需要将Token放到请求消息头中作为认证。
  2. 调用创建训练作业接口使用刚创建的算法返回的uuid创建一个训练作业,记录训练作业id。
  3. 调用查询训练作业详情接口使用刚创建的训练作业返回的id查询训练作业状态和Tensorboard地址。
  4. 打开Tensorboard Endpoint查看可视化指标数据。
  5. 当训练作业使用完成或不再需要时,调用删除训练作业接口删除训练作业。

前提条件

  • 已准备好Tensorboard日志保存目录,例如OBS并行文件系统的“obs://cnnorth4-job-pfs/summary”目录下

操作步骤

  1. 调用认证鉴权接口获取用户的Token。
    1. 请求消息体:

      URI格式:POST https://{iam_endpoint}/v3/auth/tokens

      请求消息头:Content-Type →application/json

      请求Body:
      {
        "auth": {
          "identity": {
            "methods": ["password"],
            "password": {
              "user": {
                "name": "user_name", 
                "password": "user_password",
                "domain": {
                  "name": "domain_name"  
                }
              }
            }
          },
          "scope": {
            "project": {
              "name": "cn-north-1"  
            }
          }
        }
      }
      其中,加粗的斜体字段需要根据实际值填写:
      • iam_endpoint为IAM的终端节点。
      • user_name为IAM用户名。
      • user_password为用户登录密码。
      • domain_name为用户所属的账号名。
      • cn-north-1为项目名,代表服务的部署区域。
    2. 返回状态码“201 Created”,在响应Header中获取“X-Subject-Token”的值即为Token,如下所示:
      x-subject-token →MIIZmgYJKoZIhvcNAQcCoIIZizCCGYcCAQExDTALBglghkgBZQMEAgEwgXXXXXX...
  2. 调用创建训练作业接口使用刚创建的算法返回的uuid创建一个训练作业,记录训练作业id。
    1. 请求消息体:

      URI格式:POST https://{ma_endpoint}/v2/{project_id}/training-jobs

      请求消息头:

      • X-Auth-Token →MIIZmgYJKoZIhvcNAQcCoIIZizCCGYcCAQExDTALBglghkgBZQMEAgEwgXXXXXX...
      • Content-Type →application/json

      其中,加粗的斜体字段需要根据实际值填写。

      请求Body:

      {
          "kind": "job",
          "metadata": {
              "name": "test-pytorch-cpu01",
              "description": "test pytorch work cpu",
              "annotations": {
                  "tensorboard/enable": "true"
              }
          },
          "algorithm": {
              "engine": {
                  "image_url": "atelier/pytorch_cuda:pytorch_2.7.0-cuda_12.8-py_3.11.10-ubuntu_22.04-x86_64-20251215163925-4e5422a"
              },
              "command": "sleep 1h",
              "summary": {
                  "data_sources": [
                      {
                          "pfs": {
                              "pfs_path": "obs://cnnorth4-job-pfs/summary/"
                          }
                      }
                  ]
              }
          },
          "spec": {
              "resource": {
                  "node_count": 1,
                  "pool_id": "pool-maostest-train-06024304be00d5092fbdc0013d201342"
              }
          }
      }

      其中,加粗的斜体字段需要根据实际值填写:

      • “kind”填写训练作业的类型,默认为job。
      • “metadata”下的“name”“description”填写训练作业的名称和描述。
      • “metadata”下的“annotations”填写"tensorboard/enable": "true"表示启用tensorboard。
      • “algorithm”下的“image_url”填写训练作业镜像的地址。
      • “algorithm”下的“command”填写训练作业启动命令。
      • “algorithm”下的“data_sources”填写Tensorboard日志的保存目录。
      • “spec”字段下的“pool_id”表示训练作业所依赖的资源池ID。“node_count”表示训练是否需要多机训练(分布式训练),此处为单机情况使用默认值“1”“log_export_path”用于指定用户需要上传日志的obs目录。
    2. 返回状态码“201 Created”,表示训练作业创建成功,响应Body如下所示:
      {
          "kind": "job",
          "metadata": {
              "id": "dc0e330d-c6a8-4f1d-9cd3-1f108c3bbaf9",
              "name": "test-pytorch-cpu01",
              "description": "test pytorch work cpu",
              "create_time": 1777545590838,
              "workspace_id": "0",
              "ai_project": "default-ai-project",
              "labels": {
                  "training-job": "modelarts-os"
              },
              "user_name": "",
              "annotations": {
                  "job_template": "Template DL",
                  "key_task": "worker",
                  "tensorboard/enable": "true"
              },
              "training_experiment_reference": {},
              "tags": []
          },
          "status": {
              "phase": "Pending",
              "secondary_phase": "Creating",
              "pending_time": 2000,
              "duration": 0,
              "is_hanged": false,
              "retry_count": 0,
              "start_time": 0,
              "node_count_metrics": null,
              "tasks": [
                  "worker-0"
              ],
              "metrics_statistics": {
                  "cpu_usage": {
                      "average": -1,
                      "max": -1,
                      "min": -1
                  },
                  "mem_usage": {
                      "average": -1,
                      "max": -1,
                      "min": -1
                  }
              }
          },
          "algorithm": {
              "command": "sleep 1h",
              "engine": {
                  "engine_id": "",
                  "engine_name": "",
                  "engine_version": "",
                  "v1_compatible": false,
                  "image_url": "atelier/pytorch_cuda:pytorch_2.7.0-cuda_12.8-py_3.11.10-ubuntu_22.04-x86_64-20251215163925-4e5422a",
                  "non_swr_image": false,
                  "run_user": "",
                  "image_source": true,
                  "image_repo_id": "",
                  "image_id": ""
              },
              "summary": {
                  "data_sources": [
                      {
                          "pfs": {
                              "pfs_path": "obs://cnnorth4-job-pfs/summary/"
                          }
                      }
                  ]
              }
          },
          "spec": {
              "resource": {
                  "pool_id": "pool-maostest-train-06024304be00d5092fbdc0013d201342",
                  "pool_resource_flavor": "",
                  "node_count": 1,
                  "pool_info": {
                      "cpu_arch": "x86",
                      "core_num": 5,
                      "mem_size": 22,
                      "cache_size": 0,
                      "accelerator": "",
                      "accelerator_num": 0,
                      "accelerator_type": "",
                      "accelerator_size": 0,
                      "variant": "",
                      "huge_pages": 0,
                      "x_parameter_plane": "",
                      "use_privileged": false,
                      "use_host_network": false,
                      "use_ib_network": false,
                      "project_id": "",
                      "pool_resource_flavor": "liumuqi-eni-test",
                      "pool_id": "pool-maostest-train-06024304be00d5092fbdc0013d201342",
                      "cluster_id": "",
                      "maos_pool": true,
                      "quota_id": "",
                      "maos_migrated": false,
                      "detect_all_in_int": false,
                      "pool_type": "",
                      "enable_cabinet": false,
                      "enable_memarts": false,
                      "enable_ems": false,
                      "empty_dir_size": 0
                  },
                  "main_container_allocated_resources": {
                      "cpu_arch": "x86",
                      "cpu_core_num": 4,
                      "mem_size": 20,
                      "accelerator_num": 0,
                      "accelerator_type": ""
                  }
              },
              "is_hosted_log": true,
              "runtime_type": "production"
          },
          "endpoints": {
              "tensorboard": {}
          },
          "ftjob_config": {
              "checkpoint_config": {
                  "save_checkpoints_max": 0,
                  "checkpoint_id": "",
                  "skipped_steps": 0,
                  "restore_training": 0
              },
              "task_env": {
                  "envs": null
              }
          }
      }
      • 记录“metadata”下的“id”(训练作业的任务ID)字段的值便于后续步骤使用。
      • “Status”下的“phase”“secondary_phase”为表示训练作业的状态和下一步状态。示例中“Creating”表示训练作业正在创建中。
  3. 调用查询训练作业详情接口使用刚创建的训练作业返回的uuid查询训练作业状态。
    1. 请求消息体:

      URI格式:GET https://{ma_endpoint}/v2/{project_id}/training-jobs/{training_job_id}

      请求消息头:X-Auth-Token →MIIZmgYJKoZIhvcNAQcCoIIZizCCGYcCAQExDTALBglghkgBZQMEAgEwgXXXXXX...

      其中,加粗的斜体字段需要根据实际值填写:

      “training_job_id”2记录的训练作业的任务ID。

    2. 返回状态码“200 OK”,响应Body如下所示:
      {
          "kind": "job",
          "metadata": {
              "id": "dc0e330d-c6a8-4f1d-9cd3-1f108c3bbaf9",
              "name": "test-pytorch-cpu01",
              "description": "test pytorch work cpu",
              "create_time": 1777545590838,
              "workspace_id": "0",
              "ai_project": "default-ai-project",
              "labels": {
                  "training-job": "modelarts-os"
              },
              "user_name": "ei_modelarts_y00218826_05",
              "annotations": {
                  "job_template": "Template DL",
                  "key_task": "worker",
                  "tensorboard/enable": "true"
              },
              "training_experiment_reference": {},
              "tags": []
          },
          "status": {
              "phase": "Running",
              "secondary_phase": "Running",
              "pending_time": 77162,
              "duration": 181000,
              "is_hanged": false,
              "retry_count": 0,
              "task_ips": [
                  {
                      "task": "worker-0",
                      "ip": "172.16.0.79",
                      "host_ip": "192.168.209.44",
                      "schedule_count": 1
                  }
              ],
              "start_time": 1777545668000,
              "node_count_metrics": [
                  [
                      1777545658000,
                      0
                  ],
                  [
                      1777545667000,
                      0
                  ],
                  [
                      1777545668000,
                      1
                  ],
                  [
                      1777545848000,
                      1
                  ],
                  [
                      1777545849000,
                      1
                  ]
              ],
              "tasks": [
                  "worker-0"
              ],
              "metrics_statistics": {
                  "cpu_usage": {
                      "average": -1,
                      "max": -1,
                      "min": -1
                  },
                  "mem_usage": {
                      "average": -1,
                      "max": -1,
                      "min": -1
                  }
              },
              "running_records": [
                  {
                      "start_at": 1777545675,
                      "start_type": "init_or_rescheduled"
                  }
              ]
          },
          "algorithm": {
              "command": "sleep 1h",
              "engine": {
                  "engine_id": "",
                  "engine_name": "",
                  "engine_version": "",
                  "v1_compatible": false,
                  "image_url": "atelier/pytorch_cuda:pytorch_2.7.0-cuda_12.8-py_3.11.10-ubuntu_22.04-x86_64-20251215163925-4e5422a",
                  "non_swr_image": false,
                  "run_user": "",
                  "image_source": true,
                  "image_repo_id": "",
                  "image_id": ""
              },
              "summary": {
                  "data_sources": [
                      {
                          "pfs": {
                              "pfs_path": "obs://cnnorth4-job-pfs/summary/"
                          }
                      }
                  ]
              }
          },
          "spec": {
              "resource": {
                  "pool_id": "pool-maostest-train-06024304be00d5092fbdc0013d201342",
                  "pool_resource_flavor": "",
                  "node_count": 1,
                  "pool_info": {
                      "cpu_arch": "x86",
                      "core_num": 5,
                      "mem_size": 22,
                      "cache_size": 0,
                      "accelerator": "",
                      "accelerator_num": 0,
                      "accelerator_type": "",
                      "accelerator_size": 0,
                      "variant": "",
                      "huge_pages": 0,
                      "x_parameter_plane": "",
                      "use_privileged": false,
                      "use_host_network": false,
                      "use_ib_network": false,
                      "project_id": "",
                      "pool_resource_flavor": "liumuqi-eni-test",
                      "pool_id": "pool-maostest-train-06024304be00d5092fbdc0013d201342",
                      "cluster_id": "",
                      "maos_pool": true,
                      "quota_id": "",
                      "maos_migrated": false,
                      "detect_all_in_int": false,
                      "pool_type": "",
                      "enable_cabinet": false,
                      "enable_memarts": false,
                      "enable_ems": false,
                      "empty_dir_size": 0
                  },
                  "main_container_allocated_resources": {
                      "cpu_arch": "x86",
                      "cpu_core_num": 4,
                      "mem_size": 20,
                      "accelerator_num": 0,
                      "accelerator_type": ""
                  }
              },
              "is_hosted_log": true,
              "runtime_type": "production"
          },
          "endpoints": {
              "tensorboard": {
                  "url": "https://authoring-modelarts-cnnorth7.ulanqab.huawei.com/3efaddb9-dc0e330d-c6a8-4f1d-9cd3-1f108c3bbaf9/proxy/6006/",
                  "token": "fa8e321f772xxxxxxxxxxx3f1844d06"
              }
          },
          "ftjob_config": {
              "checkpoint_config": {
                  "save_checkpoints_max": 0,
                  "checkpoint_id": "",
                  "skipped_steps": 0,
                  "restore_training": 0
              },
              "task_env": {
                  "envs": null
              }
          }
      }

      根据响应可以了解训练作业的版本详情,其中“status”“Running”表示训练作业正在运行。当训练作业运行后,开始启动Tensorboard,Tensorboard启动成功后作业详情中的endpoints会返回tensorboard的打开地址和token。

  4. 打开Tensorboard Endpoint查看可视化指标数据。

    URI格式:GET {tensorboard_endpoint}?token={tensorboard_token}

    其中,加粗的斜体字段需要根据实际值填写:

    • “tensorboard_endpoint”为训练作业的Tensorboard打开地址。
    • “tensorboard_token”训练作业的Tensorboard访问凭证。
  5. 当训练作业使用完成或不再需要时,调用删除训练作业接口删除训练作业。
    1. 请求消息体:

      URI格式:DELETE https://{ma_endpoint}/v2/{project_id}/training-jobs/{training_job_id}

      请求消息头:X-Auth-Token →MIIZmgYJKoZIhvcNAQcCoIIZizCCGYcCAQExDTALBglghkgBZQMEAgEwgXXXXXX...

      其中,加粗的斜体字段需要根据实际值填写。

    2. 返回状态码“202 No Content”响应,则表示删除作业成功。

相关文档