文档首页/ AI开发平台ModelArts/ API参考/ 应用示例/ 以PyTorch框架创建训练作业(新版训练)
更新时间:2024-05-30 GMT+08:00
分享

以PyTorch框架创建训练作业(新版训练)

本节通过调用一系列API,以训练模型为例介绍ModelArts API的使用流程。

概述

使用PyTorch框架创建训练作业的流程如下:

  1. 调用认证鉴权接口获取用户Token,在后续的请求中需要将Token放到请求消息头中作为认证。
  2. 调用获取训练作业支持的公共规格接口获取训练作业支持的资源规格。
  3. 调用获取训练作业支持的AI预置框架接口查看训练作业支持的引擎类型和版本。
  4. 调用创建算法接口创建一个算法,记录算法id。
  5. 调用创建训练作业接口使用刚创建的算法返回的uuid创建一个训练作业,记录训练作业id。
  6. 调用查询训练作业详情接口使用刚创建的训练作业返回的id查询训练作业状态。
  7. 调用查询训练作业指定任务的日志(OBS链接)接口获取训练作业日志的对应的obs路径。
  8. 调用查询训练作业指定任务的运行指标接口查看训练作业的运行指标详情。
  9. 当训练作业使用完成或不再需要时,调用删除训练作业接口删除训练作业。

前提条件

  • 已准备好PyTorch框架的训练代码,例如将启动文件“test-pytorch.py”存放在OBS的“obs://cnnorth4-job-test-v2/pytorch/fast_example/code/cpu”目录下。
  • 已经准备好训练作业的数据文件,例如将训练数据集存放在OBS的“obs://cnnorth4-job-test-v2/pytorch/fast_example/data”目录下。
  • 已经创建好训练作业的模型输出位置,例如“obs://cnnorth4-job-test-v2/pytorch/fast_example/outputs”
  • 已经创建好训练作业的日志输出位置,例如“obs://cnnorth4-job-test-v2/pytorch/fast_example/log”

操作步骤

  1. 调用认证鉴权接口获取用户的Token。
    1. 请求消息体:

      URI格式:POST https://{iam_endpoint}/v3/auth/tokens

      请求消息头:Content-Type →application/json

      请求Body:
      {
        "auth": {
          "identity": {
            "methods": ["password"],
            "password": {
              "user": {
                "name": "user_name", 
                "password": "user_password",
                "domain": {
                  "name": "domain_name"  
                }
              }
            }
          },
          "scope": {
            "project": {
              "name": "cn-north-1"  
            }
          }
        }
      }
      其中,加粗的斜体字段需要根据实际值填写:
      • iam_endpoint为IAM的终端节点。
      • user_name为IAM用户名。
      • user_password为用户登录密码。
      • domain_name为用户所属的账号名。
      • cn-north-1为项目名,代表服务的部署区域。
    2. 返回状态码“201 Created”,在响应Header中获取“X-Subject-Token”的值即为Token,如下所示:
      x-subject-token →MIIZmgYJKoZIhvcNAQcCoIIZizCCGYcCAQExDTALBglghkgBZQMEAgEwgXXXXXX...
  2. 调用获取训练作业支持的公共规格接口获取训练作业支持的资源规格。
    1. 请求消息体:

      URI格式:GET https://{ma_endpoint}/v2/{project_id}/ training-job-flavors? flavor_type=CPU

      请求消息头:X-Auth-Token →MIIZmgYJKoZIhvcNAQcCoIIZizCCGYcCAQExDTALBglghkgBZQMEAgEwgXXXXXX...

      其中,加粗的斜体字段需要根据实际值填写:

      • ma_endpoint为ModelArts的终端节点。
      • project_id为用户的项目ID。
      • “X-Auth-Token”的值是上一步获取到的Token值。
    2. 返回状态码“200”,响应Body如下所示:
      {
        "total_count": 2,
        "flavors": [
          {
            "flavor_id": "modelarts.vm.cpu.2u",
            "flavor_name": "Computing CPU(2U) instance",
            "flavor_type": "CPU",
            "billing": {
              "code": "modelarts.vm.cpu.2u",
              "unit_num": 1
            },
            "flavor_info": {
              "max_num": 1,
              "cpu": {
                "arch": "x86",
                "core_num": 2
              },
              "memory": {
                "size": 8,
                "unit": "GB"
              },
              "disk": {
                "size": 50,
                "unit": "GB"
              }
            }
          },
          {
            "flavor_id": "modelarts.vm.cpu.8u",
            "flavor_name": "Computing CPU(8U) instance",
            "flavor_type": "CPU",
            "billing": {
              "code": "modelarts.vm.cpu.8u",
              "unit_num": 1
            },
            "flavor_info": {
              "max_num": 16,
              "cpu": {
                "arch": "x86",
                "core_num": 8
              },
              "memory": {
                "size": 32,
                "unit": "GB"
              },
              "disk": {
                "size": 50,
                "unit": "GB"
              }
            }
          }
        ]
      }
      • 根据“flavor_id”字段选择并记录创建训练作业时需要的规格类型,本章以“modelarts.vm.cpu.8u”为例,并记录“max_num”字段的值为“16”。
  3. 调用获取训练作业支持的AI预置框架接口查看训练作业的引擎类型和版本。
    1. 请求消息体:

      URI格式:GET https://{ma_endpoint}/v2/{project_id}/job/ training-job-engines

      请求消息头:

      X-Auth-Token→MIIZmgYJKoZIhvcNAQcCoIIZizCCGYcCAQExDTALBglghkgBZQMEAgEwgXXXXXX...

      Content-Type →application/json

      其中,加粗的斜体字段需要根据实际值填写。

    2. 返回状态码“200”,响应Body如下所示(引擎较多,只展示部分):
      {
          "total": 28,
          "items": [
              ......
              {
                  "engine_id": "mindspore_1.6.0-cann_5.0.3.6-py_3.7-euler_2.8.3-aarch64",
                  "engine_name": "Ascend-Powered-Engine",
                  "engine_version": "mindspore_1.6.0-cann_5.0.3.6-py_3.7-euler_2.8.3-aarch64",
                  "v1_compatible": false,
                  "run_user": "1000",
                  "image_info": {
                      "cpu_image_url": "",
                      "gpu_image_url": "atelier/mindspore_1_6_0:train",
                      "image_version": "mindspore_1.6.0-cann_5.0.3.6-py_3.7-euler_2.8.3-aarch64-snt9-roma-20211231193205-33131ee"
                  }
              },
      		......
              {
                  "engine_id": "pytorch_1.8.0-cuda_10.2-py_3.7-ubuntu_18.04-x86_64",
                  "engine_name": "PyTorch",
                  "engine_version": "pytorch_1.8.0-cuda_10.2-py_3.7-ubuntu_18.04-x86_64",
                  "tags": [
                      {
                          "key": "auto_search",
                          "value": "True"
                      }
                  ],
                  "v1_compatible": false,
                  "run_user": "1102",
                  "image_info": {
                      "cpu_image_url": "aip/pytorch_1_8:train",
                      "gpu_image_url": "aip/pytorch_1_8:train",
                      "image_version": "pytorch_1.8.0-cuda_10.2-py_3.7-ubuntu_18.04-x86_64-20210912152543-1e0838d"
                  }
              },
              ......
              {
                  "engine_id": "tensorflow_2.1.0-cuda_10.1-py_3.7-ubuntu_18.04-x86_64",
                  "engine_name": "TensorFlow",
                  "engine_version": "tensorflow_2.1.0-cuda_10.1-py_3.7-ubuntu_18.04-x86_64",
                  "tags": [
                      {
                          "key": "auto_search",
                          "value": "True"
                      }
                  ],
                  "v1_compatible": false,
                  "run_user": "1102",
                  "image_info": {
                      "cpu_image_url": "aip/tensorflow_2_1:train",
                      "gpu_image_url": "aip/tensorflow_2_1:train",
                      "image_version": "tensorflow_2.1.0-cuda_10.1-py_3.7-ubuntu_18.04-x86_64-20210912152543-1e0838d"
                  }
              },
              ......
          ]
      }

      根据“engine_name”“engine_version”字段选择创建训练作业时需要的引擎规格,并记录对应的“engine_name”“engine_version”,本章以Pytorch引擎为例创建作业,记录“engine_name”“PyTorch”“engine_version”“pytorch_1.8.0-cuda_10.2-py_3.7-ubuntu_18.04-x86_64”

  4. 调用创建算法接口创建一个算法,记录算法id。
    1. 请求消息体:

      URI格式:POST https://{ma_endpoint}/v2/{project_id}/ algorithms

      请求消息头:

      X-Auth-Token→MIIZmgYJKoZIhvcNAQcCoIIZizCCGYcCAQExDTALBglghkgBZQMEAgEwgXXXXXX...

      Content-Type →application/json

      其中,加粗的斜体字段需要根据实际值填写。

      请求body:

      {
      	"metadata": {
      		"name": "test-pytorch-cpu",
      		"description": "test pytorch job in cpu in mode gloo"
      	},
      	"job_config": {
      		"boot_file": "/cnnorth4-job-test-v2/pytorch/fast_example/code/cpu/test-pytorch.py",
      		"code_dir": "/cnnorth4-job-test-v2/pytorch/fast_example/code/cpu/",
      		"engine": {
      			"engine_name": "PyTorch",
      			"engine_version": "pytorch_1.8.0-cuda_10.2-py_3.7-ubuntu_18.04-x86_64"
      		},
      		"inputs": [{
      			"name": "data_url",
      			"description": "数据来源1"
      		}],
      		"outputs": [{
      			"name": "train_url",
      			"description": "输出数据1"
      		}],
      		"parameters": [{
      				"name": "dist",
      				"description": "",
      				"value": "False",
      				"constraint": {
      					"editable": true,
      					"required": false,
      					"sensitive": false,
      					"type": "Boolean",
      					"valid_range": [],
      					"valid_type": "None"
      				}
      			},
      			{
      				"name": "world_size",
      				"description": "",
      				"value": "1",
      				"constraint": {
      					"editable": true,
      					"required": false,
      					"sensitive": false,
      					"type": "Integer",
      					"valid_range": [],
      					"valid_type": "None"
      				}
      			}
      		],
      		"parameters_customization": true
      	},
      	"resource_requirements": []
      }

      其中,加粗的斜体字段需要根据实际值填写:

      • “metadata”字段下的“name”“description”分别为算法的名称和描述。
      • “job_config”字段下的“code_dir”“boot_file”分别为算法的代码目录和代码启动文件。代码目录为代码启动文件的一级目录。
      • “job_config”字段下的“inputs”“outputs”分别为算法的输入输出管道。可以按照实例指定“data_url”“train_url”,在代码中解析超参分别指定训练所需要的数据文件本地路径和训练生成的模型输出本地路径。
      • “job_config”字段下的“parameters_customization”表示是否支持自定义超参,此处填true。
      • “job_config”字段下的“parameters”表示算法本身的超参。“name”填写超参名称(64个以内字符,仅支持大小写字母、数字、下划线和中划线),“value”填写超参的默认值,“constraint”填写超参的约束,例如此处“type”填写“String”(支持String、Integer、Float和Boolean),“editable”填写“true”“required”填写“false”等。
      • “job_config”字段下的“engine”表示算法所依赖的引擎,使用3记录的“engine_name”“engine_version”
    2. 返回状态码“200 OK”,表示创建算法成功,响应Body如下所示:
      {
          "metadata": {
              "id": "01c399ae-8593-4ef5-9e4d-085950aacde1",
              "name": "test-pytorch-cpu",
              "description": "test pytorch job in cpu in mode gloo",
              "create_time": 1641890623262,
              "workspace_id": "0",
              "ai_project": "default-ai-project",
              "user_name": "",
              "domain_id": "0659fbf6de00109b0ff1c01fc037d240",
              "source": "custom",
              "api_version": "",
              "is_valid": true,
              "state": "",
              "size": 4790,
              "tags": null,
              "attr_list": null,
              "version_num": 0,
              "update_time": 0
          },
          "share_info": {},
          "job_config": {
              "code_dir": "/cnnorth4-job-test-v2/pytorch/fast_example/code/cpu/",
              "boot_file": "/cnnorth4-job-test-v2/pytorch/fast_example/code/cpu/test-pytorch.py",
              "parameters": [
                  {
                      "name": "dist",
                      "description": "",
                      "i18n_description": null,
                      "value": "False",
                      "constraint": {
                          "type": "Boolean",
                          "editable": true,
                          "required": false,
                          "sensitive": false,
                          "valid_type": "None",
                          "valid_range": []
                      }
                  },
                  {
                      "name": "world_size",
                      "description": "",
                      "i18n_description": null,
                      "value": "1",
                      "constraint": {
                          "type": "Integer",
                          "editable": true,
                          "required": false,
                          "sensitive": false,
                          "valid_type": "None",
                          "valid_range": []
                      }
                  }
              ],
              "parameters_customization": true,
              "inputs": [
                  {
                      "name": "data_url",
                      "description": "数据来源1"
                  }
              ],
              "outputs": [
                  {
                      "name": "train_url",
                      "description": "输出数据1"
                  }
              ],
              "engine": {
                  "engine_id": "pytorch_1.8.0-cuda_10.2-py_3.7-ubuntu_18.04-x86_64",
                  "engine_name": "PyTorch",
                  "engine_version": "pytorch_1.8.0-cuda_10.2-py_3.7-ubuntu_18.04-x86_64",
                  "tags": [
                      {
                          "key": "auto_search",
                          "value": "True"
                      }
                  ],
                  "v1_compatible": false,
                  "run_user": "1102",
                  "image_info": {
                      "cpu_image_url": "aip/pytorch_1_8:train",
                      "gpu_image_url": "aip/pytorch_1_8:train",
                      "image_version": "pytorch_1.8.0-cuda_10.2-py_3.7-ubuntu_18.04-x86_64-20210912152543-1e0838d"
                  }
              },
              "code_tree": {
                  "name": "cpu/",
                  "children": [
                      {
                          "name": "test-pytorch.py"
                      }
                  ]
              }
          },
          "resource_requirements": [],
          "advanced_config": {}
      }

      记录“metadata”字段下的“id”(算法id,32位UUID)字段的值便于后续步骤使用。

  5. 调用创建训练作业接口使用刚创建的算法返回的uuid创建一个训练作业,记录训练作业id。
    1. 请求消息体:

      URI格式:POST https://{ma_endpoint}/v2/{project_id}/training-jobs

      请求消息头:

      • X-Auth-Token →MIIZmgYJKoZIhvcNAQcCoIIZizCCGYcCAQExDTALBglghkgBZQMEAgEwgXXXXXX...
      • Content-Type →application/json

      其中,加粗的斜体字段需要根据实际值填写。

      请求Body:

      {
      	"kind": "job",
      	"metadata": {
      		"name": "test-pytorch-cpu01",
      		"description": "test pytorch work cpu in mode gloo"
      	},
      	"algorithm": {
      		"id": "01c399ae-8593-4ef5-9e4d-085950aacde1",
      		"parameters": [{
      				"name": "dist",
      				"value": "False"
      			},
      			{
      				"name": "world_size",
      				"value": "1"
      			}
      		],
      		"inputs": [{
      			"name": "data_url",
      			"remote": {
      				"obs": {
      					"obs_url": "/cnnorth4-job-test-v2/pytorch/fast_example/data/"
      				}
      			}
      		}],
      		"outputs": [{
      			"name": "train_url",
      			"remote": {
      				"obs": {
      					"obs_url": "/cnnorth4-job-test-v2/pytorch/fast_example/outputs/"
      				}
      			}
      		}]
      	},
      	"spec": {
      		"resource": {
      			"flavor_id": "modelarts.vm.cpu.8u",
      			"node_count": 1
      		},
      		"log_export_path": {
      			"obs_url": "/cnnorth4-job-test-v2/pytorch/fast_example/log/"
      		}
      	}
      }

      其中,加粗的斜体字段需要根据实际值填写:

      • “kind”填写训练作业的类型,默认为job。
      • “metadata”下的“name”“description”填写训练作业的名称和描述。
      • “algorithm”下的“id”填写4获取的算法ID。
      • “algorithm”“inputs”“outputs”填写训练作业输入输出管道的具体信息。实例中“inputs”“remote”下的“obs_url”表示从OBS桶中选择训练数据的OBS路径。实例中“outputs”下种“remote”下的“obs_url”表示上传训练输出至指定OBS路径。
      • “spec”字段下的“flavor_id”表示训练作业所依赖的规格,使用2记录的flavor_id。“node_count”表示训练是否需要多机训练(分布式训练),此处为单机情况使用默认值“1”“log_export_path”用于指定用户需要上传日志的obs目录。
    2. 返回状态码“201 Created”,表示训练作业创建成功,响应Body如下所示:
      {
          "kind": "job",
          "metadata": {
              "id": "66ff6991-fd66-40b6-8101-0829a46d3731",
              "name": "test-pytorch-cpu01",
              "description": "test pytorch work cpu in mode gloo",
              "create_time": 1641892642625,
              "workspace_id": "0",
              "ai_project": "default-ai-project",
              "user_name": "",
              "annotations": {
                  "job_template": "Template DL",
                  "key_task": "worker"
              }
          },
          "status": {
              "phase": "Creating",
              "secondary_phase": "Creating",
              "duration": 0,
              "start_time": 0,
              "node_count_metrics": null,
              "tasks": [
                  "worker-0"
              ]
          },
          "algorithm": {
              "id": "01c399ae-8593-4ef5-9e4d-085950aacde1",
              "name": "test-pytorch-cpu",
              "code_dir": "/cnnorth4-job-test-v2/pytorch/fast_example/code/cpu/",
              "boot_file": "/cnnorth4-job-test-v2/pytorch/fast_example/code/cpu/test-pytorch.py",
              "parameters": [
                  {
                      "name": "dist",
                      "description": "",
                      "i18n_description": null,
                      "value": "False",
                      "constraint": {
                          "type": "Boolean",
                          "editable": true,
                          "required": false,
                          "sensitive": false,
                          "valid_type": "None",
                          "valid_range": []
                      }
                  },
                  {
                      "name": "world_size",
                      "description": "",
                      "i18n_description": null,
                      "value": "1",
                      "constraint": {
                          "type": "Integer",
                          "editable": true,
                          "required": false,
                          "sensitive": false,
                          "valid_type": "None",
                          "valid_range": []
                      }
                  }
              ],
              "parameters_customization": true,
              "inputs": [
                  {
                      "name": "data_url",
                      "description": "数据来源1",
                      "local_dir": "/home/ma-user/modelarts/inputs/data_url_0",
                      "remote": {
                          "obs": {
                              "obs_url": "/cnnorth4-job-test-v2/pytorch/fast_example/data/"
                          }
                      }
                  }
              ],
              "outputs": [
                  {
                      "name": "train_url",
                      "description": "输出数据1",
                      "local_dir": "/home/ma-user/modelarts/outputs/train_url_0",
                      "remote": {
                          "obs": {
                              "obs_url": "/cnnorth4-job-test-v2/pytorch/fast_example/outputs/"
                          }
                      },
                      "mode": "upload_periodically",
                      "period": 30
                  }
              ],
              "engine": {
                  "engine_id": "pytorch_1.8.0-cuda_10.2-py_3.7-ubuntu_18.04-x86_64",
                  "engine_name": "PyTorch",
                  "engine_version": "pytorch_1.8.0-cuda_10.2-py_3.7-ubuntu_18.04-x86_64",
                  "usage": "training",
                  "support_groups": "public",
                  "tags": [
                      {
                          "key": "auto_search",
                          "value": "True"
                      }
                  ],
                  "v1_compatible": false,
                  "run_user": "1102"
              }
          },
          "spec": {
              "resource": {
                  "flavor_id": "modelarts.vm.cpu.8u",
                  "flavor_name": "Computing CPU(8U) instance",
                  "node_count": 1,
                  "flavor_detail": {
                      "flavor_type": "CPU",
                      "billing": {
                          "code": "modelarts.vm.cpu.8u",
                          "unit_num": 1
                      },
                      "flavor_info": {
                          "cpu": {
                              "arch": "x86",
                              "core_num": 8
                          },
                          "memory": {
                              "size": 32,
                              "unit": "GB"
                          },
                          "disk": {
                              "size": 50,
                              "unit": "GB"
                          }
                      }
                  }
              },
              "log_export_path": {
                  "obs_url": "/cnnorth4-job-test-v2/pytorch/fast_example/log/"
              },
              "is_hosted_log": true
          }
      }
      • 记录“metadata”下的“id”(训练作业的任务ID)字段的值便于后续步骤使用。
      • “Status”下的“phase”“secondary_phase”为表示训练作业的状态和下一步状态。示例中“Creating”表示训练作业正在创建中。
  6. 调用查询训练作业详情接口使用刚创建的训练作业返回的uuid查询训练作业状态。
    1. 请求消息体:

      URI格式:GET https://{ma_endpoint}/v2/{project_id}/training-jobs/{training_job_id}

      请求消息头:X-Auth-Token →MIIZmgYJKoZIhvcNAQcCoIIZizCCGYcCAQExDTALBglghkgBZQMEAgEwgXXXXXX...

      其中,加粗的斜体字段需要根据实际值填写:

      “training_job_id”5记录的训练作业的任务ID。

    2. 返回状态码“200 OK”,响应Body如下所示:
      {
          "kind": "job",
          "metadata": {
              "id": "66ff6991-fd66-40b6-8101-0829a46d3731",
              "name": "test-pytorch-cpu01",
              "description": "test pytorch work cpu in mode gloo",
              "create_time": 1641892642625,
              "workspace_id": "0",
              "ai_project": "default-ai-project",
              "user_name": "hwstaff_z00424192",
              "annotations": {
                  "job_template": "Template DL",
                  "key_task": "worker"
              }
          },
          "status": {
              "phase": "Running",
              "secondary_phase": "Running",
              "duration": 268000,
              "start_time": 1641892655000,
              "node_count_metrics": [
                  [
                      1641892645000,
                      0
                  ],
                  [
                      1641892654000,
                      0
                  ],
                  [
                      1641892655000,
                      1
                  ],
                  [
                      1641892922000,
                      1
                  ],
                  [
                      1641892923000,
                      1
                  ]
              ],
              "tasks": [
                  "worker-0"
              ]
          },
          "algorithm": {
              "id": "01c399ae-8593-4ef5-9e4d-085950aacde1",
              "name": "test-pytorch-cpu",
              "code_dir": "/cnnorth4-job-test-v2/pytorch/fast_example/code/cpu/",
              "boot_file": "/cnnorth4-job-test-v2/pytorch/fast_example/code/cpu/test-pytorch.py",
              "parameters": [
                  {
                      "name": "dist",
                      "description": "",
                      "i18n_description": null,
                      "value": "False",
                      "constraint": {
                          "type": "Boolean",
                          "editable": true,
                          "required": false,
                          "sensitive": false,
                          "valid_type": "None",
                          "valid_range": []
                      }
                  },
                  {
                      "name": "world_size",
                      "description": "",
                      "i18n_description": null,
                      "value": "1",
                      "constraint": {
                          "type": "Integer",
                          "editable": true,
                          "required": false,
                          "sensitive": false,
                          "valid_type": "None",
                          "valid_range": []
                      }
                  }
              ],
              "parameters_customization": true,
              "inputs": [
                  {
                      "name": "data_url",
                      "description": "数据来源1",
                      "local_dir": "/home/ma-user/modelarts/inputs/data_url_0",
                      "remote": {
                          "obs": {
                              "obs_url": "/cnnorth4-job-test-v2/pytorch/fast_example/data/"
                          }
                      }
                  }
              ],
              "outputs": [
                  {
                      "name": "train_url",
                      "description": "输出数据1",
                      "local_dir": "/home/ma-user/modelarts/outputs/train_url_0",
                      "remote": {
                          "obs": {
                              "obs_url": "/cnnorth4-job-test-v2/pytorch/fast_example/outputs/"
                          }
                      },
                      "mode": "upload_periodically",
                      "period": 30
                  }
              ],
              "engine": {
                  "engine_id": "pytorch_1.8.0-cuda_10.2-py_3.7-ubuntu_18.04-x86_64",
                  "engine_name": "PyTorch",
                  "engine_version": "pytorch_1.8.0-cuda_10.2-py_3.7-ubuntu_18.04-x86_64",
                  "usage": "training",
                  "support_groups": "public",
                  "tags": [
                      {
                          "key": "auto_search",
                          "value": "True"
                      }
                  ],
                  "v1_compatible": false,
                  "run_user": "1102"
              }
          },
          "spec": {
              "resource": {
                  "flavor_id": "modelarts.vm.cpu.8u",
                  "flavor_name": "Computing CPU(8U) instance",
                  "node_count": 1,
                  "flavor_detail": {
                      "flavor_type": "CPU",
                      "billing": {
                          "code": "modelarts.vm.cpu.8u",
                          "unit_num": 1
                      },
                      "flavor_info": {
                          "cpu": {
                              "arch": "x86",
                              "core_num": 8
                          },
                          "memory": {
                              "size": 32,
                              "unit": "GB"
                          },
                          "disk": {
                              "size": 50,
                              "unit": "GB"
                          }
                      }
                  }
              },
              "log_export_path": {
                  "obs_url": "/cnnorth4-job-test-v2/pytorch/fast_example/log/"
              },
              "is_hosted_log": true
          }
      }

      根据响应可以了解训练作业的版本详情,其中“status”“Running”表示训练作业正在运行。

  7. 调用查询训练作业指定任务的日志(OBS链接)接口获取训练作业日志的对应的obs路径。
    1. 请求消息体:

      URI格式:GET https://{ma_endpoint}/v2/{project_id}/training-jobs/{training_job_id}/tasks/{task_id}/logs/url

      请求消息头:

      X-Auth-Token→MIIZmgYJKoZIhvcNAQcCoIIZizCCGYcCAQExDTALBglghkgBZQMEAgEwgXXXXXX...

      Content-Type→text/plain

      其中,加粗的斜体字段需要根据实际值填写:

      • “task_id”为训练作业的任务名称,一般使用work-0。
      • Content-Type可以设置成不同方式。text/plain,返回OBS临时预览链接。application/octet-stream,返回OBS临时下载链接。
    2. 返回状态码“200 OK”,响应Body如下所示:
      {
          "obs_url": "https://modelarts-training-log-cn-north-4.obs.cn-north-4.myhuaweicloud.com:443/66ff6991-fd66-40b6-8101-0829a46d3731/worker-0/modelarts-job-66ff6991-fd66-40b6-8101-0829a46d3731-worker-0.log?AWSAccessKeyId=GFGTBKOZENDD83QEMZMV&Expires=1641896599&Signature=BedFZHEU1oCmqlI912UL9mXlhkg%3D"
      }

      返回字段表示日志的obs路径。复制至浏览器即可看到对应效果。

  8. 调用查询训练作业指定任务的运行指标接口查看训练作业的运行指标详情。
    1. 请求消息体:

      URI格式:GET https://{ma_endpoint}/v2/{project_id}/training-jobs/{training_job_id}/metrics/{task_id}

      请求消息头:X-Auth-Token →MIIZmgYJKoZIhvcNAQcCoIIZizCCGYcCAQExDTALBglghkgBZQMEAgEwgXXXXXX...

      其中,加粗的斜体字段需要根据实际值填写。

    2. 返回状态码“200 OK”,响应Body如下所示:
      {
          "metrics": [
              {
                  "metric": "cpuUsage",
                  "value": [
                      -1,
                      -1,
                      28.622,
                      35.053,
                      39.988,
                      40.069,
                      40.082,
                      40.094
                  ]
              },
              {
                  "metric": "memUsage",
                  "value": [
                      -1,
                      -1,
                      0.544,
                      0.641,
                      0.736,
                      0.737,
                      0.738,
                      0.739
                  ]
              },
              {
                  "metric": "npuUtil",
                  "value": [
                      -1,
                      -1,
                      -1,
                      -1,
                      -1,
                      -1,
                      -1,
                      -1
                  ]
              },
              {
                  "metric": "npuMemUsage",
                  "value": [
                      -1,
                      -1,
                      -1,
                      -1,
                      -1,
                      -1,
                      -1,
                      -1
                  ]
              },
              {
                  "metric": "gpuUtil",
                  "value": [
                      -1,
                      -1,
                      -1,
                      -1,
                      -1,
                      -1,
                      -1,
                      -1
                  ]
              },
              {
                  "metric": "gpuMemUsage",
                  "value": [
                      -1,
                      -1,
                      -1,
                      -1,
                      -1,
                      -1,
                      -1,
                      -1
                  ]
              }
          ]
      }

      可以看到CPU等相关的使用率指标。

  9. 当训练作业使用完成或不再需要时,调用删除训练作业接口删除训练作业。
    1. 请求消息体:

      URI格式:DELETE https://{ma_endpoint}/v2/{project_id}/training-jobs/{training_job_id}

      请求消息头:X-Auth-Token →MIIZmgYJKoZIhvcNAQcCoIIZizCCGYcCAQExDTALBglghkgBZQMEAgEwgXXXXXX...

      其中,加粗的斜体字段需要根据实际值填写。

    2. 返回状态码“202 No Content”响应,则表示删除作业成功。

相关文档