基于自定义镜像创建训练作业场景
本节通过调用一系列API,以训练模型为例介绍ModelArts API的使用流程。
概述
使用PyTorch框架创建训练作业的流程如下:
- 调用认证鉴权接口获取用户Token,在后续的请求中需要将Token放到请求消息头中作为认证。
- 调用创建训练作业接口使用刚创建的算法返回的uuid创建一个训练作业,记录训练作业id。
- 调用查询训练作业详情接口使用刚创建的训练作业返回的id查询训练作业状态。
- 调用查询训练作业指定任务的日志(OBS链接)接口获取训练作业日志的对应的obs路径。
- 调用查询训练作业指定任务的运行指标接口查看训练作业的运行指标详情。
- 当训练作业使用完成或不再需要时,调用删除训练作业接口删除训练作业。
前提条件
- 已获取IAM的EndPoint和ModelArts的EndPoint。
- 确认服务的部署区域,获取项目ID和名称、获取账号名和ID和获取用户名和用户ID。
- 已准备好PyTorch框架的训练代码,例如将启动文件“test-pytorch.py”存放在OBS的“obs://cnnorth4-job-test-v2/pytorch/fast_example/code/cpu”目录下。
- 已经创建好训练作业的日志输出位置,例如“obs://cnnorth4-job-test-v2/pytorch/fast_example/log”。
操作步骤
- 调用认证鉴权接口获取用户的Token。
- 请求消息体:
URI格式:POST https://{iam_endpoint}/v3/auth/tokens
请求消息头:Content-Type →application/json
请求Body:{ "auth": { "identity": { "methods": ["password"], "password": { "user": { "name": "user_name", "password": "user_password", "domain": { "name": "domain_name" } } } }, "scope": { "project": { "name": "cn-north-1" } } } }其中,加粗的斜体字段需要根据实际值填写:- iam_endpoint为IAM的终端节点。
- user_name为IAM用户名。
- user_password为用户登录密码。
- domain_name为用户所属的账号名。
- cn-north-1为项目名,代表服务的部署区域。
- 返回状态码“201 Created”,在响应Header中获取“X-Subject-Token”的值即为Token,如下所示:
x-subject-token →MIIZmgYJKoZIhvcNAQcCoIIZizCCGYcCAQExDTALBglghkgBZQMEAgEwgXXXXXX...
- 请求消息体:
- 调用创建训练作业接口使用刚创建的算法返回的uuid创建一个训练作业,记录训练作业id。
- 请求消息体:
URI格式:POST https://{ma_endpoint}/v2/{project_id}/training-jobs
请求消息头:
- X-Auth-Token →MIIZmgYJKoZIhvcNAQcCoIIZizCCGYcCAQExDTALBglghkgBZQMEAgEwgXXXXXX...
- Content-Type →application/json
其中,加粗的斜体字段需要根据实际值填写。
请求Body:
{ "kind": "job", "metadata": { "name": "test-pytorch-cpu01", "description": "test pytorch work cpu" }, "algorithm": { "code_dir": "/cnnorth4-job-test-v2/pytorch/fast_example/code/cpu/", "local_code_dir": "/home/ma-user/modelarts/user-job-dir", "engine": { "image_url": "atelier/pytorch_cuda:pytorch_2.7.0-cuda_12.8-py_3.11.10-ubuntu_22.04-x86_64-20251215163925-4e5422a" }, "command": "python ${MA_JOB_DIR}/cpu/test-pytorch.py" }, "spec": { "resource": { "node_count": 1, "pool_id": "pool-maostest-train-06024304be00d5092fbdc0013d201342" }, "log_export_path": { "obs_url": "/cnnorth4-job-test-v2/pytorch/fast_example/log/" } } }其中,加粗的斜体字段需要根据实际值填写:
- “kind”填写训练作业的类型,默认为job。
- “metadata”下的“name”和“description”填写训练作业的名称和描述。
- “algorithm”下的“code_dir”和“local_code_dir”分别为代码目录和代码下载到作业内的本地目录。
- “algorithm”下的“image_url”填写训练作业镜像的地址。
- “algorithm”下的“command”填写训练作业启动命令。
- “spec”字段下的“pool_id”表示训练作业所依赖的资源池ID。“node_count”表示训练是否需要多机训练(分布式训练),此处为单机情况使用默认值“1”。“log_export_path”用于指定用户需要上传日志的obs目录。
- 返回状态码“201 Created”,表示训练作业创建成功,响应Body如下所示:
{ "kind": "job", "metadata": { "id": "31318695-2011-4e48-9b90-9c9178c57951", "name": "test-pytorch-cpu01", "description": "test pytorch work cpu", "create_time": 1777545352008, "workspace_id": "0", "ai_project": "default-ai-project", "labels": { "training-job": "modelarts-os" }, "user_name": "", "annotations": { "job_template": "Template DL", "key_task": "worker" }, "training_experiment_reference": {}, "tags": [] }, "status": { "phase": "Pending", "secondary_phase": "Creating", "pending_time": 1000, "duration": 0, "is_hanged": false, "retry_count": 0, "start_time": 0, "node_count_metrics": null, "tasks": [ "worker-0" ], "metrics_statistics": { "cpu_usage": { "average": -1, "max": -1, "min": -1 }, "mem_usage": { "average": -1, "max": -1, "min": -1 } } }, "algorithm": { "code_dir": "/cnnorth4-job-test-v2/pytorch/fast_example/code/cpu/", "local_code_dir": "/home/ma-user/modelarts/user-job-dir", "command": "python ${MA_JOB_DIR}/cpu/test-pytorch.py", "engine": { "engine_id": "", "engine_name": "", "engine_version": "", "v1_compatible": false, "image_url": "atelier/pytorch_cuda:pytorch_2.7.0-cuda_12.8-py_3.11.10-ubuntu_22.04-x86_64-20251215163925-4e5422a", "non_swr_image": false, "run_user": "", "image_source": true, "image_repo_id": "", "image_id": "" } }, "spec": { "resource": { "pool_id": "pool-maostest-train-06024304be00d5092fbdc0013d201342", "pool_resource_flavor": "", "node_count": 1, "pool_info": { "cpu_arch": "x86", "core_num": 5, "mem_size": 22, "cache_size": 0, "accelerator": "", "accelerator_num": 0, "accelerator_type": "", "accelerator_size": 0, "variant": "", "huge_pages": 0, "x_parameter_plane": "", "use_privileged": false, "use_host_network": false, "use_ib_network": false, "project_id": "", "pool_resource_flavor": "liumuqi-eni-test", "pool_id": "pool-maostest-train-06024304be00d5092fbdc0013d201342", "cluster_id": "", "maos_pool": true, "quota_id": "", "maos_migrated": false, "detect_all_in_int": false, "pool_type": "", "enable_cabinet": false, "enable_memarts": false, "enable_ems": false, "empty_dir_size": 0 }, "main_container_allocated_resources": { "cpu_arch": "x86", "cpu_core_num": 4, "mem_size": 20, "accelerator_num": 0, "accelerator_type": "" } }, "log_export_path": { "obs_url": "/cnnorth4-job-test-v2/pytorch/fast_example/log/" }, "is_hosted_log": true, "runtime_type": "production" }, "ftjob_config": { "checkpoint_config": { "save_checkpoints_max": 0, "checkpoint_id": "", "skipped_steps": 0, "restore_training": 0 }, "task_env": { "envs": null } } }- 记录“metadata”下的“id”(训练作业的任务ID)字段的值便于后续步骤使用。
- “Status”下的“phase”和“secondary_phase”为表示训练作业的状态和下一步状态。示例中“Creating”表示训练作业正在创建中。
- 请求消息体:
- 调用查询训练作业详情接口使用刚创建的训练作业返回的uuid查询训练作业状态。
- 请求消息体:
URI格式:GET https://{ma_endpoint}/v2/{project_id}/training-jobs/{training_job_id}
请求消息头:X-Auth-Token →MIIZmgYJKoZIhvcNAQcCoIIZizCCGYcCAQExDTALBglghkgBZQMEAgEwgXXXXXX...
其中,加粗的斜体字段需要根据实际值填写:
“training_job_id”为2记录的训练作业的任务ID。
- 返回状态码“200 OK”,响应Body如下所示:
{ "kind": "job", "metadata": { "id": "31318695-2011-4e48-9b90-9c9178c57951", "name": "test-pytorch-cpu01", "description": "test pytorch work cpu", "create_time": 1777545352008, "workspace_id": "0", "ai_project": "default-ai-project", "labels": { "training-job": "modelarts-os" }, "user_name": "modelarts_xxx", "annotations": { "job_template": "Template DL", "key_task": "worker" }, "training_experiment_reference": {}, "tags": [] }, "status": { "phase": "Running", "secondary_phase": "Running", "pending_time": 68992, "duration": 4000, "is_hanged": false, "retry_count": 0, "task_ips": [ { "task": "worker-0", "ip": "172.16.0.31", "host_ip": "192.168.140.98", "schedule_count": 1 } ], "start_time": 1777545421000, "node_count_metrics": [ [ 1777545411000, 0 ], [ 1777545420000, 0 ], [ 1777545421000, 1 ], [ 1777545424000, 1 ], [ 1777545425000, 1 ] ], "tasks": [ "worker-0" ], "metrics_statistics": { "cpu_usage": { "average": -1, "max": -1, "min": -1 }, "mem_usage": { "average": -1, "max": -1, "min": -1 } }, "running_records": [ { "start_at": 1777545424, "start_type": "init_or_rescheduled" } ] }, "algorithm": { "code_dir": "/cnnorth4-job-test-v2/pytorch/fast_example/code/cpu/", "local_code_dir": "/home/ma-user/modelarts/user-job-dir", "command": "python ${MA_JOB_DIR}/cpu/test-pytorch.py", "engine": { "engine_id": "", "engine_name": "", "engine_version": "", "v1_compatible": false, "image_url": "atelier/pytorch_cuda:pytorch_2.7.0-cuda_12.8-py_3.11.10-ubuntu_22.04-x86_64-20251215163925-4e5422a", "non_swr_image": false, "run_user": "", "image_source": true, "image_repo_id": "", "image_id": "" } }, "spec": { "resource": { "pool_id": "pool-maostest-train-06024304be00d5092fbdc0013d201342", "pool_resource_flavor": "", "node_count": 1, "pool_info": { "cpu_arch": "x86", "core_num": 5, "mem_size": 22, "cache_size": 0, "accelerator": "", "accelerator_num": 0, "accelerator_type": "", "accelerator_size": 0, "variant": "", "huge_pages": 0, "x_parameter_plane": "", "use_privileged": false, "use_host_network": false, "use_ib_network": false, "project_id": "", "pool_resource_flavor": "liumuqi-eni-test", "pool_id": "pool-maostest-train-06024304be00d5092fbdc0013d201342", "cluster_id": "", "maos_pool": true, "quota_id": "", "maos_migrated": false, "detect_all_in_int": false, "pool_type": "", "enable_cabinet": false, "enable_memarts": false, "enable_ems": false, "empty_dir_size": 0 }, "main_container_allocated_resources": { "cpu_arch": "x86", "cpu_core_num": 4, "mem_size": 20, "accelerator_num": 0, "accelerator_type": "" } }, "log_export_path": { "obs_url": "/cnnorth4-job-test-v2/pytorch/fast_example/log/" }, "is_hosted_log": true, "runtime_type": "production" }, "ftjob_config": { "checkpoint_config": { "save_checkpoints_max": 0, "checkpoint_id": "", "skipped_steps": 0, "restore_training": 0 }, "task_env": { "envs": null } } }根据响应可以了解训练作业的版本详情,其中“status”为“Running”表示训练作业正在运行。
- 请求消息体:
- 调用查询训练作业指定任务的日志(OBS链接)接口获取训练作业日志的对应的obs路径。
- 请求消息体:
URI格式:GET https://{ma_endpoint}/v2/{project_id}/training-jobs/{training_job_id}/tasks/{task_id}/logs/url
请求消息头:
X-Auth-Token→MIIZmgYJKoZIhvcNAQcCoIIZizCCGYcCAQExDTALBglghkgBZQMEAgEwgXXXXXX...
Content-Type→text/plain
其中,加粗的斜体字段需要根据实际值填写:
- “task_id”为训练作业的任务名称,一般使用work-0。
- Content-Type可以设置成不同方式。text/plain,返回OBS临时预览链接。application/octet-stream,返回OBS临时下载链接。
- 返回状态码“200 OK”,响应Body如下所示:
{ "obs_url": "https://modelarts-training-log-cn-north-4.obs.cn-north-4.myhuaweicloud.com:443/66ff6991-fd66-40b6-8101-0829a46d3731/worker-0/modelarts-job-66ff6991-fd66-40b6-8101-0829a46d3731-worker-0.log?AWSAccessKeyId=GFGTBKOZENDD83QEMZMV&Expires=1641896599&Signature=BedFZHEU1oCmqlI912UL9mXlhkg%3D" }返回字段表示日志的obs路径。复制至浏览器即可看到对应效果。
- 请求消息体:
- 调用查询训练作业指定任务的运行指标接口查看训练作业的运行指标详情。
- 请求消息体:
URI格式:GET https://{ma_endpoint}/v2/{project_id}/training-jobs/{training_job_id}/metrics/{task_id}
请求消息头:X-Auth-Token →MIIZmgYJKoZIhvcNAQcCoIIZizCCGYcCAQExDTALBglghkgBZQMEAgEwgXXXXXX...
其中,加粗的斜体字段需要根据实际值填写。
- 返回状态码“200 OK”,响应Body如下所示:
{ "metrics": [ { "metric": "cpuUsage", "value": [ -1, -1, 28.622, 35.053, 39.988, 40.069, 40.082, 40.094 ] }, { "metric": "memUsage", "value": [ -1, -1, 0.544, 0.641, 0.736, 0.737, 0.738, 0.739 ] }, { "metric": "npuUtil", "value": [ -1, -1, -1, -1, -1, -1, -1, -1 ] }, { "metric": "npuMemUsage", "value": [ -1, -1, -1, -1, -1, -1, -1, -1 ] }, { "metric": "gpuUtil", "value": [ -1, -1, -1, -1, -1, -1, -1, -1 ] }, { "metric": "gpuMemUsage", "value": [ -1, -1, -1, -1, -1, -1, -1, -1 ] } ] }可以看到CPU等相关的使用率指标。
- 请求消息体:
- 当训练作业使用完成或不再需要时,调用删除训练作业接口删除训练作业。