基于自定义镜像创建Tensorboard训练作业场景
本节通过调用一系列API,以训练模型为例介绍ModelArts API的使用流程。
概述
使用PyTorch框架创建训练作业的流程如下:
前提条件
- 已获取IAM的EndPoint和ModelArts的EndPoint。
- 确认服务的部署区域,获取项目ID和名称、获取账号名和ID和获取用户名和用户ID。
- 已准备好Tensorboard日志保存目录,例如OBS并行文件系统的“obs://cnnorth4-job-pfs/summary”目录下
操作步骤
- 调用认证鉴权接口获取用户的Token。
- 请求消息体:
URI格式:POST https://{iam_endpoint}/v3/auth/tokens
请求消息头:Content-Type →application/json
请求Body:{ "auth": { "identity": { "methods": ["password"], "password": { "user": { "name": "user_name", "password": "user_password", "domain": { "name": "domain_name" } } } }, "scope": { "project": { "name": "cn-north-1" } } } }其中,加粗的斜体字段需要根据实际值填写:- iam_endpoint为IAM的终端节点。
- user_name为IAM用户名。
- user_password为用户登录密码。
- domain_name为用户所属的账号名。
- cn-north-1为项目名,代表服务的部署区域。
- 返回状态码“201 Created”,在响应Header中获取“X-Subject-Token”的值即为Token,如下所示:
x-subject-token →MIIZmgYJKoZIhvcNAQcCoIIZizCCGYcCAQExDTALBglghkgBZQMEAgEwgXXXXXX...
- 请求消息体:
- 调用创建训练作业接口使用刚创建的算法返回的uuid创建一个训练作业,记录训练作业id。
- 请求消息体:
URI格式:POST https://{ma_endpoint}/v2/{project_id}/training-jobs
请求消息头:
- X-Auth-Token →MIIZmgYJKoZIhvcNAQcCoIIZizCCGYcCAQExDTALBglghkgBZQMEAgEwgXXXXXX...
- Content-Type →application/json
其中,加粗的斜体字段需要根据实际值填写。
请求Body:
{ "kind": "job", "metadata": { "name": "test-tensorboard-cpu01", "description": "test tensorboard cpu", "annotations": { "tensorboard/enable": "true" } }, "algorithm": { "engine": { "image_url": "atelier/pytorch_cuda:pytorch_2.7.0-cuda_12.8-py_3.11.10-ubuntu_22.04-x86_64-20251215163925-4e5422a" }, "command": "sleep 1h", "summary": { "data_sources": [ { "pfs": { "pfs_path": "obs://notebook-pfs-storage/wzy/runs/" } } ] } }, "spec": { "resource": { "pool_id": "pool-test-train-xx", "node_count": 1 }, "log_export_path": { "obs_url": "/cnnorth4-job-test-v2/pytorch/fast_example/log/" } } }其中,加粗的斜体字段需要根据实际值填写:
- “kind”填写训练作业的类型,默认为job。
- “metadata”下的“name”和“description”填写训练作业的名称和描述。
- “metadata”下的“annotations”填写"tensorboard/enable": "true"表示启用Tensorbard。
- “algorithm”下的“code_dir”和“local_code_dir”分别为代码目录和代码下载到作业内的本地目录。
- “algorithm”下的“image_url”填写训练作业镜像的地址。
- “algorithm”下的“command”填写训练作业启动命令。
- “algorithm”下的“data_sources”填写Tensorboard日志的保存目录。
- “spec”字段下的“pool_id”表示训练作业所依赖的资源池ID。“node_count”表示训练是否需要多机训练(分布式训练),此处为单机情况使用默认值“1”。“log_export_path”用于指定用户需要上传日志的obs目录。
- 返回状态码“201 Created”,表示训练作业创建成功,响应Body如下所示:
{ "kind": "job", "metadata": { "id": "66ff6991-fd66-40b6-8101-0829a46d3731", "name": "test-pytorch-cpu01", "description": "test pytorch work cpu", "create_time": 1641892642625, "workspace_id": "0", "ai_project": "default-ai-project", "user_name": "", "annotations": { "job_template": "Template DL", "key_task": "worker" } }, "status": { "phase": "Creating", "secondary_phase": "Creating", "duration": 0, "start_time": 0, "node_count_metrics": null, "tasks": [ "worker-0" ] }, "algorithm": { "code_dir": "/cnnorth4-job-test-v2/pytorch/fast_example/code/cpu/", "local_code_dir": "/home/ma-user/modelarts/user-job-dir", "engine": { "image_url": "atelier/pytorch_cuda:pytorch_2.7.0-cuda_12.8-py_3.11.10-ubuntu_22.04-x86_64-20251215163925-4e5422a" }, "command": "python ${MA_JOB_DIR}/cpu/train.py" }, "spec": { "resource": { "pool_id": "pool-test-train-xx", "node_count": 1 }, "log_export_path": { "obs_url": "/cnnorth4-job-test-v2/pytorch/fast_example/log/" } } }- 记录“metadata”下的“id”(训练作业的任务ID)字段的值便于后续步骤使用。
- “Status”下的“phase”和“secondary_phase”为表示训练作业的状态和下一步状态。示例中“Creating”表示训练作业正在创建中。
- 请求消息体:
- 调用查询训练作业详情接口使用刚创建的训练作业返回的uuid查询训练作业状态。
- 请求消息体:
URI格式:GET https://{ma_endpoint}/v2/{project_id}/training-jobs/{training_job_id}
请求消息头:X-Auth-Token →MIIZmgYJKoZIhvcNAQcCoIIZizCCGYcCAQExDTALBglghkgBZQMEAgEwgXXXXXX...
其中,加粗的斜体字段需要根据实际值填写:
“training_job_id”为2记录的训练作业的任务ID。
- 返回状态码“200 OK”,响应Body如下所示:
{ "kind": "job", "metadata": { "id": "66ff6991-fd66-40b6-8101-0829a46d3731", "name": "test-pytorch-cpu01", "description": "test pytorch work cpu in mode gloo", "create_time": 1641892642625, "workspace_id": "0", "ai_project": "default-ai-project", "user_name": "hwstaff_z00424192", "annotations": { "job_template": "Template DL", "key_task": "worker" } }, "status": { "phase": "Running", "secondary_phase": "Running", "duration": 268000, "start_time": 1641892655000, "node_count_metrics": [ [ 1641892645000, 0 ], [ 1641892654000, 0 ], [ 1641892655000, 1 ], [ 1641892922000, 1 ], [ 1641892923000, 1 ] ], "tasks": [ "worker-0" ] }, "algorithm": { "id": "01c399ae-8593-4ef5-9e4d-085950aacde1", "name": "test-pytorch-cpu", "code_dir": "/cnnorth4-job-test-v2/pytorch/fast_example/code/cpu/", "boot_file": "/cnnorth4-job-test-v2/pytorch/fast_example/code/cpu/test-pytorch.py", "parameters": [ { "name": "dist", "description": "", "i18n_description": null, "value": "False", "constraint": { "type": "Boolean", "editable": true, "required": false, "sensitive": false, "valid_type": "None", "valid_range": [] } }, { "name": "world_size", "description": "", "i18n_description": null, "value": "1", "constraint": { "type": "Integer", "editable": true, "required": false, "sensitive": false, "valid_type": "None", "valid_range": [] } } ], "parameters_customization": true, "inputs": [ { "name": "data_url", "description": "数据来源1", "local_dir": "/home/ma-user/modelarts/inputs/data_url_0", "remote": { "obs": { "obs_url": "/cnnorth4-job-test-v2/pytorch/fast_example/data/" } } } ], "outputs": [ { "name": "train_url", "description": "输出数据1", "local_dir": "/home/ma-user/modelarts/outputs/train_url_0", "remote": { "obs": { "obs_url": "/cnnorth4-job-test-v2/pytorch/fast_example/outputs/" } }, "mode": "upload_periodically", "period": 30 } ], "engine": { "engine_id": "pytorch_1.8.0-cuda_10.2-py_3.7-ubuntu_18.04-x86_64", "engine_name": "PyTorch", "engine_version": "pytorch_1.8.0-cuda_10.2-py_3.7-ubuntu_18.04-x86_64", "usage": "training", "support_groups": "public", "tags": [ { "key": "auto_search", "value": "True" } ], "v1_compatible": false, "run_user": "1102" } }, "spec": { "resource": { "flavor_id": "modelarts.vm.cpu.8u", "flavor_name": "Computing CPU(8U) instance", "node_count": 1, "flavor_detail": { "flavor_type": "CPU", "billing": { "code": "modelarts.vm.cpu.8u", "unit_num": 1 }, "flavor_info": { "cpu": { "arch": "x86", "core_num": 8 }, "memory": { "size": 32, "unit": "GB" }, "disk": { "size": 50, "unit": "GB" } } } }, "log_export_path": { "obs_url": "/cnnorth4-job-test-v2/pytorch/fast_example/log/" }, "is_hosted_log": true }, "endpoints": { "tensorboard": { "url": "https://authoring-modelarts-cnnorth7.ulanqab.huawei.com/3efaddb9-66ff6991-fd66-40b6-8101-0829a46d3731/proxy/6006/", "token": "fa8e321f772xxxxxxxxxxx3f1844d06" } }, }根据响应可以了解训练作业的版本详情,其中“status”为“Running”表示训练作业正在运行。当训练作业运行后,开始启动Tensorboard,Tensorboard启动成功后作业详情中的endpoints会返回tensorboard的打开地址和token。
- 请求消息体:
- 打开Tensorboard Endpoint查看可视化指标数据。
URI格式:GET {tensorbard_endpoint}?token={tensorbard_token}
其中,加粗的斜体字段需要根据实际值填写:
- “tensorbard_endpoint”为训练作业的Tensorboard打开地址。
- “tensorbard_token”训练作业的Tensorboard访问凭证。
- 当训练作业使用完成或不再需要时,调用删除训练作业接口删除训练作业。