Using a Custom Image to Create a Training Job
This section describes how to train a model by calling ModelArts APIs.
Overview
The process for creating a training job using PyTorch is as follows:
- Call the API for authentication to obtain a user token, which will be added in a request header for authentication.
- Call the API for creating a training job to create a training job using the UUID returned by the created algorithm and record the job ID.
- Call the API for querying details about a training job to query the job status using the job ID.
- Call the API for querying the logs of a specified task in a training job (OBS link) to obtain the OBS path of the training job logs.
- Call the API for querying the running metrics of a specified task in a training job to view detailed metrics of the job.
- Call the API for deleting a training job to delete the job if it is no longer needed.
Prerequisites
- You have obtained the endpoints of IAM and ModelArts.
- The following information is available: region where ModelArts is deployed, project ID and name, account name and ID, and username and user ID.
- The training code of PyTorch is available. For example, the startup file test-pytorch.py has been stored in the obs://cnnorth4-job-test-v2/pytorch/fast_example/code/cpu directory of OBS.
- A path for outputting the training job logs has been created, for example, obs://cnnorth4-job-test-v2/pytorch/fast_example/log.
Procedure
- Call the API for authentication to obtain a user token.
- Request body:
URI: POST https://{iam_endpoint}/v3/auth/tokens
Request header: Content-Type → application/json
Request body:{ "auth": { "identity": { "methods": ["password"], "password": { "user": { "name": "user_name", "password": "user_password", "domain": { "name": "domain_name" } } } }, "scope": { "project": { "name": "ap-southeast-1" } } } }Set the following parameters based on site requirements:- iam_endpoint: IAM endpoint
- user_name: IAM username
- user_password: login password of the user
- domain_name: account to which the user belongs
- ap-southeast-1: Project name, which is the region where ModelArts is deployed
- Status code 201 Created is returned. The X-Subject-Token value in the response header is the token.
x-subject-token →MIIZmgYJKoZIhvcNAQcCoIIZizCCGYcCAQExDTALBglghkgBZQMEAgEwgXXXXXX...
- Request body:
- Call the API for creating a training job to create a training job using the UUID returned by the created algorithm and record the job ID.
- Request body:
URI: POST https://{ma_endpoint}/v2/{project_id}/training-jobs
Request header:
- X-Auth-Token →MIIZmgYJKoZIhvcNAQcCoIIZizCCGYcCAQExDTALBglghkgBZQMEAgEwgXXXXXX...
- Content-Type →application/json
Set the italic parameters based on site requirements.
Request body:
{ "kind": "job", "metadata": { "name": "test-pytorch-cpu01", "description": "test pytorch work cpu" }, "algorithm": { "code_dir": "/cnnorth4-job-test-v2/pytorch/fast_example/code/cpu/", "local_code_dir": "/home/ma-user/modelarts/user-job-dir", "engine": { "image_url": "atelier/pytorch_cuda:pytorch_2.7.0-cuda_12.8-py_3.11.10-ubuntu_22.04-x86_64-20251215163925-4e5422a" }, "command": "python ${MA_JOB_DIR}/cpu/test-pytorch.py" }, "spec": { "resource": { "node_count": 1, "pool_id": "pool-maostest-train-06024304be00d5092fbdc0013d201342" }, "log_export_path": { "obs_url": "/cnnorth4-job-test-v2/pytorch/fast_example/log/" } } }Set the following parameters based on site requirements:
- Set kind to the type of the training job. The default value is job.
- Set name and description in the metadata field to the name and description of the training job.
- Set code_dir and local_code_dir in the algorithm field to the code directory and the local directory where the code is downloaded to the job, respectively.
- Set image_url in the algorithm field to the address of the training job image.
- Set command in the algorithm field to the command for starting the training job.
- In the spec field, pool_id indicates the ID of the resource pool on which the training job depends. node_count indicates whether to use multi-node training (distributed training). Set it to 1 for a single-node training by default. log_export_path specifies the OBS path to which logs are uploaded.
- Status code 201 Created is returned, indicating that the training job has been created. The response body is as follows:
{ "kind": "job", "metadata": { "id": "31318695-2011-4e48-9b90-9c9178c57951", "name": "test-pytorch-cpu01", "description": "test pytorch work cpu" "create_time": 1777545352008, "workspace_id": "0", "ai_project": "default-ai-project", "labels": { "training-job": "modelarts-os" }, "user_name": "", "annotations": { "job_template": "Template DL", "key_task": "worker" }, "training_experiment_reference": {}, "tags": [] }, "status": { "phase": "Pending", "secondary_phase": "Creating", "pending_time": 1000, "duration": 0, "is_hanged": false, "retry_count": 0, "start_time": 0, "node_count_metrics": null, "tasks": [ "worker-0" ], "metrics_statistics": { "cpu_usage": { "average": -1, "max": -1, "min": -1 }, "mem_usage": { "average": -1, "max": -1, "min": -1 } } }, "algorithm": { "code_dir": "/cnnorth4-job-test-v2/pytorch/fast_example/code/cpu/", "local_code_dir": "/home/ma-user/modelarts/user-job-dir", "command": "python ${MA_JOB_DIR}/cpu/test-pytorch.py", "engine": { "engine_id": "", "engine_name": "", "engine_version": "", "v1_compatible": false, "image_url": "atelier/pytorch_cuda:pytorch_2.7.0-cuda_12.8-py_3.11.10-ubuntu_22.04-x86_64-20251215163925-4e5422a", "non_swr_image": false, "run_user": "", "image_source": true, "image_repo_id": "", "image_id": "" } }, "spec": { "resource": { "pool_id": "pool-maostest-train-06024304be00d5092fbdc0013d201342", "pool_resource_flavor": "", "node_count": 1, "pool_info": { "cpu_arch": "x86", "core_num": 5, "mem_size": 22, "cache_size": 0, "accelerator": "", "accelerator_num": 0, "accelerator_type": "", "accelerator_size": 0, "variant": "", "huge_pages": 0, "x_parameter_plane": "", "use_privileged": false, "use_host_network": false, "use_ib_network": false, "project_id": "", "pool_resource_flavor": "liumuqi-eni-test", "pool_id": "pool-maostest-train-06024304be00d5092fbdc0013d201342", "cluster_id": "", "maos_pool": true, "quota_id": "", "maos_migrated": false, "detect_all_in_int": false, "pool_type": "", "enable_cabinet": false, "enable_memarts": false, "enable_ems": false, "empty_dir_size": 0 }, "main_container_allocated_resources": { "cpu_arch": "x86", "cpu_core_num": 4, "mem_size": 20, "accelerator_num": 0, "accelerator_type": "" } }, "log_export_path": { "obs_url": "/cnnorth4-job-test-v2/pytorch/fast_example/log/" }, "is_hosted_log": true, "runtime_type": "production" }, "ftjob_config": { "checkpoint_config": { "save_checkpoints_max": 0, "checkpoint_id": "", "skipped_steps": 0, "restore_training": 0 }, "task_env": { "envs": null } } }- Record the id value (training job ID) in the metadata field for subsequent steps.
- phase and secondary_phase under Status indicate the status and next status of the training job, respectively. In the example, Creating indicates that the training job is being created.
- Request body:
- Call the API for querying details about a training job to query the job status using the job ID.
- Request body:
URI: GET https://{ma_endpoint}/v2/{project_id}/training-jobs/{training_job_id}
Request header: X-Auth-Token →MIIZmgYJKoZIhvcNAQcCoIIZizCCGYcCAQExDTALBglghkgBZQMEAgEwgXXXXXX...
Set the following parameters based on site requirements:
Set training_job_id to the training job ID recorded in 2.
- Status code 200 OK is returned. The response body is as follows:
{ "kind": "job", "metadata": { "id": "31318695-2011-4e48-9b90-9c9178c57951", "name": "test-pytorch-cpu01", "description": "test pytorch work cpu", "create_time": 1777545352008, "workspace_id": "0", "ai_project": "default-ai-project", "labels": { "training-job": "modelarts-os" }, "user_name": "modelarts_xxx", "annotations": { "job_template": "Template DL", "key_task": "worker" }, "training_experiment_reference": {}, "tags": [] }, "status": { "phase": "Running", "secondary_phase": "Running", "pending_time": 68992, "duration": 4000, "is_hanged": false, "retry_count": 0, "task_ips": [ { "task": "worker-0", "ip": "172.16.0.31", "host_ip": "192.168.140.98", "schedule_count": 1 } ], "start_time": 1777545421000, "node_count_metrics": [ [ 1777545411000, 0 ], [ 1777545420000, 0 ], [ 1777545421000, 1 ], [ 1777545424000, 1 ], [ 1777545425000, 1 ] ], "tasks": [ "worker-0" ], "metrics_statistics": { "cpu_usage": { "average": -1, "max": -1, "min": -1 }, "mem_usage": { "average": -1, "max": -1, "min": -1 } }, "running_records": [ { "start_at": 1777545424, "start_type": "init_or_rescheduled" } ] }, "algorithm": { "code_dir": "/cnnorth4-job-test-v2/pytorch/fast_example/code/cpu/", "local_code_dir": "/home/ma-user/modelarts/user-job-dir", "command": "python ${MA_JOB_DIR}/cpu/test-pytorch.py", "engine": { "engine_id": "", "engine_name": "", "engine_version": "", "v1_compatible": false, "image_url": "atelier/pytorch_cuda:pytorch_2.7.0-cuda_12.8-py_3.11.10-ubuntu_22.04-x86_64-20251215163925-4e5422a", "non_swr_image": false, "run_user": "", "image_source": true, "image_repo_id": "", "image_id": "" } }, "spec": { "resource": { "pool_id": "pool-maostest-train-06024304be00d5092fbdc0013d201342", "pool_resource_flavor": "", "node_count": 1, "pool_info": { "cpu_arch": "x86", "core_num": 5, "mem_size": 22, "cache_size": 0, "accelerator": "", "accelerator_num": 0, "accelerator_type": "", "accelerator_size": 0, "variant": "", "huge_pages": 0, "x_parameter_plane": "", "use_privileged": false, "use_host_network": false, "use_ib_network": false, "project_id": "", "pool_resource_flavor": "liumuqi-eni-test", "pool_id": "pool-maostest-train-06024304be00d5092fbdc0013d201342", "cluster_id": "", "maos_pool": true, "quota_id": "", "maos_migrated": false, "detect_all_in_int": false, "pool_type": "", "enable_cabinet": false, "enable_memarts": false, "enable_ems": false, "empty_dir_size": 0 }, "main_container_allocated_resources": { "cpu_arch": "x86", "cpu_core_num": 4, "mem_size": 20, "accelerator_num": 0, "accelerator_type": "" } }, "log_export_path": { "obs_url": "/cnnorth4-job-test-v2/pytorch/fast_example/log/" }, "is_hosted_log": true, "runtime_type": "production" }, "ftjob_config": { "checkpoint_config": { "save_checkpoints_max": 0, "checkpoint_id": "", "skipped_steps": 0, "restore_training": 0 }, "task_env": { "envs": null } } }You can learn about the version details of the training job based on the response. The status value is Running, indicating that the training job is running.
- Request body:
- Call the API for querying the logs of a specified task in a training job (OBS link) to obtain the OBS path of the training job logs.
- Request body:
URI format: GET https://{ma_endpoint}/v2/{project_id}/training-jobs/{training_job_id}/tasks/{task_id}/logs/url
Request header:
X-Auth-Token →MIIZmgYJKoZIhvcNAQcCoIIZizCCGYcCAQExDTALBglghkgBZQMEAgEwgXXXXXX...
Content-Type→text/plain
Set the following parameters based on site requirements:
- task_id indicates the name of the training job. Generally, set it to work-0.
- Content-Type can be set either to text/plain or application/octet-stream. text/plain indicates that a temporary OBS preview URL is returned. application/octet-stream indicates that a temporary OBS download URL is returned.
- Status code 200 OK is returned. The response body is as follows:
{ "obs_url": "https://modelarts-training-log-cn-north-4.obs.cn-north-4.myhuaweicloud.com:443/66ff6991-fd66-40b6-8101-0829a46d3731/worker-0/modelarts-job-66ff6991-fd66-40b6-8101-0829a46d3731-worker-0.log?AWSAccessKeyId=GFGTBKOZENDD83QEMZMV&Expires=1641896599&Signature=BedFZHEU1oCmqlI912UL9mXlhkg%3D" }The returned field indicates the OBS path of logs. You can copy the value to the browser to view the result.
- Request body:
- Call the API for querying the running metrics of a specified task in a training job to view detailed metrics of the job.
- Request body:
URI format: GET https://{ma_endpoint}/v2/{project_id}/training-jobs/{training_job_id}/metrics/{task_id}
Request header: X-Auth-Token →MIIZmgYJKoZIhvcNAQcCoIIZizCCGYcCAQExDTALBglghkgBZQMEAgEwgXXXXXX...
Set the italic parameters based on site requirements.
- Status code 200 OK is returned. The response body is as follows:
{ "metrics": [ { "metric": "cpuUsage", "value": [ -1, -1, 28.622, 35.053, 39.988, 40.069, 40.082, 40.094 ] }, { "metric": "memUsage", "value": [ -1, -1, 0.544, 0.641, 0.736, 0.737, 0.738, 0.739 ] }, { "metric": "npuUtil", "value": [ -1, -1, -1, -1, -1, -1, -1, -1 ] }, { "metric": "npuMemUsage", "value": [ -1, -1, -1, -1, -1, -1, -1, -1 ] }, { "metric": "gpuUtil", "value": [ -1, -1, -1, -1, -1, -1, -1, -1 ] }, { "metric": "gpuMemUsage", "value": [ -1, -1, -1, -1, -1, -1, -1, -1 ] } ] }You can view the metrics such as the CPU usage.
- Request body:
- Call the API for deleting a training job to delete the job if it is no longer needed.
- Request body:
URI: DELETE https://{ma_endpoint}/v2/{project_id}/training-jobs/{training_job_id}
Request header: X-Auth-Token →MIIZmgYJKoZIhvcNAQcCoIIZizCCGYcCAQExDTALBglghkgBZQMEAgEwgXXXXXX...
Set the italic parameters based on site requirements.
- Status code 202 No Content is returned, indicating that the job is successfully deleted.
- Request body:
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot