Help Center/ ModelArts/ API Reference/ Use Cases/ Creating a TensorBoard Training Job Using a Custom Image
Updated on 2026-06-03 GMT+08:00

Creating a TensorBoard Training Job Using a Custom Image

This section describes how to train a model by calling ModelArts APIs.

Overview

The process for creating a training job using PyTorch is as follows:

  1. Call the API for authentication to obtain a user token, which will be added in a request header for authentication.
  2. Call the API for creating a training job to create a training job using the UUID returned by the created algorithm and record the job ID.
  3. Call the API for querying details about a training job to query the job status and TensorBoard address using the job ID.
  4. Open the TensorBoard endpoint to view the visualized metric data.
  5. Call the API for deleting a training job to delete the job if it is no longer needed.

Prerequisites

  • The directory for storing TensorBoard logs has been prepared, for example, obs://cnnorth4-job-pfs/summary in the OBS parallel file system.

Procedure

  1. Call the API for authentication to obtain a user token.
    1. Request body:

      URI: POST https://{iam_endpoint}/v3/auth/tokens

      Request header: Content-Type → application/json

      Request body:
      {
        "auth": {
          "identity": {
            "methods": ["password"],
            "password": {
              "user": {
                "name": "user_name", 
                "password": "user_password",
                "domain": {
                  "name": "domain_name"
                }
              }
            }
          },
          "scope": {
            "project": {
              "name": "ap-southeast-1"
            }
          }
        }
      }
      Set the following parameters based on site requirements:
      • iam_endpoint: IAM endpoint
      • user_name: IAM username
      • user_password: login password of the user
      • domain_name: account to which the user belongs
      • ap-southeast-1: Project name, which is the region where ModelArts is deployed
    2. Status code 201 Created is returned. The X-Subject-Token value in the response header is the token.
      x-subject-token →MIIZmgYJKoZIhvcNAQcCoIIZizCCGYcCAQExDTALBglghkgBZQMEAgEwgXXXXXX...
  2. Call the API for creating a training job to create a training job using the UUID returned by the created algorithm and record the job ID.
    1. Request body:

      URI: POST https://{ma_endpoint}/v2/{project_id}/training-jobs

      Request header:

      • X-Auth-Token →MIIZmgYJKoZIhvcNAQcCoIIZizCCGYcCAQExDTALBglghkgBZQMEAgEwgXXXXXX...
      • Content-Type →application/json

      Set the italic parameters based on site requirements.

      Request body:

      {
          "kind": "job",
          "metadata": {
              "name": "test-pytorch-cpu01",
              "description": "test pytorch work cpu",
              "annotations": {
      			"tensorboard/enable": "true"
              }
          },
          "algorithm": {
              "engine": {
      			"image_url": "atelier/pytorch_cuda:pytorch_2.7.0-cuda_12.8-py_3.11.10-ubuntu_22.04-x86_64-20251215163925-4e5422a"
              },
      		"command": "sleep 1h",
              "summary": {
                  "data_sources": [
                      {
                          "pfs": {
                              "pfs_path": "obs://cnnorth4-job-pfs/summary/"
                          }
                      }
                  ]
              }
          },
          "spec": {
              "resource": {
                  "node_count": 1,
                  "pool_id": "pool-maostest-train-06024304be00d5092fbdc0013d201342"
              }
          }
      }

      Set the following parameters based on site requirements:

      • Set kind to the type of the training job. The default value is job.
      • Set name and description in the metadata field to the name and description of the training job.
      • Set annotations in metadata to "tensorboard/enable": "true" to enable TensorBoard.
      • Set image_url in the algorithm field to the address of the training job image.
      • Set command in the algorithm field to the command for starting the training job.
      • Set data_sources in the algorithm field to the directory for storing TensorBoard logs.
      • In the spec field, pool_id indicates the ID of the resource pool on which the training job depends. node_count indicates whether to use multi-node training (distributed training). Set it to 1 for a single-node training by default. log_export_path specifies the OBS path to which logs are uploaded.
    2. Status code 201 Created is returned, indicating that the training job has been created. The response body is as follows:
      {
          "kind": "job",
          "metadata": {
              "id": "dc0e330d-c6a8-4f1d-9cd3-1f108c3bbaf9",
              "name": "test-pytorch-cpu01",
      	"description": "test pytorch work cpu"
              "create_time": 1777545590838,
              "workspace_id": "0",
              "ai_project": "default-ai-project",
              "labels": {
                  "training-job": "modelarts-os"
              },
              "user_name": "",
              "annotations": {
                  "job_template": "Template DL",
                  "key_task": "worker",
                  "tensorboard/enable": "true"
              },
              "training_experiment_reference": {},
              "tags": []
          },
          "status": {
              "phase": "Pending",
              "secondary_phase": "Creating",
              "pending_time": 2000,
              "duration": 0,
              "is_hanged": false,
              "retry_count": 0,
              "start_time": 0,
              "node_count_metrics": null,
              "tasks": [
                  "worker-0"
              ],
              "metrics_statistics": {
                  "cpu_usage": {
                      "average": -1,
                      "max": -1,
                      "min": -1
                  },
                  "mem_usage": {
                      "average": -1,
                      "max": -1,
                      "min": -1
                  }
              }
          },
          "algorithm": {
              "command": "sleep 1h",
              "engine": {
                  "engine_id": "",
                  "engine_name": "",
                  "engine_version": "",
                  "v1_compatible": false,
                  "image_url": "atelier/pytorch_cuda:pytorch_2.7.0-cuda_12.8-py_3.11.10-ubuntu_22.04-x86_64-20251215163925-4e5422a",
                  "non_swr_image": false,
                  "run_user": "",
                  "image_source": true,
                  "image_repo_id": "",
                  "image_id": ""
              },
              "summary": {
                  "data_sources": [
                      {
                          "pfs": {
                              "pfs_path": "obs://cnnorth4-job-pfs/summary/"
                          }
                      }
                  ]
              }
          },
          "spec": {
              "resource": {
                  "pool_id": "pool-maostest-train-06024304be00d5092fbdc0013d201342",
                  "pool_resource_flavor": "",
                  "node_count": 1,
                  "pool_info": {
                      "cpu_arch": "x86",
                      "core_num": 5,
                      "mem_size": 22,
                      "cache_size": 0,
                      "accelerator": "",
                      "accelerator_num": 0,
                      "accelerator_type": "",
                      "accelerator_size": 0,
                      "variant": "",
                      "huge_pages": 0,
                      "x_parameter_plane": "",
                      "use_privileged": false,
                      "use_host_network": false,
                      "use_ib_network": false,
                      "project_id": "",
                      "pool_resource_flavor": "liumuqi-eni-test",
                      "pool_id": "pool-maostest-train-06024304be00d5092fbdc0013d201342",
                      "cluster_id": "",
                      "maos_pool": true,
                      "quota_id": "",
                      "maos_migrated": false,
                      "detect_all_in_int": false,
                      "pool_type": "",
                      "enable_cabinet": false,
                      "enable_memarts": false,
                      "enable_ems": false,
                      "empty_dir_size": 0
                  },
                  "main_container_allocated_resources": {
                      "cpu_arch": "x86",
                      "cpu_core_num": 4,
                      "mem_size": 20,
                      "accelerator_num": 0,
                      "accelerator_type": ""
                  }
              },
              "is_hosted_log": true,
              "runtime_type": "production"
          },
          "endpoints": {
              "tensorboard": {}
          },
          "ftjob_config": {
              "checkpoint_config": {
                  "save_checkpoints_max": 0,
                  "checkpoint_id": "",
                  "skipped_steps": 0,
                  "restore_training": 0
              },
              "task_env": {
                  "envs": null
              }
          }
      }
      • Record the id value (training job ID) in the metadata field for subsequent steps.
      • phase and secondary_phase under Status indicate the status and next status of the training job, respectively. In the example, Creating indicates that the training job is being created.
  3. Call the API for querying details about a training job to query the job status using the job ID.
    1. Request body:

      URI: GET https://{ma_endpoint}/v2/{project_id}/training-jobs/{training_job_id}

      Request header: X-Auth-Token →MIIZmgYJKoZIhvcNAQcCoIIZizCCGYcCAQExDTALBglghkgBZQMEAgEwgXXXXXX...

      Set the following parameters based on site requirements:

      Set training_job_id to the training job ID recorded in 2.

    2. Status code 200 OK is returned. The response body is as follows:
      {
          "kind": "job",
          "metadata": {
              "id": "dc0e330d-c6a8-4f1d-9cd3-1f108c3bbaf9",
              "name": "test-pytorch-cpu01",
              "description": "test pytorch work cpu",
              "create_time": 1777545590838,
              "workspace_id": "0",
              "ai_project": "default-ai-project",
              "labels": {
                  "training-job": "modelarts-os"
              },
              "user_name": "ei_modelarts_y00218826_05",
              "annotations": {
                  "job_template": "Template DL",
                  "key_task": "worker",
                  "tensorboard/enable": "true"
              },
              "training_experiment_reference": {},
              "tags": []
          },
          "status": {
              "phase": "Running",
              "secondary_phase": "Running",
              "pending_time": 77162,
              "duration": 181000,
              "is_hanged": false,
              "retry_count": 0,
              "task_ips": [
                  {
                      "task": "worker-0",
                      "ip": "172.16.0.79",
                      "host_ip": "192.168.209.44",
                      "schedule_count": 1
                  }
              ],
              "start_time": 1777545668000,
              "node_count_metrics": [
                  [
                      1777545658000,
                      0
                  ],
                  [
                      1777545667000,
                      0
                  ],
                  [
                      1777545668000,
                      1
                  ],
                  [
                      1777545848000,
                      1
                  ],
                  [
                      1777545849000,
                      1
                  ]
              ],
              "tasks": [
                  "worker-0"
              ],
              "metrics_statistics": {
                  "cpu_usage": {
                      "average": -1,
                      "max": -1,
                      "min": -1
                  },
                  "mem_usage": {
                      "average": -1,
                      "max": -1,
                      "min": -1
                  }
              },
              "running_records": [
                  {
                      "start_at": 1777545675,
                      "start_type": "init_or_rescheduled"
                  }
              ]
          },
          "algorithm": {
              "command": "sleep 1h",
              "engine": {
                  "engine_id": "",
                  "engine_name": "",
                  "engine_version": "",
                  "v1_compatible": false,
                  "image_url": "atelier/pytorch_cuda:pytorch_2.7.0-cuda_12.8-py_3.11.10-ubuntu_22.04-x86_64-20251215163925-4e5422a",
                  "non_swr_image": false,
                  "run_user": "",
                  "image_source": true,
                  "image_repo_id": "",
                  "image_id": ""
              },
              "summary": {
                  "data_sources": [
                      {
                          "pfs": {
                              "pfs_path": "obs://cnnorth4-job-pfs/summary/"
                          }
                      }
                  ]
              }
          },
          "spec": {
              "resource": {
                  "pool_id": "pool-maostest-train-06024304be00d5092fbdc0013d201342",
                  "pool_resource_flavor": "",
                  "node_count": 1,
                  "pool_info": {
                      "cpu_arch": "x86",
                      "core_num": 5,
                      "mem_size": 22,
                      "cache_size": 0,
                      "accelerator": "",
                      "accelerator_num": 0,
                      "accelerator_type": "",
                      "accelerator_size": 0,
                      "variant": "",
                      "huge_pages": 0,
                      "x_parameter_plane": "",
                      "use_privileged": false,
                      "use_host_network": false,
                      "use_ib_network": false,
                      "project_id": "",
                      "pool_resource_flavor": "liumuqi-eni-test",
                      "pool_id": "pool-maostest-train-06024304be00d5092fbdc0013d201342",
                      "cluster_id": "",
                      "maos_pool": true,
                      "quota_id": "",
                      "maos_migrated": false,
                      "detect_all_in_int": false,
                      "pool_type": "",
                      "enable_cabinet": false,
                      "enable_memarts": false,
                      "enable_ems": false,
                      "empty_dir_size": 0
                  },
                  "main_container_allocated_resources": {
                      "cpu_arch": "x86",
                      "cpu_core_num": 4,
                      "mem_size": 20,
                      "accelerator_num": 0,
                      "accelerator_type": ""
                  }
              },
              "is_hosted_log": true,
              "runtime_type": "production"
          },
          "endpoints": {
              "tensorboard": {
                  "url": "https://authoring-modelarts-cnnorth7.ulanqab.huawei.com/3efaddb9-dc0e330d-c6a8-4f1d-9cd3-1f108c3bbaf9/proxy/6006/",
                  "token": "fa8e321f772xxxxxxxxxxx3f1844d06"
              }
          },
          "ftjob_config": {
              "checkpoint_config": {
                  "save_checkpoints_max": 0,
                  "checkpoint_id": "",
                  "skipped_steps": 0,
                  "restore_training": 0
              },
              "task_env": {
                  "envs": null
              }
          }
      }

      You can learn about the version details of the training job based on the response. The status value is Running, indicating that the training job is running. After the training job is running, TensorBoard is started. After TensorBoard is started, the endpoints in the job details return the URL and token for opening TensorBoard.

  4. Open the TensorBoard endpoint to view the visualized metric data.

    URI format: GET {tensorboard_endpoint}?token={tensorboard_token}

    Set the following parameters based on site requirements:

    • tensorboard_endpoint indicates the URL for opening TensorBoard of a training job.
    • tensorboard_token indicates the TensorBoard access credential of a training job.
  5. Call the API for deleting a training job to delete the job if it is no longer needed.
    1. Request body:

      URI: DELETE https://{ma_endpoint}/v2/{project_id}/training-jobs/{training_job_id}

      Request header: X-Auth-Token →MIIZmgYJKoZIhvcNAQcCoIIZizCCGYcCAQExDTALBglghkgBZQMEAgEwgXXXXXX...

      Set the italic parameters based on site requirements.

    2. Status code 202 No Content is returned, indicating that the job is successfully deleted.