Help Center/ ModelArts/ Model Training/ Creating a Training Job/ Creating a Custom Training Job (New Console)
Updated on 2026-07-02 GMT+08:00

Creating a Custom Training Job (New Console)

Description

This topic is specific to CN Southwest-Guiyang1. The console uses the new UI version.

Developing models involves optimizing their performance effectively. Traditional methods require repeatedly testing various model structures, datasets, and hyperparameters, which takes significant time and effort but may still fail to deliver good results. ModelArts simplifies this process by offering tools for creating training jobs, tracking progress in real time, and managing versions. With ModelArts, you can test different configurations easily and identify the best-performing setup faster.

Create a production training job in either of the following ways:

Constraints

  • Supported region: This feature is only available in the CN Southwest-Guiyang1 region.
  • Job quota: By default, you can create up to 10,000 training jobs.
  • Storage: ModelArts does not support OBS buckets with bucket encryption enabled. Ensure this option is disabled when creating your OBS bucket.

Prerequisites

  • Account not in arrears (paid resources required for training jobs).
  • Data for training uploaded to an OBS directory.
  • At least one empty folder in OBS for storing training output.

    ModelArts does not support encrypted OBS buckets. When creating an OBS bucket, do not enable bucket encryption.

  • OBS directory and ModelArts in the same region.
  • Access authorization configured. If you have not yet configured access, follow the instructions in Configuring Agency Authorization for ModelArts with One Click.

Billing

Model training in ModelArts uses compute and storage resources, which are billed. Compute resources are billed for running training jobs. Storage resources are billed for storing data in OBS or SFS. For details, see Model Training Billing Items.

Procedure

To create a training job, follow these steps:

Step 1: Accessing the Creation Page: Log in to the console and navigate to the training job list.

Step 2: Choosing the Training Mode: Configure the training mode.

Step 3: Setting Basic Information: Define the job name, description, and other basic details.

Step 4: Defining Training Configuration: Configure parameters such as the image, boot command, and environment variables.

Step 5: Configuring Resources: Specify the resource pool type, specifications, number of instances, storage mounts, job priority, and preemption settings.

Step 6: Configuring Data: Configure a dataset.

Step 7: Publishing Models to Assets: Configure whether to publish the trained model to assets.

Step 8: Configuring HA: Configure automatic restart policies (including unconditional restarts and restarts upon job suspensions).

Step 9: Managing Access Configuration: Configure debugging options, SSH remote development, and password-free SSH between nodes.

Step 10: Enabling Observability: Configure TensorBoard, MindStudio Insight, and Prometheus metric collection.

Step 11: Adjusting Additional Configurations: Configure logging, job visibility, automatic stop, event notifications, and tags.

Step 12: Submitting and Viewing the Job: Submit the job and view the training job details.

Step 1: Accessing the Creation Page

  1. Log in to the ModelArts console.
  2. In the navigation pane, choose Model Build > Training.
  3. Click Create Training Job. The new UI is displayed by default. The following describes how to create a training job on the new UI.

Step 2: Choosing the Training Mode

Table 1 Training mode

Training Mode

Description

Fine-Tuning

Ideal for scenarios where you need to fine-tune existing pre-trained models, such as Pangu models or ResNet.

Custom Job

Designed for scenarios requiring full control over the training workflow, including the use of proprietary code or specialized images. This high-flexibility training allows you to customize training with custom Docker images and algorithms and enjoy full control over the training workflow.

For this example, select Custom Job.

Step 3: Setting Basic Information

On the Create Training Job page, select Custom Job and set basic information.
Table 2 Basic information

Parameter

Description

Name

Job name, which is mandatory.

The system automatically generates a name, which you can then rename according to the following rules.

  • The name contains 1 to 64 characters.
  • Letters, digits, hyphens (-), and underscores (_) are allowed.

Description (Optional)

Job description, which helps you learn about the job information in the training job list.

Enter 0 to 256 characters. Only letters, digits, spaces, hyphens (-), underscores (_), commas (,), and periods (.) are supported.

Step 4: Defining Training Configuration

Table 3 Training configuration

Parameter

Description

Preset Template (Optional)

Click Select Preset Template to filter templates by type (currently supporting text generation and image understanding) or brand (currently supporting Qwen).

After you select a preset template, some templates will automatically fill in the description (optional), image, boot command code directory, local code directory, and environment variables of the current job. Refer to the actual GUI for final details.

You can also adjust the configuration as needed.

Select Image

Specifies the container image used to run the training code. The following options are available:

Preset Images: Ready-to-use images provided by ModelArts that include popular frameworks (e.g., PyTorch 1.8, TensorFlow 2.1). Ideal for most standard scenarios.

Custom Images: Select an image that you have created and pushed to the SWR image repository or a registered image.

  • You can select an image from SWR Basic Edition or Enterprise Edition.
  • Registered Images: Select an image that you have registered on the ModelArts console.

Custom images must be registered in ModelArts Image Management before use. This option is recommended when preset base images do not meet specific dependency requirements.

Boot Command

Defines the command executed upon container startup to launch your training script.
  • If the training boot script is a .py file, train.py for example, the boot command is as follows.
    python ${MA_JOB_DIR}/demo-code/train.py
  • If you use a shell script (e.g., main.sh):
    bash ${MA_JOB_DIR}/demo-code/main.sh

The boot command supports multiple commands concatenated with ; or &&. Note that demo-code represents the leaf directory of the OBS path where your code is stored; adjust this according to your actual project structure.

NOTE:

To ensure data security, do not include sensitive information such as plaintext passwords.

Code Directory (Optional)

Specifies the OBS directory containing the training code. This parameter is required when using a preset image and optional when using a custom image. You can choose your own OBS bucket or enter a path. The path must start with obs:// and end with a slash (/), like this: obs://bucketname/path/. For shared buckets from other users, you must enter the path. In the OBS bucket, files with the .txt, .py, .sh, and .yaml extensions can be edited online, and files with the .log, .json, and .md extensions can be viewed online.

  • Upload code to the OBS bucket beforehand. The total size of files in the directory cannot exceed 5 GB, the number of files cannot exceed 1,000, and the folder depth cannot exceed 32. If there is a pre-trained model, put it in the code directory.
  • The training code file is automatically downloaded to the ${MA_JOB_DIR}/demo-code directory of the training container when the training job is started. demo-code is the last-level OBS directory for storing the code. For example, if Code Directory is set to /test/code, the training code file is downloaded to the ${MA_JOB_DIR}/code directory of the training container.

Code Backup Directory (Optional)

Specifies the OBS directory where you want to back up the training code file.

  • You must create the directory in advance.
  • When a training job is started, files in the code directory are automatically backed up to ${MA_RECORD_CODE_DIR_OBS}/${ma_job_name}-${MA_TASK_NAME}-${task_index}/user-job-dir/demo-code. demo-code is the last-level OBS directory for storing the code. For example, if the code backup directory is /test/record-dir and the code directory is /test/code, the training code file will be backed up to the /test/record-dir/job_name-work-0/user-job-dir/code directory.

Local Code Directory

Specifies the local directory within the training container where the code will be downloaded. The default path is /home/ma-user/modelarts/user-job-dir.

The path cannot be set to /home/ma-user or any subdirectory under /home/ma-user/modelarts/*, /home/ma-user/modelarts-dev/*, or /home/ma-user/infer/*.

Click Preview Runtime Environment to view the actual working directory of the training job.

Environment Variable

Allows you to add custom environment variables based on service requirements. For predefined environment variables in the training container, see Managing Environment Variables of a Training Container.

  • Click Add to manually enter environment variables (up to 100 entries).
  • Click Upload to batch import environment variables using a template. Total entries must not exceed 100 to avoid import failure.
NOTE:

To ensure data security, do not include sensitive information such as plaintext passwords.

Step 5: Configuring Resources

Table 4 Resource configuration

Parameter

Description

Source of resources

  • Public resource pool: The public resource pool is available for all tenants and does not require user creation.
  • Dedicated resource pool: Dedicated resource pools are created separately and used exclusively. For details, see Creating a Dedicated Resource Pool.

Resource Pool

This parameter appears only for dedicated resource pools. In the Resource Pool section, click Select Resource Pool and choose your desired dedicated resource pool or logical subpool from the menu on the right. Click OK.

You can view the dedicated resource pool name, node pool specifications, number of available nodes/maximum number of nodes, number of available NPU/GPUs, available CPUs (vCPUs), available memory (GiB), and resource fragments. Hover over View in the Resource Fragment column to check fragment details and check whether the resource pool meets the training requirements.

Once you choose a resource pool, its details appear. To choose a different one, click Reselect.

Specification Type

Displays when you select a dedicated resource pool. The following specifications types are supported:

  • Preset: Select preset instance specifications in the dedicated resource pool. Ensure that the selected flavor has sufficient disk space to download the input file.
  • Customized Specifications: Training jobs allow custom resource specifications using dedicated pools to enhance their utilization. You can set CPU (vCPUs), Memory (GB), Ascend (PU), and Compute Nodes as required. Custom specifications must match or stay below the node specifications of the dedicated resource pool.
  • For CPU specifications, you can only customize the number of vCPUs and memory. For GPU and Ascend specifications, you can customize the number of vCPUs, memory, and cards.
Figure 1 Specifications

Specifications

Determines the hardware specifications for the training instances. For Dedicated resource pool, you must select a pool first. For Public resource pool, select a specification directly from the list.

  • The upper part displays the CPU, NPU, and GPU specifications. The lower part displays details such as the specification name, memory, and reference price.
  • The specification name contains the number of PUs, processor model, CPU and GPU information, and memory size.
  • If a type is grayed out or invisible, it is unsupported.
    NOTE:
    • Specifications labeled GPU:n*tnt004 do not support multi-process training jobs.
    • For a public resource pool: Use a custom image with the same image and resource types as your instance when creating a job, for example, only GPU types. Otherwise, the training job will fail.

Compute Nodes

Select the number of instances as required. The default value is 1.

  • If only one instance is used, a single-node training job is created. ModelArts starts one training container on this node. The training container exclusively uses the compute resources of the selected specifications.
  • If more than one instance is used, a distributed training job is created. For more information about distributed training configurations, see Overview.

    Once you set up hot standby nodes for a resource pool, these nodes are reserved for high availability and can only be used for recovering faulty nodes. They cannot be used for training jobs. This reduces the number of training job instances you can create. For details about how to disable hot standby nodes, see Rectifying a Faulty Node in a Dedicated Resource Pool.

    Before creating a distributed training job, pre-install all required pip dependencies (see Installing pip Dependencies in an Image). If there are more than 10 nodes, the system automatically deletes the pip source configuration. Executing pip install commands during training may cause training failures.

Specify Affinity Nodes

Supported only for dedicated resource pools. It allows you to configure supernode and node affinity for training jobs. Select the checkbox to enable it.

When enabled, it allows fine-grained control over pod deployment strategies, including: strict placement (strong affinity), preferred placement (weak affinity), prohibited placement (strong anti-affinity), and avoided placement (weak anti-affinity).

Affinity Type:

Node affinity: Requires all instances of a training job to be scheduled on selected nodes, either strictly or preferentially.

Node anti-affinity: Requires all instances of a training job to be avoided or strictly excluded from selected nodes.

Strength: The degree of affinity.

Weak: The system will try to place the pod on the specified node, but it is not guaranteed.

Strong: The pod must be scheduled onto the specified node; otherwise, scheduling will not proceed.

Supernode Affinity Method: Supported only for supernode resource pools.

At the supernode level, currently only scenarios where all instances of a training job belong to one affinity group are supported. This is suitable for training jobs where traffic must not cross supernodes.

Random child nodes: The system randomly schedules tasks to child nodes within the target supernode.

Specify child nodes: The system schedules tasks to the specified child nodes.

Select Supernode: Choose the supernode(s) to be configured. Supported only for supernode resource pools.

Select Node: Choose the node(s) to be configured.

Storage Mounting

Enables mounting high-performance storage to improve data access efficiency. Dedicated resource pools support multiple options. For details, see Table 1.

  • Add Extended Storage (SFS Turbo)

    When ModelArts and SFS Turbo are directly connected, multiple SFS Turbo file systems can be mounted to a training job to store training data. You can mount a file system multiple times, but each mount path must be distinct. A maximum of five disks can be mounted to a training job. For details, see Configuring Network Passthrough Between ModelArts and SFS Turbo.

    • Name: Select an SFS Turbo file system.
    • File System Directory: Select the storage location of the SFS Turbo file system. If you have configured the folder control permission, select a storage location. If you have not configured the folder control permission, retain the default value / or customize a location.
    • Mount Path: Enter the SFS Turbo mounting path in the training container. The path cannot be a / directory or a system-mounted directory like /cache or /home/ma-user/modelarts.
    • Mounting Mode: Show permissions on the mounted SFS Turbo file system. This parameter is displayed as Read/Write or Read-only based on the permissions on the SFS Turbo storage location. If you have not configured the folder control permission, this parameter is unavailable. For details about how to set permissions for SFS Turbo folders, see Permissions Management.
    • Mount Options: Configure SFS mount parameters to accelerate training. Alternatively, retain the default settings below:
      mountOptions:
      - vers=3 
      - timeo=600 
      - nolock 
      - hard
      NOTE:

      Configuring SFS Turbo allows the frontend page fetch the latest storage details and settings directly, ensuring valid and accurate data.

      1. Querying Details About a File System
      2. Listing File Systems
  • Add OBS Bucket:

    Training jobs allow you to mount an OBS bucket for storing training data. Set the parameters for adding a bucket as follows:

    Directory: Select the storage location of the OBS extended storage. If you have configured the folder control permission, select a storage location. If you have not configured the folder control permission, retain the default value / or customize a location. You can choose your own OBS bucket or enter a path. The path must start with obs:// and end with a slash (/), like this: obs://bucketname/path/. For shared buckets from other users, you must enter the path.

    Mount Path: Enter the cloud mount path of OBS in the training container. The path cannot be a / directory or a system-mounted directory like /cache or /home/ma-user/modelarts.

    WARNING:

    Due to the inherent differences between Object Storage Service (OBS) semantics and POSIX file systems, using rclone via FUSE for file-system-like access comes with the following restrictions:

    • Write semantics: No support for random or append writes. You cannot perform in-place modifications on existing files, including seek writes or appends. For example, opening and modifying a file in r+, w+, a, or a+ modes or continuously appending to logs (e.g., bash while true; do echo "$(date) line" >> app.log; sleep 1; done) is not supported.
    • Permissions: Most displayed permissions are faked or uniformly mapped by rclone. It is not a true multi-user permission system; therefore, using chmod or chown to modify file permissions will not work.
    • Concurrent writes: Writing to the same location at the same time leads to unpredictable results. When several processes modify one file simultaneously, the outcome may vary.
    • Link restrictions: Hard links are not supported. Symbolic links may be treated as regular files (storing the link text). For example, running ln -s target.txtlink.txt might fail outright or result in link.txt being uploaded as a regular file containing the string target.txt.

Job Scheduling Priority

  • When using a dedicated resource pool, you can set and change the scheduling priority of the training job. This parameter is not supported when a public resource pool is used.
  • The platform handles jobs by prioritizing them from highest to lowest. If multiple jobs share the same priority, they are scheduled in the order they were submitted. When resources are available, the earliest-submitted job gets processed first.
  • Changing the number changes the priority of the job in the queue. The priority can be set to 1, 2, or 3. A larger number indicates a higher priority. The default priority is 1, and the highest priority is 3.
  • To set the priority to 3, you will also need the permission. For details about how to set the permission, see Assigning the Permission to Set the Highest Job Priority to an IAM User.
  • If a training job is in the Pending state for a long time, you can change the job priority to reduce the queuing duration. For details, see Priority of a Training Job.

Preemption

  • When using a dedicated resource pool, you can set this parameter. This parameter is not supported when a public resource pool is used.
  • When enabled, jobs that allow preemption may be terminated and re-queued if resource pool capacity is insufficient. To avoid losing training progress, configure resumable training before enabling this function. For details, see Resumable Training.

Step 6: Configuring Data

Table 5 Data configuration

Parameter

Description

Training Dataset

Training datasets are used to improve model performance on specific tasks.

You can select up to three datasets for training. You can select both Preset Data and My Data.

Preset Data: template data officially provided by the platform.

My Data: personal data uploaded by you. If existing data assets cannot be selected, go to Asset Management > Data to publish them.

Step 7: Publishing Models to Assets

Parameter

Description

Publish to Assets

When enabled, the system will automatically publish model artifacts as assets, enabling operations such as inference and evaluation on the platform.

Model Output Path

  • Directory: OBS path for storing the model after training is complete. You can choose your own OBS bucket or enter a path. The path must start with obs:// and end with a slash (/), like this: obs://bucketname/path/. For shared buckets from other users, you must enter the path.

  • Mount Path: This parameter is displayed when you use a dedicated resource pool. The system mounts the file directory in the storage location to the specified path in the training container. You can customize this path, but system directories such as /home/, /home/ma-user/, and /home/ma-user/modelarts/ are not supported.

Auto-publish to Assets

When enabled, the trained model will be automatically uploaded to the Asset Management > Models > My Models page.

This option is deselected by default.

Publishing Method

A trained model can be published as a new model or a new version of an existing model.

New model: The published model is a new model and is displayed as a new asset model on the Asset Management > Models > My Models page.

New version: The published model is the model with the same name on the Asset Management > Models > My Models page. Only the model version number changes.

Model Name

Name of the new model.

Enter 2 to 128 characters. Only letters, digits, hyphens (-), and underscores (_) are allowed. The name must start with a letter and end with a letter or digit.

Model Type

Model type of the published model.

Model Brand

Model brand.

Model Version

If the model is published as a new model, the version number is V1.

If the model is published as a new version of an existing model, the version number is automatically incremented by 1 based on the previous version number of the model.

Note: The model version number cannot be modified and is automatically generated by the system.

Description (Optional)

Description of the trained model. This field is optional and can contain a maximum of 256 characters.

Step 8: Configuring HA

Table 6 HA configuration

Parameter

Description

Fault Tolerance and Recovery

Specifies whether to enable automatic restart for the training job.

  • Deselected (default): Automatic restart is disabled. If an error occurs, the training job will stop immediately.
  • Selected: If a training job fails due to environment issues, process suspensions, or other abnormalities, the system automatically detects the fault and applies recovery strategies to improve the success rate. The system supports process-level, container-level, and job-level automatic restart and recovery. These strategies are matched and upgraded automatically without requiring additional configuration.

    To avoid losing training progress and make full use of compute, ensure that your code logic supports resumable training before enabling this function. For details, see Resumable Training.

    If auto restart is triggered during training, the system records the restart information. You can check the fault recovery details on the training job details page. For details, see Training Job Fault Tolerance Check.

Maximum Restarts

This parameter is available when Fault Tolerance and Recovery is selected.

The training job will stop if it is still abnormal after maximum automatic restarts.

  • Default value: 3
  • Value range: 1 to 128

The value cannot be changed once the training job is created. Set this parameter based on your needs.

Unconditional Auto Restart

This parameter is available when Fault Tolerance and Recovery is selected. If Unconditional auto restart is selected, the training job will be restarted unconditionally once the system detects a training exception. To prevent invalid restarts, it supports a maximum of three consecutive unconditional restarts.

Restart Upon Suspension

This parameter is available when Fault Tolerance and Recovery is selected. ModelArts continuously monitors job processes to detect suspension and optimize resource usage. When this feature is enabled, suspended jobs can be automatically restarted at the process level.

CPU specifications do not support job restarts upon suspension.

However, ModelArts does not verify code logic, and suspension detection is periodic, which may result in false reports. By enabling this feature, you acknowledge the possibility of false positives. To prevent unnecessary restarts, ModelArts limits consecutive restarts to three.

Step 9: Managing Access Configuration

Table 7 Access configuration

Parameter

Description

JupyterLab

Enables online debugging and development via integrated tools such as JupyterLab.

Remote SSH

Allows remote connection to training job instances from a local IDE for real-time debugging and execution. The system automatically starts the SSHD service for each instance and configures SSH passwordless login between instances to facilitate cross-node collaboration.

Requires a key pair to be created.

If enabled, Password-free SSH Between Nodes will be unavailable.

Password-free SSH Between Instances

Specifies whether to generate SSH passwordless mutual trust files between instances.

  • Deselected (default): Mutual trust files are not generated.
  • Selected: Mutual trust files are generated. When performing distributed training using custom images based on frameworks like MPI or Horovod, this configuration is mandatory to ensure seamless inter-instance communication and the successful execution of distributed tasks.

    You must also configure the Password-free SSH File Directory, which specifies where the auto-generated SSH key files are stored in the training container (default: /home/ma-user/.ssh). For details, see Configuring Password-free SSH Mutual Trust Between Instances for a Training Job Created Using a Custom Image.

    If enabled, Remote SSH will be unavailable.

Step 10: Enabling Observability

Table 8 Observability configuration

Parameter

Description

TensorBoard

TensorBoard is a visualization tool package of TensorFlow. It provides visualization functions and tools required for machine learning experiments. With TensorBoard, computational graph during training, metric trends, and data used during training are effectively displayed. For details about TensorBoard, see the official website.

This parameter is not supported when a public resource pool is used.

Stores the results generated by the visualization tool TensorBoard.

MindStudio Insight

MindStudio Insight visualizes information such as scalars, images, computational graphs, and model hyperparameters during training. It supports training jobs based on the MindSpore engine. For details about MindStudio Insight, see MindSpore official website.

This parameter is not supported when a public resource pool is used.

Stores the results generated by the visualization tool MindStudio Insight.

Interconnect Metrics with AOM

Specifies whether to enable Prometheus metrics collection. Configure parameters in your training container to collect Prometheus metrics. Once set up, the system periodically gathers metric data during training and uploads it to AOM, allowing you to monitor custom Prometheus metrics via the AOM console.

ModelArts provides two configuration methods.

  • Method 1: Provide an HTTP process for the platform to call regularly to get metrics. For details, see Collecting Prometheus Metrics for AOM Over HTTP.
    • Set Metrics Collection Method to HTTP.
    • Enter the collection URL and port.

    Make sure the URL and port number match those of the metric collection process. If they do not, metrics might not report correctly. Additionally, ensure that the network environment where the training job is located allows access to the configured URL and port number to prevent metric collection failures due to network issues.

  • Method 2: Provide a Linux command for the platform to execute regularly to get metrics. For details, see Collecting Prometheus Metrics for AOM Using Commands.
    • Metrics Collection Method: Set it to Command Line.
    • Command: Enter the Linux command for reading metrics, for example, cat.
    • Command Parameters: Enter the path to the custom metric file, for example, /XXX/a.prom.

    Make sure the command and its parameters can reliably produce metrics with responses in seconds. If not, metric collection might fail.

Step 11: Adjusting Additional Configurations

Parameter

Description

Persistent Log Saving

This function is enabled by default when Ascend specifications are selected.

This function is available when CPU or GPU specifications are selected.

  • If this function is enabled (default), configure Log Path. The platform permanently stores training logs to the specified OBS path.
  • If this function is disabled, ModelArts automatically stores the logs for 30 days. You can download all logs on the job details page to a local path.

Log Path

When Persistent Log Saving is enabled, you must configure a log path to store log files generated by the training job.

Ensure that you have read and write permissions to the selected OBS directory. You can choose your own OBS bucket or enter a path. The path must start with obs:// and end with a slash (/), like this: obs://bucketname/path/. For shared buckets from other users, you must enter the path.

Job Visibility

The options are Workspace and Creator.

  • Workspace: The created training job is visible to all users in the current workspace.
  • Creator: Only the creator can view the job by default. To access it, other users must request the modelarts:trainJob:listAll permission, which allows them to view all training jobs, including those limited to the creator.

Auto Stop

Choose whether to enable Auto Stop.

  • This function is disabled by default, and the training job keeps running until the training is completed.
  • If you enable this function, set the auto stop time. The value can be 1 hour, 2 hours, 4 hours, 6 hours, or Customize. The customized time must range from 1 hour to 720 hours. When you enable this function, the training stops automatically when the time limit is reached. The time limit does not count down when the training is paused.

Retention Period

Choose if you want to keep the on-site training container environment after a successful or failed job creation, and set how long to keep it.

  • If you choose a dedicated resource pool but uncheck this option, the training container's environment will not stay after the job finishes or fails. You also cannot log in to the container's backend using Cloud Shell to view the training details.
  • Select this option when choosing a dedicated resource pool. Set how long the training container's environment stays active after the job ends. Log in to the container's backend via Cloud Shell on the training details page to view the training information.
    • If you select a public resource pool, the retention period cannot be selected.
    • If you select this option, you cannot configure fault tolerance and recovery in the HA Settings. This means you cannot set up automatic restarts.
    • If you select this option, the job status will show Completed (retained) and Failed (retained). This means the status will appear in the response body of the training job details API.
    • Set the period to 1, 2, 4, 6 hours, or choose a custom time. Custom times range from 1 to 720 hours.

Note: You will still be charged for the training environment during the retention period. Set the retention time based on your needs.

Event Notification

Choose whether to enable event notification for the training job.

  • This function is deselected by default, which means SMN is disabled.
  • If this function is enabled, you will be notified of specific events, such as job status changes or suspected suspensions, via an SMS or email. Notifications will be billed based on SMN pricing. In this case, you must configure the topic name and events.
    • Topic: topic of event notifications. Click Create Topic to go to the SMN console to create a topic and add a subscription to the topic. You will receive event notifications only after the subscription status changes to Confirmed. For details, see Adding a Subscription.
    • Event: events you want to subscribe to. Examples: JobStarted, JobCompleted, JobFailed, JobTerminated, JobHanged, JobRestarted, and JobPreempted.
NOTE:
  • SMN charges you for the number of notification messages. For details, see Billing.
  • Only training jobs using GPUs or NPUs support JobHanged events.

Tags

TMS's predefined tags are recommended for adding the same tag to different cloud resources. For details about how to use tags, see Using TMS Tags to Manage Resources by Group.

You can add up to 20 tags to a training job.

Step 12: Submitting and Viewing the Job

After setting the parameters, click Submit.

A training job runs for a period of time. You can go to the training job list to view the basic information about the training job.

  • In the training job list, Status of a newly created training job is Pending.
  • Once the training job shows Completed, it has finished. The system saves the created model in model assets for later access.
  • If the status is Failed or Abnormal, click the job name to go to the job details page and view logs for troubleshooting.