Updated on 2025-08-18 GMT+08:00

Creating a Production Training Job (New Version)

ModelArts has enhanced the creation page to improve the efficiency of creating training jobs. The updated page streamlines operations and enhances the GUI display.

Developing models involves optimizing their performance effectively. Traditional methods require repeatedly testing various model designs, datasets, and hyperparameters, which takes significant time and effort but may still fail to deliver good results. ModelArts simplifies this process by offering tools for creating training jobs, tracking progress in real time, and managing versions. With ModelArts, users can test different configurations easily and identify the best-performing setup faster.

Create a production training job in either of the following ways:

Notes and Constraints

By default, up to 10,000 training jobs can be created. You can view the remaining quota on the training job list page.

Figure 1 Viewing the remaining quota of a training job

Prerequisites

  • Account not in arrears (paid resources required for training jobs).
  • Data for training uploaded to an OBS directory.
  • At least one empty folder in OBS for storing training output.

    ModelArts does not support encrypted OBS buckets. When creating an OBS bucket, do not enable bucket encryption.

  • OBS directory and ModelArts in the same region.
  • Access authorization configured. If it is not configured, configure it by referring to Configuring Agency Authorization for ModelArts with One Click.

Billing

Model training in ModelArts uses compute and storage resources, which are billed. Compute resources are billed for running training jobs. Storage resources are billed for storing data in OBS or SFS. For details, see Model Training Billing Items.

Procedure

To create a training job, follow these steps:

Access the page for creating a training job. For details, see Step 1: Accessing the Page for Creating a Training Job.

Configure basic information. For details, see Step 2: Configuring Basic Parameters.

Configure the environment. For details, see Step 3: Configuring the Training Job Environment.

Configure training parameters, including inputs, outputs, hyperparameters, and environment variables. For details, see (Optional) Step 4: Configuring Training Settings.

Configure the resource pool and instance specifications. For details, see Step 5: Configuring Training Job Resources.

Configure auto restart. For details, see Step 6: Configuring Fault Tolerance and Recovery.

Set the job priority, preemption, and auto stop. For details, see Step 7: Configuring Scheduling Parameters.

Configure logs, event notifications, and tags. For details, see Step 8: Configure Advanced Parameters.

Submit a training job and view its status. For details, see Step 9: Submitting a Training Job and Viewing Its Status.

Step 1: Accessing the Page for Creating a Training Job

  1. Log in to the ModelArts console.
  2. In the navigation pane, choose Model Training > Training Jobs.
  3. Click Create Training Job. The new-version page is displayed by default. The following describes how to create a training job on the old-version page.

Step 2: Configuring Basic Parameters

On the Create Training Job page, configure basic parameters.
Table 1 Basic parameters

Parameter

Description

Runtime Type

Select Production.

Debug your training code either in the cloud or locally before using it to create production training jobs.

  • Production: Run the training job in the production environment. You are advised to start with the debug mode to save resources.
  • Debug: Modify and debug your training code in real time.

Name

Job name, which is mandatory.

The system automatically generates a name, which you can then rename according to the following rules.

  • The name contains 1 to 64 characters.
  • Letters, digits, hyphens (-), and underscores (_) are allowed.

Description (Optional)

Job description, which helps you learn about the job information in the training job list.

Experiment

Specifies whether to organize training jobs into experiments for better management. It helps manage multiple job versions efficiently.

Experiments help manage and optimize training jobs. For example, after fine-tuning hyperparameters, you can sort and compare job results in an experiment to find the optimal training configuration.

  • Use existing: Select an existing experiment to add the training job to the experiment.
  • Create new: Enter an experiment name and description to add the training job to the new experiment.

If you do not enable this feature, the job will not be managed in any experiment.

Step 3: Configuring the Training Job Environment

When creating a training job, configure the algorithm source, boot mode, image engine and version, and code directory. Training jobs can have various algorithm types.
  • Custom algorithm: Create a training job using a preset image or a custom image.
    Table 2 Environment settings (custom algorithm)

    Parameter

    Description

    Algorithm Type

    Select Custom algorithm. This parameter is mandatory.

    Boot Mode

    This parameter is mandatory when Algorithm Type is set to Custom algorithm. Options:

    • Preset image: Create a training job using a preset training framework and image.
    • Custom image: Create a training job using a custom image.

    If the software in the preset images cannot meet your needs, you can use a custom image for training. This image must be uploaded to SWR beforehand. For details about how to create an image, see Preparing a Model Training Image.

    Engine and Version

    If Boot Mode is set to Preset image, you need to select the required engine and version.

    Ensure that the framework of the AI engine you select is the same as the one you use for writing algorithm code. For example, if PyTorch is used for writing algorithm code, select PyTorch when you create a job.

    Image

    Select a container image for training. For details about the training image creation requirements, see Preparing a Model Training Image.

    • If Boot Mode is set to Preset image and the engine version is set to Customize, you need to select a proper image from the container images.

    • If the Boot Mode is set to Custom image, you need to select a proper image from the container images.

    You can set the container image path in either of the following ways:
    • To select your image or an image shared by others, click Select on the right and select a container image for training. The required image must be uploaded to SWR beforehand.
    • To select a public image, enter the address of the public image in SWR. Enter the image path in the format of "Organization name/Image name:Version name". Do not contain the domain name (swr.<region>.myhuaweicloud.com) in the path because the system will automatically add the domain name to the path. For example, if the SWR address of a public image is swr.<region>.myhuaweicloud.com/test-image/tensorflow2_1_1:1.1.1, enter test-images/tensorflow2_1_1:1.1.1.

    Code Source

    Select the code source. OBS is selected by default.

    • OBS: Select OBS if the training code is stored in an OBS bucket.

    Code Directory

    This parameter is available only when Code Source is set to OBS.

    Select the OBS directory where the training code file is stored. This parameter is mandatory when Boot Mode is set to Preset image. This parameter is optional when Boot Mode is set to Custom image.

    • Upload code to the OBS bucket beforehand. The total size of files in the directory cannot exceed 5 GB, the number of files cannot exceed 1,000, and the folder depth cannot exceed 32. If there is a pre-trained model, put it in the code directory.
    • The training code file is automatically downloaded to the ${MA_JOB_DIR}/demo-code directory of the training container when the training job is started. demo-code is the last-level OBS directory for storing the code. For example, if Code Directory is set to /test/code, the training code file is downloaded to the ${MA_JOB_DIR}/code directory of the training container.

    Boot File

    Select or enter the Python boot script of the training job in the code directory. This parameter is mandatory when Boot Mode is set to Preset image. This parameter is not required when Boot Mode is set to Custom image.

    ModelArts supports only the boot file written in Python. Therefore, the boot file must end with .py.

    Boot Command

    Command for booting an image. This parameter is not required when Boot Mode is set to Preset image. This parameter is mandatory when Boot Mode is set to Custom image.

    When a training job is running, the boot command is automatically executed after the code directory is downloaded.
    • If the training boot script is a .py file, train.py for example, the boot command is as follows.
      python ${MA_JOB_DIR}/demo-code/train.py
    • If the training boot script is a .sh file, main.sh for example, the boot command is as follows:
      bash ${MA_JOB_DIR}/demo-code/main.sh

    You can use semicolons (;) and ampersands (&&) to combine multiple commands. demo-code in the command is the last-level OBS directory where the code is stored. Replace it with the actual one.

    If there are input pipes, output pipes, or hyperparameters, ensure that the last command of the boot command runs the training script.

    Reason: The system appends input pipes, output pipes, and hyperparameters to the end of the boot command. If the last command is not the training script, an error will occur.

    Example: If the last line of the boot command is python train.py and the --data_url hyperparameter exists, the system executes python train.py --data_url=/input when running properly. However, if the boot command ends with another command, such as:

    python train.py
    pwd    # The last command is pwd instead of the training script.

    The system will execute python train.py pwd --data_url=/input, leading to an error.

    NOTE:

    To ensure data security, do not enter sensitive information, such as plaintext passwords.

    User ID

    ID of the user who runs the container. This parameter is not required when Boot Mode is set to Preset image. This parameter is optional when Boot Mode is set to Custom image.

    If the UID needs to be specified, its value must be within the specified range. The UID ranges of different resource pools are as follows:

    • Public resource pool: 1000 to 65535
    • Dedicated resource pool: 0 to 65535

    The default value 1000 is recommended.

    If the user ID is set to 0, the user in the training container is root.

    Local Code Directory

    This parameter is available only when Code Source is set to OBS. This parameter is optional.

    This parameter specifies the local directory of the training container. When training starts, the code directory is downloaded to this directory. The default local code directory is /home/ma-user/modelarts/user-job-dir.

    Click Preview Runtime Environment in the upper right corner of the page to view the work directory of the training job.

    Container Execution Directory

    Specify the local directory of the training container. During training, the system automatically runs the cd command to execute the boot file in this directory.

    It is the local directory where the boot command is executed during the training job. This directory can store generated temporary files. This directory must be the parent directory of the boot file's local directory.

  • My algorithm: Use an algorithm in Algorithm Management to create a training job.

    Set Algorithm Type to My algorithm and select an algorithm from the algorithm list. If no algorithm meets the requirements, you can create an algorithm. For details, see Creating an Algorithm.

(Optional) Step 4: Configuring Training Settings

When creating a training job, you must configure inputs, outputs, hyperparameters, and environment variables of the training job.

Table 3 Training settings

Parameter

Description

Input

Click Add and configure training inputs.

  • Parameter name

    The algorithm code reads the training input data based on the input parameter name.

    The recommended value is data_url. The training input parameters must match the input parameters of the selected algorithm.

  • Input Source: The training input supports OBS storage and datasets.
    • Click next to Input Source and select the storage path to the training input data from an OBS bucket. Files must not exceed 10 GB in total size, 1,000 in number, or 1 GB per file.
    • Click Dataset and select the target dataset and its version in the ModelArts dataset list.

    When the training job is started, ModelArts automatically downloads the data in the input path to the training container.

  • Obtained from

    The following uses training input data_path as an example.

    • If you select Hyperparameters, use this code to obtain the data:
      import argparse
      parser = argparse.ArgumentParser()
      parser.add_argument('--data_path')
      args, unknown = parser.parse_known_args()
      data_path = args.data_path 
    • If you select Environment variables, use this code to obtain the data:
      import os
      data_path = os.getenv("data_path", "")

Output

Click Add and configure training outputs.

  • Parameter name

    The algorithm code reads the training output data based on the output parameter name.

    The recommended value is train_url. The training output parameters must match the output parameters of the selected algorithm.

  • Data path

    Click next to Data path and select the storage path to the training output data from an OBS bucket. Files must not exceed 1 GB in total size, 128 in number, or 128 MB per file.

    During training, the system automatically synchronizes files from the local code directory of the training container to the data path.

    Only OBS can be used to store output data. Choose an empty folder for your output files to prevent issues with stored training inputs.

  • Obtained from

    The following uses the training output train_url as an example.

    • If you select Hyperparameters, use this code to obtain the data:
      import argparse
      parser = argparse.ArgumentParser()
      parser.add_argument('--train_url')
      args, unknown = parser.parse_known_args()
      train_url = args.train_url 
    • If you select Environment variables, use this code to obtain the data:
      import os
      train_url = os.getenv("train_url", "")
  • Predownload to Container Directory

    Choose whether to pre-download files in the output directory to the training container.

    • If you set this parameter to No, the system does not download the files in the training output path to the local directory of the training container when the training job is started.
    • If you set this parameter to Yes, the system automatically downloads the files in the training output path to the local directory of the training container when the training job is started. The larger the file size, the longer the download time. To avoid excessive training time, remove any unneeded files as soon as possible. Select Yes for Incremental Model Training or Resumable Training.

Hyperparameter

Used for tuning. This parameter is determined by the selected algorithm. If hyperparameters have been defined in the algorithm, all hyperparameters in the algorithm are displayed.

Hyperparameters can be modified and deleted. The status depends on the hyperparameter constraint settings in the algorithm. For details, see Table 6.

  • Click Add to add hyperparameters. The total number of hyperparameters cannot exceed 100.
  • To import hyperparameters in batches, click Upload. You will need to fill in the hyperparameters based on the provided template. The total number of hyperparameters should not exceed 100, or the import will fail.
NOTE:

To ensure data security, do not enter sensitive information, such as plaintext passwords.

Environment Variable

Add environment variables based on service requirements. For details about the environment variables preset in the training container, see Managing Environment Variables of a Training Container.

  • Click Add to add environment variables. The total number of environment variables cannot exceed 100.
  • To import environment variables in batches, click Upload. You will need to fill in the environment variables based on the provided template. The total number of environment variables should not exceed 100, or the import will fail.
NOTE:

To ensure data security, do not enter sensitive information, such as plaintext passwords.

Automated Hyperparameter Search

If you select My algorithm for Algorithm Type and the selected algorithm supports the autoSearch(S) policy, you can click More Configurations to show Automated Hyperparameter Search.

Selecting it enables automated hyperparameter search during training, potentially increasing the time needed.

For details, see Overview.

Step 5: Configuring Training Job Resources

When creating a training job, you need to choose training resources. Select Public resource pool or Dedicated resource pool. Select a resource pool as needed. A dedicated resource pool is recommended for optimal performance. For details about the differences between dedicated and public resource pools, see Differences Between Dedicated Resource Pools and Public Resource Pools.

Table 4 Resource parameters

Parameter

Description

Source of resources

  • Public resource pool: The public resource pool is available for all tenants and does not require user creation.
  • Dedicated resource pool: Dedicated resource pools are created separately and used exclusively. For details, see Creating a Standard Dedicated Resource Pool.

Resource Pool

After selecting Dedicated resource pool, click Select Resource Pool to select the target dedicated resource pool. You can view the status, node specifications, number of idle/fragmented nodes, number of available/total nodes, and number of cards of the dedicated resource pool. Hover over View in the Idle/Fragmented Nodes column to check fragment details and check whether the resource pool meets the training requirements.

Specification Type

This parameter is displayed when you select a dedicated resource pool. The following specifications types are supported:

  • Preset: Select preset instance specifications in the dedicated resource pool.
  • Customized Specifications: Training jobs allow custom resource specifications using dedicated pools to enhance their utilization. Custom specifications must match or stay below the node specifications of the dedicated resource pool. For CPU specifications, you can only customize the number of vCPUs and memory. For GPU and Ascend specifications, you can customize the number of vCPUs, memory, and cards.
Figure 2 Specifications

Specifications

This parameter is mandatory when Specification Type is set to Preset. Select resource specifications, including server type and model.

  • If a resource type has been defined in the training code, select a proper resource type based on algorithm constraints. For example, if the resource type defined in the training code is CPU and you select other types, the training fails.
  • If some resource types are invisible or unavailable for selection, they are not supported.
  • If Input is configured, click Check Input Size next to resource pool specifications to ensure the storage is larger than the input data size.
NOTICE:

The instance specifications GPU:n*tnt004 (n indicates a specific number) do not support multi-process training.

Compute Nodes

Select the number of instances as required. The default value is 1.

  • If only one instance is used, a single-node training job is created. ModelArts starts one training container on this node. The training container exclusively uses the compute resources of the selected specifications.
  • If more than one instance is used, a distributed training job is created. For more information about distributed training configurations, see Overview.

    Before creating a distributed training job, pre-install all required pip dependencies (see Installing pip Dependencies in an Image). If there are more than 10 nodes, the system automatically deletes the pip source configuration. Executing pip install commands during training may cause training failures.

Storage Mounting

When you select a dedicated resource pool, you can mount multiple storage types to improve data access efficiency.

Figure 3 Storage mounting
  • Add SFS Turbo

    When ModelArts and SFS Turbo are directly connected, multiple SFS Turbo file systems can be mounted to a training job to store training data. You can mount a file system multiple times, but each mount path must be distinct. A maximum of five disks can be mounted to a training job. For details about how to configure the network connection between ModelArts and SFS Turbo, see Creating a Network.

    Figure 4 SFS Turbo
    • Name: Select an SFS Turbo file system.
    • Mount Path: Enter the SFS Turbo mount path in the training container. The path cannot be a / directory or a system-mounted directory like /cache or /home/ma-user/modelarts.
    • Directory: Specify the SFS Turbo storage location. If you have configured the folder control permission, select a storage location. If you have not configured the folder control permission, retain the default value / or customize a location.
    • Mounting Mode: Permission on the mounted SFS Turbo file system. This parameter is displayed as Read/Write or Read-only based on the permission of the SFS Turbo storage location. If you have not configured the folder control permission, this parameter is unavailable. For details about how to set permissions for SFS Turbo folders, see Permissions Management.
    • Mount Options: Configure SFS mount parameters to accelerate and optimize training. For details about the parameters, see Configuring SFS Turbo Mount Options. Alternatively, retain the default settings below:
      mountOptions:
      - vers=3 
      - timeo=600 
      - nolock 
      - hard
  • Add SFS 3.0 Capacity-Oriented: A training job can mount an SFS 3.0 file system to store training data. Set the parameters below.
    • Name: Enter the name of the file system. The name must be the same as that in SFS. Otherwise, the mounting fails.
    • Mount Path: Enter the cloud mount path in the training container.
    NOTE:

    To add SFS 3.0 Capacity-Oriented, submit a service ticket.

  • Add PFS: A training job can mount an OBS parallel file system to store training data. Set the parameters below.
    Figure 5 PFS
    • Storage Configuration: Select a parallel file system.
    • Mount Path: Enter the cloud mount path in the training container.
    NOTE:

    To add PFS, submit a service ticket.

Supernode Affinity Group Instances

  • Selecting a dedicated resource pool and an Snt9b23 flavor allows you to set Supernode Affinity Group Instances.
  • You can set this parameter if you select a supernode resource pool. If Supernode Affinity Group Instances is set to N, every N pods are scheduled to the same supernode to schedule affinity jobs. In distributed training, affinity job scheduling ensures uniformity in the architecture of the allocated compute resources.
  • You must set the number of instances as an integral multiple of Supernode Affinity Group Instances. Otherwise, the training job cannot be created.
  • For more information about supernode affinity group instances, see Configuring Supernode Affinity Group Instances.

Training Mode

ModelArts offers various training modes when using a MindSpore preset image with Ascend specifications.

  • Common mode: It is the default training scenario.
  • High-performance mode: Certain O&M functions will be adjusted or even disabled to maximally accelerate the running speed, but this will deteriorate fault locating. This mode is suitable for stable networks requiring high performance.
  • Fault diagnosis mode: Certain O&M functions will be enabled or adjusted to collect more information for locating faults.

Step 6: Configuring Fault Tolerance and Recovery

You can set auto restart for a training job when creating it.

Table 5 Automatic restart settings

Parameter

Description

Auto Restart

Choose whether to enable automatic restart for a training job.

  • This function is disabled by default. If a training exception occurs, the job is directly stopped.
  • If this function is enabled, the system will handle any exceptions caused by environmental or suspension issues during a training job. The system automatically detects faults and processes them according to the corresponding policies, thereby increasing the training success rate. Training job recovery policies enable automatic restarts at the process, container, and job levels. These policies require no manual configuration as they are automatically applied and upgraded as needed.

    To avoid losing training progress, ensure your code can resume training from where it is interrupted, and then enable unconditional auto restart to optimize compute usage. For details, see Resumable Training.

    If auto restart is triggered during training, the system records the restart information. You can check the fault recovery details on the training job details page. For details, see Training Job Fault Tolerance Check.

Maximum Restarts

This parameter is available when Auto Restart is enabled.

The training job will stop if it is still abnormal after maximum automatic restarts.

  • Default value: 3
  • Value range: 1 to 128

The value cannot be changed once the training job is created. Set this parameter based on your needs.

Unconditional Auto Restart

This parameter is available when Auto Restart is enabled. If Unconditional auto restart is selected, the training job will be restarted unconditionally once the system detects a training exception. To prevent invalid restarts, the system limits unconditional restarts to three consecutive attempts.

Restart Upon Suspension

This parameter is available when Auto Restart is enabled. ModelArts continuously monitors job processes to detect suspension and optimize resource usage. When this feature is enabled, suspended jobs can be automatically restarted at the process level.

However, ModelArts does not verify code logic, and suspension detection is periodic, which may result in false reports. By enabling this feature, you acknowledge the possibility of false positives. To prevent unnecessary restarts, ModelArts limits consecutive restarts to three.

Step 7: Configuring Scheduling Parameters

When creating a training job, you can configure its scheduling policy. For example, you can increase its priority, enable preemption, or set it to stop automatically. These changes help boost scheduling efficiency.

Table 6 Scheduling settings

Parameter

Description

Increase job scheduling priority

  • Jobs join the queue based on their submission time by default. Enabling this feature increases job priorities in the queue.
  • When using a dedicated resource pool, you can increase the scheduling priority of the training job. This parameter is not supported when a public resource pool is used.
  • The priority can be set to 1, 2, or 3. A larger number indicates a higher priority. The default priority is 1, and the highest priority is 3.
  • To set the priority to 3, you will also need the permission. For details about how to set the permission, see Assigning the Permission to Set the Highest Job Priority to an IAM User.
  • If a training job is in the Pending state for a long time, you can change the job priority to reduce the queuing duration. For details, see Priority of a Training Job.

Preemption

When using a dedicated resource pool, you can set this parameter. This parameter is not supported when a public resource pool is used.

When enabled, jobs that allow preemption may be terminated and re-queued if resource pool capacity is insufficient. To avoid losing training progress, configure resumable training before enabling this function. For details, see Resumable Training.

Auto Stop

Choose whether to enable Auto Stop.

  • This function is disabled by default, and the training job keeps running until the training is completed.
  • If you enable this function, set the auto stop time. The value can be 1 hour, 2 hours, 4 hours, 6 hours, or Customize. The customized time must range from 1 hour to 720 hours. When you enable this function, the training stops automatically when the time limit is reached. The time limit does not count down when the training is paused.

Step 8: Configure Advanced Parameters

Table 7 Advanced settings

Parameter

Description

Persistent Log Saving

This function is enabled by default when Ascend specifications are selected and cannot be modified.

This function is available when CPU or GPU specifications are selected.

  • If this function is enabled (default), configure Log Path. The platform permanently stores training logs to the specified OBS path.
  • If this function is disabled, ModelArts automatically stores the logs for 30 days. You can download all logs on the job details page to a local path.

Log Path

When Persistent Log Saving is enabled, you must configure a log path to store log files generated by the training job.

Ensure that you have read and write permissions to the selected OBS directory.

Event Notification

Choose whether to enable event notification for the training job.

  • This function is deselected by default, which means SMN is disabled.
  • If this function is enabled, you will be notified of specific events, such as job status changes or suspected suspensions, via an SMS or email. Notifications will be billed based on SMN pricing. In this case, you must configure the topic name and events.
    • Topic: topic of event notifications. Click Create Topic to go to the SMN console to create a topic and add a subscription to the topic. You will receive event notifications only after the subscription status changes to Confirmed. For details, see Adding a Subscription.
    • Event: events you want to subscribe to. Examples: JobStarted, JobCompleted, JobFailed, JobTerminated, JobHanged, JobRestarted, and JobPreempted.
NOTE:
  • SMN charges you for the number of notification messages. For details, see Billing.
  • Only training jobs using GPUs or NPUs support JobHanged events.

Password-free SSH Between Nodes

Choose whether to enable password-free SSH between nodes.

  • This function is disabled by default, which means the password-free SSH file is not generated.
  • If this function is enabled, the password-free SSH file is generated. To enable distributed training with a custom MPI or Horovod-based image, set up password-free SSH trust between nodes for seamless communication.

    Configure the password-free SSH file directory. This directory stores the auto-generated SSH key files in the training container. By default, it is set to /home/ma-user/.ssh. For details, see Configuring Password-free SSH Mutual Trust Between Instances for a Training Job Created Using a Custom Image.

Tags

TMS's predefined tags are recommended for adding the same tag to different cloud resources. For details about how to use tags, see Using TMS Tags to Manage Resources by Group.

You can add up to 20 tags to a training job.

Step 9: Submitting a Training Job and Viewing Its Status

After setting the parameters, click Submit.

A training job runs for a period of time. You can go to the training job list to view the basic information about the training job.

  • In the training job list, Status of a newly created training job is Pending.
  • When the status of a training job changes to Completed, the training job is finished, and the generated model is stored in the corresponding output path.
  • If the status is Failed or Abnormal, click the job name to go to the job details page and view logs for troubleshooting.