Help Center/ ModelArts/ Model Training/ Creating a Training Job/ Creating a Model Fine-Tuning Job (New Console)
Updated on 2026-07-02 GMT+08:00

Creating a Model Fine-Tuning Job (New Console)

What Is Model Fine-Tuning?

In the context of large model training, fine-tuning refers to the process of performing secondary training on a pre-trained model using domain-specific datasets. This process updates the model weights, allowing it to adapt more effectively to specific task requirements. This phase enables the model to accurately perform tasks in specific scenarios such as copywriting, code generation, and professional Q&A.

Model Fine-Tuning Scenarios

In the model development lifecycle, the primary use cases for fine-tuning are detailed in Table 1.

Table 1 Fine-tuning scenarios

Fine-Tuning Scenario

Description

Objective

Domain adaptation

Used when a base model lacks vertical industry knowledge (e.g., medical diagnostic standards, specialized programming languages, or proprietary corporate terminology).

Injecting domain-specific knowledge graphs or terminologies to mitigate model hallucination and enhance professional accuracy.

Instruction/Format compliance

Used when a model must strictly output formats such as JSON, SQL, or XML, or follow specific chain of thought (CoT) logic, where prompt engineering proves inconsistent.

Solidifying output structures via specialized fine-tuning datasets to reduce parsing error rates.

Style alignment

Used when a model needs to respond in a specific persona or style, such as role-playing, anthropomorphic customer service, or formal document writing.

Adjusting logit distributions to align with the stylistic characteristics of specific corpora.

Constraints

  • Supported region: This feature is only available in the CN Southwest-Guiyang1 region.
  • Storage: ModelArts does not support OBS buckets with bucket encryption enabled. Ensure this option is disabled when creating your OBS bucket.

Prerequisites

Billing

Model fine-tuning in ModelArts uses compute and storage resources, which are billed. Compute resources are billed for running fine-tuning jobs. Storage resources are billed for storing data in OBS or SFS. For details, see Model Training Billing Items.

Procedure

To create a fine-tuning job, follow these steps:

Step 1: Accessing the Creation Page: Log in to the console and navigate to the training job list.

Step 2: Configuring Fine-Tuning Parameters: Configure fine-tuning parameters.

Step 3: Submitting and Monitoring the Job.

Step 1: Accessing the Creation Page

  1. Log in to the ModelArts console.
  2. In the navigation pane, choose Model Build > Training.
  3. Click Create Training Job. The new UI is displayed by default. The following describes how to create a training job on the new UI.

Step 2: Configuring Fine-Tuning Parameters

Table 2 Parameters

Parameter

Description

Training Mode

Fine-Tuning

Ideal for scenarios where you need to fine-tune existing pre-trained models, such as Qwen series. Low-threshold training: Use pre-configured high-quality model assets. There is no need to manage image building, environment dependencies, or code debugging, simply upload your training data and adjust key parameters.

For this example, select Fine-Tuning.

Custom Job

Designed for scenarios requiring full control over the training workflow, including the use of proprietary code or specialized images.

Basic Information

Task name

Specifies a custom name for the fine-tuning job. The name can contain 1 to 64 characters, including only letters, digits, hyphens (-), and underscores (_).

Description (Optional)

Provides a brief overview of the job. Supports up to 256 characters.

Training Configuration

Select Model

Identifies the base model. Click the card and choose from Preset Model or My Model. You can filter models by source, type, or brand, or search for models by keyword in the search box.

NOTE:

Only compatible models are selectable; others will be hidden.

Type

Currently, model fine-tuning is supported.

Training Objective

ModelArts supports two fine-tuning types: full fine-tuning and LoRA fine-tuning. Different models support different fine-tuning types.

  • Full fine-tuning: Updates all model parameters. Offers high accuracy but features slower convergence and longer training times.
  • LoRA fine-tuning: Freezes the original model and injects trainable layers. Provides near-full parameter performance with fast convergence and shorter training times.

Model Output Path

Fine-tuned models can be stored in OBS and SFS Turbo. Currently, fine-tuned models can only be stored in OBS. Future updates will include support for SFS Turbo. You can choose your own OBS bucket or enter a path. The path must start with obs:// and end with a slash (/), like this: obs://bucketname/path/. For shared buckets from other users, you must enter the path.

NOTE:

Note: To store fine-tuned models in OBS, ensure that you have subscribed to OBS in advance and that your OBS bucket has sufficient space.

Resource Disposition

Resource Pool Type

Resource pools are classified into public resource pools and dedicated resource pool. Currently, only public resource pools are supported.

  • Public Resource Pool: Shared across all tenants.
  • Dedicated Resource Pool: A private pool that must be pre-created.

Specification

Specifies the hardware (server type/model). Only resources compatible with the selected model are displayed.

Number of instances

Select the number of instances as required. The default value is 1.

  • Count = 1: Standalone training; the container exclusively uses resources on one node.
  • If more than one instance is used, a distributed training job is created. For more information about distributed training configurations, see Overview.

Once you set up hot standby nodes for a resource pool, these nodes are reserved for high availability and can only be used for recovering faulty nodes. They cannot be used for training jobs. This reduces the number of training job instances you can create. For details about how to disable hot standby nodes, see Rectifying a Faulty Node in a Dedicated Resource Pool.

Before creating a distributed training job, pre-install all required pip dependencies (see Installing pip Dependencies in an Image). If there are more than 10 nodes, the system automatically deletes the pip source configuration. Executing pip install commands during training may cause training failures.

Data Configuration

Training Dataset

In the pop-up dialog, choose Preset Data or My Data. Preset Data refers to commonly used datasets built in the platform. My Data refers to your own raw or processed datasets. Select a dataset as required.

Training Parameters

Learning Rate

Sets the rate at which parameters/weights are updated per iteration. An excessively high value may prevent convergence; an excessively low value will slow it down.

MIN_LR

Controls the learning rate decay floor.

Formula: Minimum learning rate = Initial learning rate x Learning rate decay ratio.

Iterations

Specifies total parameter/weight updates.

Recommended epochs based on dataset size: Hundreds of records: 4 to 8 epochs; thousands of records: 2 to 4 epochs; larger datasets: 1 to 2 epochs

Total iterations = (Dataset size/Records per iteration) × Number of epochs. Example: For a dataset with 3,200 records, if each iteration uses 32 records and you set 2 epochs, the total number of iterations would be: (3,200/32) × 2 = 100 × 2 = 200

EPOCH

The number of epochs is the number of complete iterations of the training set. The entire dataset is traversed once in each epoch.

GBS

Determines the number of samples processed per iteration.

Generally, a larger batch size can make gradients more stable and help the model converge. However, a larger batch size also occupies more GPU memory, which may cause GPU memory insufficiency and prolong the training time.

SEQ_LEN

Defines the maximum input length. Data exceeding this limit is truncated.

LR_WARMUP_RATIO

Defines the initial phase of training where the learning rate gradually increases from a small value to the target maximum.

Since model weights are often randomly initialized at the start of training, their predictive capability is weak. Using a high learning rate immediately can cause updates that are too aggressive, leading to divergence. To resolve this problem, a small learning rate is usually used at the beginning of the training and gradually increased until the preset maximum learning rate is reached. In this way, an appropriate warmup ratio can prevent the initial update from being too fast, helping the model converge better.

Data Records

Indicates the total number of records in the input dataset.

DATA_TYPE

Specifies the dataset format (e.g., open-source Alpaca or ShareGPT).

Options: AlpacaStyleInstructionHandler, SharegptStyleInstructionHandler, GeneralInstructionHandler.

publish model

Auto-publish to Assets

When enabled, the trained model will be automatically published to the Asset Management > Models > My Models page on the console.

Publishing Method

The fine-tuned model can be published as a new model or as a new version of an existing model. Select the publication method based on your needs:

New model: The published fine-tuned model is a new model and is displayed on the Asset Management > Models > My Models page.

New version: The published fine-tuned model will be associated with an existing model in Asset Management > Models > My Models. Only the version number will change; you can view the updated version number within the model's details page.

Model Name

Sets the name for the newly generated model.

Enter 2 to 128 characters. Only letters, digits, hyphens (-), and underscores (_) are allowed. The name must start with a letter and end with a letter or digit.

Model Asset Description (Optional)

Description of the trained model. This parameter is mandatory when Publishing Method is set to New model.

Model Version

If the model is published as a new model, the version number is V1.

If the model is published as a new version of an existing model, the version number is automatically incremented by 1 based on the previous version number of the model.

Note: The model version number cannot be modified and is automatically generated by the system.

Version Description (Optional)

Description of the trained model. This field is optional and can contain a maximum of 256 characters.

HA Settings

Fault Tolerance and Recovery

Specifies whether to enable automatic restart for the training job.

  • Deselected (default): Automatic restart is disabled. If an error occurs, the training job will stop immediately.
  • Selected: If a training job fails due to environment issues, process suspensions, or other abnormalities, the system automatically detects the fault and applies recovery strategies to improve the success rate. The system supports process-level, container-level, and job-level automatic restart and recovery. These strategies are matched and upgraded automatically without requiring additional configuration.

    To avoid losing training progress and make full use of compute, ensure that your code logic supports resumable training before enabling this function. For details, see Resumable Training.

    If auto restart is triggered during training, the system records the restart information. You can check the fault recovery details on the training job details page. For details, see Training Job Fault Tolerance Check.

Maximum Restarts

This parameter is available when Fault Tolerance and Recovery is selected.

The training job will stop if it is still abnormal after maximum automatic restarts.

  • Default value: 3
  • Range: 1–128

The value cannot be changed once the training job is created. Set this parameter based on your needs.

Unconditional Auto Restart

This parameter is available when Fault Tolerance and Recovery is selected. If Unconditional auto restart is selected, the training job will be restarted unconditionally once the system detects a training exception. To prevent invalid restarts, it supports a maximum of three consecutive unconditional restarts.

Restart Upon Suspension

This parameter is available when Fault Tolerance and Recovery is selected. ModelArts continuously monitors job processes to detect suspension and optimize resource usage. When this feature is enabled, suspended jobs can be automatically restarted at the process level.

CPU specifications do not support job restarts upon suspension.

However, ModelArts does not verify code logic, and suspension detection is periodic, which may result in false reports. By enabling this feature, you acknowledge the possibility of false positives. To prevent unnecessary restarts, ModelArts limits consecutive restarts to three.

More Configurations

Checkpoints Saving Policy

checkpoints: During a model training job, checkpoints are used to store the model weight and status.

  • Close: After this function is disabled, checkpoints are not saved and training cannot be resumed based on checkpoints.
  • Custom: A specified number of checkpoints are saved based on the settings.

Event Notification

Indicates whether to enable event notification.

  • This feature is disabled by default, which means SMN is disabled.
  • After this function is enabled, you will be notified of specific events, such as job status changes or suspected suspensions, via an SMS or email. Notifications will be billed based on SMN pricing. In this case, you must configure the topic name and events.
    • Topic: topic of event notifications. Click Create Topic to create a topic on the SMN console.
    • Event: events you want to subscribe to, for example, Creating, Completed, or Failed.
NOTE:
  • After you create a topic on the SMN console, add a subscription to the topic, and confirm the subscription. Then, you will be notified of events. For details about how to subscribe to a topic, see Adding a Subscription.

Tags

TMS's predefined tags are recommended for adding the same tag to different cloud resources. For details about how to use tags, see Using TMS Tags to Manage Resources by Group.

You can add up to 20 tags.

Step 3: Submitting and Monitoring the Job

After setting the parameters, click Submit.

A fine-tuning job runs for a period of time. You can go to the fine-tuning job list to view the basic information about the fine-tuning job.

  • In the fine-tuning job list, Status of a newly created fine-tuning job is Pending.
  • Once the fine-tuning job shows Completed, it has finished. The system saves the created model in model assets for later access.
  • If the status is Failed or Abnormal, click the job name to go to the job details page and view logs for troubleshooting.

FAQs

  1. How can I enable a thinking mode in my trained model?

    To train a model that incorporates a CoT or thinking process, you must explicitly structure the response data. You can achieve this by wrapping the reasoning process within <think> and </think> tags in the output field of the Alpaca format, or by prepending a prompt such as Let's think step by step to the start of the output field. Currently, the preset data uses the standard Alpaca format without these tags. If you use this data, the trained model will likely only provide direct answers without a visible reasoning process.

    Example 1:

    {
        "instruction": "Don't ask me where I come from; my hometown is in Qinglong Mountain.",
        "input": "",
        "output": "<think>The user shared a modified lyric: 'Don't ask me where I come from; my hometown is in Qinglong Mountain.' First, I need to identify the origin of this song, then xxx, and finally xxx.\n</think>\n\n'Don't ask me where I come from; my hometown is in Qinglong Mountain' appears to be an adaptation of the classic song *The Olive Tree* (the original lyrics being 'Don't ask me where I come from; my hometown is far far away')."
    }

    Example 2:

    {
        "instruction": "Karin 's science class weighed plastic rings for an experiment . They found that the orange ring weighed 0.08333333333333333 ounce , the purple ring weighed 0.3333333333333333 ounce , and the white ring weighed 0.4166666666666667 ounce . What was the total weight of the plastic rings ?",
        "input": "",
        "output": "Let's think step by step. \n\nThe orange ring weighed 0.08333333333333333 ounce. \nThe purple ring weighed 0.3333333333333333 ounce. \nThe white ring weighed 0.4166666666666667 ounce. \n\nTo find the total weight of the plastic rings, we need to add the weights of the orange ring, the purple ring, and the white ring. \n\n0.08333333333333333 + 0.3333333333333333 + 0.4166666666666667 = 0.8333333333333334 \nTherefore, the answer (arabic numerals) is 0.8333333333333334."
    }