Updated on 2026-05-30 GMT+08:00

Resumable Training

Overview

Resumable training indicates that an interrupted training job can be automatically resumed from the checkpoint where the previous training was interrupted. This method is applicable to model training that takes a long time.

The checkpoint mechanism enables resumable training.

During model training, training results (including but not limited to epochs, model weights, optimizer status, and scheduler status) are continuously saved. In this way, an interrupted training job can be automatically resumed from the checkpoint where the previous training was interrupted.

To resume a training job, load a checkpoint and use the checkpoint information to initialize the training status. To do so, add reload ckpt to the code.

Implementing Resumable Training via Training Output in ModelArts

New version:

To implement resumable training or incremental training in ModelArts, you are advised to use storage mounts.

When creating a training job, you can save and load checkpoint files by mounting a storage path. The procedure is as follows:

  1. In the training job settings, mount the storage directory (where checkpoints are stored) to a local directory within the training container.
  2. During the training process, save checkpoint files to the mounted local directory. The data will automatically synchronize to the mounted path.
  3. To resume from a breakpoint, ensure the mounted storage directory contains the previous checkpoint files. Your training script will then automatically load the latest checkpoint to continue the training.

Using storage mounts ensures persistent data storage and enables model reuse across different training jobs.

When you create a training job in ModelArts, you can choose any of the storage mount options below. The table below shows different storage choices for easy selection based on your needs.

Table 1 Comparison of storage mount options

Storage Type

Performance

Capacity

Scenario

Price

Remarks

SFS Turbo

High

Large

SFS Turbo is suitable for AI training, AI generated content, autonomous driving, rendering, EDA simulation, and enterprise NAS applications.

Relatively high

General

OBS

Medium

Large

Using OBS to decouple storage from compute in big data scenarios.

Moderate

High-frequency read and low-frequency write

Old version:

To resume model training or incrementally train a model in ModelArts, configure training output.

When creating a training job, set the training Output parameter name to train_output. You can then retrieve this parameter via environment variables or hyperparameters. Once configured, checkpoints can be saved to the specified data storage location. Ensure that Predownload is set to Yes. If you set Predownload to Yes, the system automatically downloads the checkpoint file in the training output data path to a local directory of the training container before the training job is started.

Figure 1 Configuring training output

Enable fault tolerance check (auto restart) for resumable training. On the training job creation page, enable Auto Restart. If the environment pre-check fails, the hardware is not functional, or the training job fails, ModelArts will automatically issue the training job again.

Reloading Checkpoints in the VeRL Framework

VeRL is a flexible, efficient, and widely used reinforcement learning training library, serving as the de facto standard framework for post-training. VeRL is an open-source implementation of the paper HybridFlow: A Flexible and Efficient RLHF Framework.

  1. Configure trainer.save_freq and trainer.default_local_dir in the VeRL training YAML file.

    VeRL uses the trainer.default_local_dir parameter to specify the output directory. Within this directory, multiple weight subdirectories named global_steps_xx will be created. The trainer.save_freq parameter determines the frequency of weight saving, allowing checkpoints to be stored every set number of steps.

  2. Configure trainer.resume_mode in the VeRL training YAML file.
    When trainer.resume_mode is set to auto, VeRL automatically scans the trainer.default_local_dir path to load the most recent and valid checkpoint. Taking the train_output parameter from Implementing Resumable Training via Training Output in ModelArts as an example, the parameter settings are as follows:
    trainer.default_local_dir="${train_output}" 
    trainer.resume_mode=auto

Reloading Checkpoints in the MindSpeed-LLM Framework

MindSpeed LLM is a distributed training framework for large language models (LLMs) based on the Ascend ecosystem. It aims to provide an E2E LLM training solution for Huawei Ascend chip ecosystem partners, including distributed pre-training, distributed instruction fine-tuning, and the corresponding development toolchain, such as data preprocessing, weight transformation, online inference, and baseline evaluation. As the flagship training framework for Ascend computing, it is deeply optimized for performance, particularly for large-scale parameters, large clusters, and Mixture-of-Experts (MoE) models. It is also compatible with Megatron-LM, allowing Megatron users to migrate smoothly.

  1. Configure the --save and --save-interval parameters in MindSpeed-LLM.

    In the MindSpeed-LLM training startup script, the --save parameter specifies the output directory. This directory will contain multiple weight subdirectories named iter_xx and a latest_checkpointed_iteration.txt file that records the step count of the most recent saved weights. The latest_checkpointed_iteration.txt file is updated after every save. The --save-interval parameter defines the frequency of weight saving, ensuring checkpoints are stored every set number of steps.

  2. Configure the --load parameter to match the --save path in MindSpeed-LLM.

    The --load parameter in the training startup script specifies the input directory. When the --load path is set to be identical to the --save path, the training task will automatically load the latest weights upon each restart. Taking the train_output parameter from Implementing Resumable Training via Training Output in ModelArts as an example, the parameter configuration is as follows:

    --save-interval 1000 
    --save ${train_output} 
    --load ${train_output} 

Reloading Checkpoints in the LLaMA-Factory Framework

LLaMA-Factory is a popular open-source framework for training foundation models. You can easily fine-tune hundreds of models, such as language and multimodal ones, using either the CLI or WebUI. Built on Transformers and DeepSpeed, it works well with various open-source models.

  1. Configure output_dir and save_steps in the LLaMA-Factory training YAML file.

    LLaMA-Factory uses the output_dir parameter to specify the output directory. Within this directory, multiple weight subdirectories named checkpoint-xxx will be created. The save_steps parameter configures the frequency of weight saving.

  2. Configure resume_from_checkpoint to match the output_dir path in the LLaMA-Factory training YAML file.

    The resume_from_checkpoint parameter explicitly specifies the checkpoint to be used for the current training session. If a valid checkpoint is provided, training resumes from it. However, if resume_from_checkpoint is set to the same path as output_dir, and output_dir itself is not a valid checkpoint directory (but rather a parent directory containing multiple checkpoints), additional steps (3 and 4) are required. Taking the train_output parameter from Implementing Resumable Training via Training Output in ModelArts as an example, the parameter settings are as follows:

    ### output
    output_dir: ${train_output}
    save_steps: 500 
    
    ### train
    resume_from_checkpoint: ${train_output}
  3. Create a resume.py script. This script requires the absolute path of the training configuration YAML file as an input. The specific code is shown below:
    import os
    import re
    import sys
    
    
    def update_resume_config(config_file): # Receives the configuration file path.
        # Read the configuration content
        with open(config_file, 'r', encoding='utf-8') as f:
            lines = f.readlines()
    
        resume_line_num = None
        resume_path = None
    
        # Locate the resume_from_checkpoint line
        for i, line in enumerate(lines):
            if line.strip().startswith('resume_from_checkpoint:'):
                resume_line_num = i
                # Extract the value
                parts = line.split(':', 1)
                if len(parts) > 1:
                    resume_path = parts[1].strip().strip('"\'')  # Remove quotes
                break
    
        # If not found or value is null, do nothing
        if resume_line_num is None or resume_path in (None, 'null', ''):
            return
    
        # Check the directory and find the latest checkpoint
        new_resume_path = None
        if os.path.isdir(resume_path):
            # Find all checkpoint-number folders
            checkpoint_pattern = re.compile(r'^checkpoint-(\d+)$')
            checkpoints = []
    
            for item in os.listdir(resume_path):
                item_path = os.path.join(resume_path, item)
                if os.path.isdir(item_path):
                    match = checkpoint_pattern.match(item)
                    if match:
                        step = int(match.group(1))
                        checkpoints.append((step, item_path))
    
            # If checkpoints are found, use the latest one
            if checkpoints:
                checkpoints.sort(key=lambda x: x[0])
                new_resume_path = checkpoints[-1][1]
    
        # Modify the configuration line
        indent = len(line) - len(line.lstrip())  # Preserve original indentation
        if new_resume_path:
            lines[resume_line_num] = f'{" " * indent}resume_from_checkpoint: {new_resume_path}\n'
        else:
            lines[resume_line_num] = f'{" " * indent}resume_from_checkpoint: null\n'
    
        # Write back to the file
        with open(config_file, 'w', encoding='utf-8') as f:
            f.writelines(lines)
    
    
    if __name__ == "__main__":
        # Get the configuration file path from command line arguments
        if len(sys.argv) < 2:
            print("Usage: python resume.py <config_file_path>")
            sys.exit(1)
        config_file = sys.argv[1]  # Receive the abc.yaml passed from the command line
        update_resume_config(config_file)  # Execute by passing to the function
  4. Modify the training startup script. Before executing the llamafactory-cli train command, run the resume.py script to update the YAML. The script will scan the path provided in resume_from_checkpoint, find the checkpoint-xxx directory with the largest step number, and update the parameter to that absolute path. Example modification for train_lora/deepseek3_lora_sft_kt.yaml (where WORK_DIR is the working directory):
    #!/bin/bash
    ...
    ...
    
    python $WORK_DIR/resume.py $WORK_DIR/LLaMA-Factory/examples/train_lora/deepseek3_lora_sft_kt.yaml
    llamafactory-cli train $WORK_DIR/LLaMA-Factory/examples/train_lora/deepseek3_lora_sft_kt.yaml 

reload ckpt for PyTorch

  • Use either of the following methods to save a PyTorch model.
    • Save model parameters only.
      state_dict = model.state_dict()
      torch.save(state_dict, path)
    • Save the entire model (not recommended).
      torch.save(model, path)
  • Save the data generated during model training at regular intervals based on steps and time.

    The data includes the network weight, optimizer weight, and epoch, which will be used to resume the interrupted training.

       checkpoint = {
               "net": model.state_dict(),
               "optimizer": optimizer.state_dict(),
               "epoch": epoch   
       }
       if not os.path.isdir('model_save_dir'):
           os.makedirs('model_save_dir')
       torch.save(checkpoint,'model_save_dir/ckpt_{}.pth'.format(str(epoch)))
  • Check the complete code example below.
    import os
    import argparse
    parser = argparse.ArgumentParser()
    parser.add_argument("--train_output", type=str)
    args, unparsed = parser.parse_known_args()
    args = parser.parse_known_args()
    # train_output is set to /home/ma-user/modelarts/outputs/train_output_0.
    train_output = args.train_output
    
    # Check whether there is a model file in the output path. If there is no file, the model will be trained from the beginning by default. If there is a model file, the CKPT file with the maximum epoch value will be loaded as the pre-trained model.
    if os.listdir(train_output):
        print('> load last ckpt and continue training!!')
        last_ckpt = sorted([file for file in os.listdir(train_output) if file.endswith(".pth")])[-1]
        local_ckpt_file = os.path.join(train_output, last_ckpt)
        print('last_ckpt:', last_ckpt)
        # Load the checkpoint.
        checkpoint = torch.load(local_ckpt_file)  
        # Load the parameters that can be learned by the model.
        model.load_state_dict(checkpoint['net'])  
        # Load optimizer parameters.
        optimizer.load_state_dict(checkpoint['optimizer'])  
        # Obtain the saved epoch. The model will continue to be trained based on the epoch value.
        start_epoch = checkpoint['epoch']  
    start = datetime.now()
    total_step = len(train_loader)
    for epoch in range(start_epoch + 1, args.epochs):
        for i, (images, labels) in enumerate(train_loader):
            images = images.cuda(non_blocking=True)
            labels = labels.cuda(non_blocking=True)
            # Forward pass
            outputs = model(images)
            loss = criterion(outputs, labels)
            # Backward and optimize
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            ...
    
        # Save the network weight, optimizer weight, and epoch during model training.
        checkpoint = {
              "net": model.state_dict(),
              "optimizer": optimizer.state_dict(),
              "epoch": epoch
            }
        if not os.path.isdir(train_output):
            os.makedirs(train_output)
            torch.save(checkpoint, os.path.join(train_output, 'ckpt_best_{}.pth'.format(epoch)))