Resumable Training

Overview

Resumable training indicates that an interrupted training job can be automatically resumed from the checkpoint where the previous training was interrupted. This method is applicable to model training that takes a long time.

The checkpoint mechanism enables resumable training.

During model training, training results (including but not limited to epochs, model weights, optimizer status, and scheduler status) are continuously saved. In this way, an interrupted training job can be automatically resumed from the checkpoint where the previous training was interrupted.

To resume a training job, load a checkpoint and use the checkpoint information to initialize the training status. To do so, add reload ckpt to the code.

Implementing Resumable Training via Training Output in ModelArts

New version:

To implement resumable training or incremental training in ModelArts, you are advised to use storage mounts.

When creating a training job, you can save and load checkpoint files by mounting a storage path. The procedure is as follows:

In the training job settings, mount the storage directory (where checkpoints are stored) to a local directory within the training container.
During the training process, save checkpoint files to the mounted local directory. The data will automatically synchronize to the mounted path.
To resume from a breakpoint, ensure the mounted storage directory contains the previous checkpoint files. Your training script will then automatically load the latest checkpoint to continue the training.

Using storage mounts ensures persistent data storage and enables model reuse across different training jobs.

When you create a training job in ModelArts, you can choose any of the storage mount options below. The table below shows different storage choices for easy selection based on your needs.

**Table 1** Comparison of storage mount options
Storage Type	Performance	Capacity	Scenario	Price	Remarks
SFS Turbo	High	Large	SFS Turbo is suitable for AI training, AI generated content, autonomous driving, rendering, EDA simulation, and enterprise NAS applications.	Relatively high	General
OBS	Medium	Large	Using OBS to decouple storage from compute in big data scenarios.	Moderate	High-frequency read and low-frequency write

Old version:

To resume model training or incrementally train a model in ModelArts, configure training output.

When creating a training job, set the training Output parameter name to train_output. You can then retrieve this parameter via environment variables or hyperparameters. Once configured, checkpoints can be saved to the specified data storage location. Ensure that Predownload is set to Yes. If you set Predownload to Yes, the system automatically downloads the checkpoint file in the training output data path to a local directory of the training container before the training job is started.

Figure 1 Configuring training output
Click to enlarge

Enable fault tolerance check (auto restart) for resumable training. On the training job creation page, enable Auto Restart. If the environment pre-check fails, the hardware is not functional, or the training job fails, ModelArts will automatically issue the training job again.

Reloading Checkpoints in the VeRL Framework

VeRL is a flexible, efficient, and widely used reinforcement learning training library, serving as the de facto standard framework for post-training. VeRL is an open-source implementation of the paper HybridFlow: A Flexible and Efficient RLHF Framework.

Configure trainer.save_freq and trainer.default_local_dir in the VeRL training YAML file.
VeRL uses the trainer.default_local_dir parameter to specify the output directory. Within this directory, multiple weight subdirectories named global_steps_xx will be created. The trainer.save_freq parameter determines the frequency of weight saving, allowing checkpoints to be stored every set number of steps.
Configure trainer.resume_mode in the VeRL training YAML file.
When trainer.resume_mode is set to auto, VeRL automatically scans the trainer.default_local_dir path to load the most recent and valid checkpoint. Taking the train_output parameter from Implementing Resumable Training via Training Output in ModelArts as an example, the parameter settings are as follows:
```
trainer.default_local_dir="${train_output}" 
trainer.resume_mode=auto
```

Reloading Checkpoints in the MindSpeed-LLM Framework

MindSpeed LLM is a distributed training framework for large language models (LLMs) based on the Ascend ecosystem. It aims to provide an E2E LLM training solution for Huawei Ascend chip ecosystem partners, including distributed pre-training, distributed instruction fine-tuning, and the corresponding development toolchain, such as data preprocessing, weight transformation, online inference, and baseline evaluation. As the flagship training framework for Ascend computing, it is deeply optimized for performance, particularly for large-scale parameters, large clusters, and Mixture-of-Experts (MoE) models. It is also compatible with Megatron-LM, allowing Megatron users to migrate smoothly.

Configure the --save and --save-interval parameters in MindSpeed-LLM.
In the MindSpeed-LLM training startup script, the --save parameter specifies the output directory. This directory will contain multiple weight subdirectories named iter_xx and a latest_checkpointed_iteration.txt file that records the step count of the most recent saved weights. The latest_checkpointed_iteration.txt file is updated after every save. The --save-interval parameter defines the frequency of weight saving, ensuring checkpoints are stored every set number of steps.
Configure the --load parameter to match the --save path in MindSpeed-LLM.
The --load parameter in the training startup script specifies the input directory. When the --load path is set to be identical to the --save path, the training task will automatically load the latest weights upon each restart. Taking the train_output parameter from Implementing Resumable Training via Training Output in ModelArts as an example, the parameter configuration is as follows:
```
--save-interval 1000 
--save ${train_output} 
--load ${train_output} 
```

Reloading Checkpoints in the LLaMA-Factory Framework

LLaMA-Factory is a popular open-source framework for training foundation models. You can easily fine-tune hundreds of models, such as language and multimodal ones, using either the CLI or WebUI. Built on Transformers and DeepSpeed, it works well with various open-source models.

Configure output_dir and save_steps in the LLaMA-Factory training YAML file.
LLaMA-Factory uses the output_dir parameter to specify the output directory. Within this directory, multiple weight subdirectories named checkpoint-xxx will be created. The save_steps parameter configures the frequency of weight saving.
Configure resume_from_checkpoint to match the output_dir path in the LLaMA-Factory training YAML file.
The resume_from_checkpoint parameter explicitly specifies the checkpoint to be used for the current training session. If a valid checkpoint is provided, training resumes from it. However, if resume_from_checkpoint is set to the same path as output_dir, and output_dir itself is not a valid checkpoint directory (but rather a parent directory containing multiple checkpoints), additional steps (3 and 4) are required. Taking the train_output parameter from Implementing Resumable Training via Training Output in ModelArts as an example, the parameter settings are as follows:
```
### output
output_dir: ${train_output}
save_steps: 500 

### train
resume_from_checkpoint: ${train_output}
```

Create a resume.py script. This script requires the absolute path of the training configuration YAML file as an input. The specific code is shown below:

import os
import re
import sys


def update_resume_config(config_file): # Receives the configuration file path.
    # Read the configuration content
    with open(config_file, 'r', encoding='utf-8') as f:
        lines = f.readlines()

    resume_line_num = None
    resume_path = None

    # Locate the resume_from_checkpoint line
    for i, line in enumerate(lines):
        if line.strip().startswith('resume_from_checkpoint:'):
            resume_line_num = i
            # Extract the value
            parts = line.split(':', 1)
            if len(parts) > 1:
                resume_path = parts[1].strip().strip('"\'')  # Remove quotes
            break

    # If not found or value is null, do nothing
    if resume_line_num is None or resume_path in (None, 'null', ''):
        return

    # Check the directory and find the latest checkpoint
    new_resume_path = None
    if os.path.isdir(resume_path):
        # Find all checkpoint-number folders
        checkpoint_pattern = re.compile(r'^checkpoint-(\d+)$')
        checkpoints = []

        for item in os.listdir(resume_path):
            item_path = os.path.join(resume_path, item)
            if os.path.isdir(item_path):
                match = checkpoint_pattern.match(item)
                if match:
                    step = int(match.group(1))
                    checkpoints.append((step, item_path))

        # If checkpoints are found, use the latest one
        if checkpoints:
            checkpoints.sort(key=lambda x: x[0])
            new_resume_path = checkpoints[-1][1]

    # Modify the configuration line
    indent = len(line) - len(line.lstrip())  # Preserve original indentation
    if new_resume_path:
        lines[resume_line_num] = f'{" " * indent}resume_from_checkpoint: {new_resume_path}\n'
    else:
        lines[resume_line_num] = f'{" " * indent}resume_from_checkpoint: null\n'

    # Write back to the file
    with open(config_file, 'w', encoding='utf-8') as f:
        f.writelines(lines)


if __name__ == "__main__":
    # Get the configuration file path from command line arguments
    if len(sys.argv) < 2:
        print("Usage: python resume.py <config_file_path>")
        sys.exit(1)
    config_file = sys.argv[1]  # Receive the abc.yaml passed from the command line
    update_resume_config(config_file)  # Execute by passing to the function

Modify the training startup script. Before executing the llamafactory-cli train command, run the resume.py script to update the YAML. The script will scan the path provided in resume_from_checkpoint, find the checkpoint-xxx directory with the largest step number, and update the parameter to that absolute path. Example modification for train_lora/deepseek3_lora_sft_kt.yaml (where WORK_DIR is the working directory):
```
#!/bin/bash
...
...

python $WORK_DIR/resume.py $WORK_DIR/LLaMA-Factory/examples/train_lora/deepseek3_lora_sft_kt.yaml
llamafactory-cli train $WORK_DIR/LLaMA-Factory/examples/train_lora/deepseek3_lora_sft_kt.yaml 
```

reload ckpt for PyTorch

Use either of the following methods to save a PyTorch model.
- Save model parameters only.
```
state_dict = model.state_dict()
torch.save(state_dict, path)
```
- Save the entire model (not recommended).
```
torch.save(model, path)
```

Save the data generated during model training at regular intervals based on steps and time.

The data includes the network weight, optimizer weight, and epoch, which will be used to resume the interrupted training.

   checkpoint = {
           "net": model.state_dict(),
           "optimizer": optimizer.state_dict(),
           "epoch": epoch   
   }
   if not os.path.isdir('model_save_dir'):
       os.makedirs('model_save_dir')
   torch.save(checkpoint,'model_save_dir/ckpt_{}.pth'.format(str(epoch)))

Check the complete code example below.

import os
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("--train_output", type=str)
args, unparsed = parser.parse_known_args()
args = parser.parse_known_args()
# train_output is set to /home/ma-user/modelarts/outputs/train_output_0.
train_output = args.train_output

# Check whether there is a model file in the output path. If there is no file, the model will be trained from the beginning by default. If there is a model file, the CKPT file with the maximum epoch value will be loaded as the pre-trained model.
if os.listdir(train_output):
    print('> load last ckpt and continue training!!')
    last_ckpt = sorted([file for file in os.listdir(train_output) if file.endswith(".pth")])[-1]
    local_ckpt_file = os.path.join(train_output, last_ckpt)
    print('last_ckpt:', last_ckpt)
    # Load the checkpoint.
    checkpoint = torch.load(local_ckpt_file)  
    # Load the parameters that can be learned by the model.
    model.load_state_dict(checkpoint['net'])  
    # Load optimizer parameters.
    optimizer.load_state_dict(checkpoint['optimizer'])  
    # Obtain the saved epoch. The model will continue to be trained based on the epoch value.
    start_epoch = checkpoint['epoch']  
start = datetime.now()
total_step = len(train_loader)
for epoch in range(start_epoch + 1, args.epochs):
    for i, (images, labels) in enumerate(train_loader):
        images = images.cuda(non_blocking=True)
        labels = labels.cuda(non_blocking=True)
        # Forward pass
        outputs = model(images)
        loss = criterion(outputs, labels)
        # Backward and optimize
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        ...

    # Save the network weight, optimizer weight, and epoch during model training.
    checkpoint = {
          "net": model.state_dict(),
          "optimizer": optimizer.state_dict(),
          "epoch": epoch
        }
    if not os.path.isdir(train_output):
        os.makedirs(train_output)
        torch.save(checkpoint, os.path.join(train_output, 'ckpt_best_{}.pth'.format(epoch)))

Parent topic: High Model Training Reliability

Previous topic: High Model Training Reliability

Next topic: Training Job Fault Tolerance Check

Feedback

Was this page helpful?

Helpful Not helpful

Provide feedback

Thank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.

The system is busy. Please try again later.

For any further questions, feel free to contact us through the chatbot.

Chatbot