Updated on 2025-11-04 GMT+08:00

Resumable Training

Resumable training lets you restart training using the most recent saved weights if the process gets interrupted. This avoids redundant calculations and boosts efficiency. To resume training after an interruption, follow these steps.

MindSpeed-LLM

  1. Go to the training result output directory and check the saved weights.
    cd ${af_output_dir}

    {af_output_dir} is the output folder set by the --save parameter in the training script. It includes several iter_xx weight folders and a latest_checkpointed_iteration.txt file that tracks the last checkpointed step. To resume training, edit this file to choose which weight folder to load.

  2. Modify the following resumable training configurations in the YAML training configuration file.
    backend_config.training.no-load-optim: false
    backend_config.training.no-load-rng: false
    backend_config.training.finetune: false
  3. Restart the training job. For details, see Step 2: Starting a Training Job. Add the following hyperparameter to the ascendfactory-cli train command.
    --load ${af_output_dir}/saved_checkpoints

LlaMA-Factory

  1. Go to the training result output directory and obtain the specified checkpoint directory.
    cd ${af_output_dir}

    {af_output_dir} stores outputs as defined in the training YAML file. It includes several checkpoint folders named checkpoint-xx. These folders are created based on backend_config.training.max_steps and backend_config.training.save_steps.

  2. Modify the following resumable training configurations in the YAML training configuration file.
    backend_config.training.resume_from_checkpoint: ${af_output_dir}/checkpoint-xx
  3. Restart the training job. For details, see Step 2: Starting a Training Job.

VeRL

  1. Go to the training result output directory and obtain the specified checkpoint directory.
    cd ${af_output_dir}/checkpoint_${af_model_name}

    The YAML training configuration file sets the checkpoint location using backend_config.trainer.default_local_dir. This folder contains multiple weight directories named global_step_XX. Choose the latest one.

  2. Edit the YAML training configuration file to resume training from a specific checkpoint or the most recent one.
    Method 1: Training from a specified checkpoint
    backend_config.trainer.resume_mode: resume_path
    backend_config.trainer.resume_from_path: ${af_output_dir}/checkpoint_${af_model_name}/global_step_xx
    
    Method 2: Training from the latest checkpoint
    backend_config.trainer.resume_mode: auto
  3. Restart the training job. For details, see Step 2: Starting a Training Job.

MindSpeed-RL

  1. Go to the training result output directory and obtain the specified checkpoint directory.
    cd ${af_output_dir}/checkpoint_${af_model_name}

    backend_config.trainer.default_local_dir shows several global_step_XX weight folders. Choose the most recent one.

  2. Modify the following resumable training configurations in the YAML training configuration file.
    actor_config:
      finetune: false # Set finetune to false for resumable training.
      load: ./ckpt-32b # Load the path to previously saved weights for resumable training.
      save: ./ckpt
      no_load_optim: false # Set no_load_optim to false for resumable training.
      no_load_rng: false # Set no_load_rng to false for resumable training.
    
    rf_config:
      integrated_mode_config:
        ref_model_load_path: ./Qwen2.5-7B-tp4 # Set the reference model's weight path to the original model for resumable training.
  3. Restart the training job. For details, see Step 2: Starting a Training Job.

MindSpeed-MM

  1. Go to the training result output directory and check the saved weights.
    cd ${af_output_dir}

    {af_output_dir} is the output folder set by the --save parameter in the training script. It includes several iter_xx weight folders and a latest_checkpointed_iteration.txt file that tracks the last checkpointed step. To resume training, edit this file to choose which weight folder to load.

  2. Modify the following resumable training configurations in the YAML training configuration file.
    backend_config.training.no-load-optim: false
    backend_config.training.no-load-rng: false
  3. Restart the training job. For details, see Step 2: Starting a Training Job. Add the following hyperparameter to the ascendfactory-cli train command.
    --load ${af_output_dir}/saved_checkpoints