Resumable Training
Resumable training lets you restart training using the most recent saved weights if the process gets interrupted. This avoids redundant calculations and boosts efficiency. To resume training after an interruption, follow these steps.
MindSpeed-LLM
- Go to the training result output directory and check the saved weights.
cd ${af_output_dir}{af_output_dir} is the output folder set by the --save parameter in the training script. It includes several iter_xx weight folders and a latest_checkpointed_iteration.txt file that tracks the last checkpointed step. To resume training, edit this file to choose which weight folder to load.
- Modify the following resumable training configurations in the YAML training configuration file.
backend_config.training.no-load-optim: false backend_config.training.no-load-rng: false backend_config.training.finetune: false
- Restart the training job. For details, see Step 2: Starting a Training Job. Add the following hyperparameter to the ascendfactory-cli train command.
--load ${af_output_dir}/saved_checkpoints
LlaMA-Factory
- Go to the training result output directory and obtain the specified checkpoint directory.
cd ${af_output_dir}{af_output_dir} stores outputs as defined in the training YAML file. It includes several checkpoint folders named checkpoint-xx. These folders are created based on backend_config.training.max_steps and backend_config.training.save_steps.
- Modify the following resumable training configurations in the YAML training configuration file.
backend_config.training.resume_from_checkpoint: ${af_output_dir}/checkpoint-xx - Restart the training job. For details, see Step 2: Starting a Training Job.
VeRL
- Go to the training result output directory and obtain the specified checkpoint directory.
cd ${af_output_dir}/checkpoint_${af_model_name}The YAML training configuration file sets the checkpoint location using backend_config.trainer.default_local_dir. This folder contains multiple weight directories named global_step_XX. Choose the latest one.
- Edit the YAML training configuration file to resume training from a specific checkpoint or the most recent one.
Method 1: Training from a specified checkpoint backend_config.trainer.resume_mode: resume_path backend_config.trainer.resume_from_path: ${af_output_dir}/checkpoint_${af_model_name}/global_step_xx Method 2: Training from the latest checkpoint backend_config.trainer.resume_mode: auto - Restart the training job. For details, see Step 2: Starting a Training Job.
MindSpeed-RL
- Go to the training result output directory and obtain the specified checkpoint directory.
cd ${af_output_dir}/checkpoint_${af_model_name}backend_config.trainer.default_local_dir shows several global_step_XX weight folders. Choose the most recent one.
- Modify the following resumable training configurations in the YAML training configuration file.
actor_config: finetune: false # Set finetune to false for resumable training. load: ./ckpt-32b # Load the path to previously saved weights for resumable training. save: ./ckpt no_load_optim: false # Set no_load_optim to false for resumable training. no_load_rng: false # Set no_load_rng to false for resumable training. rf_config: integrated_mode_config: ref_model_load_path: ./Qwen2.5-7B-tp4 # Set the reference model's weight path to the original model for resumable training. - Restart the training job. For details, see Step 2: Starting a Training Job.
MindSpeed-MM
- Go to the training result output directory and check the saved weights.
cd ${af_output_dir}{af_output_dir} is the output folder set by the --save parameter in the training script. It includes several iter_xx weight folders and a latest_checkpointed_iteration.txt file that tracks the last checkpointed step. To resume training, edit this file to choose which weight folder to load.
- Modify the following resumable training configurations in the YAML training configuration file.
backend_config.training.no-load-optim: false backend_config.training.no-load-rng: false
- Restart the training job. For details, see Step 2: Starting a Training Job. Add the following hyperparameter to the ascendfactory-cli train command.
--load ${af_output_dir}/saved_checkpoints
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot