Resumable Training
Overview
Resumable training indicates that an interrupted training job can be automatically resumed from the checkpoint where the previous training was interrupted. This method is applicable to model training that takes a long time.
The checkpoint mechanism enables resumable training.
During model training, training results (including but not limited to epochs, model weights, optimizer status, and scheduler status) are continuously saved. In this way, an interrupted training job can be automatically resumed from the checkpoint where the previous training was interrupted.
To resume a training job, load a checkpoint and use the checkpoint information to initialize the training status. To do so, add reload ckpt to the code.
Implementing Resumable Training in ModelArts Standard
To resume model training or incrementally train a model in ModelArts Standard, configure training output.
When creating a training job, set the data path to the training output, save checkpoints in this data path, and set Predownload to Yes. If you set Predownload to Yes, the system automatically downloads the checkpoint file in the training output data path to a local directory of the training container before the training job is started.
Enable fault tolerance check (auto restart) for resumable training. On the training job creation page, enable Auto Restart. If the environment pre-check fails, the hardware is not functional, or the training job fails, ModelArts will automatically issue the training job again.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.