Updated on 2025-06-03 GMT+08:00

Standard Model Training

ModelArts Standard Model Training offers containerized services and compute resource management capabilities. It establishes and manages the infrastructure for machine learning training workloads, alleviating the burden on users and providing a flexible, stable, user-friendly, and high-performance deep learning training environment. With ModelArts Standard Model Training, you can focus on developing, training, and fine-tuning models.

ModelArts Standard Model Training supports large-scale training jobs and provides a highly available training environment.

  • Supports distributed training with single-device multi-card and multi-device multi-card configurations, effectively accelerating the training process.
  • Supports fault awareness, diagnosis, and recovery for training jobs, including hardware failures and job freezes. It provides process-level, container-level, and job-level recovery, ensuring the long and stable operation of your training jobs.
  • Provides the ability for checkpoint-based training resumption and incremental training. Even if the training is interrupted for some reasons, it can be resumed based on the checkpoint, ensuring the stability and reliability of models that require long training time and avoiding the time and computational cost of starting from scratch.
  • Supports the use of SFS Turbo file system for training data mounting. Intermediate and result data generated by training jobs can be directly written to the SFS Turbo cache, and can be read and processed by downstream business processes. Result data can be asynchronously exported to associated OBS for long-term, low-cost storage, thereby accelerating data access in training scenarios in OBS.

ModelArts Standard Model Training provides convenient job management capabilities, improving the development efficiency of user model training.

  • Provides algorithm asset management capabilities, supporting the creation of training jobs through algorithm assets, custom algorithms, and AI Gallery subscribed algorithms, making the creation of training jobs more flexible and user-friendly.
  • Provides experiment management capabilities. You often need to adjust datasets and hyperparameters to perform multiple rounds of jobs to select the most ideal one. Model training supports the unified management of multiple training jobs, making it easier for you to choose the best model.
  • Provides capabilities such as event information (key event points in the training job lifecycle), training logs (training job runtime and exception information), resource monitoring (resource utilization), and Cloud Shell (for logging in to training containers), allowing you to have a clearer understanding of the training job runtime process and more accurately troubleshoot and locate issues when encountering task exceptions.