Solution Overview

Description

ModelArts Lite Servers allow you to train popular open-source foundation models with the AscendFactory framework using Snt9b and Snt9b23 compute resources.

AscendFactory enables one-click training by integrating frameworks like MindSpeed-LLM (previously ModelLink), Llama-Factory, VeRL, MindSpeed-RL, and MindSpeed-MM.

**Table 1** AscendFactory adaptation and training stages and strategies
Training Framework	Pre-Training	Reinforcement Learning	Supervised Fine-Tuning
Training Framework	Pre-Training	GRPO	Full	LoRA
Llama-Factory	√	x	√	√
MindSpeed-LLM	√	x	√	√
VeRL	x	√	x	x
MindSpeed-RL	x	√	x	x
MindSpeed-MM	√	x	√	x

Solution Architecture

This architecture shows how to train and deploy open-source third-party foundation models.

It outlines a solution for using these models in Lite Server scenarios, covering training, tuning, and O&M.
Lite Server and SFS Turbo serve as the deployment infrastructure. For public network access, bind an EIP to Lite Server resources.
Before using a third-party foundation model, create an image package using AscendFactory and its base image. This package includes the required training framework and dependencies.
Store trained model weights and process files in SFS Turbo shared file systems, mounted across all nodes. Combined with resumable training, this ensures reliability. (Optional) If Snt9b23 lacks local disks, use OBS or EVS disks for log and file storage.
(Optional) Consider using Cloud Eye for monitoring and configuring alarms.

Usage Process

Use this solution to deploy open-source third-party foundation models by following these steps:

Resource planning: After choosing a deployment solution using the overall architecture as a guide, pick the necessary compute, storage, and access-layer dependency resources according to the supported models, features, and resource planning.
Training preparation: After buying compute and storage resources on Huawei Cloud, download a base image and software package to build your image. Then, prepare the required open-source weights and datasets for the model.
Training execution: Adjust the suggested settings for each model using AI Compute Service and run the training jobs.
(Optional) Log collection: After finishing these steps, your basic training job is complete. For better maintenance, review the main logs created during training to identify important issues easily.
(Optional) Monitoring and alarming: The system includes built-in monitoring features. You will understand key metrics and set up alarms easily.
(Optional) Configuration optimization: To improve training efficiency, consider using suggested optimization techniques tailored to your model's parameters and data size.
(Optional) Resumable training: If the training is interrupted due to unknown reasons, you can configure resumable training and restart the training job.