Updated on 2025-11-04 GMT+08:00

Training Features Supported by Each Model

The AscendFactory solution supports various training features for each model, as outlined in this section.

Type

Series

Model

Pre-Training and Fine-Tuning

Reinforcement Learning

MindSpeed-LLM

LlamaFactory

MindSpeed-MM

VeRL

MindSpeed-RL

Pre-training and full-parameter fine-tuning

LoRA fine-tuning

Multi-sample pack

Flash

SPTD parallelism

Long sequence parallelism

Mixture of Expert (MoE) parallelism

Dynamic sentence length

Training methods

ZeRO parallelism

Flash

Pre-training and full-parameter fine-tuning

SPTD parallelism

Distributed optimizer

Recomputation

Training methods

sglang

vllm

Training backend

Training methods

vllm

Training backend Megatron

Long sequence parallelism

Fine-tuning

Attention

(SP, PP, TP, DP)

(Ring Attention, Ulysses, hybrid long sequence)

(Expert parallelism and communication rearrangement optimization)

(PT: pre-training)

(ZeRO-1, ZeRO-2, and ZeRO-3)

Attention

(SP, PP, TP, DP)

Version

Version

Version

LLM

DeepSeek

DeepSeek-R1-671B

N/A

N/A

N/A

N/A

DeepSeek-V3-671B

N/A

N/A

N/A

N/A

DeepSeek-V2-Lite 16B

N/A

N/A

N/A

N/A

Qwen2

Qwen2-0.5B

PT, SFT

N/A

N/A

N/A

N/A

Qwen2-1.5B

N/A

N/A

N/A

N/A

Qwen2-7B

PT, SFT

N/A

N/A

N/A

N/A

Qwen2-72B

PT, SFT

N/A

N/A

N/A

N/A

Qwen2.5

Qwen2.5-0.5B

PT, SFT

N/A

N/A

N/A

N/A

Qwen2.5-1.5B

N/A

N/A

N/A

N/A

GRPO

0.9.1

Qwen2.5-7B

PT, SFT

N/A

N/A

N/A

N/A

GRPO

0.9.1

Qwen2.5-14B

PT, SFT, DPO

N/A

N/A

N/A

N/A

Qwen2.5-32B

PT, SFT

N/A

N/A

N/A

N/A

GRPO, DAPO, PPO

0.9.1

FSDP

GRPO

0.9.1

Qwen2.5-72B

PT, SFT, DPO

N/A

N/A

N/A

N/A

Qwen3

Qwen3-0.6B

PT, SFT

N/A

N/A

N/A

N/A

Qwen3-1.7B

PT, SFT

N/A

N/A

N/A

N/A

Qwen3-4B

PT, SFT

N/A

N/A

N/A

N/A

Qwen3-8B

PT, SFT

N/A

N/A

N/A

N/A

GRPO

0.9.1

FSDP

Qwen3-14B

PT, SFT

N/A

N/A

N/A

N/A

GRPO, DAPO, PPO

0.9.1

FSDP

Qwen3-32B

PT, SFT

N/A

N/A

N/A

N/A

GRPO, DAPO, PPO

0.9.1

FSDP

Qwen3-30B-A3B

PT, SFT

N/A

N/A

N/A

N/A

Qwen3-235b-A22B

PT, SFT

N/A

N/A

N/A

N/A

Llama

Llama3.1 -8B/70B

PT, SFT

N/A

N/A

N/A

N/A

Llama3.2-1B/3B

PT, SFT

N/A

N/A

N/A

N/A

GLM

glm-4-9b-chat

PT, SFT

N/A

N/A

N/A

N/A

Mixtral

Mixtral-8x7B-Instruct-v0.1

N/A

N/A

N/A

N/A

Multimodal model

Qwen2 VL

Qwen2-VL-2B

N/A

N/A

N/A

N/A

N/A

N/A

N/A

N/A

PT, SFT

N/A

N/A

N/A

N/A

Qwen2-VL-7B

N/A

N/A

N/A

N/A

N/A

N/A

N/A

N/A

PT, SFT

N/A

N/A

N/A

N/A

Qwen2-VL-72B

N/A

N/A

N/A

N/A

N/A

N/A

N/A

N/A

PT, SFT

N/A

N/A

N/A

N/A

Qwen2.5 VL

Qwen2.5-VL-3B

N/A

N/A

N/A

N/A

N/A

N/A

N/A

N/A

PT, SFT

GRPO

0.9.1

FSDP

N/A

N/A

N/A

N/A

Qwen2.5-VL-7B

N/A

N/A

N/A

N/A

N/A

N/A

N/A

N/A

PT, SFT, DPO

GRPO, DAPO, PPO

0.9.1

FSDP

N/A

N/A

N/A

N/A

Qwen2.5-VL-32B

N/A

N/A

N/A

N/A

N/A

N/A

N/A

N/A

PT, SFT

GRPO, DAPO, PPO

0.9.1

FSDP

N/A

N/A

N/A

N/A

Qwen2.5-VL-72B

N/A

N/A

N/A

N/A

N/A

N/A

N/A

N/A

PT, SFT

GRPO

0.9.1

FSDP

N/A

N/A

N/A

N/A

Gemma

Gemma3-27b

N/A

N/A

N/A

N/A

N/A

N/A

N/A

N/A

PT, SFT

N/A

N/A

N/A

N/A

  • "N/A" means the model does not work with the framework. Multimodal models, for example, do not support the MindSpeed-LLM training framework.