Minimum Number of PUs and Sequence Length Supported by Each Model

Model Training Time and Cluster Scale Prediction

Training time and the number of PUs depend on the model, cluster specifications (Snt9b B3/B2/B1 or Snt9b23), and dataset size. To estimate PU count or training time, use these formulas:

Training time (second) = Total tokens/(TPS x Number of PUs). This estimate provides a rough range and is for reference only.
Number of training PUs = Total tokens/(Time x TPS). If this value exceeds eight, increase it to the next multiple of eight and make sure it is no less than the model's minimum PU requirement.

Parameters:

Total tokens: Depend on several factors like dataset size, epochs, sequence length, and the model used. Preprocessing steps like tokenization and padding can add unnecessary tokens.
- Total tokens (calculated based on training steps) = Sequence length x Total number of dataset samples. Sequence length can be either dynamic or fixed. Typically, dynamic sequences result in fewer tokens than this calculation suggests. For details about how to set parameters, see Table 1. By default, MindSpeed-LLM and LlaMA-Factory use fixed sequence length.
TPS: Check the benchmark table for each model's throughput (token/s/p) and the number of training PUs used. The table shows baseline measurements taken with a fixed sequence length. For access to the benchmark table, contact Huawei engineers.

Minimum Number of PUs for Model Training

The table below describes the recommended training parameters and compute specifications of different models. Currently, only the number of PUs in the supervised fine-tuning and pre-training phases is provided. An Snt9b flavor typically has eight PUs per node, while an Snt9b23 flavor also uses eight PUs per node but equals 16 DIEs. One DIE matches one PU in Snt9b. For training with Snt9b23, the smallest unit is 2 DIEs when setting the parallel strategy. The configurations below are for reference only. If a setup contains fewer than eight PUs, default to using eight PUs for training. Adjust the PU count as needed.

In the table, - indicates that the specification is not supported. 4 x Snt indicates four PUs in Snt9b and 4 DIEs in Snt9b23, respectively.

**Table 1** Minimum number of PUs for model training
Supported Model Parameters	Training Strategy	Sequence Length	MindSpeed-LLM PUs/DIEs		LlaMA-Factory PUs/DIEs		VeRL PUs/DIEs		MindSpeed-RL PUs/DIEs		MindSpeed-MM PUs/DIEs
Supported Model Parameters	Training Strategy	Sequence Length	Snt9b	Snt9b23	Snt9b	Snt9b23	Snt9b	Snt9b23	Snt9b	Snt9b23	Snt9b	Snt9b23
llama3.1-8b	Full-parameter	4,096/8,192	4 x Snt		8 x Snt		-	-	-	-	-	-
llama3.1-8b	LoRA	4,096/8,192	4 x Snt		1 x Snt	2 x Snt	-	-	-	-	-	-
llama3.1-70b	Full-parameter	4,096	32 x Snt		64 x Snt		-	-	-	-	-	-
	LoRA	4,096	16 x Snt		32 x Snt		-	-	-	-	-	-
	Full-parameter	8,192	64 x Snt		64 x Snt		-	-	-	-	-	-
	LoRA	8,192	16 x Snt		32 x Snt		-	-	-	-	-	-
llama3.2-1b	Full-parameter/LoRA	4,096/8,192	1 x Snt	2 x Snt	1 x Snt	1 x Snt	-	-	-	-	-	-
llama3.2-3b	Full-parameter	4,096/8,192	2 x Snt		4 x Snt		-	-	-	-	-	-
llama3.2-3b	LoRA	4,096/8,192	1 x Snt	2 x Snt	1 x Snt	2 x Snt	-	-	-	-	-	-
qwen2-0.5b	Full-parameter/LoRA	4,096/8,192	1 x Snt	2 x Snt	1 x Snt	2 x Snt	-	-	-	-	-	-
qwen2-1.5b	Full-parameter/LoRA	4,096/8,192	1 x Snt	2 x Snt	-		-	-	-	-	-	-
qwen2-7b	Full-parameter	4,096	4 x Snt		1 x Snt	2 x Snt	-	-	-	-	-	-
	LoRA	4,096	4 x Snt		8 x Snt		-	-	-	-	-	-
	Full-parameter	8,192	8 x Snt		1 x Snt	2 x Snt	-	-	-	-	-	-
	LoRA	8,192	8 x Snt		8 x Snt		-	-	-	-	-	-
qwen2-72b	Full-parameter	4,096	32 x Snt		64 x Snt		-	-	-	-	-	-
	LoRA	4,096	16 x Snt		32 x Snt		-	-	-	-	-	-
	Full-parameter	8,192	64 x Snt		64 x Snt		-	-	-	-	-	-
	LoRA	8,192	16 x Snt		32 x Snt		-	-	-	-	-	-
qwen2.5-0.5b	Full-parameter/LoRA	4,096/8,192	1 x Snt	2 x Snt	1 x Snt	2 x Snt	-	-	-	-	-	-
qwen2.5- 1.5b	Full-parameter/LoRA	4,096/8,192	1 x Snt	2 x Snt	-		-	-	8 x Snt		-	-
qwen2.5-7b	Full-parameter	4,096	4 x Snt		8 x Snt		8 x Snt	8 x Snt	8 x Snt	8 x Snt	-	-
	LoRA	4,096	2 x Snt		1 x Snt	2 x Snt					-	-
	Full-parameter	8,192	8 x Snt		8 x Snt						-	-
	LoRA	8,192	2 x Snt		1 x Snt	2 x Snt					-	-
qwen2.5-14b	Full-parameter	4,096	8 x Snt		8 x Snt		8 x Snt	8 x Snt	-	-	-	-
	LoRA	4,096	4 x Snt		4 x Snt				-	-	-	-
	Full-parameter	8,192	8 x Snt		16 x Snt				-	-	-	-
	LoRA	8,192	8 x Snt		4 x Snt				-	-	-	-
qwen2.5-32b	Full-parameter	4,096	16 x Snt		32 x Snt		16 x Snt	16 x Snt	16 x Snt	16 x Snt	-	-
	LoRA	4,096	16 x Snt		8 x Snt						-	-
	Full-parameter	8192	16 x Snt		32 x Snt						-	-
	LoRA	8192	16 x Snt		16 x Snt						-	-
qwen2.5-72b	Full-parameter	4,096	32 x Snt		64 x Snt		-	-	-	-	-	-
	LoRA	4,096	16 x Snt		32 x Snt		-	-	-	-	-	-
	Full-parameter	8,192	64 x Snt		64 x Snt		-	-	-	-	-	-
	LoRA	8,192	16 x Snt		32 x Snt		-	-	-	-	-	-
qwen2vl-2b	Full-parameter	4,096/8,192	-		2 x Snt		-	-	-	-	-	-
qwen2vl-2b	LoRA	4,096/8,192	-		1 x Snt		-	-	-	-	-	-
qwen2vl-7b	Full-parameter	4,096/8,192	-		8 x Snt		-	-	-	-	-	-
qwen2vl-7b	LoRA	4,096/8,192	-		1 x Snt	2 x Snt	-	-	-	-	-	-
qwen2vl-72b	Full-parameter	1,024	-		32 x Snt		-	-	-	-	-	-
qwen2vl-72b	LoRA	1,024	-		16 x Snt		-	-	-	-	-	-
qwen2.5_vl-3b	Full-parameter	1,024	-		-		-	-	-	-	8 x Snt
qwen2.5_vl-7b	Full-parameter	1,024/4,096/8,192	-		8 x Snt		8 x Snt	8 x Snt	-	-	8 x Snt
qwen2.5_vl-7b	LoRA	4,096	-		1 x Snt	2 x Snt	8 x Snt	8 x Snt	-	-	-	-
qwen2.5_vl-32b	Full-parameter	4,096	-		32 x Snt		16 x Snt		-	-	-	-
	Full-parameter	8,192	-		64 x Snt		-	-	-	-	-	-
	LoRA	4,096/8,192	-		16 x Snt		-	-	-	-	-	-
qwen2.5_vl-72b	Full-parameter	4,096/8,192	-		64 x Snt		-	-	-	-	-	-
qwen2.5_vl-72b	LoRA	4,096/8,192	-		32 x Snt		-	-	-	-	-	-
qwen3-0.6b	Full-parameter/LoRA	4,096/8,192	8 x Snt		8 x Snt		-	-	-	-	-	-
qwen3-1.7b	Full-parameter/LoRA	4,096/8,192	8 x Snt		8 x Snt		-	-	-	-	-	-
qwen3-4b	Full-parameter/LoRA	4,096/8,192	8 x Snt		8 x Snt		-	-	-	-	-	-
qwen3-8b	Full-parameter/LoRA	4,096/8,192	8 x Snt		8 x Snt		8 x Snt		-	-	-	-
qwen3-14b	Full-parameter/LoRA	4,096/8,192	8 x Snt		8 x Snt		-	-	-	-	-	-
qwen3-32b	Full-parameter	4,096	16 x Snt		32 x Snt		16 x Snt		-	-	-	-
	Full-parameter	8,192	16 x Snt		32 x Snt		-	-	-	-	-	-
	LoRA	4,096	8 x Snt		8 x Snt		-	-	-	-	-	-
	LoRA	8,192	8 x Snt		16 x Snt		-	-	-	-	-	-
qwen3_moe-30B_A3B	Full-parameter	4,096	16 x Snt		32 x Snt		-	-	-	-	-	-
	Full-parameter	8,192	32 x Snt		64 x Snt		-	-	-	-	-	-
	LoRA	4,096/8,192	16 x Snt		32 x Snt		-	-	-	-	-	-
qwen3_moe-235B_A22B	Full-parameter	4,096	256 x Snt		512 x Snt		-	-	-	-	-	-
qwen3_moe-235B_A22B	LoRA	4,096	128 x Snt		256 x Snt		-	-	-	-	-	-
glm4-9b	Full-parameter	4,096/8,192	8 x Snt		8 x Snt		-	-	-	-	-	-
glm4-9b	LoRA	4,096/8,192	2 x Snt		1 x Snt	2 x Snt	-	-	-	-	-	-
mixtral-8x7b	Full-parameter	4,096/8,192	16 x Snt		-		-	-	-	-	-	-
DeepSeek-V3/R1	Full-parameter	4,096	512 x Snt		-		-	-	-	-	-	-
DeepSeek-V3/R1	LoRA	4,096	64 x Snt		-		-	-	-	-	-	-
internvl2.5-8b	Full-parameter/LoRA	4,096/8,192	-		8 x Snt		-	-	-	-	-	-
internvl2.5-38b	Full-parameter	4,096/8,192	-		32 x Snt		-	-	-	-	-	-
internvl2.5-38b	LoRA	4,096/8,192	-		16 x Snt		-	-	-	-	-	-
internvl2.5-78b	Full-parameter	4,096	-		32 x Snt		-	-	-	-	-	-
	Full-parameter	8,192	-		64 x Snt		-	-	-	-	-	-
	LoRA	4,096	-		16 x Snt		-	-	-	-	-	-
	LoRA	8,192	-		32 x Snt		-	-	-	-	-	-
gemma3-27b	Full-parameter	4,096	-		16 x Snt		-	-	-	-	-	-
	Full-parameter	8,192	-		48 x Snt		-	-	-	-	-	-
	LoRA	4,096/8,192	-		16 x Snt		-	-	-	-	-	-

LlaMA-Factory's ZeRO parallelism splits the optimizer, gradients, and weights across multiple PUs. This means the size of the cluster impacts both the best setup and overall performance.
Enabling distributed optimizer parallelism on MindSpeed-LLM splits and shares optimizer parameters across all nodes in the cluster. The best setup depends on the total number of PUs used.
The benchmark balances the fewest running PUs with peak performance. Adjust settings according to your cluster size and needs.

Parent topic: Adapting Mainstream Open-Source Models to AscendFactory NPU Training Based on Lite Server

Previous topic: Training Features Supported by Each Model

Next topic: Version Software Description and Requirements