Minimum Number of PUs and Sequence Length Supported by Each Model
Model Training Time and Cluster Scale Prediction
Training time and the number of PUs depend on the model, cluster specifications (Snt9b B3/B2/B1 or Snt9b23), and dataset size. To estimate PU count or training time, use these formulas:
- Training time (second) = Total tokens/(TPS x Number of PUs). This estimate provides a rough range and is for reference only.
- Number of training PUs = Total tokens/(Time x TPS). If this value exceeds eight, increase it to the next multiple of eight and make sure it is no less than the model's minimum PU requirement.
Parameters:
- Total tokens: Depend on several factors like dataset size, epochs, sequence length, and the model used. Preprocessing steps like tokenization and padding can add unnecessary tokens.
- Total tokens (calculated based on training steps) = Sequence length x Total number of dataset samples. Sequence length can be either dynamic or fixed. Typically, dynamic sequences result in fewer tokens than this calculation suggests. For details about how to set parameters, see Table 1. By default, MindSpeed-LLM and LlaMA-Factory use fixed sequence length.
- TPS: Check the benchmark table for each model's throughput (token/s/p) and the number of training PUs used. The table shows baseline measurements taken with a fixed sequence length. For access to the benchmark table, contact Huawei engineers.
Minimum Number of PUs for Model Training
The table below describes the recommended training parameters and compute specifications of different models. Currently, only the number of PUs in the supervised fine-tuning and pre-training phases is provided. An Snt9b flavor typically has eight PUs per node, while an Snt9b23 flavor also uses eight PUs per node but equals 16 DIEs. One DIE matches one PU in Snt9b. For training with Snt9b23, the smallest unit is 2 DIEs when setting the parallel strategy. The configurations below are for reference only. If a setup contains fewer than eight PUs, default to using eight PUs for training. Adjust the PU count as needed.
|
Supported Model Parameters |
Training Strategy |
Sequence Length |
MindSpeed-LLM PUs/DIEs |
LlaMA-Factory PUs/DIEs |
VeRL PUs/DIEs |
MindSpeed-RL PUs/DIEs |
MindSpeed-MM PUs/DIEs |
|||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
Snt9b |
Snt9b23 |
Snt9b |
Snt9b23 |
Snt9b |
Snt9b23 |
Snt9b |
Snt9b23 |
Snt9b |
Snt9b23 |
|||
|
llama3.1-8b |
Full-parameter |
4,096/8,192 |
4 x Snt |
8 x Snt |
- |
- |
- |
- |
- |
- |
||
|
LoRA |
4 x Snt |
1 x Snt |
2 x Snt |
- |
- |
- |
- |
- |
- |
|||
|
llama3.1-70b |
Full-parameter |
4,096 |
32 x Snt |
64 x Snt |
- |
- |
- |
- |
- |
- |
||
|
LoRA |
16 x Snt |
32 x Snt |
- |
- |
- |
- |
- |
- |
||||
|
Full-parameter |
8,192 |
64 x Snt |
64 x Snt |
- |
- |
- |
- |
- |
- |
|||
|
LoRA |
16 x Snt |
32 x Snt |
- |
- |
- |
- |
- |
- |
||||
|
llama3.2-1b |
Full-parameter/LoRA |
4,096/8,192 |
1 x Snt |
2 x Snt |
1 x Snt |
1 x Snt |
- |
- |
- |
- |
- |
- |
|
llama3.2-3b |
Full-parameter |
4,096/8,192 |
2 x Snt |
4 x Snt |
- |
- |
- |
- |
- |
- |
||
|
LoRA |
1 x Snt |
2 x Snt |
1 x Snt |
2 x Snt |
- |
- |
- |
- |
- |
- |
||
|
qwen2-0.5b |
Full-parameter/LoRA |
4,096/8,192 |
1 x Snt |
2 x Snt |
1 x Snt |
2 x Snt |
- |
- |
- |
- |
- |
- |
|
qwen2-1.5b |
Full-parameter/LoRA |
4,096/8,192 |
1 x Snt |
2 x Snt |
- |
- |
- |
- |
- |
- |
- |
|
|
qwen2-7b |
Full-parameter |
4,096 |
4 x Snt |
1 x Snt |
2 x Snt |
- |
- |
- |
- |
- |
- |
|
|
LoRA |
4 x Snt |
8 x Snt |
- |
- |
- |
- |
- |
- |
||||
|
Full-parameter |
8,192 |
8 x Snt |
1 x Snt |
2 x Snt |
- |
- |
- |
- |
- |
- |
||
|
LoRA |
8 x Snt |
8 x Snt |
- |
- |
- |
- |
- |
- |
||||
|
qwen2-72b |
Full-parameter |
4,096 |
32 x Snt |
64 x Snt |
- |
- |
- |
- |
- |
- |
||
|
LoRA |
16 x Snt |
32 x Snt |
- |
- |
- |
- |
- |
- |
||||
|
Full-parameter |
8,192 |
64 x Snt |
64 x Snt |
- |
- |
- |
- |
- |
- |
|||
|
LoRA |
16 x Snt |
32 x Snt |
- |
- |
- |
- |
- |
- |
||||
|
qwen2.5-0.5b |
Full-parameter/LoRA |
4,096/8,192 |
1 x Snt |
2 x Snt |
1 x Snt |
2 x Snt |
- |
- |
- |
- |
- |
- |
|
qwen2.5- 1.5b |
Full-parameter/LoRA |
4,096/8,192 |
1 x Snt |
2 x Snt |
- |
- |
- |
8 x Snt |
- |
- |
||
|
qwen2.5-7b |
Full-parameter |
4,096 |
4 x Snt |
8 x Snt |
8 x Snt |
8 x Snt |
8 x Snt |
8 x Snt |
- |
- |
||
|
LoRA |
2 x Snt |
1 x Snt |
2 x Snt |
- |
- |
|||||||
|
Full-parameter |
8,192 |
8 x Snt |
8 x Snt |
- |
- |
|||||||
|
LoRA |
2 x Snt |
1 x Snt |
2 x Snt |
- |
- |
|||||||
|
qwen2.5-14b |
Full-parameter |
4,096 |
8 x Snt |
8 x Snt |
8 x Snt |
8 x Snt |
- |
- |
- |
- |
||
|
LoRA |
4 x Snt |
4 x Snt |
- |
- |
- |
- |
||||||
|
Full-parameter |
8,192 |
8 x Snt |
16 x Snt |
- |
- |
- |
- |
|||||
|
LoRA |
8 x Snt |
4 x Snt |
- |
- |
- |
- |
||||||
|
qwen2.5-32b |
Full-parameter |
4,096 |
16 x Snt |
32 x Snt |
16 x Snt |
16 x Snt |
16 x Snt |
16 x Snt |
- |
- |
||
|
LoRA |
16 x Snt |
8 x Snt |
- |
- |
||||||||
|
Full-parameter |
8192 |
16 x Snt |
32 x Snt |
- |
- |
|||||||
|
LoRA |
16 x Snt |
16 x Snt |
- |
- |
||||||||
|
qwen2.5-72b |
Full-parameter |
4,096 |
32 x Snt |
64 x Snt |
- |
- |
- |
- |
- |
- |
||
|
LoRA |
16 x Snt |
32 x Snt |
- |
- |
- |
- |
- |
- |
||||
|
Full-parameter |
8,192 |
64 x Snt |
64 x Snt |
- |
- |
- |
- |
- |
- |
|||
|
LoRA |
16 x Snt |
32 x Snt |
- |
- |
- |
- |
- |
- |
||||
|
qwen2vl-2b |
Full-parameter |
4,096/8,192 |
- |
2 x Snt |
- |
- |
- |
- |
- |
- |
||
|
LoRA |
4,096/8,192 |
- |
1 x Snt |
- |
- |
- |
- |
- |
- |
|||
|
qwen2vl-7b |
Full-parameter |
4,096/8,192 |
- |
8 x Snt |
- |
- |
- |
- |
- |
- |
||
|
LoRA |
4,096/8,192 |
- |
1 x Snt |
2 x Snt |
- |
- |
- |
- |
- |
- |
||
|
qwen2vl-72b |
Full-parameter |
1,024 |
- |
32 x Snt |
- |
- |
- |
- |
- |
- |
||
|
LoRA |
1,024 |
- |
16 x Snt |
- |
- |
- |
- |
- |
- |
|||
|
qwen2.5_vl-3b |
Full-parameter |
1,024 |
- |
- |
- |
- |
- |
- |
8 x Snt |
|||
|
qwen2.5_vl-7b |
Full-parameter |
1,024/4,096/8,192 |
- |
8 x Snt |
8 x Snt |
8 x Snt |
- |
- |
8 x Snt |
|||
|
LoRA |
4,096 |
- |
1 x Snt |
2 x Snt |
- |
- |
- |
- |
||||
|
qwen2.5_vl-32b |
Full-parameter |
4,096 |
- |
32 x Snt |
16 x Snt |
- |
- |
- |
- |
|||
|
8,192 |
- |
64 x Snt |
- |
- |
- |
- |
- |
- |
||||
|
LoRA |
4,096/8,192 |
- |
16 x Snt |
- |
- |
- |
- |
- |
- |
|||
|
qwen2.5_vl-72b |
Full-parameter |
4,096/8,192 |
- |
64 x Snt |
- |
- |
- |
- |
- |
- |
||
|
LoRA |
4,096/8,192 |
- |
32 x Snt |
- |
- |
- |
- |
- |
- |
|||
|
qwen3-0.6b |
Full-parameter/LoRA |
4,096/8,192 |
8 x Snt |
8 x Snt |
- |
- |
- |
- |
- |
- |
||
|
qwen3-1.7b |
Full-parameter/LoRA |
4,096/8,192 |
8 x Snt |
8 x Snt |
- |
- |
- |
- |
- |
- |
||
|
qwen3-4b |
Full-parameter/LoRA |
4,096/8,192 |
8 x Snt |
8 x Snt |
- |
- |
- |
- |
- |
- |
||
|
qwen3-8b |
Full-parameter/LoRA |
4,096/8,192 |
8 x Snt |
8 x Snt |
8 x Snt |
- |
- |
- |
- |
|||
|
qwen3-14b |
Full-parameter/LoRA |
4,096/8,192 |
8 x Snt |
8 x Snt |
- |
- |
- |
- |
- |
- |
||
|
qwen3-32b |
Full-parameter |
4,096 |
16 x Snt |
32 x Snt |
16 x Snt |
- |
- |
- |
- |
|||
|
8,192 |
16 x Snt |
32 x Snt |
- |
- |
- |
- |
- |
- |
||||
|
LoRA |
4,096 |
8 x Snt |
8 x Snt |
- |
- |
- |
- |
- |
- |
|||
|
8,192 |
8 x Snt |
16 x Snt |
- |
- |
- |
- |
- |
- |
||||
|
qwen3_moe-30B_A3B |
Full-parameter |
4,096 |
16 x Snt |
32 x Snt |
- |
- |
- |
- |
- |
- |
||
|
8,192 |
32 x Snt |
64 x Snt |
- |
- |
- |
- |
- |
- |
||||
|
LoRA |
4,096/8,192 |
16 x Snt |
32 x Snt |
- |
- |
- |
- |
- |
- |
|||
|
qwen3_moe-235B_A22B |
Full-parameter |
4,096 |
256 x Snt |
512 x Snt |
- |
- |
- |
- |
- |
- |
||
|
LoRA |
4,096 |
128 x Snt |
256 x Snt |
- |
- |
- |
- |
- |
- |
|||
|
glm4-9b |
Full-parameter |
4,096/8,192 |
8 x Snt |
8 x Snt |
- |
- |
- |
- |
- |
- |
||
|
LoRA |
4,096/8,192 |
2 x Snt |
1 x Snt |
2 x Snt |
- |
- |
- |
- |
- |
- |
||
|
mixtral-8x7b |
Full-parameter |
4,096/8,192 |
16 x Snt |
- |
- |
- |
- |
- |
- |
- |
||
|
DeepSeek-V3/R1 |
Full-parameter |
4,096 |
512 x Snt |
- |
- |
- |
- |
- |
- |
- |
||
|
LoRA |
64 x Snt |
- |
- |
- |
- |
- |
- |
- |
||||
|
internvl2.5-8b |
Full-parameter/LoRA |
4,096/8,192 |
- |
8 x Snt |
- |
- |
- |
- |
- |
- |
||
|
internvl2.5-38b |
Full-parameter |
4,096/8,192 |
- |
32 x Snt |
- |
- |
- |
- |
- |
- |
||
|
LoRA |
4,096/8,192 |
- |
16 x Snt |
- |
- |
- |
- |
- |
- |
|||
|
internvl2.5-78b |
Full-parameter |
4,096 |
- |
32 x Snt |
- |
- |
- |
- |
- |
- |
||
|
8,192 |
- |
64 x Snt |
- |
- |
- |
- |
- |
- |
||||
|
LoRA |
4,096 |
- |
16 x Snt |
- |
- |
- |
- |
- |
- |
|||
|
8,192 |
- |
32 x Snt |
- |
- |
- |
- |
- |
- |
||||
|
gemma3-27b |
Full-parameter |
4,096 |
- |
16 x Snt |
- |
- |
- |
- |
- |
- |
||
|
8,192 |
- |
48 x Snt |
- |
- |
- |
- |
- |
- |
||||
|
LoRA |
4,096/8,192 |
- |
16 x Snt |
- |
- |
- |
- |
- |
- |
|||
- LlaMA-Factory's ZeRO parallelism splits the optimizer, gradients, and weights across multiple PUs. This means the size of the cluster impacts both the best setup and overall performance.
- Enabling distributed optimizer parallelism on MindSpeed-LLM splits and shares optimizer parameters across all nodes in the cluster. The best setup depends on the total number of PUs used.
- The benchmark balances the fewest running PUs with peak performance. Adjust settings according to your cluster size and needs.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot