Updated on 2025-11-04 GMT+08:00

Minimum Number of PUs and Sequence Length Supported by Each Model

Model Training Time and Cluster Scale Prediction

Training time and the number of PUs depend on the model, cluster specifications (Snt9b B3/B2/B1 or Snt9b23), and dataset size. To estimate PU count or training time, use these formulas:

  • Training time (second) = Total tokens/(TPS x Number of PUs). This estimate provides a rough range and is for reference only.
  • Number of training PUs = Total tokens/(Time x TPS). If this value exceeds eight, increase it to the next multiple of eight and make sure it is no less than the model's minimum PU requirement.

Parameters:

  1. Total tokens: Depend on several factors like dataset size, epochs, sequence length, and the model used. Preprocessing steps like tokenization and padding can add unnecessary tokens.
    • Total tokens (calculated based on training steps) = Sequence length x Total number of dataset samples. Sequence length can be either dynamic or fixed. Typically, dynamic sequences result in fewer tokens than this calculation suggests. For details about how to set parameters, see Table 1. By default, MindSpeed-LLM and LlaMA-Factory use fixed sequence length.
  2. TPS: Check the benchmark table for each model's throughput (token/s/p) and the number of training PUs used. The table shows baseline measurements taken with a fixed sequence length. For access to the benchmark table, contact Huawei engineers.

Minimum Number of PUs for Model Training

The table below describes the recommended training parameters and compute specifications of different models. Currently, only the number of PUs in the supervised fine-tuning and pre-training phases is provided. An Snt9b flavor typically has eight PUs per node, while an Snt9b23 flavor also uses eight PUs per node but equals 16 DIEs. One DIE matches one PU in Snt9b. For training with Snt9b23, the smallest unit is 2 DIEs when setting the parallel strategy. The configurations below are for reference only. If a setup contains fewer than eight PUs, default to using eight PUs for training. Adjust the PU count as needed.

In the table, - indicates that the specification is not supported. 4 x Snt indicates four PUs in Snt9b and 4 DIEs in Snt9b23, respectively.
Table 1 Minimum number of PUs for model training

Supported Model Parameters

Training Strategy

Sequence Length

MindSpeed-LLM PUs/DIEs

LlaMA-Factory PUs/DIEs

VeRL PUs/DIEs

MindSpeed-RL PUs/DIEs

MindSpeed-MM PUs/DIEs

Snt9b

Snt9b23

Snt9b

Snt9b23

Snt9b

Snt9b23

Snt9b

Snt9b23

Snt9b

Snt9b23

llama3.1-8b

Full-parameter

4,096/8,192

4 x Snt

8 x Snt

-

-

-

-

-

-

LoRA

4 x Snt

1 x Snt

2 x Snt

-

-

-

-

-

-

llama3.1-70b

Full-parameter

4,096

32 x Snt

64 x Snt

-

-

-

-

-

-

LoRA

16 x Snt

32 x Snt

-

-

-

-

-

-

Full-parameter

8,192

64 x Snt

64 x Snt

-

-

-

-

-

-

LoRA

16 x Snt

32 x Snt

-

-

-

-

-

-

llama3.2-1b

Full-parameter/LoRA

4,096/8,192

1 x Snt

2 x Snt

1 x Snt

1 x Snt

-

-

-

-

-

-

llama3.2-3b

Full-parameter

4,096/8,192

2 x Snt

4 x Snt

-

-

-

-

-

-

LoRA

1 x Snt

2 x Snt

1 x Snt

2 x Snt

-

-

-

-

-

-

qwen2-0.5b

Full-parameter/LoRA

4,096/8,192

1 x Snt

2 x Snt

1 x Snt

2 x Snt

-

-

-

-

-

-

qwen2-1.5b

Full-parameter/LoRA

4,096/8,192

1 x Snt

2 x Snt

-

-

-

-

-

-

-

qwen2-7b

Full-parameter

4,096

4 x Snt

1 x Snt

2 x Snt

-

-

-

-

-

-

LoRA

4 x Snt

8 x Snt

-

-

-

-

-

-

Full-parameter

8,192

8 x Snt

1 x Snt

2 x Snt

-

-

-

-

-

-

LoRA

8 x Snt

8 x Snt

-

-

-

-

-

-

qwen2-72b

Full-parameter

4,096

32 x Snt

64 x Snt

-

-

-

-

-

-

LoRA

16 x Snt

32 x Snt

-

-

-

-

-

-

Full-parameter

8,192

64 x Snt

64 x Snt

-

-

-

-

-

-

LoRA

16 x Snt

32 x Snt

-

-

-

-

-

-

qwen2.5-0.5b

Full-parameter/LoRA

4,096/8,192

1 x Snt

2 x Snt

1 x Snt

2 x Snt

-

-

-

-

-

-

qwen2.5-

1.5b

Full-parameter/LoRA

4,096/8,192

1 x Snt

2 x Snt

-

-

-

8 x Snt

-

-

qwen2.5-7b

Full-parameter

4,096

4 x Snt

8 x Snt

8 x Snt

8 x Snt

8 x Snt

8 x Snt

-

-

LoRA

2 x Snt

1 x Snt

2 x Snt

-

-

Full-parameter

8,192

8 x Snt

8 x Snt

-

-

LoRA

2 x Snt

1 x Snt

2 x Snt

-

-

qwen2.5-14b

Full-parameter

4,096

8 x Snt

8 x Snt

8 x Snt

8 x Snt

-

-

-

-

LoRA

4 x Snt

4 x Snt

-

-

-

-

Full-parameter

8,192

8 x Snt

16 x Snt

-

-

-

-

LoRA

8 x Snt

4 x Snt

-

-

-

-

qwen2.5-32b

Full-parameter

4,096

16 x Snt

32 x Snt

16 x Snt

16 x Snt

16 x Snt

16 x Snt

-

-

LoRA

16 x Snt

8 x Snt

-

-

Full-parameter

8192

16 x Snt

32 x Snt

-

-

LoRA

16 x Snt

16 x Snt

-

-

qwen2.5-72b

Full-parameter

4,096

32 x Snt

64 x Snt

-

-

-

-

-

-

LoRA

16 x Snt

32 x Snt

-

-

-

-

-

-

Full-parameter

8,192

64 x Snt

64 x Snt

-

-

-

-

-

-

LoRA

16 x Snt

32 x Snt

-

-

-

-

-

-

qwen2vl-2b

Full-parameter

4,096/8,192

-

2 x Snt

-

-

-

-

-

-

LoRA

4,096/8,192

-

1 x Snt

-

-

-

-

-

-

qwen2vl-7b

Full-parameter

4,096/8,192

-

8 x Snt

-

-

-

-

-

-

LoRA

4,096/8,192

-

1 x Snt

2 x Snt

-

-

-

-

-

-

qwen2vl-72b

Full-parameter

1,024

-

32 x Snt

-

-

-

-

-

-

LoRA

1,024

-

16 x Snt

-

-

-

-

-

-

qwen2.5_vl-3b

Full-parameter

1,024

-

-

-

-

-

-

8 x Snt

qwen2.5_vl-7b

Full-parameter

1,024/4,096/8,192

-

8 x Snt

8 x Snt

8 x Snt

-

-

8 x Snt

LoRA

4,096

-

1 x Snt

2 x Snt

-

-

-

-

qwen2.5_vl-32b

Full-parameter

4,096

-

32 x Snt

16 x Snt

-

-

-

-

8,192

-

64 x Snt

-

-

-

-

-

-

LoRA

4,096/8,192

-

16 x Snt

-

-

-

-

-

-

qwen2.5_vl-72b

Full-parameter

4,096/8,192

-

64 x Snt

-

-

-

-

-

-

LoRA

4,096/8,192

-

32 x Snt

-

-

-

-

-

-

qwen3-0.6b

Full-parameter/LoRA

4,096/8,192

8 x Snt

8 x Snt

-

-

-

-

-

-

qwen3-1.7b

Full-parameter/LoRA

4,096/8,192

8 x Snt

8 x Snt

-

-

-

-

-

-

qwen3-4b

Full-parameter/LoRA

4,096/8,192

8 x Snt

8 x Snt

-

-

-

-

-

-

qwen3-8b

Full-parameter/LoRA

4,096/8,192

8 x Snt

8 x Snt

8 x Snt

-

-

-

-

qwen3-14b

Full-parameter/LoRA

4,096/8,192

8 x Snt

8 x Snt

-

-

-

-

-

-

qwen3-32b

Full-parameter

4,096

16 x Snt

32 x Snt

16 x Snt

-

-

-

-

8,192

16 x Snt

32 x Snt

-

-

-

-

-

-

LoRA

4,096

8 x Snt

8 x Snt

-

-

-

-

-

-

8,192

8 x Snt

16 x Snt

-

-

-

-

-

-

qwen3_moe-30B_A3B

Full-parameter

4,096

16 x Snt

32 x Snt

-

-

-

-

-

-

8,192

32 x Snt

64 x Snt

-

-

-

-

-

-

LoRA

4,096/8,192

16 x Snt

32 x Snt

-

-

-

-

-

-

qwen3_moe-235B_A22B

Full-parameter

4,096

256 x Snt

512 x Snt

-

-

-

-

-

-

LoRA

4,096

128 x Snt

256 x Snt

-

-

-

-

-

-

glm4-9b

Full-parameter

4,096/8,192

8 x Snt

8 x Snt

-

-

-

-

-

-

LoRA

4,096/8,192

2 x Snt

1 x Snt

2 x Snt

-

-

-

-

-

-

mixtral-8x7b

Full-parameter

4,096/8,192

16 x Snt

-

-

-

-

-

-

-

DeepSeek-V3/R1

Full-parameter

4,096

512 x Snt

-

-

-

-

-

-

-

LoRA

64 x Snt

-

-

-

-

-

-

-

internvl2.5-8b

Full-parameter/LoRA

4,096/8,192

-

8 x Snt

-

-

-

-

-

-

internvl2.5-38b

Full-parameter

4,096/8,192

-

32 x Snt

-

-

-

-

-

-

LoRA

4,096/8,192

-

16 x Snt

-

-

-

-

-

-

internvl2.5-78b

Full-parameter

4,096

-

32 x Snt

-

-

-

-

-

-

8,192

-

64 x Snt

-

-

-

-

-

-

LoRA

4,096

-

16 x Snt

-

-

-

-

-

-

8,192

-

32 x Snt

-

-

-

-

-

-

gemma3-27b

Full-parameter

4,096

-

16 x Snt

-

-

-

-

-

-

8,192

-

48 x Snt

-

-

-

-

-

-

LoRA

4,096/8,192

-

16 x Snt

-

-

-

-

-

-

  • LlaMA-Factory's ZeRO parallelism splits the optimizer, gradients, and weights across multiple PUs. This means the size of the cluster impacts both the best setup and overall performance.
  • Enabling distributed optimizer parallelism on MindSpeed-LLM splits and shares optimizer parameters across all nodes in the cluster. The best setup depends on the total number of PUs used.
  • The benchmark balances the fewest running PUs with peak performance. Adjust settings according to your cluster size and needs.