各模型支持的最小卡数和序列长度
模型训练时间、集群规模预测
针对不同模型、集群规格(Snt9b B3/B2/B1、Snt9b23)、数据集大小等不同导致训练时间、集群卡数规模不同,如训练过程中对卡数或训练时间有要求可根据以下公式计算预测:
- 训练时间(秒):Time=Tok总/(TPS*N卡数),计算出训练时间为范围值,仅参考。
- 训练卡数:N卡数=Tok总/(Time*TPS),N卡数>8时则需取整为8的倍数且不小于模型最小卡数配置。
参数解释:
- 总Tokens (Tok总):这通常需要对数据集进行预处理,包括分词(tokenization),与数据集大小、数据集遍历次数(Epochs)、序列长度、模型有关系,如填充(Padding)导致无效Token增加等多因素。
- 总tokens数(通过训练步数)=Seql×数据集样本总数,Seql分为动态、固定两种,一般动态Seql总tokens总数小于以上计算值。设置参数详见表2 Llama-Factory参数-packing参数或,一般MindSpeed-LLM与Llama-Factory默认为固定Seq。
- TPS:各个模型吞吐值(token/s/p)、训练卡数可在benchmark表格中查找,benchmark中的吞吐值均为固定Seq测出基线值,benchmark表格可向华为工程师咨询。
模型最小卡数配置
不同模型推荐的训练参数和计算规格要求如下表所示,目前仅提供微调(SFT)及训练(PT)阶段卡数配置。一般Snt9b规格为单节点8卡,Snt9b23规格为单机8卡=16*DIE,其中1*DIE等效于Snt9b中的1卡,Snt9b23规格实际训练过程中设置并行策略时2*DIE为最小单位;以下配置仅参考,一般小于8卡使用8卡训练,用户可基于卡数配置浮动调动。
支持模型参数量 |
训练策略类型 |
序列长度SEQ_LEN |
MindSpeed-LLM规格卡数/DIE |
Llama-Factory规格卡数/DIE |
VeRL规格卡数/DIE |
MindSpeed-RL规格卡数/DIE |
||||
---|---|---|---|---|---|---|---|---|---|---|
Snt9b |
Snt9b23 |
Snt9b |
Snt9b23 |
Snt9b |
Snt9b23 |
Snt9b |
Snt9b23 |
|||
llama3.1-8b |
full |
4096/8192 |
4*Ascend |
8*Ascend |
- |
- |
- |
- |
||
lora |
4*Ascend |
1*Ascend |
2*Ascend |
- |
- |
- |
- |
|||
llama3.1-70b |
full |
4096 |
32*Ascend |
64*Ascend |
- |
- |
- |
- |
||
lora |
16*Ascend |
32*Ascend |
- |
- |
- |
- |
||||
full |
8192 |
64*Ascend |
64*Ascend |
- |
- |
- |
- |
|||
lora |
16*Ascend |
32*Ascend |
- |
- |
- |
- |
||||
llama3.2-1b |
full/lora |
4096/8192 |
1*Ascend |
2*Ascend |
1*Ascend |
1*Ascend |
- |
- |
- |
- |
llama3.2-3b |
full |
4096/8192 |
2*Ascend |
4*Ascend |
- |
- |
- |
- |
||
lora |
1*Ascend |
2*Ascend |
1*Ascend |
2*Ascend |
- |
- |
- |
- |
||
qwen2-0.5b |
full/lora |
4096/8192 |
1*Ascend |
2*Ascend |
1*Ascend |
2*Ascend |
- |
- |
- |
- |
qwen2-1.5b |
full/lora |
4096/8192 |
1*Ascend |
2*Ascend |
- |
- |
- |
- |
- |
|
qwen2-7b |
full |
4096 |
4*Ascend |
1*Ascend |
2*Ascend |
- |
- |
- |
- |
|
lora |
4*Ascend |
8*Ascend |
- |
- |
- |
- |
||||
full |
8192 |
8*Ascend |
1*Ascend |
2*Ascend |
- |
- |
- |
- |
||
lora |
8*Ascend |
8*Ascend |
- |
- |
- |
- |
||||
qwen2-72b |
full |
4096 |
32*Ascend |
64*Ascend |
- |
- |
- |
- |
||
lora |
16*Ascend |
32*Ascend |
- |
- |
- |
- |
||||
full |
8192 |
64*Ascend |
64*Ascend |
- |
- |
- |
- |
|||
lora |
16*Ascend |
32*Ascend |
- |
- |
- |
- |
||||
qwen2.5-0.5b |
full/lora |
4096/8192 |
1*Ascend |
2*Ascend |
1*Ascend |
2*Ascend |
- |
- |
- |
- |
qwen2.5- 1.5b |
full/lora |
4096/8192 |
1*Asce nd |
2*Ascend |
- |
8*Ascend |
- |
8*Ascend |
||
qwen2.5-7b |
full |
4096 |
4*Ascend |
8*Ascend |
8*Ascend |
- |
8*Ascend |
8*Ascend |
||
lora |
2*Ascend |
1*Ascend |
2*Ascend |
- |
||||||
full |
8192 |
8*Ascend |
8*Ascend |
- |
||||||
lora |
2*Ascend |
1*Ascend |
2*Ascend |
- |
||||||
qwen2.5-14b |
full |
4096 |
8*Ascend |
8*Ascend |
- |
- |
- |
- |
||
lora |
4*Ascend |
4*Ascend |
- |
- |
- |
- |
||||
full |
8192 |
8*Ascend |
16*Ascend |
- |
- |
- |
- |
|||
lora |
8*Ascend |
4*Ascend |
- |
- |
- |
- |
||||
qwen2.5-32b |
full |
4096 |
16*Ascend |
32*Ascend |
32*Ascend |
- |
16*Ascend |
16*Ascend |
||
lora |
16*Ascend |
8*Ascend |
- |
|||||||
full |
8192 |
16*Ascend |
32*Ascend |
- |
||||||
lora |
16*Ascend |
16*Ascend |
- |
|||||||
qwen2.5-72b |
full |
4096 |
32*Ascend |
64*Ascend |
- |
- |
- |
- |
||
lora |
16*Ascend |
32*Ascend |
- |
- |
- |
- |
||||
full |
8192 |
64*Ascend |
64*Ascend |
- |
- |
- |
- |
|||
lora |
16*Ascend |
32*Ascend |
- |
- |
- |
- |
||||
qwen2vl-2b |
full |
4096/8192 |
- |
2*Ascend |
- |
- |
- |
- |
||
lora |
4096/8192 |
- |
1*Ascend |
- |
- |
- |
- |
|||
qwen2vl-7b |
full |
4096/8192 |
- |
8*Ascend |
- |
- |
- |
- |
||
lora |
4096/8192 |
- |
1*Ascend |
2*Ascend |
- |
- |
- |
- |
||
qwen2vl-72b |
full |
1024 |
- |
32*Ascend |
- |
- |
- |
- |
||
lora |
1024 |
- |
16*Ascend |
- |
- |
- |
- |
|||
qwen2.5_vl-7b |
full |
4096/8192 |
- |
8*Ascend |
- |
- |
- |
- |
||
lora |
4096/8192 |
- |
1*Ascend |
2*Ascend |
- |
- |
- |
- |
||
qwen2.5_vl-32b |
full |
4096 |
- |
32*Ascend |
16*Ascend |
- |
- |
|||
8192 |
- |
64*Ascend |
- |
- |
- |
- |
||||
lora |
4096/8192 |
- |
16*Ascend |
- |
- |
- |
- |
|||
qwen2.5_vl-72b |
full |
4096/8192 |
- |
64*Ascend |
- |
- |
- |
- |
||
lora |
4096/8192 |
- |
32*Ascend |
- |
- |
- |
- |
|||
qwen3-0.6b |
full/lora |
4096/8192 |
8*Ascend |
8*Ascend |
- |
- |
- |
- |
||
qwen3-1.7b |
full/lora |
4096/8192 |
8*Ascend |
8*Ascend |
- |
- |
- |
- |
||
qwen3-4b |
full/lora |
4096/8192 |
8*Ascend |
8*Ascend |
- |
- |
- |
- |
||
qwen3-8b |
full/lora |
4096/8192 |
8*Ascend |
8*Ascend |
8*Ascend |
- |
- |
|||
qwen3-14b |
full/lora |
4096/8192 |
8*Ascend |
8*Ascend |
- |
- |
- |
- |
||
qwen3-32b |
full |
4096 |
16*Ascend |
32*Ascend |
16*Ascend |
- |
- |
|||
8192 |
16*Ascend |
32*Ascend |
- |
- |
- |
- |
||||
lora |
4096 |
8*Ascend |
8*Ascend |
- |
- |
- |
- |
|||
8192 |
8*Ascend |
16*Ascend |
- |
- |
- |
- |
||||
qwen3_moe-30B_A3B |
full |
4096 |
16*Ascend |
32*Ascend |
- |
- |
- |
- |
||
8192 |
32*Ascend |
64*Ascend |
- |
- |
- |
- |
||||
lora |
4096/8192 |
16*Ascend |
32*Ascend |
- |
- |
- |
- |
|||
qwen3_moe-235B_A22B |
full |
4096 |
256*Ascend |
512*Ascend |
- |
- |
- |
- |
||
lora |
4096 |
128*Ascend |
256*Ascend |
- |
- |
- |
- |
|||
glm4-9b |
full |
4096/8192 |
8*Ascend |
8*Ascend |
- |
- |
- |
- |
||
lora |
4096/8192 |
2*Ascend |
1*Ascend |
2*Ascend |
- |
- |
- |
- |
||
mixtral-8x7b |
full |
4096/8192 |
16*Ascend |
- |
- |
- |
- |
- |
||
DeepSeek-V3/R1 |
full |
4096 |
512*Ascend |
- |
- |
- |
- |
- |
||
lora |
64*Ascend |
- |
- |
- |
- |
- |
||||
internvl2.5-8b |
full/lora |
4096/8192 |
- |
8*Ascend |
- |
- |
- |
- |
||
internvl2.5-38b |
full |
4096/8192 |
- |
32*Ascend |
- |
- |
- |
- |
||
lora |
4096/8192 |
- |
16*Ascend |
- |
- |
- |
- |
|||
internvl2.5-78b |
full |
4096 |
- |
32*Ascend |
- |
- |
- |
- |
||
8192 |
- |
64*Ascend |
- |
- |
- |
- |
||||
lora |
4096 |
- |
16*Ascend |
- |
- |
- |
- |
|||
8192 |
- |
32*Ascend |
- |
- |
- |
- |
||||
gemma3-27b |
full |
4096 |
- |
16*Ascend |
- |
- |
- |
- |
||
8192 |
- |
48*Ascend |
- |
- |
- |
- |
||||
lora |
4096/8192 |
- |
16*Ascend |
- |
- |
- |
- |

1. LLama-Factory使用的zero并行会将优化器、梯度、权重在多卡上切分,因此集群规模的大小会影响最佳配置与性能。
2. 当mindspeed-llm上开启分布式优化器并行时,优化器参数会在集群所有机器上切分共享,因此最优配置会和卡数相关。
3. 当前benchmark是综合考虑了最小可运行卡数和最优性能平衡情况下测试出的配置,实际情况中可以根据集群规模大小和性能取舍进行参数调整。