模型最小卡数配置
不同模型推荐的训练参数和计算规格要求如表1所示,
目前仅提供微调(SFT)及训练(PT)阶段卡数配置。一般snt9B规格为单节点8卡,Snt9B23规格为单机8卡=16*DIE,其中1*DIE等效于Snt9B中的1卡,Snt9B23规格实际训练过程中设置并行策略时2*DIE为最小单位;以下配置仅参考,一般小于8卡使用8卡训练,用户可基于卡数配置浮动调动。
* 表格中“-”代表不支持,规格与卡数中的 4*Ascend表示4卡在Snt9B中表示4卡,Snt9B23表示4*DIE,以此类推。
支持模型参数量 |
训练策略类型 |
序列长度SEQ_LEN |
MindSpeed-LLM规格卡数/DIE |
|
---|---|---|---|---|
Snt9B |
Snt9B23 |
|||
llama3.1-8b |
full |
4096/8192 |
4*Ascend |
|
lora |
4*Ascend |
|||
llama3.1-70b |
full |
4096 |
32*Ascend |
|
lora |
16*Ascend |
|||
full |
8192 |
64*Ascend |
||
lora |
16*Ascend |
|||
llama3.2-1b |
full/lora |
4096/8192 |
1*Ascend |
2*Ascend |
llama3.2-3b |
full |
4096/8192 |
2*Ascend |
|
lora |
1*Ascend |
2*Ascend |
||
qwen2-0.5b |
full/lora |
4096/8192 |
1*Ascend |
2*Ascend |
qwen2-1.5b |
full/lora |
4096/8192 |
1*Ascend |
2*Ascend |
qwen2-7b |
full |
4096 |
4*Ascend |
|
lora |
2*Ascend |
|||
full |
8192 |
8*Ascend |
||
lora |
2*Ascend |
|||
qwen2-72b |
full |
4096 |
32*Ascend |
|
lora |
16*Ascend |
|||
full |
8192 |
64*Ascend |
||
lora |
16*Ascend |
|||
qwen2.5-0.5b |
full/lora |
4096/8192 |
1*Ascend |
2*Ascend |
qwen2.5-7b |
full |
4096 |
2*Ascend |
|
lora |
2*Ascend |
|||
full |
8192 |
2*Ascend |
||
lora |
2*Ascend |
|||
qwen2.5-14b |
full |
4096 |
8*Ascend |
|
lora |
4*Ascend |
|||
full |
8192 |
8*Ascend |
||
lora |
8*Ascend |
|||
qwen2.5-32b |
full |
4096 |
16*Ascend |
|
lora |
16*Ascend |
|||
full |
8192 |
16*Ascend |
||
lora |
16*Ascend |
|||
qwen2.5-72b |
full |
4096 |
32*Ascend |
|
lora |
16*Ascend |
|||
full |
8192 |
64*Ascend |
||
lora |
16*Ascend |
|||
qwen3-0.6b |
full/lora |
4096/8192 |
8*Ascend |
|
qwen3-1.7b |
full/lora |
4096/8192 |
8*Ascend |
|
qwen3-4b |
full/lora |
4096/8192 |
8*Ascend |
|
qwen3-8b |
full/lora |
4096/8192 |
8*Ascend |
|
qwen3-14b |
full/lora |
4096/8192 |
8*Ascend |
|
qwen3-32b |
full |
4096/8192 |
32*Ascend |
|
lora |
4096 |
8*Ascend |
||
8192 |
16*Ascend |
|||
qwen3_moe-30B_A3B |
full |
4096 |
16*Ascend |
|
8192 |
32*Ascend |
|||
lora |
4096/8192 |
16*Ascend |
||
qwen3_moe-235B_A22B |
full |
4096 |
256*Ascend |
|
lora |
4096 |
128*Ascend |
||
glm4-9b |
full |
4096/8192 |
8*Ascend |
|
lora |
4096/8192 |
2*Ascend |
||
mixtral-8x7b |
full |
4096/8192 |
16*Ascend |
|
DeepSeek-V3/R1 |
full |
4096 |
512*Ascend |
|
lora |
64*Ascend |

1. 当mindspeed-llm上开启分布式优化器并行时,优化器参数会在集群所有机器上切分共享,因此最优配置会和卡数相关;
2. 当前benchmark是综合考虑了最小可运行卡数和最优性能平衡情况下测试出的配置,实际情况中可以根据集群规模大小和性能取舍进行参数调整;