模型NPU卡数、梯度累积值取值表
不同模型推荐的训练参数和计算规格要求如表1所示。规格与节点数中的1*节点 & 4*Ascend表示单机4卡,以此类推。
模型 |
Template |
模型参数量 |
训练策略类型 |
序列长度cutoff_len |
梯度累积值 |
优化工具(Deepspeed) |
规格与节点数 |
---|---|---|---|---|---|---|---|
llama2 |
llama2 |
7B |
lora |
4096/8192 |
gradient_accumulation_steps: 8 |
ZeRO-1 |
1*节点 & 1*Ascend |
full |
gradient_accumulation_steps: 8 |
ZeRO-2 |
1*节点 & 8*Ascend |
||||
13B |
lora |
4096/8192 |
gradient_accumulation_steps: 8 |
ZeRO-2 |
1*节点 & 1*Ascend |
||
full |
gradient_accumulation_steps: 8 |
ZeRO-3 |
1*节点 & 8*Ascend |
||||
70B |
lora |
4096 |
gradient_accumulation_steps: 8 |
ZeRO-3 |
2*节点 & 8*Ascend |
||
8192 |
gradient_accumulation_steps: 8 |
ZeRO-3-Offload |
2*节点 & 8*Ascend |
||||
full |
4096/8192 |
gradient_accumulation_steps: 4 |
ZeRO-3-Offload |
4*节点 & 8*Ascend |
|||
llama3 |
llama3 |
70B |
lora |
4096/8192 |
gradient_accumulation_steps: 8 |
ZeRO-3 |
2*节点 & 8*Ascend |
full |
gradient_accumulation_steps: 4 |
ZeRO-3-Offload |
4*节点 & 8*Ascend |
||||
8B |
lora |
4096/8192 |
gradient_accumulation_steps: 8 |
ZeRO-2 |
1*节点 & 1*Ascend |
||
full |
gradient_accumulation_steps: 8 |
ZeRO-2 |
1*节点 & 8*Ascend |
||||
llama3.1 |
llama3 |
8B |
lora |
4096/8192 |
gradient_accumulation_steps: 8 |
ZeRO-1 |
1*节点 & 1*Ascend |
full |
4096/8192 |
gradient_accumulation_steps: 8 |
ZeRO-2 |
1*节点 & 8*Ascend |
|||
70B |
lora |
4096 |
gradient_accumulation_steps: 8 |
ZeRO-3 |
2*节点 & 8*Ascend |
||
8192 |
gradient_accumulation_steps: 8 |
ZeRO-3-Offload |
2*节点 & 8*Ascend |
||||
full |
4096/8192 |
gradient_accumulation_steps: 4 |
ZeRO-3-Offload |
4*节点 & 8*Ascend |
|||
Qwen2 |
qwen |
72B |
lora |
4096 |
gradient_accumulation_steps: 8 |
ZeRO-3 |
2*节点 & 8*Ascend |
8192 |
gradient_accumulation_steps: 8 |
ZeRO-3-Offload |
2*节点 & 8*Ascend |
||||
full |
4096/8192 |
gradient_accumulation_steps: 4 |
ZeRO-3-Offload |
4*节点 & 8*Ascend |
|||
7B |
lora |
4096/8192 |
gradient_accumulation_steps: 8 |
ZeRO-0 |
1*节点 & 1*Ascend |
||
full |
4096/8192 |
gradient_accumulation_steps: 8 |
ZeRO-2 |
1*节点 & 8*Ascend |
|||
0.5/1.5B |
lora/full |
4096/8192 |
gradient_accumulation_steps: 8 |
ZeRO-0 |
1*节点 & 1*Ascend |
||
Qwen2_vl |
qwen2_vl |
2B |
lora |
4096/8192 |
gradient_accumulation_steps: 8 |
ZeRO-0 |
1*节点 & 1*Ascend |
full |
4096/8192 |
gradient_accumulation_steps: 8 |
ZeRO-0 |
1*节点 & 2*Ascend |
|||
7B |
lora |
4096 |
gradient_accumulation_steps: 8 |
ZeRO-0 |
1*节点 & 1*Ascend |
||
8192 |
gradient_accumulation_steps: 8 |
ZeRO-1 |
1*节点 & 1*Ascend |
||||
full |
4096 |
gradient_accumulation_steps: 8 |
ZeRO-2 |
1*节点 & 8*Ascend |
|||
8192 |
gradient_accumulation_steps: 8 |
ZeRO-2-Offload |
1*节点 & 8*Ascend |
||||
Qwen1.5 |
qwen |
0.5/1.8B |
lora/full |
4096/8192 |
gradient_accumulation_steps: 8 |
ZeRO-0 |
1*节点 & 1*Ascend |
4B |
lora |
4096/8192 |
gradient_accumulation_steps: 8 |
ZeRO-1 |
1*节点 & 1*Ascend |
||
full |
4096/8192 |
gradient_accumulation_steps: 8 |
ZeRO-1 |
1*节点 & 4*Ascend |
|||
7B |
lora |
4096/8192 |
gradient_accumulation_steps: 8 |
ZeRO-1 |
1*节点 & 1*Ascend |
||
full |
4096/8192 |
gradient_accumulation_steps: 8 |
ZeRO-2 |
1*节点 & 8*Ascend |
|||
14B |
lora |
4096/8192 |
gradient_accumulation_steps: 8 |
ZeRO-3 |
1*节点 & 1*Ascend |
||
full |
4096/8192 |
gradient_accumulation_steps: 8 |
ZeRO-3 |
1*节点 & 8*Ascend |
|||
32B |
lora |
4096/8192 |
gradient_accumulation_steps: 8 |
ZeRO-3 |
1*节点 & 4*Ascend |
||
full |
4096 |
gradient_accumulation_steps: 8 |
ZeRO-3 |
2*节点 & 8*Ascend |
|||
full |
8192 |
gradient_accumulation_steps: 4 |
ZeRO-3-Offload |
2*节点 & 8*Ascend |
|||
72B |
lora |
4096 |
gradient_accumulation_steps: 8 |
ZeRO-3 |
2*节点 & 8*Ascend |
||
lora |
8192 |
gradient_accumulation_steps: 8 |
ZeRO-3-Offload |
2*节点 & 8*Ascend |
|||
full |
4096/8192 |
gradient_accumulation_steps: 4 |
ZeRO-3-Offload |
4*节点 & 8*Ascend |
|||
falcon2 |
falcon |
11B |
lora |
4096/8192 |
gradient_accumulation_steps: 8 |
ZeRO-2 |
1*节点 & 1*Ascend |
full |
4096/8192 |
gradient_accumulation_steps: 8 |
ZeRO-2 |
1*节点 & 8*Ascend |
|||
GLM4 |
glm4 |
9B |
lora |
4096/8192 |
gradient_accumulation_steps: 8 |
ZeRO-2 |
1*节点 & 1*Ascend |
full |
4096/8192 |
gradient_accumulation_steps: 8 |
ZeRO-3 |
1*节点 & 8*Ascend |
|||
Yi |
yi |
6B |
lora |
4096/8192 |
gradient_accumulation_steps: 8 |
ZeRO-1 |
1*节点 & 1*Ascend |
full |
4096/8192 |
gradient_accumulation_steps: 8 |
ZeRO-1 |
1*节点 & 4*Ascend |
|||
34B |
full |
4096 |
gradient_accumulation_steps: 8 |
ZeRO-3 |
2*节点 & 8*Ascend |
||
lora |
gradient_accumulation_steps: 8 |
ZeRO-3 |
1*节点 & 2*Ascend |
||||
full |
8192 |
gradient_accumulation_steps: 8 |
ZeRO-3 |
4*节点 & 8*Ascend |
|||
lora |
gradient_accumulation_steps: 8 |
ZeRO-3 |
1*节点 & 4*Ascend |
以上参数为开启NPU FlashAttention融合算子,上述参数值仅供参考,请根据自己实际要求合理配置其他加速框架或ZeRO (Zero Redundancy Optimizer)优化器、NPU节点数即其他配置。
具体优化工具使用说明可参考如何选择最佳性能的zero-stage和-offloads。