更新时间:2024-12-09 GMT+08:00
分享

模型NPU卡数、梯度累积值取值表

不同模型推荐的训练参数和计算规格要求如表1所示。规格与节点数中的1*节点 & 4*Ascend表示单机4卡,以此类推。

表1 NPU卡数、加速框架、梯度配置取值表

模型

Template

模型参数量

训练策略类型

序列长度cutoff_len

梯度累积值

优化工具(Deepspeed)

规格与节点数

llama2

llama2

7B

lora

4096/8192

gradient_accumulation_steps: 8

ZeRO-1

1*节点 & 1*Ascend

full

gradient_accumulation_steps: 8

ZeRO-2

1*节点 & 8*Ascend

13B

lora

4096/8192

gradient_accumulation_steps: 8

ZeRO-2

1*节点 & 1*Ascend

full

gradient_accumulation_steps: 8

ZeRO-3

1*节点 & 8*Ascend

70B

lora

4096

gradient_accumulation_steps: 8

ZeRO-3

2*节点 & 8*Ascend

8192

gradient_accumulation_steps: 8

ZeRO-3-Offload

2*节点 & 8*Ascend

full

4096/8192

gradient_accumulation_steps: 4

ZeRO-3-Offload

4*节点 & 8*Ascend

llama3

llama3

70B

lora

4096/8192

gradient_accumulation_steps: 8

ZeRO-3

2*节点 & 8*Ascend

full

gradient_accumulation_steps: 4

ZeRO-3-Offload

4*节点 & 8*Ascend

8B

lora

4096/8192

gradient_accumulation_steps: 8

ZeRO-2

1*节点 & 1*Ascend

full

gradient_accumulation_steps: 8

ZeRO-2

1*节点 & 8*Ascend

llama3.1

llama3

8B

lora

4096/8192

gradient_accumulation_steps: 8

ZeRO-1

1*节点 & 1*Ascend

full

4096/8192

gradient_accumulation_steps: 8

ZeRO-2

1*节点 & 8*Ascend

70B

lora

4096

gradient_accumulation_steps: 8

ZeRO-3

2*节点 & 8*Ascend

8192

gradient_accumulation_steps: 8

ZeRO-3-Offload

2*节点 & 8*Ascend

full

4096/8192

gradient_accumulation_steps: 4

ZeRO-3-Offload

4*节点 & 8*Ascend

llama3.2

llama3

1B

lora/full

4096/8192

gradient_accumulation_steps: 8

ZeRO-0

1*节点 & 1*Ascend

3B

lora

4096/8192

gradient_accumulation_steps: 8

ZeRO-0

1*节点 & 1*Ascend

full

4096/8192

gradient_accumulation_steps: 8

ZeRO-0

1*节点 & 2*Ascend

Qwen2

qwen

72B

lora

4096

gradient_accumulation_steps: 8

ZeRO-3

2*节点 & 8*Ascend

8192

gradient_accumulation_steps: 8

ZeRO-3-Offload

2*节点 & 8*Ascend

full

4096/8192

gradient_accumulation_steps: 4

ZeRO-3-Offload

4*节点 & 8*Ascend

7B

lora

4096/8192

gradient_accumulation_steps: 8

ZeRO-0

1*节点 & 1*Ascend

full

4096/8192

gradient_accumulation_steps: 8

ZeRO-2

1*节点 & 8*Ascend

0.5/1.5B

lora/full

4096/8192

gradient_accumulation_steps: 8

ZeRO-0

1*节点 & 1*Ascend

Qwen2_vl

qwen2_vl

2B

lora

4096/8192

gradient_accumulation_steps: 8

ZeRO-0

1*节点 & 1*Ascend

full

4096/8192

gradient_accumulation_steps: 8

ZeRO-0

1*节点 & 2*Ascend

7B

lora

4096/8192

gradient_accumulation_steps: 8

ZeRO-0

1*节点 & 1*Ascend

full

4096

gradient_accumulation_steps: 8

ZeRO-2

1*节点 & 8*Ascend

8192

gradient_accumulation_steps: 8

ZeRO-2-Offload

1*节点 & 8*Ascend

Qwen1.5

qwen

0.5/1.8B

lora/full

4096/8192

gradient_accumulation_steps: 8

ZeRO-0

1*节点 & 1*Ascend

7B

lora

4096/8192

gradient_accumulation_steps: 8

ZeRO-1

1*节点 & 1*Ascend

full

4096/8192

gradient_accumulation_steps: 8

ZeRO-2

1*节点 & 8*Ascend

14B

lora

4096/8192

gradient_accumulation_steps: 8

ZeRO-3

1*节点 & 1*Ascend

full

4096

gradient_accumulation_steps: 8

ZeRO-3

1*节点 & 8*Ascend

8192

gradient_accumulation_steps: 8

ZeRO-3

2*节点 & 8*Ascend

32B

lora

4096/8192

gradient_accumulation_steps: 8

ZeRO-3

1*节点 & 4*Ascend

full

4096

gradient_accumulation_steps: 8

ZeRO-3

2*节点 & 8*Ascend

full

8192

gradient_accumulation_steps: 4

ZeRO-3-Offload

2*节点 & 8*Ascend

72B

lora

4096

gradient_accumulation_steps: 8

ZeRO-3

2*节点 & 8*Ascend

lora

8192

gradient_accumulation_steps: 8

ZeRO-3-Offload

2*节点 & 8*Ascend

full

4096/8192

gradient_accumulation_steps: 4

ZeRO-3-Offload

4*节点 & 8*Ascend

Qwen2.5

qwen

0.5B

lora/full

4096/8192

gradient_accumulation_steps: 8

ZeRO-0

1*节点 & 1*Ascend

7B

lora

4096/8192

gradient_accumulation_steps: 8

ZeRO-1

1*节点 & 1*Ascend

full

4096/8192

gradient_accumulation_steps: 8

ZeRO-2

1*节点 & 8*Ascend

14B

lora

4096/8192

gradient_accumulation_steps: 8

ZeRO-3

1*节点 & 1*Ascend

full

4096

gradient_accumulation_steps: 8

ZeRO-3

1*节点 & 8*Ascend

8192

gradient_accumulation_steps: 8

ZeRO-3

2*节点 & 8*Ascend

32B

lora

4096/8192

gradient_accumulation_steps: 8

ZeRO-3

1*节点 & 4*Ascend

full

4096

gradient_accumulation_steps: 8

ZeRO-3

2*节点 & 8*Ascend

8192

gradient_accumulation_steps: 4

ZeRO-3-Offload

2*节点 & 8*Ascend

72B

lora

4096

gradient_accumulation_steps: 8

ZeRO-3

2*节点 & 8*Ascend

lora

8192

gradient_accumulation_steps: 8

ZeRO-3-Offload

2*节点 & 8*Ascend

full

4096/8192

gradient_accumulation_steps: 4

ZeRO-3-Offload

4*节点 & 8*Ascend

falcon2

falcon

11B

lora

4096/8192

gradient_accumulation_steps: 8

ZeRO-2

1*节点 & 1*Ascend

full

4096/8192

gradient_accumulation_steps: 8

ZeRO-2

1*节点 & 8*Ascend

GLM4

glm4

9B

lora

4096/8192

gradient_accumulation_steps: 8

ZeRO-2

1*节点 & 1*Ascend

full

4096/8192

gradient_accumulation_steps: 8

ZeRO-3

1*节点 & 8*Ascend

Yi

yi

6B

lora

4096/8192

gradient_accumulation_steps: 8

ZeRO-1

1*节点 & 1*Ascend

full

4096/8192

gradient_accumulation_steps: 8

ZeRO-1

1*节点 & 8*Ascend

34B

full

4096

gradient_accumulation_steps: 8

ZeRO-3

4*节点 & 8*Ascend

lora

gradient_accumulation_steps: 8

ZeRO-3

1*节点 & 4*Ascend

full

8192

gradient_accumulation_steps: 8

ZeRO-3

8*节点 & 8*Ascend

lora

gradient_accumulation_steps: 8

ZeRO-3

2*节点 & 8*Ascend

以上参数为开启NPU FlashAttention融合算子,上述参数值仅供参考,请根据自己实际要求合理配置其他加速框架或ZeRO (Zero Redundancy Optimizer)优化器、NPU节点数即其他配置。

具体优化工具使用说明可参考如何选择最佳性能的zero-stage和-offloads

相关文档