更新时间:2025-07-29 GMT+08:00
分享

训练参数配置说明【旧】

如果用户希望自定义参数进行训练,可直接编辑对应模型的训练脚本,请根据实际模型修改。
表1 模型训练脚本参数

参数

示例值

参数说明

ORIGINAL_TRAIN_DATA_PATH

【预训练:pt】预训练数据集相对或绝对地址

【微调:sft】微调数据集相对或绝对地址

【必改】训练时指定的输入原始数据路径。请根据实际规划修改。用户根据训练情况二选一;

USER_PROCESSED_DATA_DIR

/home/ma-user/ws/process_data

可选】如已有预处理完成数据可指定此目录,训练过程中会优先加载此目录,跳过数据预处理过程;默认无此参数。

ORIGINAL_HF_WEIGHT

/home/ma-user/ws/llm_train/AscendFactory/model/llama2-70B

【必改】。加载tokenizer与Hugging Face权重时,对应的存放地址。请根据实际规划修改。

OUTPUT_SAVE_DIR

/home/ma-user/ws/save_dir/llama2-70B_sft_lora_4096

【必改】。训练任务结束生成日志及权重文件目录。根据实际情况决定

SHELL_FOLDER

$(dirname $(readlink -f "$0"))

表示执行脚本时的路径。

MODEL_NAME

llama2-70b

对应模型名称。请根据实际修改。

STAGE

pt

表示当前的训练阶段。可选择值:【pt、sft】

  • sft:代表监督微调;
  • pt:代表预训练;

FINETUNING_TYPE

full

表示训练策略。可选择值【full、lora】:

  • full:全参微调
  • lora:lora微调

DATA_TYPE

【GeneralPretrainHandler, GeneralInstructionHandler, MOSSInstructionHandler, AlpacaStyleInstructionHandler, SharegptStyleInstructionHandler】

必改】示例值需要根据数据集的不同,选择其一。

  • GeneralPretrainHandler:使用预训练的alpaca数据集。
  • GeneralInstructionHandler:使用微调的alpaca数据集。
  • MOSSInstructionHandler:使用微调的moss数据集。
  • AlpacaStyleInstructionHandler:使用LLama-Factory模板Alpaca数据集
  • SharegptStyleInstructionHandler:使用LLama-Factory模板Sharegpt数据集

MBS

1

表示流水线并行中一个micro batch所处理的样本量。在流水线并行中,为了减少气泡时间,会将一个step的数据切分成多个micro batch。

该值与TP和PP以及模型大小相关,可根据实际情况进行调整。

GBS

128

表示训练中所有机器一个step所处理的样本量。影响每一次训练迭代的时长。

TP

8

表示张量并行。对应训练参数 tensor-model-parallel-size

PP

4

表示流水线并行。一般此值与训练节点数相等,与权重转换时设置的值相等。对应训练参数 pipeline-model-parallel-size

CP

1

表示context并行,默认为1。应用于训练长序列文本的模型。如果训练时SEQ_LEN超过32768长度,则推荐增加CP值(CP ≥ 2)。对应训练参数 context-parallel-size

(此参数目前仅适用于Llama3系列模型长序列训练)

LR

2.5e-5

学习率设置。

MIN_LR

2.5e-6

最小学习率设置。

SEQ_LEN

4096

要处理的最大序列长度。

MAX_PE

8192

设置模型能够处理的最大序列长度。

SN

1200

必须修改。指定的输入数据集中数据的总数量。更换数据集时,需要修改。

EPOCH

5

表示训练轮次,根据实际需要修改。一个Epoch是将所有训练样本训练一次的过程。

TRAIN_ITERS

10

非必填。表示训练step迭代次数,会进行自动计算得出。

SEED

1234

随机种子数。每次数据采样时,保持一致。

SAVE_INTERVAL

1000

用于模型中间版本地保存。

  • 当参数值>=TRAIN_ITERS时,生成模型仅保存经过TRAIN_ITERS次训练后的最后一个版本。
  • 当参数值<TRAIN_ITERS时,生成模型会每经过SAVE_INTERVAL次,保存一次模型版本。

模型版本保存次数=TRAIN_ITERS//SAVE_INTERVAL+1

SN

5120

指定的输入数据集中数据的总数量。更换数据集时,需要修改。

SAVE_TOTAL_LIMIT

0

用于控制权重版本保存次数。

  • 当参数不设置或<=0时,不会触发效果。
  • 参数值需<=TRAIN_ITERS//SAVE_INTERVAL+1
  • 当参数值>1时,保存模型版本次数与SAVE_TOTAL_LIMIT的值一致。

MA_TRAIN_AUTO_RESUME

False

可选】【故障快恢】是否开启此功能,【True、False】默认False不开启,当训练中断时重启任务会从最新生成权重文件处继续训练。详见断点续训和故障快恢说明

CKPT_LOAD_TYPE

1

可选【0、1、2】,默认为1

  • 0: 不加载权重
  • 1:加载权重不加载优化器状态【增量训练】
  • 2:加载权重且加载优化器状态【

    断点续训】详见断点续训和故障快恢说明

USER_CONVERTED_CKPT_PATH

/home/ma-user/ws/xxx

可选】已转换Megatron格式权重目录或训练输出结果权重目录,一般搭配断点续训或增量训练。

  • 增量训练:转换Megatron权重,如不指定默认为${output_dir}/converted_hf2mg_weight_TP{tp}PP{pp}目录。
  • 断点续训:训练过程中保存的某个权重,详见断点续训和故障快恢说明

模型参数设置规定

  • TP张量并行、PP流水线并行、CP context并行的参数设置:TP×PP×CP的值要被NPU数量(word_size)整除。
  • TP×CP的值要被模型参数中num_attention_heads整除。
  • MBS(micro-batch-size)、GBS(global-batch-size)的设置:需要遵循GBS/MBS的值能够被NPU/(TP×PP×CP)的值进行整除。

模型推荐的参数与NPU卡数设置

不同模型推荐的训练参数和计算规格要求如表2所示。规格与节点数中的1*节点 & 4*Ascend表示单机4卡,以此类推。

表2 不同模型推荐的参数与NPU卡数设置

序号

支持模型

支持模型参数量

训练策略类型

文本序列长度(SEQ_LEN)

并行参数设置

micro batch size (MBS)

规格与节点数

1

llama2

llama2-7b

full

4096

TP(tensor model parallel size)=1

PP(pipeline model parallel size)=4

1

1*节点 & 8*Ascend

lora

TP(tensor model parallel size)=1

PP(pipeline model parallel size)=4

2

1*节点 & 8*Ascend

full

8192

TP(tensor model parallel size)=2

PP(pipeline model parallel size)=4

1

1*节点 & 8*Ascend

lora

TP(tensor model parallel size)=2

PP(pipeline model parallel size)=4

2

1*节点 & 8*Ascend

2

llama2-13b

full

4096

TP(tensor model parallel size)=8

PP(pipeline model parallel size)=1

4

1*节点 & 8*Ascend

lora

TP(tensor model parallel size)=8

PP(pipeline model parallel size)=1

4

1*节点 & 8*Ascend

full

8192

TP(tensor model parallel size)=8

PP(pipeline model parallel size)=1

2

1*节点 & 8*Ascend

lora

TP(tensor model parallel size)=8

PP(pipeline model parallel size)=1

2

1*节点 & 8*Ascend

3

llama2-70b

full

4096

TP(tensor model parallel size)=8

PP(pipeline model parallel size)=4

1

4*节点 & 8*Ascend

lora

TP(tensor model parallel size)=8

PP(pipeline model parallel size)=4

2

4*节点 & 8*Ascend

full

8192

TP(tensor model parallel size)=8

PP(pipeline model parallel size)=8

1

8*节点 & 8*Ascend

lora

TP(tensor model parallel size)=8

PP(pipeline model parallel size)=4

1

4*节点 & 8*Ascend

4

llama3

llama3-8b

full

4096

TP(tensor model parallel size)=4

PP(pipeline model parallel size)=1

2

1*节点 & 8*Ascend

lora

TP(tensor model parallel size)=4

PP(pipeline model parallel size)=1

4

1*节点 & 8*Ascend

full

8192

TP(tensor model parallel size)=4

PP(pipeline model parallel size)=1

1

1*节点 & 8*Ascend

lora

TP(tensor model parallel size)=4

PP(pipeline model parallel size)=1

2

1*节点 & 8*Ascend

5

llama3-70b

full

4096

TP(tensor model parallel size)=8

PP(pipeline model parallel size)=4

1

4*节点 & 8*Ascend

lora

TP(tensor model parallel size)=8

PP(pipeline model parallel size)=4

2

4*节点 & 8*Ascend

full

8192

TP(tensor model parallel size)=8

PP(pipeline model parallel size)=8

1

8*节点 & 8*Ascend

lora

TP(tensor model parallel size)=8

PP(pipeline model parallel size)=4

1

4*节点 & 8*Ascend

6

Qwen

qwen-7b

full

4096

TP(tensor model parallel size)=4

PP(pipeline model parallel size)=1

2

1*节点 & 8*Ascend

lora

TP(tensor model parallel size)=4

PP(pipeline model parallel size)=1

4

1*节点 & 8*Ascend

full

8192

TP(tensor model parallel size)=4

PP(pipeline model parallel size)=1

1

1*节点 & 8*Ascend

lora

TP(tensor model parallel size)=4

PP(pipeline model parallel size)=1

2

1*节点 & 8*Ascend

7

qwen-14b

full

4096

TP(tensor model parallel size)=4

PP(pipeline model parallel size)=2

2

1*节点 & 8*Ascend

lora

TP(tensor model parallel size)=4

PP(pipeline model parallel size)=1

2

1*节点 & 8*Ascend

full

8192

TP(tensor model parallel size)=4

PP(pipeline model parallel size)=2

1

1*节点 & 8*Ascend

lora

TP(tensor model parallel size)=4

PP(pipeline model parallel size)=2

2

1*节点 & 8*Ascend

8

qwen-72b

full

4096

TP(tensor model parallel size)=8

PP(pipeline model parallel size)=4

1

4*节点 & 8*Ascend

lora

TP(tensor model parallel size)=8

PP(pipeline model parallel size)=4

2

4*节点 & 8*Ascend

full

8192

TP(tensor model parallel size)=8

PP(pipeline model parallel size)=8

1

8*节点 & 8*Ascend

lora

TP(tensor model parallel size)=8

PP(pipeline model parallel size)=4

1

4*节点 & 8*Ascend

9

Qwen1.5

qwen1.5-7b

full

4096

TP(tensor model parallel size)=1

PP(pipeline model parallel size)=4

1

1*节点 & 8*Ascend

lora

TP(tensor model parallel size)=1

PP(pipeline model parallel size)=4

2

1*节点 & 8*Ascend

full

8192

TP(tensor model parallel size)=4

PP(pipeline model parallel size)=1

1

1*节点 & 8*Ascend

lora

TP(tensor model parallel size)=1

PP(pipeline model parallel size)=4

1

1*节点 & 8*Ascend

10

qwen1.5-14b

full

4096

TP(tensor model parallel size)=8

PP(pipeline model parallel size)=1

4

1*节点 & 8*Ascend

lora

TP(tensor model parallel size)=4

PP(pipeline model parallel size)=1

4

1*节点 & 8*Ascend

full

8192

TP(tensor model parallel size)=8

PP(pipeline model parallel size)=1

2

1*节点 & 8*Ascend

lora

TP(tensor model parallel size)=8

PP(pipeline model parallel size)=1

2

1*节点 & 8*Ascend

11

qwen1.5-32b

full

4096

TP(tensor model parallel size)=8

PP(pipeline model parallel size)=2

2

2*节点 & 8*Ascend

lora

TP(tensor model parallel size)=8

PP(pipeline model parallel size)=2

4

2*节点 & 8*Ascend

full

8192

TP(tensor model parallel size)=8

PP(pipeline model parallel size)=2

1

2*节点 & 8*Ascend

lora

TP(tensor model parallel size)=8

PP(pipeline model parallel size)=2

2

2*节点 & 8*Ascend

12

qwen1.5-72b

full

4096

TP(tensor model parallel size)=8

PP(pipeline model parallel size)=4

1

4*节点 & 8*Ascend

lora

TP(tensor model parallel size)=8

PP(pipeline model parallel size)=4

2

4*节点 & 8*Ascend

full

8192

TP(tensor model parallel size)=8

PP(pipeline model parallel size)=8

1

8*节点 & 8*Ascend

lora

TP(tensor model parallel size)=8

PP(pipeline model parallel size)=4

1

4*节点 & 8*Ascend

13

Yi

yi-6b

full

4096

TP(tensor model parallel size)=1

PP(pipeline model parallel size)=4

1

1*节点 & 8*Ascend

lora

TP(tensor model parallel size)=1

PP(pipeline model parallel size)=4

2

1*节点 & 8*Ascend

full

8192

TP(tensor model parallel size)=2

PP(pipeline model parallel size)=2

1

1*节点 & 8*Ascend

lora

TP(tensor model parallel size)=1

PP(pipeline model parallel size)=4

1

1*节点 & 8*Ascend

14

yi-34b

full

4096

TP(tensor model parallel size)=4

PP(pipeline model parallel size)=4

1

2*节点 & 8*Ascend

lora

TP(tensor model parallel size)=4

PP(pipeline model parallel size)=4

2

2*节点 & 8*Ascend

full

8192

TP(tensor model parallel size)=8

PP(pipeline model parallel size)=4

1

4*节点 & 8*Ascend

lora

TP(tensor model parallel size)=8

PP(pipeline model parallel size)=4

2

4*节点 & 8*Ascend

15

ChatGLMv3

glm3-6b

full

4096

TP(tensor model parallel size)=1

PP(pipeline model parallel size)=2

1

1*节点 & 4*Ascend

lora

TP(tensor model parallel size)=1

PP(pipeline model parallel size)=2

2

1*节点 & 4*Ascend

full

8192

TP(tensor model parallel size)=1

PP(pipeline model parallel size)=4

1

1*节点 & 4*Ascend

lora

TP(tensor model parallel size)=1

PP(pipeline model parallel size)=2

1

1*节点 & 4*Ascend

16

Baichuan2

baichuan2-7b

full

4096

TP(tensor model parallel size)=1

PP(pipeline model parallel size)=4

1

1*节点 & 4*Ascend

lora

TP(tensor model parallel size)=1

PP(pipeline model parallel size)=4

2

full

8192

TP(tensor model parallel size)=4

PP(pipeline model parallel size)=1

1

lora

TP(tensor model parallel size)=1

PP(pipeline model parallel size)=4

1

17

baichuan2-13b

full

4096

TP(tensor model parallel size)=8

PP(pipeline model parallel size)=1

2

1*节点 & 8*Ascend

lora

TP(tensor model parallel size)=8

PP(pipeline model parallel size)=1

4

1*节点 & 8*Ascend

full

8192

TP(tensor model parallel size)=8

PP(pipeline model parallel size)=2

1

1*节点 & 8*Ascend

lora

TP(tensor model parallel size)=8

PP(pipeline model parallel size)=1

1

2*节点 & 8*Ascend

18

Qwen2

qwen2-0.5b

full

4096

TP(tensor model parallel size)=1

PP(pipeline model parallel size)=1

2

1*节点 & 4*Ascend

lora

TP(tensor model parallel size)=1

PP(pipeline model parallel size)=1

2

1*节点 & 4*Ascend

full

8192

TP(tensor model parallel size)=1

PP(pipeline model parallel size)=1

1

1*节点 & 4*Ascend

lora

TP(tensor model parallel size)=1

PP(pipeline model parallel size)=1

1

1*节点 & 4*Ascend

19

qwen2-1.5b

full

4096

TP(tensor model parallel size)=1

PP(pipeline model parallel size)=1

2

1*节点 & 4*Ascend

lora

TP(tensor model parallel size)=1

PP(pipeline model parallel size)=1

2

1*节点 & 4*Ascend

full

8192

TP(tensor model parallel size)=1

PP(pipeline model parallel size)=1

1

1*节点 & 4*Ascend

lora

TP(tensor model parallel size)=1

PP(pipeline model parallel size)=1

1

1*节点 & 4*Ascend

20

qwen2-7b

full

4096

TP(tensor model parallel size)=4

PP(pipeline model parallel size)=1

2

1*节点 & 8*Ascend

lora

TP(tensor model parallel size)=4

PP(pipeline model parallel size)=1

2

1*节点 & 8*Ascend

full

8192

TP(tensor model parallel size)=4

PP(pipeline model parallel size)=2

1

1*节点 & 8*Ascend

lora

TP(tensor model parallel size)=4

PP(pipeline model parallel size)=2

2

1*节点 & 8*Ascend

21

qwen2-72b

full

4096

TP(tensor model parallel size)=8

PP(pipeline model parallel size)=4

1

4*节点 & 8*Ascend

lora

TP(tensor model parallel size)=8

PP(pipeline model parallel size)=4

2

4*节点 & 8*Ascend

full

8192

TP(tensor model parallel size)=8

PP(pipeline model parallel size)=8

1

8*节点 & 8*Ascend

lora

TP(tensor model parallel size)=8

PP(pipeline model parallel size)=8

1

8*节点 & 8*Ascend

22

GLMv4

glm4-9b

full

4096

TP(tensor model parallel size)=1

PP(pipeline model parallel size)=4

1

1*节点 & 8*Ascend

lora

TP(tensor model parallel size)=1

PP(pipeline model parallel size)=2

1

1*节点 & 4*Ascend

full

8192

TP(tensor model parallel size)=2

PP(pipeline model parallel size)=2

1

1*节点 & 8*Ascend

lora

TP(tensor model parallel size)=2

PP(pipeline model parallel size)=1

1

1*节点 & 4*Ascend

23

mistral

mistral-7b

full

4096

TP(tensor model parallel size)=1

PP(pipeline model parallel size)=4

1

1*节点 & 8*Ascend

lora

TP(tensor model parallel size)=1

PP(pipeline model parallel size)=4

2

1*节点 & 8*Ascend

24

mixtral

mixtral-8x7b

full

4096

TP(tensor model parallel size)=2

PP(pipeline model parallel size)=8

1

2*节点 & 8*Ascend

full

8192

TP(tensor model parallel size)=2

PP(pipeline model parallel size)=8

1

2*节点 & 8*Ascend

25

llama3.1

llama3.1-8b

full

4096

TP(tensor model parallel size)=4

PP(pipeline model parallel size)=1

2

1*节点 & 8*Ascend

lora

TP(tensor model parallel size)=4

PP(pipeline model parallel size)=1

4

1*节点 & 8*Ascend

full

8192

TP(tensor model parallel size)=4

PP(pipeline model parallel size)=1

1

1*节点 & 8*Ascend

lora

TP(tensor model parallel size)=4

PP(pipeline model parallel size)=1

2

1*节点 & 8*Ascend

26

llama3.1-70b

full

4096

TP(tensor model parallel size)=8

PP(pipeline model parallel size)=4

1

4*节点 & 8*Ascend

lora

TP(tensor model parallel size)=8

PP(pipeline model parallel size)=2

4

2*节点 & 8*Ascend

full

8192

TP(tensor model parallel size)=8

PP(pipeline model parallel size)=8

1

8*节点 & 8*Ascend

lora

TP(tensor model parallel size)=8

PP(pipeline model parallel size)=2

2

2*节点 & 8*Ascend

27

Qwen2.5

qwen2.5-0.5b

full

4096

TP(tensor model parallel size)=1

PP(pipeline model parallel size)=1

1

1*节点 & 4*Ascend

lora

TP(tensor model parallel size)=1

PP(pipeline model parallel size)=1

2

1*节点 & 4*Ascend

full

8192

TP(tensor model parallel size)=1

PP(pipeline model parallel size)=1

1

1*节点 & 4*Ascend

lora

TP(tensor model parallel size)=1

PP(pipeline model parallel size)=1

1

1*节点 & 4*Ascend

28

qwen2.5-7b

full

4096

TP(tensor model parallel size)=4

PP(pipeline model parallel size)=1

2

1*节点 & 8*Ascend

lora

TP(tensor model parallel size)=4

PP(pipeline model parallel size)=1

4

1*节点 & 8*Ascend

full

8192

TP(tensor model parallel size)=4

PP(pipeline model parallel size)=2

1

1*节点 & 8*Ascend

lora

TP(tensor model parallel size)=4

PP(pipeline model parallel size)=2

2

1*节点 & 8*Ascend

29

qwen2.5-14b

full

4096

TP(tensor model parallel size)=8

PP(pipeline model parallel size)=1

4

1*节点 & 8*Ascend

lora

TP(tensor model parallel size)=4

PP(pipeline model parallel size)=1

4

1*节点 & 8*Ascend

full

8192

TP(tensor model parallel size)=8

PP(pipeline model parallel size)=1

2

1*节点 & 8*Ascend

lora

TP(tensor model parallel size)=8

PP(pipeline model parallel size)=1

2

1*节点 & 8*Ascend

30

qwen2.5-32b

full

4096

TP(tensor model parallel size)=8

PP(pipeline model parallel size)=2

2

2*节点 & 8*Ascend

lora

TP(tensor model parallel size)=8

PP(pipeline model parallel size)=2

4

2*节点 & 8*Ascend

full

8192

TP(tensor model parallel size)=8

PP(pipeline model parallel size)=2

1

2*节点 & 8*Ascend

lora

TP(tensor model parallel size)=8

PP(pipeline model parallel size)=2

2

2*节点 & 8*Ascend

31

qwen2.5-72b

full

4096

TP(tensor model parallel size)=8

PP(pipeline model parallel size)=4

1

4*节点 & 8*Ascend

lora

TP(tensor model parallel size)=8

PP(pipeline model parallel size)=4

4

4*节点 & 8*Ascend

full

8192

TP(tensor model parallel size)=8

PP(pipeline model parallel size)=8

1

8*节点 & 8*Ascend

lora

TP(tensor model parallel size)=8

PP(pipeline model parallel size)=4

2

4*节点 & 8*Ascend

32

llama3.2

llama3.2-1b

full

4096

TP(tensor model parallel size)=1

PP(pipeline model parallel size)=1

2

1*节点 & 4*Ascend

lora

TP(tensor model parallel size)=1

PP(pipeline model parallel size)=1

2

1*节点 & 4*Ascend

full

8192

TP(tensor model parallel size)=1

PP(pipeline model parallel size)=1

1

1*节点 & 4*Ascend

lora

TP(tensor model parallel size)=1

PP(pipeline model parallel size)=1

1

1*节点 & 4*Ascend

33

llama3.2-3b

full

4096

TP(tensor model parallel size)=1

PP(pipeline model parallel size)=2

2

1*节点 & 4*Ascend

lora

TP(tensor model parallel size)=1

PP(pipeline model parallel size)=1

2

1*节点 & 4*Ascend

full

8192

TP(tensor model parallel size)=1

PP(pipeline model parallel size)=2

1

1*节点 & 4*Ascend

lora

TP(tensor model parallel size)=1

PP(pipeline model parallel size)=1

1

1*节点 & 4*Ascend

相关文档