训练启动脚本说明和参数配置
本代码包中集成了不同模型(包括llama2、llama3、Qwen、Qwen1.5 ......)的训练脚本,并可通过不同模型中的训练脚本一键式运行。训练脚本可判断是否完成预处理后的数据和权重转换的模型。如果未完成,则执行脚本,自动完成数据预处理和权重转换的过程。
如果用户进行自定义数据集预处理以及权重转换,可通过Notebook环境编辑 1_preprocess_data.sh 、2_convert_mg_hf.sh中的具体python指令,并在Notebook环境中运行执行。本代码中有许多环境变量的设置,在下面的指导步骤中,会展开进行详细的解释。
如果用户希望自定义参数进行训练,可直接编辑对应模型的训练脚本,可编辑参数以及详细介绍如下。以llama2-13b预训练为例:
参数 |
示例值 |
参数说明 |
---|---|---|
ORIGINAL_TRAIN_DATA_PATH |
/home/ma-user/work/training_data/pretrain/train-00000-of-00001-a09b74b3ef9c3b56.parquet |
必须修改。训练时指定的输入数据路径。请根据实际规划修改。 |
ORIGINAL_HF_WEIGHT |
/home/ma-user/work/model/llama-2-13b-chat-hf |
必须修改。加载tokenizer与Hugging Face权重时,对应的存放地址。请根据实际规划修改。 |
SHELL_FOLDER |
$(dirname $(readlink -f "$0")) |
表示执行脚本时的路径。 |
MODEL_NAME |
llama2-13b |
对应模型名称。 |
RUN_TYPE |
pretrain |
表示训练类型。可选择值:[pretrain, sft, lora]。 |
DATA_TYPE |
[GeneralPretrainHandler, GeneralInstructionHandler, MOSSMultiTurnHandler] |
示例值需要根据数据集的不同,选择其一。
|
MBS |
4 |
表示流水线并行中一个micro batch所处理的样本量。在流水线并行中,为了减少气泡时间,会将一个step的数据切分成多个micro batch。 该值与TP和PP以及模型大小相关,可根据实际情况进行调整。 |
GBS |
512 |
表示训练中所有机器一个step所处理的样本量。影响每一次训练迭代的时长。 |
TP |
8 |
表示张量并行。对应训练参数 tensor-model-parallel-size 。 |
PP |
1 |
表示流水线并行。一般此值与训练节点数相等,与权重转换时设置的值相等。对应训练参数 pipeline-model-parallel-size 。 |
CP |
1 |
表示context并行,默认为1。应用于训练长序列文本的模型。如果训练时SEQ_LEN超过32768长度,则推荐增加CP值(CP ≥ 2)。对应训练参数 context-parallel-size 。 (此参数目前仅适用于Llama3系列模型长序列训练) |
LR |
2.5e-5 |
学习率设置。 |
MIN_LR |
2.5e-6 |
最小学习率设置。 |
SEQ_LEN |
4096 |
要处理的最大序列长度。 |
MAX_PE |
8192 |
设置模型能够处理的最大序列长度。 |
SN |
1200 |
必须修改。指定的输入数据集中数据的总数量。更换数据集时,需要修改。 |
EPOCH |
5 |
表示训练轮次,根据实际需要修改。一个Epoch是将所有训练样本训练一次的过程。 |
TRAIN_ITERS |
SN / GBS * EPOCH |
非必填。表示训练step迭代次数,根据实际需要修改。 |
SEED |
1234 |
随机种子数。每次数据采样时,保持一致。 |
SAVE_INTERVAL |
10 |
表示训练间隔多少step,则会保存一次权重文件。 |
模型参数设置规定
- TP张量并行 、PP流水线并行、CP context并行的参数设置:TP×PP×CP的值要被NPU数量(word_size)整除。
- TP×CP的值要被模型参数中 num_attention_heads 整除。
- MBS(micro-batch-size)、GBS(global-batch-size)的设置:需要遵循GBS/MBS的值能够被NPU/(TP×PP×CP)的值进行整除。
模型推荐的参数与NPU卡数设置
序号 |
支持模型 |
支持模型参数量 |
训练策略类型 |
文本序列长度(SEQ_LEN) |
并行参数设置 |
micro batch size (MBS) |
规格与节点数 |
---|---|---|---|---|---|---|---|
1 |
llama2 |
llama2-7b |
pretrain/sft |
4096 |
TP(tensor model parallel size)=1 PP(pipeline model parallel size)=4 |
1 |
1*节点 & 4*Ascend |
lora |
TP(tensor model parallel size)=1 PP(pipeline model parallel size)=4 |
2 |
1*节点 & 4*Ascend |
||||
pretrain/sft |
8192 |
TP(tensor model parallel size)=2 PP(pipeline model parallel size)=4 |
1 |
1*节点 & 8*Ascend |
|||
lora |
TP(tensor model parallel size)=2 PP(pipeline model parallel size)=4 |
2 |
1*节点 & 8*Ascend |
||||
2 |
llama2-13b |
pretrain/sft |
4096 |
TP(tensor model parallel size)=8 PP(pipeline model parallel size)=1 |
4 |
1*节点 & 8*Ascend |
|
lora |
TP(tensor model parallel size)=8 PP(pipeline model parallel size)=1 |
4 |
1*节点 & 8*Ascend |
||||
pretrain/sft |
8192 |
TP(tensor model parallel size)=8 PP(pipeline model parallel size)=1 |
2 |
1*节点 & 8*Ascend |
|||
lora |
TP(tensor model parallel size)=8 PP(pipeline model parallel size)=1 |
2 |
1*节点 & 8*Ascend |
||||
3 |
llama2-70b |
pretrain/sft |
4096 |
TP(tensor model parallel size)=8 PP(pipeline model parallel size)=4 |
1 |
4*节点 & 8*Ascend |
|
lora |
TP(tensor model parallel size)=8 PP(pipeline model parallel size)=4 |
2 |
4*节点 & 8*Ascend |
||||
pretrain/sft |
8192 |
TP(tensor model parallel size)=8 PP(pipeline model parallel size)=8 |
1 |
8*节点 & 8*Ascend |
|||
lora |
TP(tensor model parallel size)=8 PP(pipeline model parallel size)=4 |
1 |
4*节点 & 8*Ascend |
||||
4 |
llama3 |
llama3-8b |
pretrain/sft |
4096 |
TP(tensor model parallel size)=4 PP(pipeline model parallel size)=1 |
2 |
1*节点 & 4*Ascend |
lora |
TP(tensor model parallel size)=4 PP(pipeline model parallel size)=1 |
4 |
1*节点 & 4*Ascend |
||||
pretrain/sft |
8192 |
TP(tensor model parallel size)=4 PP(pipeline model parallel size)=1 |
1 |
1*节点 & 4*Ascend |
|||
lora |
TP(tensor model parallel size)=4 PP(pipeline model parallel size)=1 |
2 |
1*节点 & 4*Ascend |
||||
5 |
llama3-70b |
pretrain/sft |
4096 |
TP(tensor model parallel size)=8 PP(pipeline model parallel size)=4 |
1 |
4*节点 & 8*Ascend |
|
lora |
TP(tensor model parallel size)=8 PP(pipeline model parallel size)=4 |
2 |
4*节点 & 8*Ascend |
||||
pretrain/sft |
8192 |
TP(tensor model parallel size)=8 PP(pipeline model parallel size)=8 |
1 |
8*节点 & 8*Ascend |
|||
lora |
TP(tensor model parallel size)=8 PP(pipeline model parallel size)=4 |
1 |
4*节点 & 8*Ascend |
||||
6 |
Qwen |
qwen-7b |
pretrain/sft |
4096 |
TP(tensor model parallel size)=4 PP(pipeline model parallel size)=1 |
2 |
1*节点 & 4*Ascend |
lora |
TP(tensor model parallel size)=4 PP(pipeline model parallel size)=1 |
4 |
1*节点 & 4*Ascend |
||||
pretrain/sft |
8192 |
TP(tensor model parallel size)=4 PP(pipeline model parallel size)=1 |
1 |
1*节点 & 4*Ascend |
|||
lora |
TP(tensor model parallel size)=4 PP(pipeline model parallel size)=1 |
2 |
1*节点 & 4*Ascend |
||||
7 |
qwen-14b |
pretrain/sft |
4096 |
TP(tensor model parallel size)=4 PP(pipeline model parallel size)=2 |
2 |
1*节点 & 8*Ascend |
|
lora |
TP(tensor model parallel size)=4 PP(pipeline model parallel size)=1 |
2 |
1*节点 & 4*Ascend |
||||
pretrain/sft |
8192 |
TP(tensor model parallel size)=4 PP(pipeline model parallel size)=2 |
1 |
1*节点 & 8*Ascend |
|||
lora |
TP(tensor model parallel size)=4 PP(pipeline model parallel size)=2 |
2 |
1*节点 & 8*Ascend |
||||
8 |
qwen-72b |
pretrain/sft |
4096 |
TP(tensor model parallel size)=8 PP(pipeline model parallel size)=4 |
1 |
4*节点 & 8*Ascend |
|
lora |
TP(tensor model parallel size)=8 PP(pipeline model parallel size)=4 |
2 |
4*节点 & 8*Ascend |
||||
pretrain/sft |
8192 |
TP(tensor model parallel size)=8 PP(pipeline model parallel size)=8 |
1 |
8*节点 & 8*Ascend |
|||
lora |
TP(tensor model parallel size)=8 PP(pipeline model parallel size)=4 |
1 |
4*节点 & 8*Ascend |
||||
9 |
Qwen1.5 |
qwen1.5-7b |
pretrain/sft |
4096 |
TP(tensor model parallel size)=1 PP(pipeline model parallel size)=4 |
1 |
1*节点 & 4*Ascend |
lora |
TP(tensor model parallel size)=1 PP(pipeline model parallel size)=4 |
2 |
1*节点 & 4*Ascend |
||||
pretrain/sft |
8192 |
TP(tensor model parallel size)=4 PP(pipeline model parallel size)=1 |
1 |
1*节点 & 4*Ascend |
|||
lora |
TP(tensor model parallel size)=1 PP(pipeline model parallel size)=4 |
1 |
1*节点 & 4*Ascend |
||||
10 |
qwen1.5-14b |
pretrain/sft |
4096 |
TP(tensor model parallel size)=8 PP(pipeline model parallel size)=1 |
4 |
1*节点 & 8*Ascend |
|
lora |
TP(tensor model parallel size)=4 PP(pipeline model parallel size)=1 |
4 |
1*节点 & 4*Ascend |
||||
pretrain/sft |
8192 |
TP(tensor model parallel size)=8 PP(pipeline model parallel size)=1 |
2 |
1*节点 & 8*Ascend |
|||
lora |
TP(tensor model parallel size)=8 PP(pipeline model parallel size)=1 |
2 |
1*节点 & 8*Ascend |
||||
11 |
qwen1.5-32b |
pretrain/sft |
4096 |
TP(tensor model parallel size)=8 PP(pipeline model parallel size)=2 |
2 |
2*节点 & 8*Ascend |
|
lora |
TP(tensor model parallel size)=8 PP(pipeline model parallel size)=2 |
4 |
2*节点 & 8*Ascend |
||||
pretrain/sft |
8192 |
TP(tensor model parallel size)=8 PP(pipeline model parallel size)=2 |
1 |
2*节点 & 8*Ascend |
|||
lora |
TP(tensor model parallel size)=8 PP(pipeline model parallel size)=2 |
2 |
2*节点 & 8*Ascend |
||||
12 |
qwen1.5-72b |
pretrain/sft |
4096 |
TP(tensor model parallel size)=8 PP(pipeline model parallel size)=4 |
1 |
4*节点 & 8*Ascend |
|
lora |
TP(tensor model parallel size)=8 PP(pipeline model parallel size)=4 |
2 |
4*节点 & 8*Ascend |
||||
pretrain/sft |
8192 |
TP(tensor model parallel size)=8 PP(pipeline model parallel size)=8 |
1 |
8*节点 & 8*Ascend |
|||
lora |
TP(tensor model parallel size)=8 PP(pipeline model parallel size)=4 |
1 |
4*节点 & 8*Ascend |
||||
13 |
Yi |
yi-6b |
pretrain/sft |
4096 |
TP(tensor model parallel size)=1 PP(pipeline model parallel size)=4 |
1 |
1*节点 & 4*Ascend |
lora |
TP(tensor model parallel size)=1 PP(pipeline model parallel size)=4 |
2 |
1*节点 & 4*Ascend |
||||
pretrain/sft |
8192 |
TP(tensor model parallel size)=2 PP(pipeline model parallel size)=2 |
1 |
1*节点 & 4*Ascend |
|||
lora |
TP(tensor model parallel size)=1 PP(pipeline model parallel size)=4 |
1 |
1*节点 & 4*Ascend |
||||
14 |
yi-34b |
pretrain/sft |
4096 |
TP(tensor model parallel size)=4 PP(pipeline model parallel size)=4 |
1 |
2*节点 & 8*Ascend |
|
lora |
TP(tensor model parallel size)=4 PP(pipeline model parallel size)=4 |
2 |
2*节点 & 8*Ascend |
||||
pretrain/sft |
8192 |
TP(tensor model parallel size)=8 PP(pipeline model parallel size)=4 |
1 |
4*节点 & 8*Ascend |
|||
lora |
TP(tensor model parallel size)=8 PP(pipeline model parallel size)=4 |
2 |
4*节点 & 8*Ascend |
||||
15 |
ChatGLMv3 |
glm3-6b |
pretrain/sft |
4096 |
TP(tensor model parallel size)=1 PP(pipeline model parallel size)=2 |
1 |
1*节点 & 2*Ascend |
lora |
TP(tensor model parallel size)=1 PP(pipeline model parallel size)=2 |
2 |
1*节点 & 2*Ascend |
||||
pretrain/sft |
8192 |
TP(tensor model parallel size)=1 PP(pipeline model parallel size)=4 |
1 |
1*节点 & 4*Ascend |
|||
lora |
TP(tensor model parallel size)=1 PP(pipeline model parallel size)=2 |
1 |
1*节点 & 2*Ascend |
||||
16 |
Baichuan2 |
baichuan2-13b |
pretrain/sft |
4096 |
TP(tensor model parallel size)=8 PP(pipeline model parallel size)=1 |
2 |
1*节点 & 8*Ascend |
lora |
TP(tensor model parallel size)=8 PP(pipeline model parallel size)=1 |
4 |
1*节点 & 8*Ascend |
||||
pretrain/sft |
8192 |
TP(tensor model parallel size)=8 PP(pipeline model parallel size)=2 |
1 |
1*节点 & 8*Ascend |
|||
lora |
TP(tensor model parallel size)=8 PP(pipeline model parallel size)=1 |
1 |
2*节点 & 8*Ascend |
||||
17 |
Qwen2 |
qwen2-0.5b |
pretrain/sft |
4096 |
TP(tensor model parallel size)=1 PP(pipeline model parallel size)=1 |
2 |
1*节点 & 1*Ascend |
lora |
TP(tensor model parallel size)=1 PP(pipeline model parallel size)=1 |
2 |
1*节点 & 1*Ascend |
||||
pretrain/sft |
8192 |
TP(tensor model parallel size)=1 PP(pipeline model parallel size)=1 |
1 |
1*节点 & 1*Ascend |
|||
lora |
TP(tensor model parallel size)=1 PP(pipeline model parallel size)=1 |
1 |
1*节点 & 1*Ascend |
||||
18 |
qwen2-1.5b |
pretrain/sft |
4096 |
TP(tensor model parallel size)=1 PP(pipeline model parallel size)=1 |
2 |
1*节点 & 1*Ascend |
|
lora |
TP(tensor model parallel size)=1 PP(pipeline model parallel size)=1 |
2 |
1*节点 & 1*Ascend |
||||
pretrain/sft |
8192 |
TP(tensor model parallel size)=1 PP(pipeline model parallel size)=1 |
1 |
1*节点 & 1*Ascend |
|||
lora |
TP(tensor model parallel size)=1 PP(pipeline model parallel size)=1 |
1 |
1*节点 & 1*Ascend |
||||
19 |
qwen2-7b |
pretrain/sft |
4096 |
TP(tensor model parallel size)=4 PP(pipeline model parallel size)=1 |
2 |
1*节点 & 4*Ascend |
|
lora |
TP(tensor model parallel size)=4 PP(pipeline model parallel size)=1 |
2 |
1*节点 & 4*Ascend |
||||
pretrain/sft |
8192 |
TP(tensor model parallel size)=4 PP(pipeline model parallel size)=2 |
1 |
1*节点 & 8*Ascend |
|||
lora |
TP(tensor model parallel size)=4 PP(pipeline model parallel size)=2 |
2 |
1*节点 & 8*Ascend |
||||
20 |
qwen2-72b |
pretrain/sft |
4096 |
TP(tensor model parallel size)=8 PP(pipeline model parallel size)=4 |
1 |
4*节点 & 8*Ascend |
|
lora |
TP(tensor model parallel size)=8 PP(pipeline model parallel size)=4 |
2 |
4*节点 & 8*Ascend |
||||
pretrain/sft |
8192 |
TP(tensor model parallel size)=8 PP(pipeline model parallel size)=8 |
1 |
8*节点 & 8*Ascend |
|||
lora |
TP(tensor model parallel size)=8 PP(pipeline model parallel size)=8 |
1 |
8*节点 & 8*Ascend |
||||
21 |
GLMv4 |
glm4-9b |
pretrain/sft |
4096 |
TP(tensor model parallel size)=1 PP(pipeline model parallel size)=4 |
1 |
1*节点 & 4*Ascend |
lora |
TP(tensor model parallel size)=1 PP(pipeline model parallel size)=2 |
1 |
1*节点 & 2*Ascend |
||||
pretrain/sft |
8192 |
TP(tensor model parallel size)=2 PP(pipeline model parallel size)=2 |
1 |
1*节点 & 4*Ascend |
|||
lora |
TP(tensor model parallel size)=2 PP(pipeline model parallel size)=1 |
1 |
1*节点 & 2*Ascend |
||||
22 |
mistral |
mistral-7b |
pretrain/sft |
4096 |
TP(tensor model parallel size)=1 PP(pipeline model parallel size)=4 |
1 |
1*节点 & 4*Ascend |
lora |
TP(tensor model parallel size)=1 PP(pipeline model parallel size)=4 |
2 |
1*节点 & 4*Ascend |
||||
23 |
mixtral |
mixtral-8x7b |
pretrain/sft |
4096 |
TP(tensor model parallel size)=2 PP(pipeline model parallel size)=8 |
1 |
2*节点 & 8*Ascend |
pretrain/sft |
8192 |
TP(tensor model parallel size)=2 PP(pipeline model parallel size)=8 |
1 |
2*节点 & 8*Ascend |
|||
24 |
llama3.1 |
llama3.1-8b |
pretrain/sft |
4096 |
TP(tensor model parallel size)=4 PP(pipeline model parallel size)=1 |
2 |
1*节点 & 4*Ascend |
lora |
TP(tensor model parallel size)=4 PP(pipeline model parallel size)=1 |
4 |
1*节点 & 4*Ascend |
||||
pretrain/sft |
8192 |
TP(tensor model parallel size)=4 PP(pipeline model parallel size)=1 |
1 |
1*节点 & 4*Ascend |
|||
lora |
TP(tensor model parallel size)=4 PP(pipeline model parallel size)=1 |
2 |
1*节点 & 4*Ascend |
||||
25 |
llama3.1-70b |
pretrain/sft |
4096 |
TP(tensor model parallel size)=8 PP(pipeline model parallel size)=4 |
1 |
4*节点 & 8*Ascend |
|
lora |
TP(tensor model parallel size)=8 PP(pipeline model parallel size)=2 |
4 |
2*节点 & 8*Ascend |
||||
pretrain/sft |
8192 |
TP(tensor model parallel size)=8 PP(pipeline model parallel size)=8 |
1 |
8*节点 & 8*Ascend |
|||
lora |
TP(tensor model parallel size)=8 PP(pipeline model parallel size)=2 |
2 |
2*节点 & 8*Ascend |
||||
26 |
Qwen2.5 |
qwen2.5-0.5b |
pretrain/sft |
4096 |
TP(tensor model parallel size)=1 PP(pipeline model parallel size)=1 |
1 |
1*节点 & 1*Ascend |
lora |
TP(tensor model parallel size)=1 PP(pipeline model parallel size)=1 |
2 |
1*节点 & 1*Ascend |
||||
pretrain/sft |
8192 |
TP(tensor model parallel size)=1 PP(pipeline model parallel size)=1 |
1 |
1*节点 & 1*Ascend |
|||
lora |
TP(tensor model parallel size)=1 PP(pipeline model parallel size)=1 |
1 |
1*节点 & 1*Ascend |
||||
27 |
qwen2.5-7b |
pretrain/sft |
4096 |
TP(tensor model parallel size)=4 PP(pipeline model parallel size)=1 |
2 |
1*节点 & 4*Ascend |
|
lora |
TP(tensor model parallel size)=4 PP(pipeline model parallel size)=1 |
4 |
1*节点 & 4*Ascend |
||||
pretrain/sft |
8192 |
TP(tensor model parallel size)=4 PP(pipeline model parallel size)=2 |
1 |
1*节点 & 8*Ascend |
|||
lora |
TP(tensor model parallel size)=4 PP(pipeline model parallel size)=2 |
2 |
1*节点 & 8*Ascend |
||||
28 |
qwen2.5-14b |
pretrain/sft |
4096 |
TP(tensor model parallel size)=8 PP(pipeline model parallel size)=1 |
4 |
1*节点 & 8*Ascend |
|
lora |
TP(tensor model parallel size)=4 PP(pipeline model parallel size)=1 |
4 |
1*节点 & 4*Ascend |
||||
pretrain/sft |
8192 |
TP(tensor model parallel size)=8 PP(pipeline model parallel size)=1 |
2 |
1*节点 & 8*Ascend |
|||
lora |
TP(tensor model parallel size)=8 PP(pipeline model parallel size)=1 |
2 |
1*节点 & 8*Ascend |
||||
29 |
qwen2.5-32b |
pretrain/sft |
4096 |
TP(tensor model parallel size)=8 PP(pipeline model parallel size)=2 |
2 |
2*节点 & 8*Ascend |
|
lora |
TP(tensor model parallel size)=8 PP(pipeline model parallel size)=2 |
4 |
2*节点 & 8*Ascend |
||||
pretrain/sft |
8192 |
TP(tensor model parallel size)=8 PP(pipeline model parallel size)=2 |
1 |
2*节点 & 8*Ascend |
|||
lora |
TP(tensor model parallel size)=8 PP(pipeline model parallel size)=2 |
2 |
2*节点 & 8*Ascend |
||||
30 |
qwen2.5-72b |
pretrain/sft |
4096 |
TP(tensor model parallel size)=8 PP(pipeline model parallel size)=4 |
2 |
4*节点 & 8*Ascend |
|
lora |
TP(tensor model parallel size)=8 PP(pipeline model parallel size)=4 |
4 |
4*节点 & 8*Ascend |
||||
pretrain/sft |
8192 |
TP(tensor model parallel size)=8 PP(pipeline model parallel size)=8 |
1 |
8*节点 & 8*Ascend |
|||
lora |
TP(tensor model parallel size)=8 PP(pipeline model parallel size)=4 |
2 |
4*节点 & 8*Ascend |
||||
31 |
llama3.2 |
llama3.2-1b |
pretrain/sft |
4096 |
TP(tensor model parallel size)=1 PP(pipeline model parallel size)=1 |
2 |
1*节点 & 1*Ascend |
lora |
TP(tensor model parallel size)=1 PP(pipeline model parallel size)=1 |
2 |
1*节点 & 1*Ascend |
||||
pretrain/sft |
8192 |
TP(tensor model parallel size)=1 PP(pipeline model parallel size)=1 |
1 |
1*节点 & 1*Ascend |
|||
lora |
TP(tensor model parallel size)=1 PP(pipeline model parallel size)=1 |
1 |
1*节点 & 1*Ascend |
||||
32 |
llama3.2-3b |
pretrain/sft |
4096 |
TP(tensor model parallel size)=1 PP(pipeline model parallel size)=2 |
2 |
1*节点 & 2*Ascend |
|
lora |
TP(tensor model parallel size)=1 PP(pipeline model parallel size)=1 |
2 |
1*节点 & 1*Ascend |
||||
pretrain/sft |
8192 |
TP(tensor model parallel size)=1 PP(pipeline model parallel size)=2 |
1 |
1*节点 & 2*Ascend |
|||
lora |
TP(tensor model parallel size)=1 PP(pipeline model parallel size)=1 |
1 |
1*节点 & 1*Ascend |