Skip to content

Commit ca22425

Browse files
authored
[Optimization] Support lower memory cards. (#9804)
* support lower memory cards. * add doc for v100 16G such devices. * remove debug info. * add pre divided factor to overcome overfit problem for fp16 attention.
1 parent 1ca1d59 commit ca22425

23 files changed

+475
-73
lines changed

README.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -166,7 +166,7 @@
166166
### 环境依赖
167167

168168
* python >= 3.8
169-
* paddlepaddle >= 3.0.0b0
169+
* paddlepaddle >= 3.0.0rc0
170170

171171
如果您尚未安装 PaddlePaddle,请参考 [飞桨官网](https://www.paddlepaddle.org.cn/) 进行安装。
172172

@@ -211,7 +211,7 @@ wget https://bj.bcebos.com/paddlenlp/models/transformers/llama/data/llama_openwe
211211
wget https://bj.bcebos.com/paddlenlp/models/transformers/llama/data/llama_openwebtext_100k.idx
212212
cd .. # change folder to PaddleNLP/llm
213213
# 如需使用use_fused_rms_norm=true,需要前往slm/model_zoo/gpt-3/external_ops安装fused_ln
214-
python -u -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" run_pretrain.py ./config/llama/pretrain_argument.json --use_fused_rms_norm false
214+
python -u run_pretrain.py ./config/qwen/pretrain_argument_0p5b.json
215215
```
216216

217217
### 大模型 SFT 精调
@@ -221,7 +221,7 @@ git clone https://github.com/PaddlePaddle/PaddleNLP.git && cd PaddleNLP # 如已
221221
mkdir -p llm/data && cd llm/data
222222
wget https://bj.bcebos.com/paddlenlp/datasets/examples/AdvertiseGen.tar.gz && tar -zxvf AdvertiseGen.tar.gz
223223
cd .. # change folder to PaddleNLP/llm
224-
python -u -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" run_finetune.py ./config/llama/sft_argument.json
224+
python -u run_finetune.py ./config/qwen/sft_argument_0p5b.json
225225
```
226226

227227
更多大模型全流程步骤,请参考[飞桨大模型套件](./llm)介绍。
@@ -236,7 +236,7 @@ dataset = load_dataset("ZHUI/alpaca_demo", split="train")
236236
training_args = SFTConfig(output_dir="Qwen/Qwen2.5-0.5B-SFT", device="gpu")
237237
trainer = SFTTrainer(
238238
args=training_args,
239-
model="Qwen/Qwen2.5-0.5B",
239+
model="Qwen/Qwen2.5-0.5B-Instruct",
240240
train_dataset=dataset,
241241
)
242242
trainer.train()

llm/README.md

Lines changed: 56 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,11 @@
3737

3838
## 🚀 快速开始 🚀
3939

40+
开始之前,您可以安装先 PaddleNLP 最新 develop 版本:
41+
```shell
42+
pip install --pre --upgrade paddlenlp -f https://www.paddlepaddle.org.cn/whl/paddlenlp.html
43+
```
44+
4045
### 1. 预训练
4146

4247
PaddleNLP 将飞桨4D 并行策略加入到 Trainer API 中, 用户只需修改 Trainer 配置即可使用不同的分布式策略。目前大模型套件提供[LLaMA/LLaMA2/LLaMA3](./config/llama)[GPT-3](./config/gpt-3)[Qwen](./config/qwen)[Baichuan/Baichuan2](./config/baichuan)[Mixtral](./config/mixtral) 等模型预训练功能,更多模型支持持续更新中。
@@ -73,19 +78,30 @@ mkdir data
7378
mv llama_openwebtext_100k.bin ./data
7479
mv llama_openwebtext_100k.idx ./data
7580
```
81+
单卡训练:
82+
```shell
83+
# 16G 显存可训练
84+
python -u run_pretrain.py ./config/qwen/pretrain_argument_0p5b.json
85+
```
86+
- 该配置16G 显存可训练,可以开启 use_flash_attention,use_fused_rms_norm,recompute 进一步省显存
87+
- 如果上述配置无法开启,或显存依然不够,可以开启`offload_optim`,此时显存约为11G `python -u run_pretrain.py ./config/qwen/pretrain_argument_0p5b.json --offload_optim 1`
7688

89+
高性能、多卡、多机训练:
7790
```shell
7891
# 编译自定义算子,可选
7992
cd ../slm/model_zoo/gpt-3/external_ops/ && python3 setup.py install && cd -
8093

81-
# 模型预训练参考
82-
python -u -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" run_pretrain.py ./config/llama/pretrain_argument.json
94+
# 多卡模型预训练参考:
95+
python -u -m paddle.distributed.launch --devices "0,1,2,3,4,5,6,7" run_pretrain.py ./config/llama/pretrain_argument.json
96+
# 多机训练参考: 占用45G显存左右
97+
python -u -m paddle.distributed.launch --devices "0,1,2,3,4,5,6,7" --master=192.168.1.1:8090 --nnodes=2 run_pretrain.py ./config/llama/pretrain_argument.json
8398
```
99+
- 更详细的分布式启动命令请参考[这里](https://www.paddlepaddle.org.cn/documentation/docs/zh/2.6/api/paddle/distributed/launch_cn.html#launch)
84100

85101
注意:
86102

87103
1. 建议使用 paddle develop 版本训练,需要安装`pip install fast_dataindex visualdl==2.5.3`等相关缺失 whl 包
88-
2. `use_flash_attention` 需要在 A100机器开启,建议使用 cuda11.8环境
104+
2. `use_flash_attention` 需要在 A100 以上机器开启,建议使用 cuda11.8以上环境
89105
3. `use_fused_rms_norm` 需要安装自定义算子。如果安装后仍然找不到算子,需要额外设置 PYTHONPATH
90106
4. `continue_training` 表示从现有的预训练模型加载训练。7b 模型初始 loss 大概为2.xx, 随机初始化模型 loss 从11.x 左右下降。
91107
5. 多机训练时,若各机器使用的训练数据文件位置相同(例如挂载共享硬盘情况),请指定`--share_folder true`使全局0号卡制作缓存数据。否则默认各台机器的0号卡独立制作缓存数据,
@@ -125,29 +141,45 @@ PaddleNLP 支持多个主流大模型的 SFT、PEFT 等精调策略,提供统
125141
为了方便测试,我们也提供了[tatsu-lab/alpaca](https://huggingface.co/datasets/tatsu-lab/alpaca)demo 数据集可以直接使用:
126142

127143
```shell
144+
# 在 PaddleNLP/llm 目录执行
128145
wget https://bj.bcebos.com/paddlenlp/datasets/examples/alpaca_demo.gz
129146
tar -xvf alpaca_demo.gz
130147
```
131148

132149
#### 2.2 全参精调:SFT
133150

151+
单卡
152+
```bash
153+
# 需要12G显存左右
154+
python -u run_finetune.py ./config/qwen/sft_argument_0p5b.json
155+
# 单卡性能最佳实践,16G显存,可以参考打开开关。
156+
# ./config/qwen/sft_argument_0p5b_best.json
157+
```
158+
159+
多卡
134160
```bash
135-
# SFT 启动命令参考
136-
python -u -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" run_finetune.py ./config/llama/sft_argument.json
161+
# SFT 启动命令参考,需要45G显存左右
162+
python -u -m paddle.distributed.launch --devices "0,1,2,3,4,5,6,7" run_finetune.py ./config/qwen/sft_argument.json
137163
```
138164

139165
#### 2.3 LoRA
140166

167+
LoRA 启动命令参考
141168
```bash
142-
# LoRA 启动命令参考
143-
python run_finetune.py ./config/llama/lora_argument.json
169+
# 需要9G左右显存
170+
python run_finetune.py ./config/qwen/lora_argument_0p5b.json
171+
# 需要29G左右显存
172+
python run_finetune.py ./config/qwen/lora_argument.json
144173
```
145174

146175
#### 2.4 Prefix Tuning
147176

177+
Prefix Tuning 启动命令参考
148178
```bash
149-
# Prefix Tuning 启动命令参考
150-
python run_finetune.py ./config/llama/pt_argument.json
179+
# 需要10G左右显存
180+
python run_finetune.py ./config/qwen/pt_argument_0p5b.json
181+
# 需要30G左右显存
182+
python run_finetune.py ./config/qwen/pt_argument.json
151183
```
152184

153185
除了 LoRA、Prefix Tuning 外,还支持 LoKr、VeRA、MoRA、ReFT、rsLoRA、LoRA+、PiSSA、MoSLoRA 等多种精调算法,更多大模型精调使用文档、训练细节和效果请参见[大模型精调教程](./docs/finetune.md)
@@ -192,18 +224,26 @@ tar -zxvf ultrafeedback_binarized.tar.gz
192224

193225
##### 全参 DPO
194226

227+
195228
```bash
196-
# DPO 启动命令参考
197-
python -u -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" ./alignment/dpo/run_dpo.py ./config/llama/dpo_argument.json
229+
# DPO 启动命令参考, 8卡训练, 需要大概40G显存
230+
python -u -m paddle.distributed.launch --devices "0,1,2,3,4,5,6,7" ./alignment/dpo/run_dpo.py ./config/llama/dpo_argument.json
231+
232+
# 单卡训练,大概需要26G显存左右
233+
python -u ./alignment/dpo/run_dpo.py ./config/qwen/dpo_argument_0p5b.json
198234
```
199235

200236
##### LoRA DPO
201237

202238
```bash
203239
# DPO 启动命令参考
204-
python -u -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" ./alignment/dpo/run_dpo.py ./config/llama/dpo_lora_argument.json
240+
python -u -m paddle.distributed.launch --devices "0,1,2,3,4,5,6,7" ./alignment/dpo/run_dpo.py ./config/llama/dpo_lora_argument.json
205241
```
206242
更多 DPO 技术细节和使用说明详见[DPO 文档](./docs/dpo.md)
243+
```bash
244+
# 需要52G左右显存
245+
python -u ./alignment/dpo/run_dpo.py ./config/llama/dpo_lora_argument.json
246+
```
207247

208248
#### 3.2 KTO
209249

@@ -240,13 +280,13 @@ tar -zxvf ultrafeedback_binarized.tar.gz
240280

241281
```bash
242282
# KTO 启动命令参考
243-
python -u -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" ./alignment/kto/run_kto.py ./config/llama/kto_argument.json
283+
python -u -m paddle.distributed.launch --devices "0,1,2,3,4,5,6,7" ./alignment/kto/run_kto.py ./config/llama/kto_argument.json
244284
```
245285
##### LoRA KTO
246286

247287
```bash
248288
# KTO 启动命令参考
249-
python -u -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" ./alignment/kto/run_kto.py ./config/llama/kto_lora_argument.json
289+
python -u -m paddle.distributed.launch --devices "0,1,2,3,4,5,6,7" ./alignment/kto/run_kto.py ./config/llama/kto_lora_argument.json
250290
```
251291

252292
#### 3.3 RLHF
@@ -362,7 +402,8 @@ python ./predict/predictor.py --model_name_or_path ./inference --inference_model
362402

363403
服务化部署脚本
364404

365-
```shell
405+
```shell
406+
# 单卡,可以使用 paddle.distributed.launch 启动多卡推理
366407
python ./predict/flask_server.py \
367408
--model_name_or_path Qwen/Qwen2.5-0.5B-Instruct \
368409
--port 8010 \

llm/config/llama/dpo_argument.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
{
2-
"model_name_or_path": "meta-llama/Meta-Llama-3-8B",
2+
"model_name_or_path": "meta-llama/Meta-Llama-3-8B-Instruct",
33
"train_dataset_path": "./data/train.jsonl",
44
"dev_dataset_path": "./data/dev.jsonl",
55
"output_dir": "./checkpoints/dpo_ckpts",

llm/config/llama/pretrain_argument.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,7 @@
2828
"warmup_ratio": 0.01,
2929
"max_grad_norm": 1.0,
3030
"dataloader_num_workers": 1,
31-
"continue_training": 1,
31+
"continue_training": 0,
3232
"do_train": true,
3333
"do_eval": true,
3434
"do_predict": true,
Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
{
2+
"model_name_or_path": "Qwen/Qwen2.5-0.5B-Instruct",
3+
"train_dataset_path": "./data/train.jsonl",
4+
"dev_dataset_path": "./data/dev.jsonl",
5+
"output_dir": "./checkpoints/dpo_ckpts",
6+
"per_device_train_batch_size": 1,
7+
"gradient_accumulation_steps": 8,
8+
"per_device_eval_batch_size": 1,
9+
"num_train_epochs": 1,
10+
"max_steps": 100,
11+
"learning_rate": 1e-06,
12+
"warmup_steps": 10,
13+
"logging_steps": 1,
14+
"evaluation_strategy": "steps",
15+
"save_strategy": "steps",
16+
"eval_steps": 100,
17+
"save_steps": 500,
18+
"max_seq_len": 2048,
19+
"max_prompt_len": 1024,
20+
"fp16": true,
21+
"fp16_opt_level": "O2",
22+
"do_train": true,
23+
"do_eval": true,
24+
"disable_tqdm": true,
25+
"load_best_model_at_end": true,
26+
"tensor_parallel_degree": 1,
27+
"sharding": "stage1",
28+
"use_flash_attention": false,
29+
"flash_mask": false,
30+
"recompute": true,
31+
"recompute_granularity": "full",
32+
"benchmark": false,
33+
"unified_checkpoint": true,
34+
"autotuner_benchmark":false,
35+
"beta": 0.1,
36+
"loss_type": "sigmoid",
37+
"greedy_zero_padding": false,
38+
"label_smoothing": 0.0
39+
}

llm/config/qwen/lora_argument.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44
"output_dir": "./checkpoints/lora_ckpts",
55
"per_device_train_batch_size": 4,
66
"gradient_accumulation_steps": 4,
7-
"per_device_eval_batch_size": 8,
7+
"per_device_eval_batch_size": 4,
88
"eval_accumulation_steps":16,
99
"num_train_epochs": 3,
1010
"learning_rate": 3e-04,
Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
{
2+
"model_name_or_path": "Qwen/Qwen2.5-0.5B-Instruct",
3+
"dataset_name_or_path": "./data",
4+
"output_dir": "./checkpoints/lora_ckpts",
5+
"per_device_train_batch_size": 2,
6+
"gradient_accumulation_steps": 8,
7+
"per_device_eval_batch_size": 2,
8+
"eval_accumulation_steps": 32,
9+
"num_train_epochs": 3,
10+
"learning_rate": 3e-04,
11+
"warmup_steps": 30,
12+
"logging_steps": 1,
13+
"evaluation_strategy": "epoch",
14+
"save_strategy": "epoch",
15+
"src_length": 1024,
16+
"max_length": 2048,
17+
"fp16": true,
18+
"fp16_opt_level": "O2",
19+
"do_train": true,
20+
"do_eval": true,
21+
"disable_tqdm": true,
22+
"load_best_model_at_end": true,
23+
"eval_with_do_generation": false,
24+
"metric_for_best_model": "accuracy",
25+
"recompute": true,
26+
"save_total_limit": 1,
27+
"tensor_parallel_degree": 1,
28+
"pipeline_parallel_degree": 1,
29+
"lora": true,
30+
"unified_checkpoint": true,
31+
"zero_padding": false,
32+
"use_flash_attention": false,
33+
"pissa": false
34+
}
Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,40 @@
1+
{
2+
"model_name_or_path": "Qwen/Qwen2.5-0.5B",
3+
"tokenizer_name_or_path": "Qwen/Qwen2.5-0.5B",
4+
"input_dir": "./data",
5+
"output_dir": "./checkpoints/pretrain_ckpts",
6+
"per_device_train_batch_size": 1,
7+
"gradient_accumulation_steps": 1,
8+
"per_device_eval_batch_size": 2,
9+
"tensor_parallel_degree": 1,
10+
"pipeline_parallel_degree": 1,
11+
"sharding": "stage2",
12+
"virtual_pp_degree": 1,
13+
"sequence_parallel": 0,
14+
"use_flash_attention": false,
15+
"use_fused_rms_norm": false,
16+
"max_seq_length": 1024,
17+
"learning_rate": 3e-05,
18+
"min_learning_rate": 3e-06,
19+
"warmup_steps": 30,
20+
"logging_steps": 1,
21+
"max_steps": 10000,
22+
"save_steps": 5000,
23+
"eval_steps": 1000,
24+
"weight_decay": 0.01,
25+
"fp16": true,
26+
"fp16_opt_level": "O2",
27+
"warmup_ratio": 0.01,
28+
"max_grad_norm": 1.0,
29+
"dataloader_num_workers": 1,
30+
"continue_training": 0,
31+
"do_train": true,
32+
"do_eval": true,
33+
"do_predict": true,
34+
"disable_tqdm": true,
35+
"recompute": false,
36+
"distributed_dataloader": 1,
37+
"recompute_granularity": "full",
38+
"unified_checkpoint": true,
39+
"save_total_limit": 2
40+
}

llm/config/qwen/pt_argument.json

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,8 +4,8 @@
44
"output_dir": "./checkpoints/pt_ckpts",
55
"per_device_train_batch_size": 4,
66
"gradient_accumulation_steps": 4,
7-
"per_device_eval_batch_size": 8,
8-
"eval_accumulation_steps":16,
7+
"per_device_eval_batch_size": 4,
8+
"eval_accumulation_steps": 32,
99
"num_train_epochs": 3,
1010
"learning_rate": 3e-02,
1111
"warmup_steps": 30,

llm/config/qwen/pt_argument_0p5b.json

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
{
2+
"model_name_or_path": "Qwen/Qwen2.5-0.5B-Instruct",
3+
"dataset_name_or_path": "./data",
4+
"output_dir": "./checkpoints/pt_ckpts",
5+
"per_device_train_batch_size": 2,
6+
"gradient_accumulation_steps": 8,
7+
"per_device_eval_batch_size": 4,
8+
"eval_accumulation_steps": 32,
9+
"num_train_epochs": 3,
10+
"learning_rate": 3e-02,
11+
"warmup_steps": 30,
12+
"logging_steps": 1,
13+
"evaluation_strategy": "epoch",
14+
"save_strategy": "epoch",
15+
"src_length": 1024,
16+
"max_length": 2048,
17+
"fp16": true,
18+
"fp16_opt_level": "O2",
19+
"do_train": true,
20+
"do_eval": true,
21+
"disable_tqdm": true,
22+
"load_best_model_at_end": true,
23+
"eval_with_do_generation": false,
24+
"metric_for_best_model": "accuracy",
25+
"recompute": true,
26+
"save_total_limit": 1,
27+
"tensor_parallel_degree": 1,
28+
"pipeline_parallel_degree": 1,
29+
"prefix_tuning": true,
30+
"use_flash_attention": false
31+
}

0 commit comments

Comments
 (0)