Skip to content

Commit b35b8d6

Browse files
authored
[Trainer] support sharding for trainer. (#3352)
* supprt sharding for trainer. * support sharding for t5. * add t5-3b * add bf16 support. * fix sharding stage2 * support iterable dataset. * fix save strategy.
1 parent 25473a8 commit b35b8d6

20 files changed

+1820
-156
lines changed

docs/trainer.md

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -395,6 +395,37 @@ Trainer 是一个简单,但功能完整的 Paddle训练和评估模块,并
395395

396396
The value of initial scale_loss for fp16. (default: 32768)
397397

398+
--sharding
399+
是否使用Paddle的Sharding数据并行功能,用户的参数。支持sharding `stage1`, `stage2` or `stage3`
400+
其中`stage2``stage3`可以和`offload`组合使用。
401+
每个种策略分别为:
402+
stage1 : optimizer 中的参数切分到不同卡
403+
stage2 : optimizer + gradient 中的参数切分到不同卡
404+
stage3 : parameter + gradient + optimizer 中的参数都切分到不同卡
405+
offload : offload parameters to cpu 部分参数存放到cpu中
406+
(`str`, 可选, 默认为 `` 不使用sharding)
407+
注意:当前stage3暂时不可用
408+
409+
Whether or not to use Paddle Sharding Data Parallel training (in distributed training
410+
only). The base option should be `stage1`, `stage2` or `stage3` and you can add
411+
CPU-offload to `stage2` or `stage3` like this: `stage2 offload` or `stage3 offload`.
412+
Each stage means:
413+
stage1 : optimizer state segmentation
414+
stage2 : optimizer state + gradient segmentation
415+
stage3 : parameter + gradient + optimizer state segmentation
416+
offload : offload parameters to cpu
417+
NOTICE: stage3 is temporarily unavaliable.
418+
419+
--sharding_degree
420+
设置sharding的通信组参数,表示通信组的大小。同一个sharding通信组内的参数,进行sharding,分布到不同卡上。
421+
不同sharding通信组之间,相当于单纯的数据并行。此选项只在sharding选项开启时候生效。
422+
默认值为-1,表示所有训练的卡在同一个通信组内。
423+
(`int`, 可选, 默认为 `-1`)
424+
425+
Sharding parameter in certain cards group. For example, aussume we use 2 machines each
426+
with 8 cards, then set sharding_degree=8, sharding will only communication inside machine.
427+
default -1 means sharding parameters between all workers. (`int`, *optional*, defaults to `-1`)
428+
398429
--recompute
399430
是否使用重计算训练。可以节省显存。
400431
重新计算前向过程以获取梯度,减少中间变量显存

examples/language_model/t5/README.md

Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -46,12 +46,53 @@ python run_glue.py \
4646
- `scheduler_type` scheduler类型,可选linear和cosine,默认linear。
4747
- `output_dir` 表示模型保存路径。
4848

49+
使用trainer进行Fine-tuning:
50+
```shell
51+
python -m paddle.distributed.launch --gpus "0,1,2,3" run_glue_trainer.py \
52+
--model_name_or_path t5-base \
53+
--task_name rte \
54+
--max_seq_length 256 \
55+
--do_train \
56+
--do_eval \
57+
--per_device_train_batch_size 16 \
58+
--per_device_eval_batch_size 64 \
59+
--learning_rate 1e-4 \
60+
--weight_decay 0.01 \
61+
--warmup_ratio 0.1 \
62+
--num_train_epochs 10 \
63+
--eval_steps 200 \
64+
--logging_steps 20 \
65+
--save_steps 200 \
66+
--save_total_limit 3 \
67+
--metric_for_best_model "eval_accuarcy" \
68+
--fp16 false \
69+
--fp16_opt_level "O1" \
70+
--recompute true \
71+
--sharding "stage1" \
72+
--overwrite_output_dir \
73+
--disable_tqdm true \
74+
--output_dir outputs/rte/
75+
```
76+
具体参数含义请参见: https://paddlenlp.readthedocs.io/zh/latest/trainer.html
77+
4978
###### t5-base模型在GLUE开发集上的结果:
5079
| Model | cola | sst-2 | mrpc | sts-b | qqp | mnli | qnli | rte | mean |
5180
|--------------------------------|-------|-------|-------------|------------------|-------------|-------------|------|-------|-------|
5281
| | mcc | acc | acc | pearson | acc | acc | acc | acc | |
5382
| T5-base-Paddle | 61.74 | 95.18 | 90.44 | 90.09 | 91.60 | 87.18 | 93.56 | 81.95 | 86.4675 |
5483

84+
###### t5_v1_1-base模型在GLUE开发集上的结果:
85+
使用`run_glue_trainer.py`运行,由于`t5_v1_1-base`没有在glue任务上进行训练过,直接生成label的策略需要的训练时间需要更长。
86+
| Model | cola | sst-2 | mrpc | sts-b | qqp | mnli | qnli | rte |
87+
|--------------------------------|-------|-------|-------------|------------------|-------------|-------------|------|-------|
88+
| | mcc | acc | acc | pearson | acc | acc | acc | acc |
89+
| T5-v1_1-base Paddle | 47.6845 | 94.38 | 84.31 | 87.74 | 88.05 | 85.39 | 90.518 | 65.70 |
90+
| epoch | 100 | 10 | 100 | 100 | 3 | 3 | 10 | 100 |
91+
92+
注:
93+
- 直接生成label的finetune方式难度较大,前期基本学习如何正确生成label标签,后期才学习分类任务。
94+
- 生成的label标签设计,标签差异大一些,效果会更好一些。
95+
- `qqp`,`mnli`数据集适当增大训练epoch数,可以取得更好效果。
5596

5697
### GLUE Demo测试
5798

examples/language_model/t5/data.py

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -50,6 +50,33 @@
5050
),
5151
])
5252

53+
GLUE_1_1_PROCESSED = collections.OrderedDict([
54+
("cola", (["cola sentence: "], ["outrageous", "acceptable"])),
55+
("sst-2", (["sst2 sentence: "], ["negative", "positive"])),
56+
(
57+
"mrpc",
58+
(["mrpc sentence1: ", " sentence2: "], ["nonidentical", "equivalent"]),
59+
),
60+
("sts-b", (["stsb sentence1: ", " sentence2: "], None)),
61+
("qqp", (["qqp question1: ", " question2: "], ["inequable", "duplicate"])),
62+
(
63+
"mnli",
64+
(
65+
["mnli hypothesis: ", " premise: "],
66+
["contradiction", "entailment", "neutral"],
67+
),
68+
),
69+
(
70+
"qnli",
71+
(["qnli question: ", " sentence: "], ["entailment", "contradiction"]),
72+
),
73+
(
74+
"rte",
75+
(["rte sentence1: ",
76+
" rte sentence2: "], ["entailment", "contradiction"]),
77+
),
78+
])
79+
5380

5481
def trans_func(example, tokenizer, args):
5582
task_name = args.task_name

0 commit comments

Comments
 (0)