Skip to content

[Trainer] support sharding for trainer. #3352

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 24 commits into from
Nov 15, 2022
Merged
Show file tree
Hide file tree
Changes from 14 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
31 changes: 31 additions & 0 deletions docs/trainer.md
Original file line number Diff line number Diff line change
Expand Up @@ -395,6 +395,37 @@ Trainer 是一个简单,但功能完整的 Paddle训练和评估模块,并

The value of initial scale_loss for fp16. (default: 32768)

--sharding
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里可以写一下NOTICE,目前可用的状态

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已补充

是否使用Paddle的Sharding数据并行功能,用户的参数。支持sharding `stage1`, `stage2` or `stage3`。
其中`stage2``stage3`可以和`offload`组合使用。
每个种策略分别为:
stage1 : optimizer 中的参数切分到不同卡
stage2 : optimizer + gradient 中的参数切分到不同卡
stage3 : parameter + gradient + optimizer 中的参数都切分到不同卡
offload : offload parameters to cpu 部分参数存放到cpu中
(`str`, 可选, 默认为 `` 不使用sharding)
注意:当前stage3暂时不可用

Whether or not to use Paddle Sharding Data Parallel training (in distributed training
only). The base option should be `stage1`, `stage2` or `stage3` and you can add
CPU-offload to `stage2` or `stage3` like this: `stage2 offload` or `stage3 offload`.
Each stage means:
stage1 : optimizer state segmentation
stage2 : optimizer state + gradient segmentation
stage3 : parameter + gradient + optimizer state segmentation
offload : offload parameters to cpu
NOTICE: stage3 is temporarily unavaliable.

--sharding_degree
设置sharding的通信组参数,表示通信组的大小。同一个sharding通信组内的参数,进行sharding,分布到不同卡上。
不同sharding通信组之间,相当于单纯的数据并行。此选项只在sharding选项开启时候生效。
默认值为-1,表示所有训练的卡在同一个通信组内。
(`int`, 可选, 默认为 `-1`)

Sharding parameter in certain cards group. For example, aussume we use 2 machines each
with 8 cards, then set sharding_degree=8, sharding will only communication inside machine.
default -1 means sharding parameters between all workers. (`int`, *optional*, defaults to `-1`)

--recompute
是否使用重计算训练。可以节省显存。
重新计算前向过程以获取梯度,减少中间变量显存
Expand Down
41 changes: 41 additions & 0 deletions examples/language_model/t5/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,12 +46,53 @@ python run_glue.py \
- `scheduler_type` scheduler类型,可选linear和cosine,默认linear。
- `output_dir` 表示模型保存路径。

使用trainer进行Fine-tuning:
```shell
python -m paddle.distributed.launch --gpus "0,1,2,3" run_glue_trainer.py \
--model_name_or_path t5-base \
--task_name rte \
--max_seq_length 256 \
--do_train \
--do_eval \
--per_device_train_batch_size 16 \
--per_device_eval_batch_size 64 \
--learning_rate 1e-4 \
--weight_decay 0.01 \
--warmup_ratio 0.1 \
--num_train_epochs 10 \
--eval_steps 200 \
--logging_steps 20 \
--save_steps 200 \
--save_total_limit 3 \
--metric_for_best_model "eval_accuarcy" \
--fp16 false \
--fp16_opt_level "O1" \
--recompute true \
--sharding "stage1" \
--overwrite_output_dir \
--disable_tqdm true \
--output_dir outputs/rte/
```
具体参数含义请参见: https://paddlenlp.readthedocs.io/zh/latest/trainer.html

###### t5-base模型在GLUE开发集上的结果:
| Model | cola | sst-2 | mrpc | sts-b | qqp | mnli | qnli | rte | mean |
|--------------------------------|-------|-------|-------------|------------------|-------------|-------------|------|-------|-------|
| | mcc | acc | acc | pearson | acc | acc | acc | acc | |
| T5-base-Paddle | 61.74 | 95.18 | 90.44 | 90.09 | 91.60 | 87.18 | 93.56 | 81.95 | 86.4675 |

###### t5_v1_1-base模型在GLUE开发集上的结果:
使用`run_glue_trainer.py`运行,由于`t5_v1_1-base`没有在glue任务上进行训练过,直接生成label的策略需要的训练时间需要更长。
| Model | cola | sst-2 | mrpc | sts-b | qqp | mnli | qnli | rte |
|--------------------------------|-------|-------|-------------|------------------|-------------|-------------|------|-------|
| | mcc | acc | acc | pearson | acc | acc | acc | acc |
| T5-v1_1-base Paddle | 47.6845 | 94.38 | 84.31 | 87.74 | 88.05 | 85.39 | 90.518 | 65.70 |
| epoch | 100 | 10 | 100 | 100 | 3 | 3 | 10 | 100 |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

T5_v1_1_base效果对齐了吗?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

如线下沟通


注:
- 直接生成label的finetune方式难度较大,前期基本学习如何正确生成label标签,后期才学习分类任务。
- 生成的label标签设计,标签差异大一些,效果会更好一些。
- `qqp`,`mnli`数据集适当增大训练epoch数,可以取得更好效果。

### GLUE Demo测试

Expand Down
27 changes: 27 additions & 0 deletions examples/language_model/t5/data.py
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,33 @@
),
])

GLUE_1_1_PROCESSED = collections.OrderedDict([
("cola", (["cola sentence: "], ["outrageous", "acceptable"])),
("sst-2", (["sst2 sentence: "], ["negative", "positive"])),
(
"mrpc",
(["mrpc sentence1: ", " sentence2: "], ["nonidentical", "equivalent"]),
),
("sts-b", (["stsb sentence1: ", " sentence2: "], None)),
("qqp", (["qqp question1: ", " question2: "], ["inequable", "duplicate"])),
(
"mnli",
(
["mnli hypothesis: ", " premise: "],
["contradiction", "entailment", "neutral"],
),
),
(
"qnli",
(["qnli question: ", " sentence: "], ["entailment", "contradiction"]),
),
(
"rte",
(["rte sentence1: ",
" rte sentence2: "], ["entailment", "contradiction"]),
),
])


def trans_func(example, tokenizer, args):
task_name = args.task_name
Expand Down
Loading