[Trainer] support sharding for trainer. #3352

ZHUI · 2022-09-22T12:26:23Z

PR types

New features

PR changes

APIs

Description

support sharding for trainer.

stage1: 可以支持

stage2：部分支持

offload 暂不支持，需要修复pure_fp16

stage3：暂不支持

模型保存存在问题

ZHUI · 2022-09-23T10:16:34Z

paddlenlp/trainer/trainer_base.py

+                                    paddle.distributed.all_reduce(
+                                        p.bw_storage, group=self.dp_group)
+
+                    elif (args.recompute and args.local_rank != -1):
                        fused_allreduce_gradients(list(model.parameters()),
                                                  None)

                    if self.do_grad_scaling:


@haohongxiang 这里的scaler使用体验，请与官方scaler一致。

ZHUI · 2022-09-23T10:20:58Z

paddlenlp/trainer/trainer_base.py

@@ -1117,11 +1266,25 @@ def _save_checkpoint(self, model, metrics=None):

        self.save_model(output_dir)

+        if self.sharding is not None:


@haohongxiang 提供接口，rank 0卡，收集参数到cpu

ZHUI · 2022-10-09T09:42:50Z

paddlenlp/trainer/trainer_base.py

                    if self.do_grad_scaling:
-                        self.scaler.minimize(self.optimizer, tr_loss)
+                        # TODO: fix sharding stage2 stage3 with original scaler useage.


此处的api使用上有问题

wawltor · 2022-10-24T08:16:01Z

docs/trainer.md

@@ -395,6 +395,35 @@ Trainer 是一个简单，但功能完整的 Paddle训练和评估模块，并

                        The value of initial scale_loss for fp16. (default: 32768)

+  --sharding


这里可以写一下NOTICE，目前可用的状态

examples/language_model/t5/run_glue_trainer.py

wawltor · 2022-10-24T08:24:04Z

examples/language_model/t5/README.md

+|--------------------------------|-------|-------|-------------|------------------|-------------|-------------|------|-------|
+|                                | mcc   | acc   | acc      | pearson | acc      | acc      | acc  | acc   |
+| T5-v1_1-base Paddle | 47.6845 | 94.38 | 84.31 | 87.74   | 88.05 | 85.39 | 90.518 | 65.70 |
+| epoch | 100 | 10 | 100 | 100   | 3 | 3 | 10 | 100 |


T5_v1_1_base效果对齐了吗?

如线下沟通

wawltor · 2022-10-24T08:33:59Z

examples/language_model/t5/run_glue_trainer.py

+            if self.label2id:
+                label = self.label2id[label]
+                if pred not in self.label2id:
+                    pred = 0


这里为什么label为0的时候，pred = 1？
这里的逻辑可以再具体说一下

生成的label不在label list里面，最终预付label 0，这块的情况在encoder不会出现，这里能具体说一下数据指标怎么对齐了？

已补充注释

examples/language_model/t5/run_glue_trainer.py

paddlenlp/trainer/trainer_base.py

…into trainer_sharding

haohongxiang

LGTM for sharding+dp

ZHUI · 2022-10-26T08:19:52Z

paddlenlp/transformers/generation_utils.py

@@ -971,7 +971,8 @@ def greedy_search(self, input_ids, logits_processors, max_length,
            probs = F.softmax(logits)
            probs = paddle.log(probs)
            next_tokens = paddle.argmax(probs, axis=-1).unsqueeze(-1)
-            next_scores = paddle.index_sample(probs, next_tokens)
+            next_scores = paddle.index_sample(probs.astype("float32"),
+                                              next_tokens)


这里的话，index_sample 没有 fp16/bf16 kernel

ZHUI · 2022-10-26T08:20:35Z

paddlenlp/transformers/t5/modeling.py

-                f"{self.dtype} not recognized. `dtype` should be set to either `paddle.float32` or `paddle.float16`"
-            )
+            encoder_extended_attention_mask = (
+                1.0 - encoder_extended_attention_mask) * -1e4


For bf16 dtype

ZHUI · 2022-10-26T08:21:14Z

paddlenlp/transformers/t5/modeling.py

-                            labels.flatten())
+            loss = loss_fct(
+                lm_logits.reshape(
+                    shape=[-1, lm_logits.shape[-1]]).astype("float32"),


CrossEntropyLoss 没有fp16/bf16 kernel

…rding

wawltor

LGTM

wawltor

LGTM

ZHUI added 3 commits September 22, 2022 20:25

supprt sharding for trainer.

09b53e4

fix amp

b8e3c4c

fix sharding.

19ab021

ZHUI marked this pull request as ready for review September 23, 2022 10:03

ZHUI requested review from haohongxiang and wuhuachaocoding September 23, 2022 10:04

ZHUI assigned wawltor Sep 23, 2022

ZHUI commented Sep 23, 2022

View reviewed changes

ZHUI commented Oct 9, 2022

View reviewed changes

ZHUI requested a review from gongweibao October 9, 2022 09:43

ZHUI added 7 commits October 9, 2022 20:16

support sharding for t5.

ec2589b

add t5-3b

76a7ce2

add bf16 support.

150456d

fix memory leak

280a3e2

optimize stage1 finetune.

fce23ce

optimize v1_1 finetune.

65a5eb8

Update README.md

d8c9290

wawltor reviewed Oct 24, 2022

View reviewed changes

ZHUI added 3 commits October 24, 2022 19:37

fix as reviews.

b9f6f2f

Merge branch 'trainer_sharding' of https://github.com/ZHUI/PaddleNLP …

c6c568d

…into trainer_sharding

Merge branch 'develop' into trainer_sharding

9193cca

ZHUI requested a review from wawltor October 24, 2022 11:44

haohongxiang previously approved these changes Oct 25, 2022

View reviewed changes

Merge branch 'develop' into trainer_sharding

1cc507f

ZHUI commented Oct 26, 2022

View reviewed changes

ZHUI added 2 commits October 31, 2022 12:16

fix bugs.

62b6ccd

Merge remote-tracking branch 'zhui/trainer_sharding' into trainer_sha…

1101e08

…rding

ZHUI dismissed haohongxiang’s stale review via 1101e08 October 31, 2022 04:52

ZHUI added 6 commits November 4, 2022 14:47

fix sharding stage2

a9363fb

support iterable dataset.

769aa77

fix save strategy.

0368a4e

Merge remote-tracking branch 'origin/develop' into trainer_sharding

835b773

Merge remote-tracking branch 'origin/develop' into trainer_sharding

eac339e

fix trainer.

69c5082

wawltor previously approved these changes Nov 14, 2022

View reviewed changes

fix bug

827de30

ZHUI dismissed wawltor’s stale review via 827de30 November 14, 2022 12:48

Merge branch 'develop' into trainer_sharding

4cc41bf

wawltor approved these changes Nov 15, 2022

View reviewed changes

ZHUI merged commit b35b8d6 into PaddlePaddle:develop Nov 15, 2022

ZHUI mentioned this pull request Nov 17, 2022

PaddleNLP 2.4.3 Release Note Candidate #3774

Closed

ZHUI mentioned this pull request Jan 12, 2023

PaddleNLP 2.5.0 Release Note Candidate #4439

Closed

		@@ -1117,11 +1266,25 @@ def _save_checkpoint(self, model, metrics=None):

		self.save_model(output_dir)

		if self.sharding is not None:

		@@ -395,6 +395,35 @@ Trainer 是一个简单，但功能完整的 Paddle训练和评估模块，并

		The value of initial scale_loss for fp16. (default: 32768)

		--sharding

[Trainer] support sharding for trainer. #3352

[Trainer] support sharding for trainer. #3352

Uh oh!

Conversation

ZHUI commented Sep 22, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR types

PR changes

Description

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

haohongxiang left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wawltor left a comment

Choose a reason for hiding this comment

Uh oh!

wawltor left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ZHUI commented Sep 22, 2022 •

edited

Loading