fix model run error when use auto parallel and recompute(use_reentrant=false) #65188

jeff41404 · 2024-06-14T13:46:11Z

PR Category

Auto Parallel

PR Types

Bug fixes

Description

pcard-84677
model will run error and core dump when meets 2 conditions:

class Large_model(...... #sub-class of paddle.nn.Layer):
    def __init__(
        self,
        ......
    ):
        super().__init__()
        ......
        self.layers = nn.LayerList()
        for i in range(num_layers):
            # condition 1 : use auto parallel shard parameters, e.g. shard parameters in TransformerBlock
            self.layers.append(TransformerBlock(......))

    def forward(self, x, label):
        ......
        for i, layer in enumerate(self.layers):
            # condition 2 : use recompute and set use_reentrant=False
            x = paddle.distributed.fleet.utils.recompute(layer, x, use_reentrant=False)
        ......
        return loss

model = Large_model(......)
for step, inputs in enumerate(dataloader):
    # inputs contain data and label
    loss = model(inputs)
    loss.backward() # error and core dump, this PR will fix this issue

…t=false

paddle-bot · 2024-06-14T13:46:16Z

你的PR提交成功，感谢你对开源项目的贡献!
请关注后续CI自动化测试结果，详情请参考Paddle-CI手册。
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

…_false_issue

zhiqiu

LGTM

JiabinYang

LGTM for Tensor construction

…t=false) (PaddlePaddle#65188) * fix model run error when auto parallel and recompute and use_reentrant=false * solve the defect of TensorWrapper not considering DistTensor * add unittest * fix recompute have not support cpu when use_reentrant is false

fix model run error when auto parallel and recompute and use_reentran…

470854a

…t=false

jeff41404 changed the title ~~fix model run error when auto parallel and recompute(use_reentrant=false)~~ fix model run error when use auto parallel and recompute(use_reentrant=false) Jun 17, 2024

solve the defect of TensorWrapper not considering DistTensor

7f5e138

jeff41404 requested review from JiabinYang and phlrain as code owners June 17, 2024 08:09

jeff41404 added 3 commits June 18, 2024 18:40

add unittest

4a8199c

fix recompute have not support cpu when use_reentrant is false

5c537b7

Merge branch 'develop' into fix_auto_parallel_recompute_use_reentrant…

1a052d7

…_false_issue

jeff41404 mentioned this pull request Jun 20, 2024

modify path of model_zoo in ci_case_auto.sh and ci_case_dy.sh PaddlePaddle/PaddleNLP#8633

Merged

zhiqiu approved these changes Jun 20, 2024

View reviewed changes

JiabinYang approved these changes Jun 20, 2024

View reviewed changes

jeff41404 merged commit 86206c8 into PaddlePaddle:develop Jun 20, 2024
32 of 33 checks passed

jeff41404 deleted the fix_auto_parallel_recompute_use_reentrant_false_issue branch June 20, 2024 10:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix model run error when use auto parallel and recompute(use_reentrant=false) #65188

fix model run error when use auto parallel and recompute(use_reentrant=false) #65188

Uh oh!

jeff41404 commented Jun 14, 2024 •

edited

Loading

Uh oh!

paddle-bot bot commented Jun 14, 2024

Uh oh!

zhiqiu left a comment

Uh oh!

JiabinYang left a comment

Uh oh!

Uh oh!

Uh oh!

fix model run error when use auto parallel and recompute(use_reentrant=false) #65188

fix model run error when use auto parallel and recompute(use_reentrant=false) #65188

Uh oh!

Conversation

jeff41404 commented Jun 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Category

PR Types

Description

Uh oh!

paddle-bot bot commented Jun 14, 2024

Uh oh!

zhiqiu left a comment

Choose a reason for hiding this comment

Uh oh!

JiabinYang left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jeff41404 commented Jun 14, 2024 •

edited

Loading