Skip to content

fix model run error when use auto parallel and recompute(use_reentrant=false) #65188

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

jeff41404
Copy link
Contributor

@jeff41404 jeff41404 commented Jun 14, 2024

PR Category

Auto Parallel

PR Types

Bug fixes

Description

pcard-84677
model will run error and core dump when meets 2 conditions:

class Large_model(...... #sub-class of paddle.nn.Layer):
    def __init__(
        self,
        ......
    ):
        super().__init__()
        ......
        self.layers = nn.LayerList()
        for i in range(num_layers):
            # condition 1 : use auto parallel shard parameters, e.g. shard parameters in TransformerBlock
            self.layers.append(TransformerBlock(......))

    def forward(self, x, label):
        ......
        for i, layer in enumerate(self.layers):
            # condition 2 : use recompute and set use_reentrant=False
            x = paddle.distributed.fleet.utils.recompute(layer, x, use_reentrant=False)
        ......
        return loss

model = Large_model(......)
for step, inputs in enumerate(dataloader):
    # inputs contain data and label
    loss = model(inputs)
    loss.backward() # error and core dump, this PR will fix this issue

Copy link

paddle-bot bot commented Jun 14, 2024

你的PR提交成功,感谢你对开源项目的贡献!
请关注后续CI自动化测试结果,详情请参考Paddle-CI手册
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

@jeff41404 jeff41404 changed the title fix model run error when auto parallel and recompute(use_reentrant=false) fix model run error when use auto parallel and recompute(use_reentrant=false) Jun 17, 2024
Copy link
Contributor

@zhiqiu zhiqiu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@JiabinYang JiabinYang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for Tensor construction

@jeff41404 jeff41404 merged commit 86206c8 into PaddlePaddle:develop Jun 20, 2024
32 of 33 checks passed
@jeff41404 jeff41404 deleted the fix_auto_parallel_recompute_use_reentrant_false_issue branch June 20, 2024 10:19
co63oc pushed a commit to co63oc/Paddle that referenced this pull request Jun 25, 2024
…t=false) (PaddlePaddle#65188)

* fix model run error when auto parallel and recompute and use_reentrant=false

* solve the defect of TensorWrapper not considering DistTensor

* add unittest

* fix recompute have not support cpu when use_reentrant is false
co63oc pushed a commit to co63oc/Paddle that referenced this pull request Jun 25, 2024
…t=false) (PaddlePaddle#65188)

* fix model run error when auto parallel and recompute and use_reentrant=false

* solve the defect of TensorWrapper not considering DistTensor

* add unittest

* fix recompute have not support cpu when use_reentrant is false
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants