-
Notifications
You must be signed in to change notification settings - Fork 29.5k
Disable delay_optimizer_creation in Trainer
to support fsdp2
#37147
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. cc @SunMarc for double-check on the added tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks ! Left a couple of comments. It would be nice to check that all the fsdp tests works with fsdpv2 also @S1ro1 but that can be done in a follow-up PR
Trainer
to support fsdp2Trainer
to support fsdp2
@SunMarc I've addressed your comments and the test suite passes on I'm a bit worried about how this interacts with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks ! make sure to update your ruff version to fix the quaility CI test
cc @S1ro1 for FSDP2_PYTORCH_VERSION comment. Maybe we need to update it to 2.6.0 ? |
Yeah after updating ruff the 2 files that were being changed seem to be unaffected. |
…ngface#37147) * github why you do this * fix * make fixup * disable cpu offload test * fixup * tmp reworks * git branch movement * make fixup * add require_fsdp_v2_version * dep issues * update ruff and fixup
…ngface#37147) * github why you do this * fix * make fixup * disable cpu offload test * fixup * tmp reworks * git branch movement * make fixup * add require_fsdp_v2_version * dep issues * update ruff and fixup
What does this PR do?
In order to get Trainer support for FSDP2 in
accelerate
, we have to pass the model and optimizer intoAccelerator.prepare()
at the same time (https://github.com/huggingface/accelerate/pull/3394/files#r2017637611).Note: This may not be sufficient for full FSDP2 support in Trainer. This PR might be scrapped and replaced with something more complete if some issues arise.
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Testing
2 unit tests were added to
tests/fsdp/test_fsdp.py
test_fsdp2_cpu_offloading
, which matchestest_fsdp_cpu_offloading
, but is skipped since the second test appears to be non functioningtest_accelerate_fsdp2_integration
which matchestest_training_and_can_resume_normally
with SHARDED_STATE_DICT (as this is the only config that can run accelerate), and passes.This will not be caught by CI, as it depends on an unreleased feature, but was instead manually run with the command
RUN_SLOW=1 pytest tests/fsdp/test_fsdp.py
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.