-
Notifications
You must be signed in to change notification settings - Fork 974
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixup Zero3 + save_model
#3146
Fixup Zero3 + save_model
#3146
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing this bug. The failing test looks unrelated.
Just a nit, I would move the directory creation code below the newly introduced early return, as it's an unnecessary side effect if we're not saving anything, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM ! Left a small suggestion but feel free to ignore.
# Case: DeepSpeed zero3 gets gathered and `state_dict` is empty | ||
if state_dict is None: | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can maybe make it more specific to deepspeed ?
In their code, this is how they do this: https://github.com/microsoft/DeepSpeed/blob/f3943cf9109226ed3ecf2d5dbb639a11cd925555/deepspeed/runtime/engine.py#L3414
Seems like the model only gets saved on rank 0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went more generic just in case somehow this happens in other places that we don't know yet (so we enable and don't block them)
* Fixup + test * Easier diff * Move os.makedirs to under return statement
What does this PR do?
When
save_model
is called under Zero3 only a single rank has all of the parameters. As a resultAccelerator.save_model
will throw an error under zero3 saying'NoneType' object has no attribute 'items'
because in this case thestate_dict
of the model on non-0-ranks won't exist.Fixes #2985
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
@SunMarc @BenjaminBossan