-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add init group during exp manager #7497
Conversation
nemo/utils/exp_manager.py
Outdated
trainer.strategy.launcher.launch(dummy, trainer=trainer) | ||
trainer.strategy.setup_environment() | ||
|
||
global_rank = torch.distributed.get_rank() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This forces even single gpu runs to launch torch.distributed, let's only do this if there's more than one gpu
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
07ff8ee only updated from before if we initialized the process group
nemo/utils/exp_manager.py
Outdated
# the exp manager calls some operations that require explicit | ||
# synchronization, therefore we need to initialize the process | ||
# group to initiate a barrier | ||
if parallel_state.is_unitialized(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just check if torch.distrib is init here, don't need megatron.core
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also make sure there's more than one gpu here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed 07ff8ee
daa2284
to
07ff8ee
Compare
07ff8ee
to
e3d5351
Compare
e3d5351
to
6dc19ec
Compare
Signed-off-by: Gerald Shen <[email protected]>
Signed-off-by: Gerald Shen <[email protected]>
…cked version Signed-off-by: Gerald Shen <[email protected]>
d178eeb
to
0daa321
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks!
# group to initiate a barrier | ||
if not torch.distributed.is_initialized() and trainer.num_nodes * trainer.num_devices > 1: | ||
|
||
def dummy(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm that seems reasonable to avoid race conditions.
It also should probably not have an effect on 1 GPU things.
However I'm worried about what hacking ptl like this will do to ddp training and stuff. What happens when you call trainer.fit() after already setting up ddp with that dummy ? Will it reinitialize ? Not likely. So what does the dummy do then to normal PTL behaviour ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when testing on my local setup, doing trainer.fit after this initialization seems totally fine. But I don't really know how it behaves on larger runs. To be safe will you rather me modify NLPDDPStrategy as well and set a flag that checks if setup_environment has already been called?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually looking closely, i think this is safe because PTL guards against already initialized case https://github.com/Lightning-AI/lightning/blob/984f49f7195ddc67e961c7c498ee6e19fc0cecb5/src/lightning/fabric/utilities/distributed.py#L237
To me, personally I really do not want to make Exp Manager deal with distributed computing under any circumstances. While this may be foolproof solution with barrier and stuff, it messes around ptl internals. Instead, please see if you can do two things - |
I know of runs that have slept 30seconds + but still have this issue. I agree this may mess with PTL internals but sleep is going to have to depend on the size of the run and size of the cluster. We'll have this issue forever unless we have a barrier |
Closing in favor of #7498 |
What does this PR do ?
Add a one line overview of what this PR aims to accomplish.
Fixes #7460. Exp manager now initializes distributed groups to force a barrier.
Changelog
PR Type:
Additional Information
The moving of files in
exp_manager
may cause crashes in other processes #7460