Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add init group during exp manager #7497

Closed
wants to merge 3 commits into from

Conversation

gshennvm
Copy link
Collaborator

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Fixes #7460. Exp manager now initializes distributed groups to force a barrier.

Changelog

  • Add specific line by line info of high level changes in this PR.

PR Type:

  • New Feature
  • Bugfix
  • Documentation

Additional Information

trainer.strategy.launcher.launch(dummy, trainer=trainer)
trainer.strategy.setup_environment()

global_rank = torch.distributed.get_rank()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This forces even single gpu runs to launch torch.distributed, let's only do this if there's more than one gpu

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

07ff8ee only updated from before if we initialized the process group

# the exp manager calls some operations that require explicit
# synchronization, therefore we need to initialize the process
# group to initiate a barrier
if parallel_state.is_unitialized():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just check if torch.distrib is init here, don't need megatron.core

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also make sure there's more than one gpu here

Copy link
Collaborator Author

@gshennvm gshennvm Sep 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed 07ff8ee

@gshennvm gshennvm force-pushed the geshen/fix_file_move_on_large_runs branch from daa2284 to 07ff8ee Compare September 23, 2023 21:45
@gshennvm gshennvm changed the base branch from main to r1.21.0 September 23, 2023 21:47
@gshennvm gshennvm force-pushed the geshen/fix_file_move_on_large_runs branch from 07ff8ee to e3d5351 Compare September 23, 2023 21:48
@gshennvm gshennvm force-pushed the geshen/fix_file_move_on_large_runs branch from e3d5351 to 6dc19ec Compare September 23, 2023 21:56
@gshennvm gshennvm force-pushed the geshen/fix_file_move_on_large_runs branch from d178eeb to 0daa321 Compare September 23, 2023 22:08
Copy link
Collaborator

@ericharper ericharper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

# group to initiate a barrier
if not torch.distributed.is_initialized() and trainer.num_nodes * trainer.num_devices > 1:

def dummy():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm that seems reasonable to avoid race conditions.

It also should probably not have an effect on 1 GPU things.

However I'm worried about what hacking ptl like this will do to ddp training and stuff. What happens when you call trainer.fit() after already setting up ddp with that dummy ? Will it reinitialize ? Not likely. So what does the dummy do then to normal PTL behaviour ?

Copy link
Collaborator Author

@gshennvm gshennvm Sep 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when testing on my local setup, doing trainer.fit after this initialization seems totally fine. But I don't really know how it behaves on larger runs. To be safe will you rather me modify NLPDDPStrategy as well and set a flag that checks if setup_environment has already been called?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually looking closely, i think this is safe because PTL guards against already initialized case https://github.com/Lightning-AI/lightning/blob/984f49f7195ddc67e961c7c498ee6e19fc0cecb5/src/lightning/fabric/utilities/distributed.py#L237

@titu1994
Copy link
Collaborator

To me, personally I really do not want to make Exp Manager deal with distributed computing under any circumstances.

While this may be foolproof solution with barrier and stuff, it messes around ptl internals.

Instead, please see if you can do two things -
If trauner.num_nodes * num_gpus > 1 and trainer.global_rank == 0 (assume you can properly access trainer.global_rank), then simply sleep on rank 0 for 5-10 seconds. It's a minor waste, but way simpler without making all these hacks with distributed.

@gshennvm
Copy link
Collaborator Author

To me, personally I really do not want to make Exp Manager deal with distributed computing under any circumstances.

While this may be foolproof solution with barrier and stuff, it messes around ptl internals.

Instead, please see if you can do two things - If trauner.num_nodes * num_gpus > 1 and trainer.global_rank == 0 (assume you can properly access trainer.global_rank), then simply sleep on rank 0 for 5-10 seconds. It's a minor waste, but way simpler without making all these hacks with distributed.

I know of runs that have slept 30seconds + but still have this issue. I agree this may mess with PTL internals but sleep is going to have to depend on the size of the run and size of the cluster. We'll have this issue forever unless we have a barrier

@ericharper ericharper closed this Sep 24, 2023
@ericharper
Copy link
Collaborator

Closing in favor of #7498

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

The moving of files in exp_manager may cause crashes in other processes
3 participants