support branch parallel for evoformer #14

GuoxiaWang · 2022-11-10T02:21:29Z

support Branch Parallelism in paper Efficient AlphaFold2 Training using Parallel Evoformer and Branch Parallelism

guolinke · 2022-11-11T04:25:19Z

Thank you, I will review this in the weekend.

guolinke · 2022-11-14T06:02:33Z

unicore_cli/train.py

@@ -49,6 +49,7 @@ def main(args) -> None:
    ), "Must specify batch size either with --batch-size"
    metrics.reset()

+    args.seed += args.dp_rank


is this change needed?

When using a hybrid distributed parallel strategy, such as DP-BP, the parameters and data in the same BP group need to be the same, so the seeds need to be the same.

guolinke · 2022-11-14T06:04:02Z

unicore/distributed/utils.py

@@ -137,6 +139,9 @@ def distributed_init(args):
        if torch.cuda.is_available():
            dist.all_reduce(torch.zeros(1).cuda())

+        scg.init_group(bp_degree=args.bp_degree, dap_degree=1)


will this affect the normal c10d, no_c10d mode?
Can we make "bp" a choice, like currently c10d, no_c10d?

I'm not quite sure about this question. This PR is just to show how to use BP, not to merge this PR into UniCore.

sorry, I may miss some contexts.

guolinke · 2022-11-14T06:05:50Z

unicore/distributed/bp.py

+
+        return outer_grad.clone(), msa_grad.clone(), pair_grad.clone()
+
+def sync_evoformer_results(outer, msa, pair, training):


I feel like the functions in this file are better to be in Uni-Fold.

Same problem as above. It is necessary to design the code together and merge them into UniFold and UniCore respectively.

dptech-corp/Uni-Fold#73

GuoxiaWang added 2 commits November 10, 2022 10:06

support branch parallel for evoformer

b1ebd0f

add training input param

e3ea8c8

guolinke reviewed Nov 14, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support branch parallel for evoformer #14

support branch parallel for evoformer #14

GuoxiaWang commented Nov 10, 2022 •

edited

Loading

guolinke commented Nov 11, 2022

guolinke Nov 14, 2022

GuoxiaWang Nov 21, 2022

guolinke Nov 14, 2022

GuoxiaWang Nov 21, 2022

guolinke Nov 22, 2022

guolinke Nov 14, 2022

GuoxiaWang Nov 21, 2022

GuoxiaWang Nov 21, 2022


		return outer_grad.clone(), msa_grad.clone(), pair_grad.clone()

		def sync_evoformer_results(outer, msa, pair, training):

support branch parallel for evoformer #14

Are you sure you want to change the base?

support branch parallel for evoformer #14

Conversation

GuoxiaWang commented Nov 10, 2022 • edited Loading

guolinke commented Nov 11, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

GuoxiaWang commented Nov 10, 2022 •

edited

Loading