Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix issues with ddp/hydra and add tests #796

Merged
merged 5 commits into from
Aug 8, 2024
Merged

fix issues with ddp/hydra and add tests #796

merged 5 commits into from
Aug 8, 2024

Conversation

misko
Copy link
Collaborator

@misko misko commented Aug 6, 2024

The original hydra commit does not play nice with DDP because there are unused parameters. These parameters are the original heads present in the backbone that are duplicated in the heads. To fix this issue this PR sets them to None in the underlying class when it is used through the Backbone class.

Additionally tests for DDP versions were not present in the code before this. I have added every network to in tests_e2e to be tested in both DDP and non-DDP modes.
A slight issue with DDP tests is that when using spawn method for distributed we cannot properly pickle and load LMDBs. A workaround for now is when testing with DDP , num_workers=0.

A future BE task could be to re-initialize LMDB hooks in workers instead of pickling them over.

@misko misko requested a review from rayg1234 August 6, 2024 21:55
@misko misko marked this pull request as ready for review August 6, 2024 21:56
@misko misko requested a review from lbluque August 6, 2024 21:56
trainer: forces

task:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw i think the other configs like test_equiformers_v2 also has the "task" field and can be removed

@@ -683,6 +684,11 @@ def no_weight_decay(self) -> set:

@registry.register_model("equiformer_v2_backbone")
class EquiformerV2Backbone(EquiformerV2, BackboneInterface):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.energy_block = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a comment to get rid of this when we stop inheriting from EquiformersV2?

@rayg1234
Copy link
Collaborator

rayg1234 commented Aug 6, 2024

Thanks for debugging! this is OK for now but i think we should aggresively remove OR deprecate equiformersV2 instead of creating this inheritance structure:

class EquiformerV2Backbone(EquiformerV2, BackboneInterface):

Can you also run a quick 31M eqv2 vs eqv2 hydra experiment to make sure they are the same?

@rayg1234
Copy link
Collaborator

rayg1234 commented Aug 6, 2024

Additionally tests for DDP versions were not present in the code before this. I have added every network to in tests_e2e to be tested in both DDP and non-DDP modes. A slight issue with DDP tests is that when using spawn method for distributed we cannot properly pickle and load LMDBs. A workaround for now is when testing with DDP , num_workers=0.

A future BE task could be to re-initialize LMDB hooks in workers instead of pickling them over.

The lmdb thing is unfortunate, but we can test with DDP and NO multiprocess by just always initializing a local distributed process group, this is a simple thing to fix when we get rid of the no-ddp flag

Copy link

codecov bot commented Aug 6, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Files Coverage Δ
...airchem/core/models/equiformer_v2/equiformer_v2.py 88.81% <100.00%> (+0.30%) ⬆️
src/fairchem/core/models/escn/escn.py 96.79% <100.00%> (+0.01%) ⬆️
src/fairchem/core/models/gemnet_oc/gemnet_oc.py 90.13% <100.00%> (+0.08%) ⬆️
src/fairchem/core/models/painn/painn.py 90.84% <100.00%> (+0.06%) ⬆️

@misko
Copy link
Collaborator Author

misko commented Aug 7, 2024

Thanks for debugging! this is OK for now but i think we should aggresively remove OR deprecate equiformersV2 instead of creating this inheritance structure:

class EquiformerV2Backbone(EquiformerV2, BackboneInterface):

Can you also run a quick 31M eqv2 vs eqv2 hydra experiment to make sure they are the same?

Running these now on devfair

Copy link
Collaborator

@rayg1234 rayg1234 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shipit!

@misko
Copy link
Collaborator Author

misko commented Aug 7, 2024

Thanks for debugging! this is OK for now but i think we should aggresively remove OR deprecate equiformersV2 instead of creating this inheritance structure:
class EquiformerV2Backbone(EquiformerV2, BackboneInterface):
Can you also run a quick 31M eqv2 vs eqv2 hydra experiment to make sure they are the same?

Running these now on devfair

https://fburl.com/z1fez1mf

The results of these show that equiformer v2 regular is the same as equiformer v2 hydra

@rayg1234 rayg1234 mentioned this pull request Aug 8, 2024
7 tasks
@misko misko added this pull request to the merge queue Aug 8, 2024
Merged via the queue into main with commit 44c71b3 Aug 8, 2024
7 checks passed
@misko misko deleted the hydra_fixes branch August 8, 2024 16:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants