-
Notifications
You must be signed in to change notification settings - Fork 264
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix issues with ddp/hydra and add tests #796
Conversation
trainer: forces | ||
|
||
task: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw i think the other configs like test_equiformers_v2 also has the "task" field and can be removed
@@ -683,6 +684,11 @@ def no_weight_decay(self) -> set: | |||
|
|||
@registry.register_model("equiformer_v2_backbone") | |||
class EquiformerV2Backbone(EquiformerV2, BackboneInterface): | |||
def __init__(self, *args, **kwargs): | |||
super().__init__(*args, **kwargs) | |||
self.energy_block = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a comment to get rid of this when we stop inheriting from EquiformersV2?
Thanks for debugging! this is OK for now but i think we should aggresively remove OR deprecate equiformersV2 instead of creating this inheritance structure:
Can you also run a quick 31M eqv2 vs eqv2 hydra experiment to make sure they are the same? |
The lmdb thing is unfortunate, but we can test with DDP and NO multiprocess by just always initializing a local distributed process group, this is a simple thing to fix when we get rid of the no-ddp flag |
Codecov ReportAll modified and coverable lines are covered by tests ✅
|
Running these now on devfair |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shipit!
The results of these show that equiformer v2 regular is the same as equiformer v2 hydra |
The original hydra commit does not play nice with DDP because there are unused parameters. These parameters are the original heads present in the backbone that are duplicated in the heads. To fix this issue this PR sets them to None in the underlying class when it is used through the Backbone class.
Additionally tests for DDP versions were not present in the code before this. I have added every network to in tests_e2e to be tested in both DDP and non-DDP modes.
A slight issue with DDP tests is that when using spawn method for distributed we cannot properly pickle and load LMDBs. A workaround for now is when testing with DDP , num_workers=0.
A future BE task could be to re-initialize LMDB hooks in workers instead of pickling them over.