Stage3: Use new torch grad accumulation hooks API #6773

deepcharm · 2024-11-21T14:57:53Z

This commit addresses a Deepspeed issue #6718
The existing code has been using the grad_acc node hook to reduce params grads.
The constructs such as param.data = replicated_tensor.data used in allgather_params(..)
are compiled into param.set() causing the hook assigned to the grad_acc node not being called.
The above caused accuracy issues and could be temporarily solved by simply disabling the torch compile when activation checkpointing is used.
This commit provides a clean solution by replacing the hook on a grad_acc node to a hook using a new and robust hook API on a param itself: param.register_post_accumulate_grad_hook(..)

* This commit addresses an issue reported in: microsoft#6718 * The existing code has been using the grad_acc node hook to reduce params grads. The constructs such as param.data = replicated_tensor.data used in allgather_params(..) are compiled into param.set() causing the hook assigned to the grad_acc node not being called. * This is a known torch issue pytorch/pytorch#139742. * The above caused accuracy issues and could be temporarily solved by simply disabling the torch compile when activation checkpointing is used. * This commit provides a clean solution by replacing the hook on a grad_acc node to a hook using a new and robust hook API on a param itself: param.register_post_accumulate_grad_hook(..)

tjruwase · 2024-11-21T16:04:34Z

deepspeed/runtime/zero/stage3.py

-                        self._grad_acc_hooks.append(grad_acc.register_hook(reduce_partition_and_remove_grads))
-                        self.grad_accs.append(grad_acc)
+                        self._grad_acc_hooks.append(
+                            param.register_post_accumulate_grad_hook(reduce_partition_and_remove_grads))


Which pytorch version introduced this API? How should we handle older versions?

Hi @tjruwase

This API was introduced starting pytorch v2.1.

We can add check what version and use the older API when needed.

However, compilation should be disabled for the older versions to avoid the accuracy issues (if activation checkpointing is enabled).

Please advice what should be the approach. Thanks.

deepcharm requested a review from tjruwase as a code owner November 21, 2024 14:57

tjruwase reviewed Nov 21, 2024

View reviewed changes

yitingw1 mentioned this pull request Nov 22, 2024

[Compiled_autograd] running nn.LayerNorm failed for torch.compile with compiled_autograd when deepspeed Zero3 pytorch/pytorch#140091

Open

loadams added 2 commits November 22, 2024 08:22

Merge branch 'master' into stage3-use-new-grad-acc-api

99bb156

Merge branch 'master' into stage3-use-new-grad-acc-api

e103272

yitingw1 mentioned this pull request Nov 27, 2024

[Compiled_autograd] running deepspeed Zero3 failed for torch.compile with compiled_autograd pytorch/pytorch#141646

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stage3: Use new torch grad accumulation hooks API #6773

Stage3: Use new torch grad accumulation hooks API #6773

deepcharm commented Nov 21, 2024

tjruwase Nov 21, 2024

deepcharm Nov 28, 2024

Stage3: Use new torch grad accumulation hooks API #6773

Are you sure you want to change the base?

Stage3: Use new torch grad accumulation hooks API #6773

Conversation

deepcharm commented Nov 21, 2024

tjruwase Nov 21, 2024

Choose a reason for hiding this comment

deepcharm Nov 28, 2024

Choose a reason for hiding this comment