New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Give example on how to handle gradient accumulation with cross-entropy #3193

Open

ylacombe wants to merge 11 commits into huggingface:main from ylacombe:add-cross-entropy-accumulation-example

ylacombe commented Oct 24, 2024

What does this PR do?

Following the recent highlights on how gradient accumulation with the cross-entropy loss is usually off, it could be great to have it mentioned in the doc. I've thus added some code and explanation of it in the gradient accumulation page.

cc @SunMarc and @muellerzr, let me know what you think of it or if I can make this any clearer!

ylacombe added 2 commits

October 24, 2024 12:13


          Add cross-entropy example in the gradient accumulation docs

2eb684e


          add example of logs

4d4ed80

HuggingFaceDocBuilderDev commented Oct 24, 2024

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.


          correct skeleton code

3b8c887

muellerzr approved these changes

View reviewed changes

Collaborator

muellerzr left a comment

Looks great! (We need to do the gather() + div by num processes in the trainer still).

Left a few nits, I think it'd be really cool if we can show full training graphs. After doing stuff with FP8 just taking "the end result is the same" at face value I don't fully trust :)

docs/source/usage_guides/gradient_accumulation.md Outdated

Comment on lines 432 to 452

+              Results on a single device:
+              ```
+              initial model weight is tensor([-0.0075,  0.5364])
+              initial model clone weight is tensor([-0.0075,  0.5364])
+              Step 0 - Device 0 - num items in the local batch 36
+              Total num items 36
+              Device 0 - w/ accumulation, the final model weight is tensor([0.0953, 0.4337])
+              w/o accumulation, the final model weight is tensor([0.0953, 0.4337])
+              ```
+              Results on a two devices set-up:
+              ```
+              initial model weight is tensor([-0.0075,  0.5364])
+              initial model clone weight is tensor([-0.0075,  0.5364])
+              Step 0 - Device 0 - num items in the local batch 52
+              Step 0 - Device 1 - num items in the local batch 84
+              Total num items 136
+              Device 1 - w/ accumulation, the final model weight is tensor([0.2117, 0.3172])
+              Device 0 - w/ accumulation, the final model weight is tensor([0.2117, 0.3172])
+              w/o accumulation, the final model weight is tensor([0.2117, 0.3172])
+              ```

Collaborator

muellerzr Oct 24, 2024

Honestly if we can let's even toss up some wandb graphs 🔥

Author

ylacombe Oct 24, 2024 •

edited

Loading

Indeed, it'd be great, but here we do only one single global batch size, I don't think it's worth adding a graph. Maybe should I modify the current code snippet to do this with multiple global steps ?

Author

ylacombe Oct 24, 2024

Or add some wandb graphs from the upcoming modif of examples/by_feature/gradient_accumulation ?

docs/source/usage_guides/gradient_accumulation.md

		model_optimizer.zero_grad()


		logger.warning(f"Device {accelerator.process_index} - w/ accumulation, the final model weight is {accelerator.unwrap_model(model).weight.detach().cpu().squeeze()}", main_process_only=False)

Collaborator

muellerzr Oct 24, 2024

Rather than logger.warning, we can do print() here or change the default logging level :) (Just logging.warning rather than logging.info weirds me out)

docs/source/usage_guides/gradient_accumulation.md Outdated Show resolved Hide resolved

docs/source/usage_guides/gradient_accumulation.md Outdated Show resolved Hide resolved

docs/source/usage_guides/gradient_accumulation.md Outdated Show resolved Hide resolved

muellerzr reviewed

View reviewed changes

docs/source/usage_guides/gradient_accumulation.md Outdated Show resolved Hide resolved

ylacombe added 3 commits

October 24, 2024 14:28


          replace gather_for_metrics with gather

c01827c


          batch_size -> per_device_batch_size

22cbf9c


          remove main_process_only=True

395c572

SunMarc reviewed

View reviewed changes

Member

SunMarc left a comment

Nice job @ylacombe ! Left a few suggestions !

docs/source/usage_guides/gradient_accumulation.md Outdated Show resolved Hide resolved

docs/source/usage_guides/gradient_accumulation.md

Comment on lines +366 to +378

+              num_samples_in_epoch = len(dataloader)
+              remainder = num_samples_in_epoch % gradient_accumulation_steps
+              remainder = remainder if remainder != 0 else gradient_accumulation_steps
+              total_gradient_updates = math.ceil(num_samples_in_epoch / gradient_accumulation_steps)
+              total_batched_samples = 0
+              for update_step in range(total_gradient_updates):
+                      # In order to correctly the total number of non-padded tokens on which we'll compute the cross-entropy loss
+                      # we need to pre-load the full local batch - i.e the next per_device_batch_size * accumulation_steps samples
+                      batch_samples = []
+                      num_batches_in_step = gradient_accumulation_steps if update_step != (total_gradient_updates - 1) else remainder
+                      for _ in range(num_batches_in_step):
+                          batch_samples += [next(training_iterator)]

Member

SunMarc Oct 24, 2024

This only works when we know the size of the dataloader. Can we think of a solution that doesn't require this information ? I think we can just iter on the dataloader until we have gradient_accumulation_steps to create the batch_sample. If we can't iter anymore, then we stop also. I think that code will be easier to understand.

Collaborator

muellerzr Oct 24, 2024

Yes agreed :) (What we do in the Trainer)

docs/source/usage_guides/gradient_accumulation.md Outdated Show resolved Hide resolved

docs/source/usage_guides/gradient_accumulation.md Outdated

Comment on lines 431 to 432


		Results on a single device:

Member

SunMarc Oct 24, 2024

Maybe we can precise the exact setup ? I think that we are doing the following ?

dp=1 grad_acc= 2 batch_size = 4 vs dp=1 grad_acc= 1 batch_size = 8 ?
If we are only doing one update, then we won't be able to get a graph. Maybe we do this on a larger dataset where batch_size != len(data_loader) and add the graphs.

docs/source/usage_guides/gradient_accumulation.md Outdated

Comment on lines 442 to 443

		Results on a two devices set-up:
		```

Member

SunMarc Oct 24, 2024 •

edited

Loading

On a two devices set-up, the modification you did to take into account the dp won't be reflected here as we are only changing grad acc and batch_size. So the loss will be the same nevertheless. However, it's nice to see that the total_num_items really changed:

dp=2 grad_acc= 2 batch_size = 4 vs dp=2 grad_acc=1 batch_size=8

Maybe we should probably do a separate section/experiment to show the following will have the same loss graph

dp=2 batch_size =2 is the same as dp=1 batch_size=4. See this experiment for clarification

ylacombe and others added 5 commits

October 29, 2024 16:48


          add autoregressive example in examples/

2e80bf0


          Update docs/source/usage_guides/gradient_accumulation.md

5e3e811

Co-authored-by: Marc Sun <[email protected]>


          ruff format

c56c780


          add grad accum test

80c720a


          update docs

e5d2c50

muellerzr reviewed

View reviewed changes

tests/test_examples.py

Comment on lines +249 to +251

+                  def test_gradient_accumulation_for_autoregressive_models(self):
+                      testargs = ["examples/by_feature/gradient_accumulation_for_autoregressive_models.py"]
+                      run_command(self.launch_args + testargs)

Collaborator

muellerzr Oct 30, 2024

Just a nit: this doesn't use gradient accumulation here since it uses the default of 1

muellerzr reviewed

View reviewed changes

examples/by_feature/gradient_accumulation_for_autoregressive_models.py

+                      "--per_device_batch_size",
+                      type=int,
+                      default=2,
+                      help="The number of minibatches to be ran before gradients are accumulated.",

Collaborator

muellerzr Oct 30, 2024

Shouldn't this be "The size of each minibatch"?

Suggested change

      
                    help="The number of minibatches to be ran before gradients are accumulated.",
          
                    help="The size of each minibatch",

github-actions bot commented Nov 24, 2024

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet