LoRA: `zero_pad` speed improvements #770

Andrei-Aksionov · 2023-11-23T13:53:42Z

While experimenting with GitHub actions I deleted my fork (I know, I know) and thus all opened PRs were automatically closed. This PR mirrors #630.

Hi there 👋

This PR is a result of #461.
In that issue I've found out that the creation of a new tensor with lora_ind (that are stored as a python list on a CPU) for each zero_pad call ...
https://github.com/Lightning-AI/lit-gpt/blob/807c7bc17413d53961f96dc668aa03c0b970a43f/lit_gpt/lora.py#L293-L295

... implicitly calls cudaStreamSynchronize every time and that slows down the forward pass a bit.

Traces

Note

Number are provided for the Nvidia T4 and 16-mixed precision.

Let's take a look at the traces for Pythia-410m.

Currently zero_pad takes a significant part of the time:

Note

Compare the size of cudaStreamSynchronize from the screenshot above (CUDA 12.1) and the one from the "Performance Study" issue (CUDA 11.8) - it's much smaller thanks to the newest CUDA.

After the code is optimized, from the trace we can see that the zero_pad now takes much less portion of the time:

In numbers, it's 830 μs vs 126 μs.

LoRA fine-tuning

If to compare LoRA fine-tuning with Pythia-410m and 1k iterations, we have:

Model	Loss $_{control}$	Loss $_{test}$	Time $_{control}$	Time $_{test}$
Pythia-70m	2.5835	2.5802	30.90	28.51
Pythia-410m	1.7976	1.7976	124.63	114.51

Not a drastic difference, but still a nice optimization.

…all.

This reverts commit e4fb763.

This reverts commit e05c9b4.

This reverts commit 445af8ba970f7b652dad84582a8bd9a0d1c640af.

lit_gpt/lora.py

Andrei-Aksionov · 2024-05-06T14:04:45Z

I did a very quick benchmarking with Pythia-410m on 1xT4 between the code from this PR and the current main.
Loss the same, memory ever so slightly smaller, time to do a short run of 20 steps is comparable.
So I guess it's a green light.

It would be nice if someone with an access to a multi-GPU machine could run a quick LoRA finetune just to make sure.
In a Studio I have access only to a single GPU.

carmocca

LGTM. I'll run finetuning

awaelchli · 2024-05-06T22:01:44Z

tests/test_lora.py

+    x = torch.randint(0, config.padded_vocab_size, size=(2, config.block_size), dtype=torch.int64, device=fabric.device)
+    model = fabric.setup(model)
+    y = model(x)
+    assert y.shape == torch.Size([2, 8, 512])


@Andrei-Aksionov Could we maybe add a sanity test that iterates over all model attributes of all submodules and asserts that if it's a tensor then .is_meta is False? The previous bug wasn't caught simply because defaults were all lora kqv were True, which would essentially skip this code path:

litgpt/litgpt/lora.py

Lines 330 to 342 in 90a16e4

if all(self.enable_lora):

return x

# Let's image that:

# ⚬ input x has shape (64, 64, 256): (batch_size, sequence_length, embeddings_size)

# ⚬ embeddings_size: 128

# ⚬ self.linear.out_features: 384 (3 * embeddings_size)

# ⚬ enable_lora: [True, False, True]

# Then x has embeddings_size of 256 (2 * 128 as enable_lora only for query and value, not keys) and expected

# embeddings_size is 384 (self.linear.out_features), so that means that we need to pad from 256 to 384 with zeros, but

# only for key updates (this is where self.lora_ind comes in handy)

result = x.new_zeros(*x.shape[:-1], self.linear.out_features) # (64, 64, 384)

return result.index_copy_(dim=-1, index=self.lora_ind, source=x) # (64, 64, 384)

Makes sense.
Sure, I'll do this.

Andrei-Aksionov and others added 19 commits September 18, 2023 15:06

Use only index_copy without any views or reshapes.

3a6e637

Don't do transpose required for the merge method in each zero_pad c…

1d591f2

…all.

self.lora_ind as a property

3d068cb

Updates for a case with an inference tensor

89d8eaa

Docstring for a lora_ind property.

6785012

Reassign self._lora_ind so it will be recreated outside inference mode

8951570

Merge branch 'main' into lora_zero_pad_speed_improvements

38d9ef0

Make lora_ind property a bit shorter.

eb73d27

Trim comments for lora_ind property.

4cc6cb3

Merge branch 'main' into lora_zero_pad_speed_improvements

138acaa

If validate is running in no_grad mode there is no need to clone ind

7013fed

Merge branch 'main' into lora_zero_pad_speed_improvements

b494438

Merge branch 'main' into lora_zero_pad_speed_improvements

865f883

Merge branch 'main' into lora_zero_pad_speed_improvements

48a401b

Merge branch 'main' into lora_zero_pad_speed_improvements

5a14afc

Merge branch 'main' into lora_zero_pad_speed_improvements

c5f079f

Revert "Minor tutorial updates"

4b20c88

This reverts commit e4fb763.

Revert "Fix typo"

daf0e4e

This reverts commit e05c9b4.

Revert "Revert "Minor tutorial updates""

652d7bd

This reverts commit 445af8ba970f7b652dad84582a8bd9a0d1c640af.

Andrei-Aksionov requested review from awaelchli, carmocca and lantiga as code owners November 23, 2023 13:53

Andrei-Aksionov commented Nov 23, 2023

View reviewed changes

lit_gpt/lora.py Outdated Show resolved Hide resolved

Andrei-Aksionov commented Nov 23, 2023

View reviewed changes

lit_gpt/lora.py Outdated Show resolved Hide resolved

Andrei-Aksionov added 2 commits November 23, 2023 16:58

Merge branch 'main' into lora_zero_pad_speed_improvements

b16e385

Undo weirdly appeared typo.

015f868

Andrei-Aksionov added the performance label Nov 23, 2023

Andrei-Aksionov mentioned this pull request May 5, 2024

Eliminate cuda syncs #1374

Merged

Andrei-Aksionov added 2 commits May 6, 2024 16:07

Merge branch 'main' into lora_zero_pad_speed_improvements

532a710

Add FSDP test with empty_init=True

1ef81f3

Andrei-Aksionov marked this pull request as draft May 6, 2024 13:27

Andrei-Aksionov marked this pull request as ready for review May 6, 2024 14:01

Merge branch 'main' into lora_zero_pad_speed_improvements

9560bc0

carmocca approved these changes May 6, 2024

View reviewed changes

Merge branch 'main' into lora_zero_pad_speed_improvements

576de42

carmocca merged commit f84b610 into Lightning-AI:main May 6, 2024
9 checks passed

This was referenced May 6, 2024

LoRA multi-GPU no longer works if applying LoRA selectively #1385

Closed

Cannot copy out of meta tensor; no data! #1378

Closed

Andrei-Aksionov deleted the lora_zero_pad_speed_improvements branch May 6, 2024 17:17

awaelchli reviewed May 6, 2024

View reviewed changes

Andrei-Aksionov mentioned this pull request May 7, 2024

LoRA test: check that all the tensors are materialized. #1395

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LoRA: `zero_pad` speed improvements #770

LoRA: `zero_pad` speed improvements #770

Andrei-Aksionov commented Nov 23, 2023 •

edited

Loading

Andrei-Aksionov commented May 6, 2024

carmocca left a comment

awaelchli May 6, 2024

Andrei-Aksionov May 7, 2024

	if all(self.enable_lora):
	return x

	# Let's image that:
	# ⚬ input x has shape (64, 64, 256): (batch_size, sequence_length, embeddings_size)
	# ⚬ embeddings_size: 128
	# ⚬ self.linear.out_features: 384 (3 * embeddings_size)
	# ⚬ enable_lora: [True, False, True]
	# Then x has embeddings_size of 256 (2 * 128 as enable_lora only for query and value, not keys) and expected
	# embeddings_size is 384 (self.linear.out_features), so that means that we need to pad from 256 to 384 with zeros, but
	# only for key updates (this is where self.lora_ind comes in handy)
	result = x.new_zeros(*x.shape[:-1], self.linear.out_features) # (64, 64, 384)
	return result.index_copy_(dim=-1, index=self.lora_ind, source=x) # (64, 64, 384)

LoRA: zero_pad speed improvements #770

LoRA: zero_pad speed improvements #770

Conversation

Andrei-Aksionov commented Nov 23, 2023 • edited Loading

Traces

LoRA fine-tuning

Andrei-Aksionov commented May 6, 2024

carmocca left a comment

Choose a reason for hiding this comment

awaelchli May 6, 2024

Choose a reason for hiding this comment

Andrei-Aksionov May 7, 2024

Choose a reason for hiding this comment

LoRA: `zero_pad` speed improvements #770

LoRA: `zero_pad` speed improvements #770

Andrei-Aksionov commented Nov 23, 2023 •

edited

Loading