LoRA: `zero_pad` speed improvements #630

Andrei-Aksionov · 2023-10-09T16:23:22Z

Hi there 👋

This PR is a result of #461.
In that issue I've found out that the creation of a new tensor with lora_ind (that are stored as a python list on a CPU) for each zero_pad call ...
https://github.com/Lightning-AI/lit-gpt/blob/807c7bc17413d53961f96dc668aa03c0b970a43f/lit_gpt/lora.py#L293-L295

... implicitly calls cudaStreamSynchronize every time and that slows down the forward pass a bit.

Traces

Note

Number are provided for the Nvidia T4 and 16-mixed precision.

Let's take a look at the traces for Pythia-410m.

Currently zero_pad takes a significant part of the time:

Note

Compare the size of cudaStreamSynchronize from the screenshot above (CUDA 12.1) and the one from the "Performance Study" issue (CUDA 11.8) - it's much smaller thanks to the newest CUDA.

After the code is optimized, from the trace we can see that the zero_pad now takes much less portion of the time:

In numbers it's 830 μs vs 126 μs.

LoRA fine-tuning

If to compare LoRA fine-tuning with Pythia-410m and 1k iterations, we have:

Model	Loss $_{control}$	Loss $_{test}$	Time $_{control}$	Time $_{test}$
Pythia-70m	2.5835	2.5802	30.90	28.51
Pythia-410m	1.7976	1.7976	124.63	114.51

Not a drastic difference, but still a nice optimization.

…all.

Andrei-Aksionov · 2023-10-09T16:26:10Z

finetune/lora.py

@@ -250,7 +250,7 @@ def train(
            save_lora_checkpoint(fabric, model, checkpoint_path)


-@torch.inference_mode()
+@torch.no_grad()


In inference mode every new tensor is created as an inference tensor.
In order to use such tensors for the training we have to clone them.
Since every other fine-tune script uses torch.no_grad for validation, I think it's easier/better to use this decorator here too.

Since validation (as a sanity check) is the first step of the training, the lora_ind property will be called first here. So if the validation is running in inference_mode, the indices will be also stored in an inference tensor. That explains the issue above.

Andrei-Aksionov · 2023-10-09T16:28:15Z

lit_gpt/lora.py

+                indices.append(torch.arange(in_features + self.kv_embd_size, out_features, device=device))
+            self.register_buffer("_lora_ind", torch.cat(indices), persistent=False)
+
+        return self._lora_ind


One might ask why don't we place these indices on a target device (i.e. GPU) during the init method.
There is an issue with FSDP and meta devices if to place them during the init, so as a workaround a "lazy" initialization is used.

Andrei-Aksionov · 2023-10-09T16:30:04Z

lit_gpt/lora.py

@@ -345,7 +339,7 @@ def merge(self):
                0
            )  # (1, 4, 128) @ (256, 2, 1) -> (1, 256, 128) -> (256, 128)
            # W = W + delta_W (merge)
-            self.linear.weight.data += self.zero_pad(delta_w * self.scaling)  # (256, 128) after zero_pad (384, 128)
+            self.linear.weight.data += self.zero_pad(delta_w.T * self.scaling).T  # (256, 128) after zero_pad (384, 128)


I think that it's better to do double transpose here (and once), rather than every time in zero_pad and for the cases where it's not needed at all.

Andrei-Aksionov added 11 commits September 18, 2023 15:06

Use only index_copy without any views or reshapes.

3a6e637

Don't do transpose required for the merge method in each zero_pad c…

1d591f2

…all.

self.lora_ind as a property

3d068cb

Updates for a case with an inference tensor

89d8eaa

Docstring for a lora_ind property.

6785012

Reassign self._lora_ind so it will be recreated outside inference mode

8951570

Merge branch 'main' into lora_zero_pad_speed_improvements

38d9ef0

Make lora_ind property a bit shorter.

eb73d27

Trim comments for lora_ind property.

4cc6cb3

Merge branch 'main' into lora_zero_pad_speed_improvements

138acaa

If validate is running in no_grad mode there is no need to clone ind

7013fed

Andrei-Aksionov requested review from awaelchli, carmocca and lantiga as code owners October 9, 2023 16:23

Andrei-Aksionov changed the title ~~LoRA: zero_pad speed improvements~~ LoRA: zero_pad speed improvements Oct 9, 2023

Andrei-Aksionov commented Oct 9, 2023

View reviewed changes

Andrei-Aksionov and others added 5 commits October 9, 2023 19:36

Merge branch 'main' into lora_zero_pad_speed_improvements

b494438

Merge branch 'main' into lora_zero_pad_speed_improvements

865f883

Merge branch 'main' into lora_zero_pad_speed_improvements

48a401b

Merge branch 'main' into lora_zero_pad_speed_improvements

5a14afc

Merge branch 'main' into lora_zero_pad_speed_improvements

c5f079f

Andrei-Aksionov closed this by deleting the head repository Nov 21, 2023

Andrei-Aksionov mentioned this pull request Nov 23, 2023

LoRA: zero_pad speed improvements #770

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LoRA: `zero_pad` speed improvements #630

LoRA: `zero_pad` speed improvements #630

Andrei-Aksionov commented Oct 9, 2023

Andrei-Aksionov Oct 9, 2023 •

edited

Loading

Andrei-Aksionov Oct 9, 2023

Andrei-Aksionov Oct 9, 2023

Andrei-Aksionov Oct 9, 2023 •

edited

Loading

LoRA: zero_pad speed improvements #630

LoRA: zero_pad speed improvements #630

Conversation

Andrei-Aksionov commented Oct 9, 2023

Traces

LoRA fine-tuning

Andrei-Aksionov Oct 9, 2023 • edited Loading

Choose a reason for hiding this comment

Andrei-Aksionov Oct 9, 2023

Choose a reason for hiding this comment

Andrei-Aksionov Oct 9, 2023

Choose a reason for hiding this comment

Andrei-Aksionov Oct 9, 2023 • edited Loading

Choose a reason for hiding this comment

LoRA: `zero_pad` speed improvements #630

LoRA: `zero_pad` speed improvements #630

Andrei-Aksionov Oct 9, 2023 •

edited

Loading

Andrei-Aksionov Oct 9, 2023 •

edited

Loading