Eliminate cuda syncs #1374

robieta · 2024-04-29T21:14:21Z

This PR fixes two CUDA syncs that I ran across when optimizing Gemma:

1) max(1, non_masked_elems)
This punts to python int before being implicitly converted to a Tensor. (I'm pretty sure I'm responsible for this one.) We need to use the uglier but more performant non_masked_elems.maximum(torch.ones_like(non_masked_elems)).

2) torch.tensor(self.lora_ind, device=result.device)
This one is a little harder because we genuinely do need to move data from host to device. However, lora_ind is set in __init__ and doesn't change. So the best we can do is cache the first time we see it on a given device.

NOTE: It's very important that we do our own caching rather than use functools.cache, as the latter extends the life of self by storing it in the cache.

rasbt · 2024-04-29T21:49:28Z

Thanks a lot for the PR! Do you have some rough estimates in terms of how the performance is before and after? E.g., if it is a noticeable difference, it could potentially be related to #1369

robieta · 2024-04-29T22:05:44Z

@rasbt It's going to be super case dependent. (The LoRA one is definitely the much more important one.) I saw ~5%, but host-device syncs can vary from no difference to several-fold slowdown. For #1369 it's impossible to say anything without a profile. (It's not clear to me that it should be related, but stranger things have happened.)

robieta · 2024-04-29T22:08:01Z

By the way, I did an audit of other uses of torch.tensor and was pleasantly surprised to find no other cases that looked problematic. (Which is very unusual for a codebase of this size and complexity.) Thanks for keeping the bar high everyone!

litgpt/lora.py

lantiga

Looks great, let's add a couple of comments so future readers understand

litgpt/utils.py

litgpt/lora.py

robieta · 2024-04-29T22:36:25Z

Added comments and fixed the lora_ind issue.

awaelchli · 2024-05-04T22:25:58Z

litgpt/lora.py

            if enable_v:
                v_ind = [x for x in ind if (x // head_size) % total_qkv == total_qkv - 1]
-                self.lora_ind.extend(v_ind)
+                lora_ind.extend(v_ind)
+            self._lora_ind = torch.tensor(lora_ind)


The LoRA training script supports FSDP with meta-device initialization. But this change brakes this, because this is now a tensor on the meta device but never gets re-initialized.

self._lora_ind = torch.tensor(lora_ind)

should probably move to reset_parameters()

Or we can keep _lora_ind as a Python list during the initialization and place the list on a target device as a tensor (inside zero_pad method) if it's not in the cache.

Replacing the changes in this PR with those in #770 could also be a good alternative

In theory that should work on a multi-GPU machine too, thanks to self.register_buffer, but I haven't checked it.
Due to a higher cost of a multi-GPU machine, I almost never use more than a single GPU, and thus I'm lacking in knowledge in this department.

At the same time, the code in this PR looks fairly compact. That means that I don't have any preference.

It just needs a test. You can let it run on CI. A simple test like:

@RunIf(standalone=True, min_cuda_gpus=2) def test_lora_model_fsdp_init(): config = ... fabric = Fabric(devices=2, strategy="fsdp", precision="16-true") fabric.launch() with fabric.init_module(empty_init=True): model = GPT(config) x = ... model = fabric.setup(model) y = model(x) assert y.shape == ...

Should catch this issue

tests/test_lora.py

Andrei-Aksionov · 2024-05-05T14:06:20Z

Hah, I have already forgotten that I've created a PR to eliminate unnecessary CUDA sync during the zero_pad call. I remember that there was an issue with a CUDA Stream overflow during the backward pass, which made the backward call slower, but thanks to a speedup during the forward pass the overall time during training was smaller.
The funniest part is that I started to investigate it after @carmocca recommend watching the video where Taylor explained how to do profiling.
Eventually, @robieta fixed the issue himself 🙃.

eliminate cuda syncs

841df23

robieta requested review from awaelchli, carmocca and lantiga as code owners April 29, 2024 21:14

rasbt reviewed Apr 29, 2024

View reviewed changes

litgpt/lora.py Outdated Show resolved Hide resolved

lantiga approved these changes Apr 29, 2024

View reviewed changes

litgpt/utils.py Show resolved Hide resolved

litgpt/lora.py Show resolved Hide resolved

Address PR comments

107832a

robieta merged commit 4780604 into main Apr 29, 2024
9 checks passed

robieta deleted the robieta/eliminate_syncs branch April 29, 2024 22:49

awaelchli mentioned this pull request Apr 30, 2024

Update LoRA test #1376

Merged

rasbt mentioned this pull request May 1, 2024

Cannot copy out of meta tensor; no data! #1378

Closed

awaelchli mentioned this pull request May 4, 2024

LoRA multi-GPU no longer works if applying LoRA selectively #1385

Closed

awaelchli reviewed May 4, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eliminate cuda syncs #1374

Eliminate cuda syncs #1374

robieta commented Apr 29, 2024

rasbt commented Apr 29, 2024

robieta commented Apr 29, 2024

robieta commented Apr 29, 2024

lantiga left a comment

robieta commented Apr 29, 2024

awaelchli May 4, 2024

Andrei-Aksionov May 5, 2024

carmocca May 6, 2024

Andrei-Aksionov May 6, 2024

carmocca May 6, 2024 •

edited

Loading

Andrei-Aksionov commented May 5, 2024

Eliminate cuda syncs #1374

Eliminate cuda syncs #1374

Conversation

robieta commented Apr 29, 2024

rasbt commented Apr 29, 2024

robieta commented Apr 29, 2024

robieta commented Apr 29, 2024

lantiga left a comment

Choose a reason for hiding this comment

robieta commented Apr 29, 2024

awaelchli May 4, 2024

Choose a reason for hiding this comment

Andrei-Aksionov May 5, 2024

Choose a reason for hiding this comment

carmocca May 6, 2024

Choose a reason for hiding this comment

Andrei-Aksionov May 6, 2024

Choose a reason for hiding this comment

carmocca May 6, 2024 • edited Loading

Choose a reason for hiding this comment

Andrei-Aksionov commented May 5, 2024

carmocca May 6, 2024 •

edited

Loading