Benchmarking phi3 on single A100 40gb GPU: unable to reproduce benchmark results #236

cosmicBboy · 2024-09-09T19:43:25Z

🐛 Describe the bug

I'm using flyte to reproduce the token throughput and memory savings results reported in this repo's README under slightly different conditions: using the microsoft/Phi-3-mini-4k-instruct model on a single A100 gpu.

Are the performance benefits of liger only applicable to multi-gpu training workloads, or should it also take effect with single gpu training?

Reproduce

The code I used for this is essentially the same as the code in this repo's huggingface example: https://github.com/linkedin/Liger-Kernel/tree/main/examples/huggingface

The full Flyte code is here: https://github.com/unionai/unionai-examples/pull/56/files

Input configuration: https://github.com/unionai/unionai-examples/pull/56/files#diff-a32241859d655e3c98e4d588804dfb8b9906365e87d0cfe80a6420c5b1de686f

It produces this Flyte deck with the basic benchmark of liger vs. regular hf transformer:

As you can see liger does reduce peak memory reserved, but token throughput is slightly lower.

Please advice! I can work on a google colab example if that will help with reproducing the issue.

Versions

datasets==2.21.0
pandas==2.2.2
matplotlib==3.9.2
huggingface-hub==0.24.6
transformers==4.42.2
trl==0.10.1
torch==2.4.0
liger-kernel==0.2.1

The text was updated successfully, but these errors were encountered:

tyler-romero · 2024-09-09T20:35:01Z

Hmm, as a sanity check can you try running your benchmark with a per-device batch size of 8 instead of 4? Using tensors that have dimensions that are multiples of 8 can be important for tensor-core utilization on modern nvidia GPUs (although that statement has a lot of caveats and I'm not sure that it is the issue).

A colab that reproduces would be helpful as well.

ByronHsu · 2024-09-10T17:03:17Z

I think this is because you are using A100 40GB so it is heavily memory bound while we use A100 80GB. You can maybe try using SGD optimizer instead of adamw so it takes less memory (-> more compute).

cosmicBboy · 2024-09-10T23:02:13Z

Okay, gonna try out SGD And report back with findings

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmarking phi3 on single A100 40gb GPU: unable to reproduce benchmark results #236

Benchmarking phi3 on single A100 40gb GPU: unable to reproduce benchmark results #236

cosmicBboy commented Sep 9, 2024 •

edited

Loading

tyler-romero commented Sep 9, 2024 •

edited

Loading

ByronHsu commented Sep 10, 2024

cosmicBboy commented Sep 10, 2024

Benchmarking phi3 on single A100 40gb GPU: unable to reproduce benchmark results #236

Benchmarking phi3 on single A100 40gb GPU: unable to reproduce benchmark results #236

Comments

cosmicBboy commented Sep 9, 2024 • edited Loading

🐛 Describe the bug

Reproduce

Versions

tyler-romero commented Sep 9, 2024 • edited Loading

ByronHsu commented Sep 10, 2024

cosmicBboy commented Sep 10, 2024

cosmicBboy commented Sep 9, 2024 •

edited

Loading

tyler-romero commented Sep 9, 2024 •

edited

Loading