Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adding support for GPT-J #1

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

ImanHosseini
Copy link

@ImanHosseini ImanHosseini commented Dec 6, 2022

Adding LinearGELU kernel. This first commit is a small change - this kernel will be used in GPT-J.

@Guangxuan-Xiao
Copy link
Owner

Could you add a test file in tests to verify the correctness? Thank you!

@ImanHosseini ImanHosseini changed the title adding LinearGELU kernel adding support for GPT-J Dec 6, 2022
@ImanHosseini
Copy link
Author

ImanHosseini commented Dec 6, 2022

The GELU unit is working (passes the test), but the GPT-J - the Attention - is failing (test shows high r2). Any help on what's going wrong there?

@Guangxuan-Xiao
Copy link
Owner

Hi Iman, sorry I don't have the bandwidth recently to take a detailed look at your implementation. I can provide some common pitfalls and advice for implementation.

  • First, be careful with the ordering of matrices, especially the bmm functions. The CUTLASS uses column-major matrices, and the t in the function name means transposed column-major matrices, which are row-major matrices.
  • Second, you should be aware of the contiguity of tensor elements. The CUTLASS kernels only receive a pointer from the PyTorch tensor, so if you don't call contiguous() after some operations like transpose() and permute(), the content of the tensor will remain the same.
  • You can cache each module's input and output activations to compare the ground truth with your modules' outputs. Then you will be able to spot the bug and fix it :).

@ImanHosseini
Copy link
Author

ImanHosseini commented Dec 7, 2022

Thanks for the advice! It helped me narrow down and fix many of the problems.
I have added tests for each submodule and every check passes only 'test_gptj' (end to end test) is left. I have a question, in test_opt here we have:

int8_model_path = '/dataset/opt/opt-13b-smoothquant'
-> what am I supposed to do here? How am I supposed to generate this (for my gptj case) -- for this test?
-> I added the new changes. the test for Int8GPTJBlock module has low R2, and this incorporates the attention module inside it. Next step is GPTJModel which is basically a stack of GPTJBlocks -> and this one has high R2, so something is going wrong there.

@Guangxuan-Xiao
Copy link
Owner

what am I supposed to do here? How am I supposed to generate this (for my gptj case) -- for this test?

You can refer to this script to generate the GPT-J INT8 model.

Next step is GPTJModel which is basically a stack of GPTJBlocks -> and this one has high R2, so something is going wrong there.

I am not sure why this happens. According to the difference in the activation value of the cache, from which layer is there a significant error? You can continue use this method to locate the bug.

@ImanHosseini
Copy link
Author

ImanHosseini commented Dec 11, 2022

It seems to be an issue with how the model is reconstructed, to try it out, I started with OPT - so no change to the repo, just trying to recreate results for OPT-350M-:
first, I generate act_scales and then export the model (OPT-350M):

python3 examples/generate_act_scales.py --model-name facebook/opt-350m --output-path act_scales/opt-350m.pt
python3 examples/export_int8_model.py --model-name facebook/opt-350m --act-scales act_scales/opt-350m.pt --output-path i8m/opt-350m-sq

Then I ran 'test_opt':

python3 tests/test_opt.py

Once the original (which gives an accuracy of ~64% as expected) and then with "int8" (with what I got from step 1: 'i8m/opt-350m-sq')-> this one showed an accuracy of 0!
[I did not change the dataset or anything else]
update:
This was getting flipped -not sure why- : https://github.com/huggingface/transformers/blob/799cea64ac1029d66e9e58f18bc6f47892270723/src/transformers/modeling_utils.py#L2293 (with no error, no warning, nothing)- > which is bad?
I commented those lines in my transformer installation and now I face many of these errors:

size mismatch for model.decoder.layers.0.self_attn.k_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([1, 1024]).

@Guangxuan-Xiao
Copy link
Owner

Currently, we haven't supported the OPT-350M model since it uses post-layernorm. Please try with a 125m or 1.7b model.

@Guangxuan-Xiao
Copy link
Owner

Hi Iman, could you check the box Allow edits from maintainers so I can work on this PR?

@ImanHosseini
Copy link
Author

ImanHosseini commented Dec 15, 2022

Sure [it seems like it was checked already - let me know if its not?], but can you please don't look into this until next week? I have found out all the issues and I can now recreate results for GPT-J with high accuracy, I will clean up the changes and commit by the end of next week - make any changes you want over that commit -, is that ok?

@Guangxuan-Xiao
Copy link
Owner

Great! If you have any further questions or need assistance, feel free to ask. 😄

@ImanHosseini
Copy link
Author

ImanHosseini commented Dec 23, 2022

To generate that "model_dec_scales.json" I used a tweaked version of smoothquant: https://gist.github.com/ImanHosseini/ad922cc9c01e05bc16c59926b6f35fd9
Which generates those based on the codegen models.
I added that "layers_to_keep" parameter to be able to test how quantization changes affect that model as I start with keeping all the original layers intact, and then start quantizing more layers.
One thing I am considering for higher accuracy is to keep the MLP module as it is (no quantization), as it seems most accuracy is lost there? - and this would end up being better than say, only quantizing half the layers: for the same memory benefit, it would lead to better accuracy? -
Thoughts?

@LeoYoo247
Copy link

@Guangxuan-Xiao Did you have the chance to look into this PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants