-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
adding support for GPT-J #1
base: main
Are you sure you want to change the base?
Conversation
Could you add a test file in tests to verify the correctness? Thank you! |
The GELU unit is working (passes the test), but the GPT-J - the Attention - is failing (test shows high r2). Any help on what's going wrong there? |
Hi Iman, sorry I don't have the bandwidth recently to take a detailed look at your implementation. I can provide some common pitfalls and advice for implementation.
|
Thanks for the advice! It helped me narrow down and fix many of the problems. Line 49 in d03a8da
-> I added the new changes. the test for Int8GPTJBlock module has low R2, and this incorporates the attention module inside it. Next step is GPTJModel which is basically a stack of GPTJBlocks -> and this one has high R2, so something is going wrong there. |
You can refer to this script to generate the GPT-J INT8 model.
I am not sure why this happens. According to the difference in the activation value of the cache, from which layer is there a significant error? You can continue use this method to locate the bug. |
It seems to be an issue with how the model is reconstructed, to try it out, I started with OPT - so no change to the repo, just trying to recreate results for OPT-350M-:
Then I ran 'test_opt':
Once the original (which gives an accuracy of ~64% as expected) and then with "int8" (with what I got from step 1: 'i8m/opt-350m-sq')-> this one showed an accuracy of 0!
|
Currently, we haven't supported the OPT-350M model since it uses post-layernorm. Please try with a 125m or 1.7b model. |
Hi Iman, could you check the box |
Sure [it seems like it was checked already - let me know if its not?], but can you please don't look into this until next week? I have found out all the issues and I can now recreate results for GPT-J with high accuracy, I will clean up the changes and commit by the end of next week - make any changes you want over that commit -, is that ok? |
Great! If you have any further questions or need assistance, feel free to ask. 😄 |
To generate that "model_dec_scales.json" I used a tweaked version of smoothquant: https://gist.github.com/ImanHosseini/ad922cc9c01e05bc16c59926b6f35fd9 |
@Guangxuan-Xiao Did you have the chance to look into this PR? |
Adding LinearGELU kernel. This first commit is a small change - this kernel will be used in GPT-J.