-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gemma 2 #9672
Gemma 2 #9672
Conversation
Signed-off-by: Chen Cui <[email protected]>
Signed-off-by: Chen Cui <[email protected]>
Signed-off-by: Chen Cui <[email protected]>
Signed-off-by: Chen Cui <[email protected]>
Signed-off-by: Chen Cui <[email protected]>
Signed-off-by: Chen Cui <[email protected]>
Signed-off-by: Chen Cui <[email protected]>
Signed-off-by: cuichenx <[email protected]>
Hi! Thank you for your work! When continue pretrain the model using your updated gemma-related code, I found that the initial loss is around 9.x, while hf was about 2.x. Are there still some differences that are not aligned? |
Hi @Emperorizzis, thanks for your interest! It should work the same as HF but there could be a bug somewhere. Do you mind sharing your config and/or run command? |
Hi @cuichenx, I used your modeling file, as well as the native megatron (f3a3020031f384ddafd9b7e9f3a587798c0aea21) for training (with a few additional arguments). Below are my megatron arguments configuration during training:
And below are the losses of first 5 steps:
The data should be fine; I sampled 1 million articles from enwiki and tokenized them using the tokenizer from gemma2-9b. |
@Emperorizzis I verified inference and finetuning performance with the code in nemo framework, and the accuracy looked okay. |
* gemma2 initial commit Signed-off-by: Chen Cui <[email protected]> * enable conversion on cpu Signed-off-by: Chen Cui <[email protected]> * fix code scanning Signed-off-by: Chen Cui <[email protected]> * typo in config Signed-off-by: Chen Cui <[email protected]> * fix output layer and add comments Signed-off-by: Chen Cui <[email protected]> * refactor model customize to one function Signed-off-by: Chen Cui <[email protected]> * unpin transformers version Signed-off-by: Chen Cui <[email protected]> * Apply isort and black reformatting Signed-off-by: cuichenx <[email protected]> --------- Signed-off-by: Chen Cui <[email protected]> Signed-off-by: cuichenx <[email protected]> Co-authored-by: cuichenx <[email protected]>
Hi! Thank you for your response! After our testing, we found that there seems to be two bugs : 1) Sliding Window def get_swa(seq_q, seq_kv, w):
"""Create the equivalent attention mask fro SWA in [seq_q, seq_kv] shape"""
m = torch.ones(seq_q, seq_kv, dtype=torch.bool, device="cuda")
### original
# mu = torch.triu(m, diagonal=seq_kv - seq_q - w[0])
### after modified
mu = torch.triu(m, diagonal=seq_kv - seq_q - w[0] + 1)
ml = torch.tril(mu, diagonal=seq_kv - seq_q + w[1])
ml = ~ml
return ml For example:
2) The odd and even layers of the sliding window are reversed huggingface/transformers@a695c18
And we also found that adding or not adding "<bos>" token when continue pretraining Gemma Base model has a significant impact on the initial loss (possibly a difference of up to double). Hope this finding can help other people. Thank you again for your work! |
* Gemma 2 (#9672) * gemma2 initial commit Signed-off-by: Chen Cui <[email protected]> * enable conversion on cpu Signed-off-by: Chen Cui <[email protected]> * fix code scanning Signed-off-by: Chen Cui <[email protected]> * typo in config Signed-off-by: Chen Cui <[email protected]> * fix output layer and add comments Signed-off-by: Chen Cui <[email protected]> * refactor model customize to one function Signed-off-by: Chen Cui <[email protected]> * unpin transformers version Signed-off-by: Chen Cui <[email protected]> * Apply isort and black reformatting Signed-off-by: cuichenx <[email protected]> --------- Signed-off-by: Chen Cui <[email protected]> Signed-off-by: cuichenx <[email protected]> Co-authored-by: cuichenx <[email protected]> * typo Signed-off-by: Chen Cui <[email protected]> * import in function to fix test Signed-off-by: Chen Cui <[email protected]> --------- Signed-off-by: Chen Cui <[email protected]> Signed-off-by: cuichenx <[email protected]> Co-authored-by: Chen Cui <[email protected]> Co-authored-by: cuichenx <[email protected]> Co-authored-by: Eric Harper <[email protected]>
* Gemma 2 (NVIDIA#9672) * gemma2 initial commit Signed-off-by: Chen Cui <[email protected]> * enable conversion on cpu Signed-off-by: Chen Cui <[email protected]> * fix code scanning Signed-off-by: Chen Cui <[email protected]> * typo in config Signed-off-by: Chen Cui <[email protected]> * fix output layer and add comments Signed-off-by: Chen Cui <[email protected]> * refactor model customize to one function Signed-off-by: Chen Cui <[email protected]> * unpin transformers version Signed-off-by: Chen Cui <[email protected]> * Apply isort and black reformatting Signed-off-by: cuichenx <[email protected]> --------- Signed-off-by: Chen Cui <[email protected]> Signed-off-by: cuichenx <[email protected]> Co-authored-by: cuichenx <[email protected]> * typo Signed-off-by: Chen Cui <[email protected]> * import in function to fix test Signed-off-by: Chen Cui <[email protected]> --------- Signed-off-by: Chen Cui <[email protected]> Signed-off-by: cuichenx <[email protected]> Co-authored-by: Chen Cui <[email protected]> Co-authored-by: cuichenx <[email protected]> Co-authored-by: Eric Harper <[email protected]>
* gemma2 initial commit Signed-off-by: Chen Cui <[email protected]> * enable conversion on cpu Signed-off-by: Chen Cui <[email protected]> * fix code scanning Signed-off-by: Chen Cui <[email protected]> * typo in config Signed-off-by: Chen Cui <[email protected]> * fix output layer and add comments Signed-off-by: Chen Cui <[email protected]> * refactor model customize to one function Signed-off-by: Chen Cui <[email protected]> * unpin transformers version Signed-off-by: Chen Cui <[email protected]> * Apply isort and black reformatting Signed-off-by: cuichenx <[email protected]> --------- Signed-off-by: Chen Cui <[email protected]> Signed-off-by: cuichenx <[email protected]> Co-authored-by: cuichenx <[email protected]>
@Emperorizzis @cuichenx I meet the same problem (high initial loss) for continual pre-training gemma2-2B model. SFT scripts works well (training loss < 3), but pertaining loss (training loss > 20) is extremely high. ExperimentsThe following experiments are running with the same docker SFT training: [B]gemma2_2b_sft_0830_main: Continual pre-training
[C]gemma2_2b_sft_pretraining [D]gemma2_2b_sft_pretraining_bos
Observations(1) SFT tuning is normal, but also differs with the codebase changes. => [A] and [B] adopts the same data and docker, but receives different loss curve. Looking for helpCould @cuichenx provide a stable codebase version tag and the necessary guidance on how to run the NeMo for Gemma2 model for both pre-training? Thanks for your great work! |
Thanks for reporting these issues! I will look into them this week. |
* Gemma 2 (NVIDIA#9672) * gemma2 initial commit Signed-off-by: Chen Cui <[email protected]> * enable conversion on cpu Signed-off-by: Chen Cui <[email protected]> * fix code scanning Signed-off-by: Chen Cui <[email protected]> * typo in config Signed-off-by: Chen Cui <[email protected]> * fix output layer and add comments Signed-off-by: Chen Cui <[email protected]> * refactor model customize to one function Signed-off-by: Chen Cui <[email protected]> * unpin transformers version Signed-off-by: Chen Cui <[email protected]> * Apply isort and black reformatting Signed-off-by: cuichenx <[email protected]> --------- Signed-off-by: Chen Cui <[email protected]> Signed-off-by: cuichenx <[email protected]> Co-authored-by: cuichenx <[email protected]> * typo Signed-off-by: Chen Cui <[email protected]> * import in function to fix test Signed-off-by: Chen Cui <[email protected]> --------- Signed-off-by: Chen Cui <[email protected]> Signed-off-by: cuichenx <[email protected]> Co-authored-by: Chen Cui <[email protected]> Co-authored-by: cuichenx <[email protected]> Co-authored-by: Eric Harper <[email protected]> Signed-off-by: adityavavre <[email protected]>
* Gemma 2 (#9672) * gemma2 initial commit Signed-off-by: Chen Cui <[email protected]> * enable conversion on cpu Signed-off-by: Chen Cui <[email protected]> * fix code scanning Signed-off-by: Chen Cui <[email protected]> * typo in config Signed-off-by: Chen Cui <[email protected]> * fix output layer and add comments Signed-off-by: Chen Cui <[email protected]> * refactor model customize to one function Signed-off-by: Chen Cui <[email protected]> * unpin transformers version Signed-off-by: Chen Cui <[email protected]> * Apply isort and black reformatting Signed-off-by: cuichenx <[email protected]> --------- Signed-off-by: Chen Cui <[email protected]> Signed-off-by: cuichenx <[email protected]> Co-authored-by: cuichenx <[email protected]> * typo Signed-off-by: Chen Cui <[email protected]> * import in function to fix test Signed-off-by: Chen Cui <[email protected]> --------- Signed-off-by: Chen Cui <[email protected]> Signed-off-by: cuichenx <[email protected]> Co-authored-by: Chen Cui <[email protected]> Co-authored-by: cuichenx <[email protected]> Co-authored-by: Eric Harper <[email protected]>
* Gemma 2 (NVIDIA#9672) * gemma2 initial commit Signed-off-by: Chen Cui <[email protected]> * enable conversion on cpu Signed-off-by: Chen Cui <[email protected]> * fix code scanning Signed-off-by: Chen Cui <[email protected]> * typo in config Signed-off-by: Chen Cui <[email protected]> * fix output layer and add comments Signed-off-by: Chen Cui <[email protected]> * refactor model customize to one function Signed-off-by: Chen Cui <[email protected]> * unpin transformers version Signed-off-by: Chen Cui <[email protected]> * Apply isort and black reformatting Signed-off-by: cuichenx <[email protected]> --------- Signed-off-by: Chen Cui <[email protected]> Signed-off-by: cuichenx <[email protected]> Co-authored-by: cuichenx <[email protected]> * typo Signed-off-by: Chen Cui <[email protected]> * import in function to fix test Signed-off-by: Chen Cui <[email protected]> --------- Signed-off-by: Chen Cui <[email protected]> Signed-off-by: cuichenx <[email protected]> Co-authored-by: Chen Cui <[email protected]> Co-authored-by: cuichenx <[email protected]> Co-authored-by: Eric Harper <[email protected]> Signed-off-by: Hainan Xu <[email protected]>
What does this PR do ?
Duplicate of #9587 for release branch
Also includes transformers version update from #9606
Collection: [Note which collection this PR will affect]
Changelog
Usage
# Add a code snippet demonstrating how to use this
GitHub Actions CI
The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.
The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".
Before your PR is "Ready for review"
Pre checks:
PR Type:
If you haven't finished some of the above items you can still open "Draft" PR.
Who can review?
Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.
Additional Information