Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Modify Torchtune to train Ichigo Qwen #125

Closed
4 tasks done
Tracked by #122
hahuyhoang411 opened this issue Nov 20, 2024 · 4 comments
Closed
4 tasks done
Tracked by #122

feat: Modify Torchtune to train Ichigo Qwen #125

hahuyhoang411 opened this issue Nov 20, 2024 · 4 comments
Assignees
Labels
type: experiment One off experiment to test something
Milestone

Comments

@hahuyhoang411
Copy link
Contributor

hahuyhoang411 commented Nov 20, 2024

Proble

Our training pipeline currently supports the standard Qwen model but requires two critical modifications:

  1. The tokenizer implementation needs updating (adapt from tiktoken-old to qwen bpe)
  2. LoRA checkpointing strategy needs revision

Background

  • We're exploring Qwen as our base model for Asian language support (particularly for Project Ichigo) due to its superior performance with Asian languages

Goal:

  • Upload Resized Embedding Qwen Model
  • Modify tokenizer to expand vocab
  • modify saving strategy
  • Check the bug when syncing new code from torchtune cause VRAM usage fluctuated.
@hahuyhoang411 hahuyhoang411 added the type: experiment One off experiment to test something label Nov 20, 2024
@hahuyhoang411 hahuyhoang411 changed the title feat: [DESCRIPTION] feat: Modify Torchtune to train Ichigo Qwen Nov 20, 2024
@bachvudinh
Copy link
Contributor

bachvudinh commented Nov 20, 2024

Sync Torchtune from Upstream

  • Thanks @hahuyhoang411 for creating the issue to track. The lastest version of torchtune supported Qwen2.5 so i have to sync from upstream and do some testing with the training. All the changes are summarized here: Sync upstream/bach torchtune#5.
    cc @tikikun
  • After sync, there is a bug that the vram usage between gpus spike. may need further investigate.

@bachvudinh
Copy link
Contributor

bachvudinh commented Nov 20, 2024

  • When uploading resized embedding Qwen 2.5 32B model i realise that there is a bug from HF tokenizer that the vocab size of tokenizer are different from the embedding size. Found the similar issue: Tokenizer size and embedding size mismatch QwenLM/Qwen2.5#29.
  • I found the answer:
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    print(tokenizer.vocab_size)
    print(len(tokenizer))
    params = model.state_dict()
    print(f"Updated embedding weights shape: {params['model.embed_tokens.weight'].shape}")
    
    The first one refers to the vocab size without special tokens, while the second includes them. The embedding size is larger with padding tokens due to distributed training in our pretraining stage.

@hahuyhoang411
Copy link
Contributor Author

Solution here?

I think the implication behind this mismatch is dueto they want to optimize the training process with the number // 128
(https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#)

So we can learn from them. First let's expand the vocab from the ground truth vocab (tokenizer vocab) then we can further add the padding embeds later.

refenrence from HF: https://github.com/huggingface/transformers/blob/40821a247823b35d7ff10ba490d0d930fe8f5afa/src/transformers/models/idefics2/modeling_idefics2.py#L1289
image
image

@bachvudinh
Copy link
Contributor

i added the training code for Ichigo Qwen2.5 family as base to torchtune codebase. its in the dev branch soon be merged into the main. i think we can close this issue cc @tikikun @hahuyhoang411

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: experiment One off experiment to test something
Projects
Archived in project
Development

No branches or pull requests

2 participants