feat: Modify Torchtune to train Ichigo Qwen #125

hahuyhoang411 · 2024-11-20T20:18:04Z

Proble

Our training pipeline currently supports the standard Qwen model but requires two critical modifications:

The tokenizer implementation needs updating (adapt from tiktoken-old to qwen bpe)
LoRA checkpointing strategy needs revision

Background

We're exploring Qwen as our base model for Asian language support (particularly for Project Ichigo) due to its superior performance with Asian languages

Goal:

Upload Resized Embedding Qwen Model
Modify tokenizer to expand vocab
modify saving strategy
Check the bug when syncing new code from torchtune cause VRAM usage fluctuated.

bachvudinh · 2024-11-20T21:56:43Z

Sync Torchtune from Upstream

Thanks @hahuyhoang411 for creating the issue to track. The lastest version of torchtune supported Qwen2.5 so i have to sync from upstream and do some testing with the training. All the changes are summarized here: Sync upstream/bach torchtune#5.
cc @tikikun
After sync, there is a bug that the vram usage between gpus spike. may need further investigate.

bachvudinh · 2024-11-20T22:03:45Z

When uploading resized embedding Qwen 2.5 32B model i realise that there is a bug from HF tokenizer that the vocab size of tokenizer are different from the embedding size. Found the similar issue: Tokenizer size and embedding size mismatch QwenLM/Qwen2.5#29.

I found the answer:

tokenizer = AutoTokenizer.from_pretrained(model_name)
print(tokenizer.vocab_size)
print(len(tokenizer))
params = model.state_dict()
print(f"Updated embedding weights shape: {params['model.embed_tokens.weight'].shape}")

The first one refers to the vocab size without special tokens, while the second includes them. The embedding size is larger with padding tokens due to distributed training in our pretraining stage.

hahuyhoang411 · 2024-11-21T00:38:44Z

Solution here?

I think the implication behind this mismatch is dueto they want to optimize the training process with the number // 128
(https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#)

So we can learn from them. First let's expand the vocab from the ground truth vocab (tokenizer vocab) then we can further add the padding embeds later.

refenrence from HF: https://github.com/huggingface/transformers/blob/40821a247823b35d7ff10ba490d0d930fe8f5afa/src/transformers/models/idefics2/modeling_idefics2.py#L1289

bachvudinh · 2024-11-22T03:25:04Z

i added the training code for Ichigo Qwen2.5 family as base to torchtune codebase. its in the dev branch soon be merged into the main. i think we can close this issue cc @tikikun @hahuyhoang411

hahuyhoang411 added the type: experiment One off experiment to test something label Nov 20, 2024

hahuyhoang411 assigned bachvudinh Nov 20, 2024

hahuyhoang411 changed the title ~~feat: [DESCRIPTION]~~ feat: Modify Torchtune to train Ichigo Qwen Nov 20, 2024

hahuyhoang411 mentioned this issue Nov 20, 2024

task: Ichigo LLM v0.5 Training #122

Open

5 tasks

bachvudinh closed this as completed Nov 22, 2024

hiento09 added this to Jan & Cortex Nov 22, 2024

github-project-automation bot moved this to Investigating in Jan & Cortex Nov 22, 2024

bachvudinh added this to the Ichigo v0.5 - Multilingual milestone Nov 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Modify Torchtune to train Ichigo Qwen #125

feat: Modify Torchtune to train Ichigo Qwen #125

hahuyhoang411 commented Nov 20, 2024 •

edited by bachvudinh

Loading

bachvudinh commented Nov 20, 2024 •

edited

Loading

bachvudinh commented Nov 20, 2024 •

edited

Loading

hahuyhoang411 commented Nov 21, 2024

bachvudinh commented Nov 22, 2024

feat: Modify Torchtune to train Ichigo Qwen #125

feat: Modify Torchtune to train Ichigo Qwen #125

Comments

hahuyhoang411 commented Nov 20, 2024 • edited by bachvudinh Loading

Proble

Background

Goal:

bachvudinh commented Nov 20, 2024 • edited Loading

Sync Torchtune from Upstream

bachvudinh commented Nov 20, 2024 • edited Loading

hahuyhoang411 commented Nov 21, 2024

bachvudinh commented Nov 22, 2024

hahuyhoang411 commented Nov 20, 2024 •

edited by bachvudinh

Loading

bachvudinh commented Nov 20, 2024 •

edited

Loading

bachvudinh commented Nov 20, 2024 •

edited

Loading