Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sequence packing with ChatDataset #11357

Open
LiweiPE opened this issue Nov 21, 2024 · 1 comment
Open

Sequence packing with ChatDataset #11357

LiweiPE opened this issue Nov 21, 2024 · 1 comment

Comments

@LiweiPE
Copy link

LiweiPE commented Nov 21, 2024

Hello, may I know how to use gpt_sft_chat_dataset with sequence packing? I couldnt find documentation related to it. I only found to convert from input-output format.

@qijiaxing
Copy link
Collaborator

Sequence packing steps can found here

To make it work for chat dataset, you probably need to add the chat format option, like this:

python /workspace/sequence_packing/tokenize_dataset.py \
  model.data.train_ds.file_names=[/path/to/training.jsonl] \
  model.data.train_ds.max_seq_length=4096 \
  +model.data.chat=True \
  +model.data.chat_prompt_tokens.system_turn_start='<extra_id_0>' \
      +model.data.chat_prompt_tokens.turn_start='<extra_id_1>' \
      +model.data.chat_prompt_tokens.label_start='<extra_id_2>' \
      +model.data.chat_prompt_tokens.end_of_turn="\x0A" \
      +model.data.chat_prompt_tokens.end_of_name="\x0A" \
  model.restore_from_path=/path/to/starcoder2.nemo \ # any starcoder2 .nemo models works here
  +output_path=/path/to/my_dataset.npy

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants