Sequence packing with ChatDataset #11357

LiweiPE · 2024-11-21T01:56:47Z

Hello, may I know how to use gpt_sft_chat_dataset with sequence packing? I couldnt find documentation related to it. I only found to convert from input-output format.

qijiaxing · 2024-11-28T03:38:12Z

Sequence packing steps can found here

To make it work for chat dataset, you probably need to add the chat format option, like this:

python /workspace/sequence_packing/tokenize_dataset.py \
  model.data.train_ds.file_names=[/path/to/training.jsonl] \
  model.data.train_ds.max_seq_length=4096 \
  +model.data.chat=True \
  +model.data.chat_prompt_tokens.system_turn_start='<extra_id_0>' \
      +model.data.chat_prompt_tokens.turn_start='<extra_id_1>' \
      +model.data.chat_prompt_tokens.label_start='<extra_id_2>' \
      +model.data.chat_prompt_tokens.end_of_turn="\x0A" \
      +model.data.chat_prompt_tokens.end_of_name="\x0A" \
  model.restore_from_path=/path/to/starcoder2.nemo \ # any starcoder2 .nemo models works here
  +output_path=/path/to/my_dataset.npy

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sequence packing with ChatDataset #11357

Sequence packing with ChatDataset #11357

LiweiPE commented Nov 21, 2024

qijiaxing commented Nov 28, 2024

Sequence packing with ChatDataset #11357

Sequence packing with ChatDataset #11357

Comments

LiweiPE commented Nov 21, 2024

qijiaxing commented Nov 28, 2024