-
Notifications
You must be signed in to change notification settings - Fork 85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Need clarification on token limit of input used for fine tuning #20
Comments
Thank you for your question! The reason behind this is that most examples from OpenOrca and MathInstruct should fit within this context length (only chat examples were longer, but as they were sampled with a probability of 0.01 we decided to truncate them). In the instruction tuning code, examples are truncated (take a prefix and discard the rest):
To handle both long and short inputs without running all with 15K context you can disable the always_pad option
and set the max_*_length to the maximum value in the dataset then the DataColloator will pad to the maximum length within the batch of examples
The parameters
were used to simulate long context scenarios using short context data (so that the model won't forget how to use memory layers, here random padding decides for short examples how much goes to the memory and how much to the last local context). This is possible as FoT assigns the same positional encoding to all tokens from the memory (in memory layers). (In fact, one can ask about positional encodings in the local context as they are less utilized in our case, but we assume that the model got well accustomed to them during the whole pre-training procedure and won't forget how to use them). Note that in the FoT continued-pretraining code we do not truncate (take a prefix and discard the rest) long documents but instead move the remaining parts to the next batch. This is because standard language modeling on parts of the documents still makes sense whereas answering an unknown question may not make sense.
|
Hi CStanKonrad, Should I set "mem_layers" parameter like |
Hi, I am going through the page:https://huggingface.co/syzymon/long_llama_code_7b_instruct. I found the text "All inputs were truncated and randomly padded (left/right) to 3072 tokens" under Training. Is there a reason behind this truncation? . I have noticed in creating instruct version of the model from long llama model, the context lengths used for finetuning the model are significantly smaller than the context length provided during inference time. I like to get this clarification because I have prepared a dataset in the format similar to jsonl file in the link: https://github.com/chrishayuk/opl-train/blob/main/JSONL/train.jsonl . Here each line belong to one input. Several of the input in my custom jsonl file have larger tokens like ~15K. If I follow the FT script you provided here : https://github.com/CStanKonrad/long_llama/tree/main/instruction_fine_tuning for my dataset, will the inputs with longer tokens get trucated or ignored after certain tokens during finetuning?. Or is there a possibility that during finetuning process my larger input gets split into several windows ?
The text was updated successfully, but these errors were encountered: