You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I generate train samples with dolma, and I found that some of the texts are really long, which can be 8k, but my max_seq_len is only 2k. In this case, will OLMa dataset split the 8k sample into 4 parts(each of which is 2k long), or only the first 2k tokens are kept while the remainings are dropped?
The text was updated successfully, but these errors were encountered:
When generating training samples with Dolma, if a text exceeds your max_seq_len (e.g., 8k tokens when the limit is 2k), the dataset will not necessarily split the sample into neat 2k token chunks. Instead, the slicing of the text happens randomly, and each slice (batch) can start at different positions within the original text. This means there’s no guarantee that the continuation of the text is preserved across batches. The model might treat a new slice as a fresh start, potentially losing context from previous tokens. Therefore, instead of strictly splitting the text into 4 parts, the data might be tokenized and sliced in a way that can result in discontinuities between batches.
❓ The question
I generate train samples with dolma, and I found that some of the texts are really long, which can be 8k, but my
max_seq_len
is only 2k. In this case, will OLMa dataset split the 8k sample into 4 parts(each of which is 2k long), or only the first 2k tokens are kept while the remainings are dropped?The text was updated successfully, but these errors were encountered: