Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can long text be splitted into short texts? #655

Open
CoinCheung opened this issue Jul 12, 2024 · 1 comment
Open

Can long text be splitted into short texts? #655

CoinCheung opened this issue Jul 12, 2024 · 1 comment
Labels
type/question An issue that's a question

Comments

@CoinCheung
Copy link

❓ The question

I generate train samples with dolma, and I found that some of the texts are really long, which can be 8k, but my max_seq_len is only 2k. In this case, will OLMa dataset split the 8k sample into 4 parts(each of which is 2k long), or only the first 2k tokens are kept while the remainings are dropped?

@CoinCheung CoinCheung added the type/question An issue that's a question label Jul 12, 2024
@aman-17
Copy link
Member

aman-17 commented Oct 22, 2024

When generating training samples with Dolma, if a text exceeds your max_seq_len (e.g., 8k tokens when the limit is 2k), the dataset will not necessarily split the sample into neat 2k token chunks. Instead, the slicing of the text happens randomly, and each slice (batch) can start at different positions within the original text. This means there’s no guarantee that the continuation of the text is preserved across batches. The model might treat a new slice as a fresh start, potentially losing context from previous tokens. Therefore, instead of strictly splitting the text into 4 parts, the data might be tokenized and sliced in a way that can result in discontinuities between batches.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/question An issue that's a question
Projects
None yet
Development

No branches or pull requests

2 participants