EOS token '\n' not working properly in llama3 #72

sjaelee25 · 2024-05-22T00:43:20Z

There is an issue with '\n' not working properly in llama3. When passing '\n' through tokenizer.encode, it outputs the token ID 198, but it does not terminate the sentence generation appropriately and continues generating subsequent text.
eos_token_id = base_model.tokenizer.encode("\n", bos=False, eos=False)[-1]
In contrast, using other strings like 'Q' works correctly. Additionally, testing with llama2 shows that all strings, including '\n', work as expected.

Could you please look into this issue?

The text was updated successfully, but these errors were encountered:

Ber666 · 2024-05-22T02:36:15Z

Yes, there is a slight difference in tokenization with Llama-3 compared to other models, e.g., \n\n is a different token from \n. To use llama-3, maybe you want to play with the tokenizer and investigate what's the really desired eos_token in your use case.

Ber666 mentioned this issue May 24, 2024

why so slow when I run rag_stragegyQA #73

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EOS token '\n' not working properly in llama3 #72

EOS token '\n' not working properly in llama3 #72

sjaelee25 commented May 22, 2024

Ber666 commented May 22, 2024

EOS token '\n' not working properly in llama3 #72

EOS token '\n' not working properly in llama3 #72

Comments

sjaelee25 commented May 22, 2024

Ber666 commented May 22, 2024