Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tokenizer.encode function`s param add_special_tokens=False not work. #765

Open
xiaohan2909 opened this issue Dec 12, 2024 · 0 comments
Open
Labels
type/bug An issue about a bug

Comments

@xiaohan2909
Copy link

🐛 Describe the bug

the tokenizer is from olmo.tokenizer package.
Keep token id to the default value 50279, when the default tokenizer is loaded, run the code below:
input:
tokenizer.encode("hello", add_special_tokens=False)
output :
[25521, 50279]
The Result shows that the parameter 'add_special_tokens=False' dose not work.
And I find the reason is in /olmo/tokenizer.py line 183:
batch_encoding = self.base_tokenizer.encode_batch(inputs)
the param 'add_special_tokens' didn't pass to the base_tokenizer's encode function.
tokenizer_error

I can find the bug because it caused an assertion in /scripts/prepare_tulu_data.py line 90
prepare_tulu_l90

Versions

0.5.1

@xiaohan2909 xiaohan2909 added the type/bug An issue about a bug label Dec 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug An issue about a bug
Projects
None yet
Development

No branches or pull requests

1 participant