tokenizer.encode function`s param add_special_tokens=False not work. #765

xiaohan2909 · 2024-12-12T07:37:51Z

🐛 Describe the bug

the tokenizer is from olmo.tokenizer package.
Keep token id to the default value 50279, when the default tokenizer is loaded, run the code below:
input：
tokenizer.encode("hello", add_special_tokens=False)
output :
[25521, 50279]
The Result shows that the parameter 'add_special_tokens=False' dose not work.
And I find the reason is in /olmo/tokenizer.py line 183:
batch_encoding = self.base_tokenizer.encode_batch(inputs)
the param 'add_special_tokens' didn't pass to the base_tokenizer's encode function.

I can find the bug because it caused an assertion in /scripts/prepare_tulu_data.py line 90

Versions

0.5.1

The text was updated successfully, but these errors were encountered:

xiaohan2909 added the type/bug An issue about a bug label Dec 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tokenizer.encode function`s param add_special_tokens=False not work. #765

tokenizer.encode function`s param add_special_tokens=False not work. #765

xiaohan2909 commented Dec 12, 2024

tokenizer.encode function`s param add_special_tokens=False not work. #765

tokenizer.encode function`s param add_special_tokens=False not work. #765

Comments

xiaohan2909 commented Dec 12, 2024

🐛 Describe the bug

Versions