You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
the tokenizer is from olmo.tokenizer package.
Keep token id to the default value 50279, when the default tokenizer is loaded, run the code below:
input: tokenizer.encode("hello", add_special_tokens=False)
output :
[25521, 50279]
The Result shows that the parameter 'add_special_tokens=False' dose not work.
And I find the reason is in /olmo/tokenizer.py line 183: batch_encoding = self.base_tokenizer.encode_batch(inputs)
the param 'add_special_tokens' didn't pass to the base_tokenizer's encode function.
I can find the bug because it caused an assertion in /scripts/prepare_tulu_data.py line 90
Versions
0.5.1
The text was updated successfully, but these errors were encountered:
🐛 Describe the bug
the tokenizer is from olmo.tokenizer package.
Keep token id to the default value 50279, when the default tokenizer is loaded, run the code below:
input:
tokenizer.encode("hello", add_special_tokens=False)
output :
[25521, 50279]
The Result shows that the parameter 'add_special_tokens=False' dose not work.
And I find the reason is in /olmo/tokenizer.py line 183:
batch_encoding = self.base_tokenizer.encode_batch(inputs)
the param 'add_special_tokens' didn't pass to the base_tokenizer's encode function.
I can find the bug because it caused an assertion in /scripts/prepare_tulu_data.py line 90
Versions
0.5.1
The text was updated successfully, but these errors were encountered: