You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Jun 18, 2024. It is now read-only.
Hi -
I recently noticed that tokenized results from albert's tokenizer implementation and sentencepiece library differ for some inputs. Check below:
SentencePiece Implementation
Using Albert
After looking at Albert's tokenizer implementation, I see that the if condition here is leading to the differences in the outputs above. https://github.com/google-research/albert/blob/master/tokenization.py#L67
Could you explain the intuition behind having this additional steps in albert's tokenizer and what purpose do they serve here?
Thanks!
The text was updated successfully, but these errors were encountered: