You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Oct 31, 2023. It is now read-only.
I understand training sentence piece model in monolingual case. But in multilingual case, its not clear enough. It's because dataset sizes across languages varies greatly. I think this leads to biased shared vocabulary.
Is it using sampling technique while training sentence piece as well?
If yes, how many times is sampling performed?
Isn't it better to go through all the text in dataset to create sub-words vocab instead of just the samples?
The text was updated successfully, but these errors were encountered:
Sign up for freeto subscribe to this conversation on GitHub.
Already have an account?
Sign in.
I understand training sentence piece model in monolingual case. But in multilingual case, its not clear enough. It's because dataset sizes across languages varies greatly. I think this leads to biased shared vocabulary.
The text was updated successfully, but these errors were encountered: