Skip to content
This repository has been archived by the owner on Oct 31, 2023. It is now read-only.

How is sentence piece model trained in XLM-R? #350

Open
mani-rai opened this issue Jun 3, 2022 · 0 comments
Open

How is sentence piece model trained in XLM-R? #350

mani-rai opened this issue Jun 3, 2022 · 0 comments

Comments

@mani-rai
Copy link

mani-rai commented Jun 3, 2022

I understand training sentence piece model in monolingual case. But in multilingual case, its not clear enough. It's because dataset sizes across languages varies greatly. I think this leads to biased shared vocabulary.

  1. Is it using sampling technique while training sentence piece as well?
  2. If yes, how many times is sampling performed?
  3. Isn't it better to go through all the text in dataset to create sub-words vocab instead of just the samples?
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant