How is sentence piece model trained in XLM-R? #350

mani-rai · 2022-06-03T09:37:24Z

I understand training sentence piece model in monolingual case. But in multilingual case, its not clear enough. It's because dataset sizes across languages varies greatly. I think this leads to biased shared vocabulary.

Is it using sampling technique while training sentence piece as well?
If yes, how many times is sampling performed?
Isn't it better to go through all the text in dataset to create sub-words vocab instead of just the samples?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How is sentence piece model trained in XLM-R? #350

How is sentence piece model trained in XLM-R? #350

mani-rai commented Jun 3, 2022

How is sentence piece model trained in XLM-R? #350

How is sentence piece model trained in XLM-R? #350

Comments

mani-rai commented Jun 3, 2022