-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tokenizing Phonetics #10
Comments
@ClashLuke How about using byte-level BPE ? |
That would be a perfect combination! Do you want to give it a try? |
I am still checking with the original facebook authors on some of the technical details about BBPE. Once I understand the BBPE mechanism, I can help implement this for your project, but I supposed you are already well-versed in BBPE tokenizer ? |
It's a bit hard to see, but the outputs do differ. Look at the bytes of the Japanese tokens and the spaces between them. |
yup, it is hard to see the extra whitespace. However, it is confusing as in how would an extra whitespace be introduced with the use of dynamic programming equation (1). Any idea ? Besides, the equation might needs to be adapted to limit |
It's not about Chinese characters; it's about UTF-8, which can be made up of up to 4 bytes. Please look at its Wikipedia entry, as it explains the encoding quite well. |
Currently, all tokenisers work on a character level. This means that transferring them to a new language is often not possible. At the same time, this means that a model trained with such a tokeniser is specific for that particular language and won't be able to transfer from Spanish to Italian without significant effort. Additionally, written language is a quantised form of speech to reduce the space you need to store it. However, this conversion is very lossy, as it doesn't contain sarcasm or other vocal information.
We hope to reduce the first issue by using phonetic information while leaving the second untouched. The second could be solved by #9, although that uses less sparsity and therefore needs a bigger context to encode the same information.
This issue tracks the progress of implementing such a tokeniser built on phonetic information and the resulting language model trained with it.
The text was updated successfully, but these errors were encountered: