Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenizing Phonetics #10

Open
ClashLuke opened this issue Apr 30, 2022 · 7 comments
Open

Tokenizing Phonetics #10

ClashLuke opened this issue Apr 30, 2022 · 7 comments
Labels
downstream Changes code wrapping the core model ML Requires machine-learning knowledge (can be built up on the fly) research Creative project that might fail but could give high returns

Comments

@ClashLuke
Copy link
Member

Currently, all tokenisers work on a character level. This means that transferring them to a new language is often not possible. At the same time, this means that a model trained with such a tokeniser is specific for that particular language and won't be able to transfer from Spanish to Italian without significant effort. Additionally, written language is a quantised form of speech to reduce the space you need to store it. However, this conversion is very lossy, as it doesn't contain sarcasm or other vocal information.
We hope to reduce the first issue by using phonetic information while leaving the second untouched. The second could be solved by #9, although that uses less sparsity and therefore needs a bigger context to encode the same information.
This issue tracks the progress of implementing such a tokeniser built on phonetic information and the resulting language model trained with it.

@ClashLuke ClashLuke added research Creative project that might fail but could give high returns ML Requires machine-learning knowledge (can be built up on the fly) labels Apr 30, 2022
@buttercutter
Copy link

@ClashLuke How about using byte-level BPE ?

@ClashLuke
Copy link
Member Author

That would be a perfect combination! Do you want to give it a try?

@buttercutter
Copy link

I am still checking with the original facebook authors on some of the technical details about BBPE.

Once I understand the BBPE mechanism, I can help implement this for your project, but I supposed you are already well-versed in BBPE tokenizer ?

@buttercutter
Copy link

I managed to understand the rationale behind their dynamic programming in equation (1) for BBPE.

However, I am still checking as in why would the BBPE output be the same between 4K and 32K.

image

@ClashLuke
Copy link
Member Author

It's a bit hard to see, but the outputs do differ. Look at the bytes of the Japanese tokens and the spaces between them.

@buttercutter
Copy link

yup, it is hard to see the extra whitespace. However, it is confusing as in how would an extra whitespace be introduced with the use of dynamic programming equation (1). Any idea ?

Besides, the equation might needs to be adapted to limit t to maximum of 3 instead of 4 given that chinese characters are encoded with just 3 bytes, as I found. Please correct me if wrong.

image

@ClashLuke
Copy link
Member Author

It's not about Chinese characters; it's about UTF-8, which can be made up of up to 4 bytes. Please look at its Wikipedia entry, as it explains the encoding quite well.

@ClashLuke ClashLuke added the downstream Changes code wrapping the core model label May 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
downstream Changes code wrapping the core model ML Requires machine-learning knowledge (can be built up on the fly) research Creative project that might fail but could give high returns
Projects
None yet
Development

No branches or pull requests

2 participants