Tokenizing Phonetics #10

ClashLuke · 2022-04-30T08:52:24Z

Currently, all tokenisers work on a character level. This means that transferring them to a new language is often not possible. At the same time, this means that a model trained with such a tokeniser is specific for that particular language and won't be able to transfer from Spanish to Italian without significant effort. Additionally, written language is a quantised form of speech to reduce the space you need to store it. However, this conversion is very lossy, as it doesn't contain sarcasm or other vocal information.
We hope to reduce the first issue by using phonetic information while leaving the second untouched. The second could be solved by #9, although that uses less sparsity and therefore needs a bigger context to encode the same information.
This issue tracks the progress of implementing such a tokeniser built on phonetic information and the resulting language model trained with it.

buttercutter · 2022-04-30T16:14:31Z

@ClashLuke How about using byte-level BPE ?

ClashLuke · 2022-04-30T20:01:06Z

That would be a perfect combination! Do you want to give it a try?

buttercutter · 2022-05-01T02:06:01Z

I am still checking with the original facebook authors on some of the technical details about BBPE.

Once I understand the BBPE mechanism, I can help implement this for your project, but I supposed you are already well-versed in BBPE tokenizer ?

buttercutter · 2022-05-04T02:11:09Z

I managed to understand the rationale behind their dynamic programming in equation (1) for BBPE.

However, I am still checking as in why would the BBPE output be the same between 4K and 32K.

ClashLuke · 2022-05-04T17:02:27Z

It's a bit hard to see, but the outputs do differ. Look at the bytes of the Japanese tokens and the spaces between them.

buttercutter · 2022-05-04T17:20:50Z

yup, it is hard to see the extra whitespace. However, it is confusing as in how would an extra whitespace be introduced with the use of dynamic programming equation (1). Any idea ?

Besides, the equation might needs to be adapted to limit t to maximum of 3 instead of 4 given that chinese characters are encoded with just 3 bytes, as I found. Please correct me if wrong.

ClashLuke · 2022-05-04T17:41:06Z

It's not about Chinese characters; it's about UTF-8, which can be made up of up to 4 bytes. Please look at its Wikipedia entry, as it explains the encoding quite well.

ClashLuke added research Creative project that might fail but could give high returns ML Requires machine-learning knowledge (can be built up on the fly) labels Apr 30, 2022

ClashLuke added the downstream Changes code wrapping the core model label May 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenizing Phonetics #10

Tokenizing Phonetics #10

ClashLuke commented Apr 30, 2022

buttercutter commented Apr 30, 2022

ClashLuke commented Apr 30, 2022

buttercutter commented May 1, 2022

buttercutter commented May 4, 2022

ClashLuke commented May 4, 2022

buttercutter commented May 4, 2022

ClashLuke commented May 4, 2022

Tokenizing Phonetics #10

Tokenizing Phonetics #10

Comments

ClashLuke commented Apr 30, 2022

buttercutter commented Apr 30, 2022

ClashLuke commented Apr 30, 2022

buttercutter commented May 1, 2022

buttercutter commented May 4, 2022

ClashLuke commented May 4, 2022

buttercutter commented May 4, 2022

ClashLuke commented May 4, 2022