[TTS] Add VietnameseCharsTokenizer #9665

huutuongtu · 2024-07-10T04:40:03Z

What does this PR do ?

Add a Vietnamese language tokenizer for TTS training

Collection: [TTS]

Changelog

Add VietnameseCharsTokenizer
Add unit tests for Vietnamese

Usage

from nemo.collections.common.tokenizers.text_to_speech.tts_tokenizers import VietnameseCharsTokenizer

text = "Xin chào các bạn."

tokenizer = VietnameseCharsTokenizer(
    pad_with_space=True,
)

tokens = tokenizer(text)
graphemes = tokenizer.decode(tokens)
graphemes = graphemes.replace('|', '')

print(tokens)
#  xin chào các bạn.
print(graphemes)

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

New Feature
Bugfix
Documentation

Signed-off-by: huutuongtu <[email protected]>

…com)

Signed-off-by: huutuongtu <[email protected]>

Signed-off-by: Xuesong Yang <[email protected]>

Signed-off-by: XuesongYang <[email protected]>

nemo/collections/common/tokenizers/text_to_speech/tts_tokenizers.py

XuesongYang

please refactor the code accordingly.

XuesongYang · 2024-07-24T06:57:19Z

nemo/collections/common/tokenizers/text_to_speech/ipa_lexicon.py

@@ -184,6 +208,32 @@ def get_ipa_punctuation_list(locale):
                '—',  # em dash, U+2014, decimal 8212
            ]
        )
+    if locale == "vi-VN":
+        punct_set.update(


could you pls add the source of punctuations of Vietnamese?

Hmm, it seems that there isn't any 'official' source that talks about Vietnamese punctuation marks. I can find some information about punctuation marks here: https://languagedrops.com/word/en/english/vietnamese/topics/punctuation/.
Maybe we just need to use DEFAULT_PUNCTUATION.

XuesongYang

thanks. LGTM. pls add the source of Vietnamese punctuations if any.

…ilto:[email protected])

Signed-off-by: Xuesong Yang <[email protected]>

Signed-off-by: XuesongYang <[email protected]>

* Update tts_tokenizers.py * Update tokenizer_utils.py * Update test_tts_tokenizers.py * Apply isort and black reformatting Signed-off-by: huutuongtu <[email protected]> * Signed-off-by: Tu [[email protected]](mailto:[email protected]) * Update ipa_lexicon.py - Signed-off-by: Tu [[email protected]](mailto:[email protected]) Signed-off-by: XuesongYang <[email protected]> --------- Signed-off-by: huutuongtu <[email protected]> Signed-off-by: Xuesong Yang <[email protected]> Signed-off-by: XuesongYang <[email protected]> Co-authored-by: huutuongtu <[email protected]> Co-authored-by: Xuesong Yang <[email protected]> Co-authored-by: XuesongYang <[email protected]> Signed-off-by: Boxiang Wang <[email protected]>

* Update tts_tokenizers.py * Update tokenizer_utils.py * Update test_tts_tokenizers.py * Apply isort and black reformatting Signed-off-by: huutuongtu <[email protected]> * Signed-off-by: Tu [[email protected]](mailto:[email protected]) * Update ipa_lexicon.py - Signed-off-by: Tu [[email protected]](mailto:[email protected]) Signed-off-by: XuesongYang <[email protected]> --------- Signed-off-by: huutuongtu <[email protected]> Signed-off-by: Xuesong Yang <[email protected]> Signed-off-by: XuesongYang <[email protected]> Co-authored-by: huutuongtu <[email protected]> Co-authored-by: Xuesong Yang <[email protected]> Co-authored-by: XuesongYang <[email protected]> Signed-off-by: Vivian Chen <[email protected]>

* Update tts_tokenizers.py * Update tokenizer_utils.py * Update test_tts_tokenizers.py * Apply isort and black reformatting Signed-off-by: huutuongtu <[email protected]> * Signed-off-by: Tu [[email protected]](mailto:[email protected]) * Update ipa_lexicon.py - Signed-off-by: Tu [[email protected]](mailto:[email protected]) Signed-off-by: XuesongYang <[email protected]> --------- Signed-off-by: huutuongtu <[email protected]> Signed-off-by: Xuesong Yang <[email protected]> Signed-off-by: XuesongYang <[email protected]> Co-authored-by: huutuongtu <[email protected]> Co-authored-by: Xuesong Yang <[email protected]> Co-authored-by: XuesongYang <[email protected]>

* Update tts_tokenizers.py * Update tokenizer_utils.py * Update test_tts_tokenizers.py * Apply isort and black reformatting Signed-off-by: huutuongtu <[email protected]> * Signed-off-by: Tu [[email protected]](mailto:[email protected]) * Update ipa_lexicon.py - Signed-off-by: Tu [[email protected]](mailto:[email protected]) Signed-off-by: XuesongYang <[email protected]> --------- Signed-off-by: huutuongtu <[email protected]> Signed-off-by: Xuesong Yang <[email protected]> Signed-off-by: XuesongYang <[email protected]> Co-authored-by: huutuongtu <[email protected]> Co-authored-by: Xuesong Yang <[email protected]> Co-authored-by: XuesongYang <[email protected]> Signed-off-by: Hainan Xu <[email protected]>

huutuongtu added 5 commits July 10, 2024 11:00

Update tts_tokenizers.py

145f3fb

Update tokenizer_utils.py

9f4ad78

Update tts_tokenizers.py

59d0dd5

Update test_tts_tokenizers.py

d120a3d

Update tts_tokenizers.py

6efc946

github-actions bot added TTS common labels Jul 10, 2024

huutuongtu and others added 3 commits July 10, 2024 04:40

Apply isort and black reformatting

6a780c3

Signed-off-by: huutuongtu <[email protected]>

Signed-off-by: Tu [[email protected]](mailto:huutu12312vn@gmail.…

6258be6

…com)

Apply isort and black reformatting

e79b31a

Signed-off-by: huutuongtu <[email protected]>

XuesongYang requested review from XuesongYang, rlangman, mgrafu and blisc July 24, 2024 00:35

Merge branch 'main' into main

84c21b8

Signed-off-by: Xuesong Yang <[email protected]>

XuesongYang added the Run CICD label Jul 24, 2024

Apply isort and black reformatting

61091c7

Signed-off-by: XuesongYang <[email protected]>

XuesongYang reviewed Jul 24, 2024

View reviewed changes

nemo/collections/common/tokenizers/text_to_speech/tts_tokenizers.py Outdated Show resolved Hide resolved

XuesongYang requested changes Jul 24, 2024

View reviewed changes

huutuongtu added 2 commits July 24, 2024 11:14

Update ipa_lexicon.py

503ad81

Update tts_tokenizers.py

52c9b00

XuesongYang reviewed Jul 24, 2024

View reviewed changes

XuesongYang self-requested a review July 24, 2024 06:58

XuesongYang previously approved these changes Jul 24, 2024

View reviewed changes

XuesongYang added Run CICD and removed Run CICD labels Jul 24, 2024

XuesongYang and others added 2 commits July 24, 2024 00:00

Merge branch 'main' into main

ce4dad8

Update tts_tokenizers.py

6f57a55

huutuongtu dismissed XuesongYang’s stale review via 6f57a55 July 24, 2024 07:58

Update ipa_lexicon.py - Signed-off-by: Tu [[email protected]](ma…

45cc59e

…ilto:[email protected])

XuesongYang previously approved these changes Jul 24, 2024

View reviewed changes

Merge branch 'main' into main

f7c7a0c

Signed-off-by: Xuesong Yang <[email protected]>

XuesongYang dismissed their stale review via f7c7a0c July 25, 2024 17:19

Apply isort and black reformatting

b46a3ea

Signed-off-by: XuesongYang <[email protected]>

XuesongYang added Run CICD and removed Run CICD labels Jul 25, 2024

XuesongYang self-requested a review July 25, 2024 22:39

XuesongYang approved these changes Jul 25, 2024

View reviewed changes

XuesongYang merged commit 74c2caf into NVIDIA:main Jul 26, 2024
206 of 207 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TTS] Add VietnameseCharsTokenizer #9665

[TTS] Add VietnameseCharsTokenizer #9665

huutuongtu commented Jul 10, 2024 •

edited

Loading

XuesongYang left a comment

XuesongYang Jul 24, 2024

huutuongtu Jul 24, 2024

XuesongYang left a comment

[TTS] Add VietnameseCharsTokenizer #9665

[TTS] Add VietnameseCharsTokenizer #9665

Conversation

huutuongtu commented Jul 10, 2024 • edited Loading

What does this PR do ?

Changelog

Usage

Before your PR is "Ready for review"

XuesongYang left a comment

Choose a reason for hiding this comment

XuesongYang Jul 24, 2024

Choose a reason for hiding this comment

huutuongtu Jul 24, 2024

Choose a reason for hiding this comment

XuesongYang left a comment

Choose a reason for hiding this comment

huutuongtu commented Jul 10, 2024 •

edited

Loading