Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FYI: Run models from piper with the Next-gen Kaldi subproject sherpa-onnx #251

Open
csukuangfj opened this issue Oct 26, 2023 · 66 comments
Open

Comments

@csukuangfj
Copy link

FYI: We have supported piper models in
https://github.com/k2-fsa/sherpa-onnx

Note that it does not depend on https://github.com/rhasspy/piper-phonemize

sherpa-onnx supports a variety of platforms, such as

  • Windows (x86, x64)
  • Linux (x64, arm, arm64), i.e., rapsberry Pi
  • macOS (x64, arm64)

It also provides various programming language APIs, e.g., C/C++/Python/Kotlin/Swift/C#/Go. We also have android APKs for TTS.

You can find the installation doc at https://k2-fsa.github.io/sherpa/onnx/install/index.html

You can find the usage of piper models with sherpa-onnx at
https://k2-fsa.github.io/sherpa/onnx/tts/pretrained_models/vits.html#lessac-blizzard2013-medium-english-single-speaker
Screen Shot 2023-10-26 at 15 43 05

We also have a huggingface space for you to try piper models with sherpa-onnx.
Please visit
https://huggingface.co/spaces/k2-fsa/text-to-speech

Screen Shot 2023-10-26 at 15 40 08


You can find the PR supporting piper in sherpa-onnx at k2-fsa/sherpa-onnx#390

@mush42
Copy link
Contributor

mush42 commented Oct 26, 2023

@csukuangfj where to find the Android APKs?

@beqabeqa473
Copy link

@csukuangfj Yes, it would be good to know about android tts as well. Could you please tell where to get it?

@csukuangfj
Copy link
Author

I'm sorry for not getting back to you sooner.

I have been working on converting more models from piper.

Now all models of the following languages have been converted to sherpa-onnx:

  • English (both US and GB)
  • French
  • German
  • Spanish (both ES and MX)

You can find the Android APKs on the following page.
https://k2-fsa.github.io/sherpa/onnx/tts/apk.html

Screenshot 2023-10-29 at 17 48 27

@beqabeqa473
Copy link

beqabeqa473 commented Oct 29, 2023 via email

@csukuangfj
Copy link
Author

Are there using standard android text-to-speech api or not?

@beqabeqa473

No, it uses sherpa-onnx with vits pre-trained models for tts.

Everything is open-sourced. You can find the source code for the android project at
https://github.com/k2-fsa/sherpa-onnx/tree/master/android/SherpaOnnxTts

The underlying C++ code can be found at https://github.com/k2-fsa/sherpa-onnx

The JNI C++ binding code can be found at
https://github.com/k2-fsa/sherpa-onnx/tree/master/sherpa-onnx/jni

You can find kotlin API examples at
https://github.com/k2-fsa/sherpa-onnx/tree/master/kotlin-api-examples

@beqabeqa473
Copy link

beqabeqa473 commented Oct 29, 2023 via email

@synesthesiam
Copy link
Contributor

Thanks for doing this @csukuangfj! I'd looked into sherpa-onnx at one point, but wasn't sure how to proceed. I'd like to link to your work when you think it's stable enough; I do want to make sure people understand that pronunciations may be slightly different due to the pre-computed lexicon.

Speaking of the lexicon, could it be extended dynamically at runtime with your approach?

@csukuangfj
Copy link
Author

@synesthesiam

but wasn't sure how to proceed.

We have detailed documentation at
https://k2-fsa.github.io/sherpa/onnx/

Could you tell us what you want to do? We can clarify the doc if you think it is not clear.


I do want to make sure people understand that pronunciations may be slightly different due to the pre-computed lexicon.

The lexicon.txt is generated by following the colab notebook from this repo
https://github.com/rhasspy/piper/blob/master/notebooks/piper_inference_(ONNX).ipynb

The exact code can be found at
https://github.com/csukuangfj/models/tree/master/.github/scripts

Could you explain where the difference comes from?

Speaking of the lexicon, could it be extended dynamically at runtime with your approach?

No, it cannot. If there is an OOV at runtime, it is simply ignored, though a message is printed to tell the user
that an OOV has been ignored.

I'd like to link to your work when you think it's stable enough;

Thank you! I think the support for offline VITS models is stable now. (The APIs for the VITS model are quite simple and
there should be no big changes to the APIs in the near future)

@synesthesiam
Copy link
Contributor

Could you tell us what you want to do? We can clarify the doc if you think it is not clear.

I meant more "big picture" in how I should proceed. I wasn't sure if it was worth investigating porting Piper to sherpa-onnx. I'd be curious if you've noticed any speed difference.

@csukuangfj
Copy link
Author

Thanks for doing this @csukuangfj! I'd looked into sherpa-onnx at one point, but wasn't sure how to proceed. I'd like to link to your work when you think it's stable enough; I do want to make sure people understand that pronunciations may be slightly different due to the pre-computed lexicon.

Speaking of the lexicon, could it be extended dynamically at runtime with your approach?

@synesthesiam
I am integrating piper-phonemize so that we can discard lexicon.txt in sherpa-onnx.

Could you have a look at the following two PRs?

@csukuangfj
Copy link
Author

Screenshot 2023-11-29 at 21 51 51

https://huggingface.co/csukuangfj/vits-piper-pt_PT-tugao-medium/tree/main

I have converted all of the models from piper to sherpa-onnx.
No lexicon.txt is required any more. I am using piper-phonemize.

(No that you can all run the models on Android/iOS/Raspberry Pi, etc).

@anita-smith1
Copy link

anita-smith1 commented Dec 8, 2023

@csukuangfj
"No lexicon.txt is required any more. I am using piper-phonemize."

does this apply to piper models only? is lexicon required for coqui tts models? I'm following up on [#257]
(#257)

I couldn't use my coqui tts converted sherpa onyx model because I had to manually add words to lexicon and there was poor pronunciation for single words.

@csukuangfj
Copy link
Author

is lexicon required for coqui tts models?

No, it is also not required for coqui tts models

All vits models for coqui don't use lexicon.txt for sherpa-onnx.


I couldn't use my coqui tts converted sherpa onyx model because I had to manually add words to lexicon and there was poor pronunciation for single words.

Please look at just one coqui model at
https://github.com/k2-fsa/sherpa-onnx/releases/tag/tts-models

For instance, you can look at
https://github.com/k2-fsa/sherpa-onnx/releases/download/tts-models/vits-coqui-en-ljspeech.tar.bz2

Download it, unzip it, and you will find the code for exporting models from coqui to sherpa-onnx.

@anita-smith1
Copy link

@csukuangfj
Copy link
Author

@csukuangfj meaning your notebook doesn't work anymore ? https://colab.research.google.com/drive/1cI9VzlimS51uAw4uCR-OBeSXRPBc4KoK?usp=sharing

I just updated the colab notebook. Please reload it.

@anita-smith1

The updated colab notebook is much much simpler than before.

@anita-smith1
Copy link

anita-smith1 commented Dec 8, 2023

@csukuangfj meaning your notebook doesn't work anymore ? https://colab.research.google.com/drive/1cI9VzlimS51uAw4uCR-OBeSXRPBc4KoK?usp=sharing

I just updated the colab notebook. Please reload it.

@anita-smith1

The updated colab notebook is much much simpler than before.

Your colab notebook works for default vits models, but when I use my fine tuned vits model which contains words like "orrse", "atua" (not in the English dictionary) I get the error Error when reading tokens at Line <PAD> 0. size: 5 when I try to synthesize speech. Seems to be a token.txt issue

The first colab which used lexicons worked, but this does not work with a fine tuned model containing your own words. How can we solve this issue?

Screenshot 2023-12-08 at 16 28 38

@csukuangfj
Copy link
Author

please show your meta data and add
--debug=1 to your commandline.

@anita-smith1
Copy link

--debug=1

meta_data {'model_type': 'vits', 'comment': 'coqui', 'language': 'English', 'voice': 'en-us', 'has_espeak': 1, 'add_blank': 1, 'blank_id': 3, 'n_speakers': 0, 'use_eos_bos': 0, 'bos_id': 2, 'eos_id': 1, 'sample_rate': 22050}

adding --debug=1, I have the output:

/project/sherpa-onnx/csrc/parse-options.cc:Read:361 sherpa-onnx-offline-tts --vits-model=./model.onnx --vits-tokens=./tokens.txt --vits-data-dir=./espeak-ng-data --output-filename=./test.wav --debug=1 'orrse wo betumi atua de a fa mobile' 

/project/sherpa-onnx/csrc/offline-tts-vits-model.cc:Init:79 ---vits model---
bos_id=2
use_eos_bos=0
n_speakers=0
blank_id=3
has_espeak=1
voice=en-us
sample_rate=22050
language=English
add_blank=1
comment=coqui
eos_id=1
model_type=vits
----------input names----------
0 input
1 input_lengths
2 scales
----------output names----------
0 output


/project/sherpa-onnx/csrc/piper-phonemize-lexicon.cc:ReadTokens:66 Error when reading tokens at Line <PAD> 0. size: 5
---------------------------------------------------------------------------
CalledProcessError                        Traceback (most recent call last)
[<ipython-input-13-c8218415962b>](https://localhost:8080/#) in <cell line: 1>()
----> 1 get_ipython().run_cell_magic('shell', '', '\nsherpa-onnx-offline-tts \\\n --vits-model=./model.onnx \\\n --vits-tokens=./tokens.txt \\\n --vits-data-dir=./espeak-ng-data \\\n --output-filename=./test.wav \\\n --debug=1 \\\n "orrse wo betumi atua de a fa mobile"\n')

3 frames
[/usr/local/lib/python3.10/dist-packages/google/colab/_system_commands.py](https://localhost:8080/#) in check_returncode(self)
    135   def check_returncode(self):
    136     if self.returncode:
--> 137       raise subprocess.CalledProcessError(
    138           returncode=self.returncode, cmd=self.args, output=self.output
    139       )

CalledProcessError: Command '
sherpa-onnx-offline-tts \
 --vits-model=./model.onnx \
 --vits-tokens=./tokens.txt \
 --vits-data-dir=./espeak-ng-data \
 --output-filename=./test.wav \
 --debug=1 \
 "orrse wo betumi atua de a fa mobile"
' returned non-zero exit status 255.

and this is the generated token.txt file content:

<PAD> 0
<EOS> 1
<BOS> 2
<BLNK> 3
a 4
b 5
c 6
d 7
e 8
f 9
h 10
i 11
j 12
k 13
l 14
m 15
n 16
o 17
p 18
q 19
r 20
s 21
t 22
u 23
v 24
w 25
x 26
y 27
z 28
æ 29
ç 30
ð 31
ø 32
ħ 33
ŋ 34
œ 35
ǀ 36
ǁ 37
ǂ 38
ǃ 39
ɐ 40
ɑ 41
ɒ 42
ɓ 43
ɔ 44
ɕ 45
ɖ 46
ɗ 47
ɘ 48
ə 49
ɚ 50
ɛ 51
ɜ 52
ɞ 53
ɟ 54
ɠ 55
ɡ 56
ɢ 57
ɣ 58
ɤ 59
ɥ 60
ɦ 61
ɧ 62
ɨ 63
ɪ 64
ɫ 65
ɬ 66
ɭ 67
ɮ 68
ɯ 69
ɰ 70
ɱ 71
ɲ 72
ɳ 73
ɴ 74
ɵ 75
ɶ 76
ɸ 77
ɹ 78
ɺ 79
ɻ 80
ɽ 81
ɾ 82
ʀ 83
ʁ 84
ʂ 85
ʃ 86
ʄ 87
ʈ 88
ʉ 89
ʊ 90
ʋ 91
ʌ 92
ʍ 93
ʎ 94
ʏ 95
ʐ 96
ʑ 97
ʒ 98
ʔ 99
ʕ 100
ʘ 101
ʙ 102
ʛ 103
ʜ 104
ʝ 105
ʟ 106
ʡ 107
ʢ 108
ʲ 109
ˈ 110
ˌ 111
ː 112
ˑ 113
˞ 114
β 115
θ 116
χ 117
ᵻ 118
ⱱ 119
! 120
' 121
( 122
) 123
, 124
- 125
. 126
: 127
; 128
? 129
  130

@csukuangfj
Copy link
Author

Could you share your config.json?

The English VITS models from coqui use phonemes. All other non-English models from coqui use Characters.

@csukuangfj
Copy link
Author

From your config.json:

    "characters_class": "TTS.tts.utils.text.characters.IPAPhonemes",

Unfortunately, we don't support models using IPAPhonemes, only Graphmes and VitsCharacters are supported
from coqui-ai/tts.

You can find all supported models at
https://github.com/k2-fsa/sherpa-onnx/releases/tag/tts-models

You can find the script for converting the model by unzipping the downloaded file.

@anita-smith1
Copy link

@csukuangfj how can I fine-tune my model to support this ? I shared the colab notebook I used in my previous message. Can you take a look ? Is it possible to change the configuration and re-fine tune my model? In case that’s not possible and I decide to train/fine tune using piper , do you have a similar colab notebook for converting piper model to onnx ?

@csukuangfj
Copy link
Author

Please download a model and unzip it, you will find the converting script.

@anita-smith1
Copy link

@csukuangfj I have fine tuned a model with characters_class="TTS.tts.models.vits.VitsCharacters" and I'm able to synthesis now using your colab notebook. it is working :) Thanks a lot. Now I want to try on android and iOS but I can see android uses the old code below. Will it ignore the lexicon file?

fun getOfflineTtsConfig(
    modelDir: String,
    modelName: String,
    lexicon: String,
    dataDir: String,
    ruleFsts: String
): OfflineTtsConfig? {
    return OfflineTtsConfig(
        model = OfflineTtsModelConfig(
            vits = OfflineTtsVitsModelConfig(
                model = "$modelDir/$modelName",
                lexicon = "$modelDir/$lexicon",
                tokens = "$modelDir/tokens.txt",
                dataDir = "$dataDir"
            ),
            numThreads = 2,
            debug = true,
            provider = "cpu",
        ),
        ruleFsts = ruleFsts,
    )

@csukuangfj
Copy link
Author

please see where and how this function is called.

@csukuangfj
Copy link
Author

Please see
https://github.com/k2-fsa/sherpa-onnx/blob/master/android/SherpaOnnxTts/app/src/main/java/com/k2fsa/sherpa/onnx/MainActivity.kt#L172

https://github.com/k2-fsa/sherpa-onnx/blob/0f053d80408b70efde3c8a37f5eeed1c5fd7f837/android/SherpaOnnxTts/app/src/main/java/com/k2fsa/sherpa/onnx/MainActivity.kt#L167-L183

        // Example 1:
        // modelDir = "vits-vctk"
        // modelName = "vits-vctk.onnx"
        // lexicon = "lexicon.txt"

        // Example 2:
        // https://github.com/k2-fsa/sherpa-onnx/releases/tag/tts-models
        // https://github.com/k2-fsa/sherpa-onnx/releases/download/tts-models/vits-piper-en_US-amy-low.tar.bz2
        // modelDir = "vits-piper-en_US-amy-low"
        // modelName = "en_US-amy-low.onnx"
        // dataDir = "vits-piper-en_US-amy-low/espeak-ng-data"

        // Example 3:
        // modelDir = "vits-zh-aishell3"
        // modelName = "vits-aishell3.onnx"
        // ruleFsts = "vits-zh-aishell3/rule.fst"
        // lexcion = "lexicon.txt"

In your case, please use Example 2.

@anita-smith1

@anita-smith1
Copy link

anita-smith1 commented Dec 10, 2023

@csukuangfj Thanks a lot for your patience. I'm learning a lot as a beginner. I have run the android app with version 1.9.3 .so files and it worked but I had to make some changes to the initAudioTrack() function. It crashed with an invalid audio buffer size :

java.lang.RuntimeException: Unable to start activity ComponentInfo{com.k2fsa.sherpa.onnx/com.k2fsa.sherpa.onnx.MainActivity}: java.lang.IllegalArgumentException: Invalid audio buffer size.
                                                                                                    	at android.app.ActivityThread.performLaunchActivity(ActivityThread.java:4184)
                                                                                                    	at android.app.ActivityThread.handleLaunchActivity(ActivityThread.java:4340)
                                                                                                    	at android.app.servertransaction.LaunchActivityItem.execute(LaunchActivityItem.java:101)
                                                                                                    	at android.app.servertransaction.TransactionExecutor.executeCallbacks(TransactionExecutor.java:135)
                                                                                                    	at android.app.servertransaction.TransactionExecutor.execute(TransactionExecutor.java:95)
                                                                                                    	at android.app.ActivityThread$H.handleMessage(ActivityThread.java:2584)
                                                                                                    	at android.os.Handler.dispatchMessage(Handler.java:106)
                                                                                                    	at android.os.Looper.loopOnce(Looper.java:226)
                                                                                                    	at android.os.Looper.loop(Looper.java:313)
                                                                                                    	at android.app.ActivityThread.main(ActivityThread.java:8810)
                                                                                                    	at java.lang.reflect.Method.invoke(Native Method)
                                                                                                    	at com.android.internal.os.RuntimeInit$MethodAndArgsCaller.run(RuntimeInit.java:604)
                                                                                                    	at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:1067)
                                                                                                    Caused by: java.lang.IllegalArgumentException: Invalid audio buffer size.
                                                                                                    	at android.media.AudioTrack.audioBuffSizeCheck(AudioTrack.java:1955)
                                                                                                    	at android.media.AudioTrack.<init>(AudioTrack.java:810)
                                                                                                    	at android.media.AudioTrack.<init>(AudioTrack.java:752)
                                                                                                    	at com.k2fsa.sherpa.onnx.MainActivity.initAudioTrack(MainActivity.kt:78)
                                                                                                    	at com.k2fsa.sherpa.onnx.MainActivity.onCreate(MainActivity.kt:40)
                                                                                                    	at android.app.Activity.performCreate(Activity.java:8657)
                                                                                                    	at android.app.Activity.performCreate(Activity.java:8636)
                                                                                                    	at android.app.Instrumentation.callActivityOnCreate(Instrumentation.java:1417)
                                                                                                    	at android.app.ActivityThread.performLaunchActivity(ActivityThread.java:4165)
                                                                                                    	at android.app.ActivityThread.handleLaunchActivity(ActivityThread.java:4340) 
                                                                                                    	at android.app.servertransaction.LaunchActivityItem.execute(LaunchActivityItem.java:101) 
                                                                                                    	at android.app.servertransaction.TransactionExecutor.executeCallbacks(TransactionExecutor.java:135) 
                                                                                                    	at android.app.servertransaction.TransactionExecutor.execute(TransactionExecutor.java:95) 
                                                                                                    	at android.app.ActivityThread$H.handleMessage(ActivityThread.java:2584) 
                                                                                                    	at android.os.Handler.dispatchMessage(Handler.java:106) 
                                                                                                    	at android.os.Looper.loopOnce(Looper.java:226) 
                                                                                                    	at android.os.Looper.loop(Looper.java:313) 
                                                                                                    	at android.app.ActivityThread.main(ActivityThread.java:8810) 
                                                                                                    	at java.lang.reflect.Method.invoke(Native Method) 
                                                                                                    	at com.android.internal.os.RuntimeInit$MethodAndArgsCaller.run(RuntimeInit.java:604) 
                                                                                                    	at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:1067) 
                                                                                                    	
                                                                                                    	

I had to change the original to the the version below which worked, but I'm not sure if it has any implications:

private fun initAudioTrack() {
        val sampleRate = tts.sampleRate()
        val minBufferSize = AudioTrack.getMinBufferSize(
            sampleRate,
            AudioFormat.CHANNEL_OUT_MONO,
            AudioFormat.ENCODING_PCM_FLOAT
        )

        // Check if getMinBufferSize returned a valid size
        if (minBufferSize == AudioTrack.ERROR || minBufferSize == AudioTrack.ERROR_BAD_VALUE) {
            Log.e(TAG, "Invalid minimum buffer size: $minBufferSize")
            return
        }

        // Ensure buffer size is at least 0.1 seconds of audio or the minimum buffer size, whichever is larger
        val bufLength = max((sampleRate * 0.1).toInt(), minBufferSize)
        Log.i(TAG, "sampleRate: $sampleRate, bufLength: $bufLength")

        val attr = AudioAttributes.Builder()
            .setContentType(AudioAttributes.CONTENT_TYPE_SPEECH)
            .setUsage(AudioAttributes.USAGE_MEDIA)
            .build()

        val format = AudioFormat.Builder()
            .setEncoding(AudioFormat.ENCODING_PCM_FLOAT)
            .setChannelMask(AudioFormat.CHANNEL_OUT_MONO)
            .setSampleRate(sampleRate)
            .build()

        try {
            track = AudioTrack(attr, format, bufLength, AudioTrack.MODE_STREAM, AudioManager.AUDIO_SESSION_ID_GENERATE)

            // Check if AudioTrack is initialized properly
            if (track.state != AudioTrack.STATE_INITIALIZED) {
                Log.e(TAG, "AudioTrack initialization failed")
                return
            }

            track.play()
        } catch (e: IllegalArgumentException) {
            Log.e(TAG, "AudioTrack initialization failed: ${e.message}")
        }
    }

@csukuangfj
Copy link
Author

Thanks! Would you mind making a PR to fix it?

@anita-smith1
Copy link

Thanks! Would you mind making a PR to fix it?

The working code is from ChatGPT. I don't know why it works. I asked it why the app crashed and it told me why with a solution. I think you need to first check and confirm it does not cause any other issue before making a PR. Example, in your recent video on twitter (X), synthesis is very fast but mine is a bit slow, so not sure if it's due to the code. Thanks

@csukuangfj
Copy link
Author

I just fixed it in the master branch.

I am using a small model in the video. How large is your model?

@anita-smith1
Copy link

anita-smith1 commented Dec 10, 2023

Okay that's great. Hope you will soon fix the single word pronunciation issue too. My model size is 145MB

@csukuangfj
Copy link
Author

is there any chance you can bring back support for models using IPAPhonemes?

@anita-smith1

Sorry, it is not in the plan. The major difficulty is that the phonemizer used by IPAPhonemes is hard to port to C++.

As you know, you are training your model in Python, but if you want to deploy it, every part must be converted to C++, including the phonemizer.


All the VITS models from coqui-ai/tts are listed below.

# Graphemes
# wget https://coqui.gateway.scarf.sh/v0.8.0_models/tts_models--bg--cv--vits.zip
# wget https://coqui.gateway.scarf.sh/v0.8.0_models/tts_models--cs--cv--vits.zip
# wget https://coqui.gateway.scarf.sh/v0.8.0_models/tts_models--da--cv--vits.zip
# wget https://coqui.gateway.scarf.sh/v0.8.0_models/tts_models--et--cv--vits.zip
# wget https://coqui.gateway.scarf.sh/v0.8.0_models/tts_models--ga--cv--vits.zip
# wget https://coqui.gateway.scarf.sh/v0.8.0_models/tts_models--es--css10--vits.zip
# wget https://coqui.gateway.scarf.sh/v0.8.0_models/tts_models--fr--css10--vits.zip
# wget https://coqui.gateway.scarf.sh/v0.8.0_models/tts_models--nl--css10--vits.zip
# wget https://coqui.gateway.scarf.sh/v0.8.0_models/tts_models--de--css10--vits.zip
# wget https://coqui.gateway.scarf.sh/v0.8.0_models/tts_models--hu--css10--vits.zip
# wget https://coqui.gateway.scarf.sh/v0.8.0_models/tts_models--fi--css10--vits.zip
# wget https://coqui.gateway.scarf.sh/v0.8.0_models/tts_models--hr--cv--vits.zip
# wget https://coqui.gateway.scarf.sh/v0.8.0_models/tts_models--lt--cv--vits.zip
# wget https://coqui.gateway.scarf.sh/v0.8.0_models/tts_models--lv--cv--vits.zip
# wget https://coqui.gateway.scarf.sh/v0.8.0_models/tts_models--mt--cv--vits.zip
# wget https://coqui.gateway.scarf.sh/v0.8.0_models/tts_models--pl--mai_female--vits.zip
# wget https://coqui.gateway.scarf.sh/v0.8.0_models/tts_models--pt--cv--vits.zip
# wget https://coqui.gateway.scarf.sh/v0.8.0_models/tts_models--ro--cv--vits.zip
# wget https://coqui.gateway.scarf.sh/v0.8.0_models/tts_models--sk--cv--vits.zip
# wget https://coqui.gateway.scarf.sh/v0.8.0_models/tts_models--sl--cv--vits.zip
# wget https://coqui.gateway.scarf.sh/v0.8.0_models/tts_models--sv--cv--vits.zip
# wget https://coqui.gateway.scarf.sh/v0.13.3_models/tts_models--bn--custom--vits_male.zip
# wget https://coqui.gateway.scarf.sh/v0.13.3_models/tts_models--bn--custom--vits_female.zip

# IPAPhonemes
# wget https://coqui.gateway.scarf.sh/v0.7.0_models/tts_models--de--thorsten--vits.zip
# wget https://coqui.gateway.scarf.sh/v0.8.0_models/tts_models--el--cv--vits.zip
# wget https://coqui.gateway.scarf.sh/v0.10.1_models/tts_models--ca--custom--vits.zip

# VitsCharacters
# wget https://coqui.gateway.scarf.sh/v0.6.1_models/tts_models--it--mai_female--vits.zip
# wget https://coqui.gateway.scarf.sh/v0.6.1_models/tts_models--it--mai_male--vits.zip
# wget https://coqui.gateway.scarf.sh/v0.6.2_models/tts_models--ewe--openbible--vits.zip # ewe
# wget https://coqui.gateway.scarf.sh/v0.6.2_models/tts_models--hau--openbible--vits.zip # hausa
# wget https://coqui.gateway.scarf.sh/v0.6.2_models/tts_models--lin--openbible--vits.zip # lingala
# wget https://coqui.gateway.scarf.sh/v0.6.2_models/tts_models--tw_akuapem--openbible--vits.zip # akuapem-twi
# wget https://coqui.gateway.scarf.sh/v0.6.2_models/tts_models--tw_asante--openbible--vits.zip # asante-twi
# wget https://coqui.gateway.scarf.sh/v0.6.2_models/tts_models--yor--openbible--vits.zip # yoruba

You can see that only 3 of them are using IPAPhonemes.

I suggest that you switch to

"characters_class": "TTS.tts.utils.text.characters.Graphemes",

or

"characters_class": "TTS.tts.models.vits.VitsCharacters",

@csukuangfj
Copy link
Author

@anita-smith1

. I have noticed that my fine tuned model using IPAPhonemes for non English words (like names of people), has way better quality than the version using VitsCharacter.

You can also use espeak-ng in coqui-ai/tts, though I find that only English VITS models from coqui-ai/tts are using espeak-ng.

@aaronnewsome
Copy link

@aaronnewsome

I just wrote a detailed, step-by-step, guide about how to convert a piper vits pre-trained model to sherpa-onnx for you. You can find it at https://k2-fsa.github.io/sherpa/onnx/tts/piper.html

Thank you @csukuangfj , I honestly don't think I stumbled across all of these instructions while I was trying to do the conversion for the hours I was trying. It was much easier to do with the instructions you created.

I was able to use the sherpa-onnx-offline-tts example to create a wav with my custom voice trained from scratch. However, the quality was not very good at all. Lots of words with strange pronunciations. The words were pronounced much more accurately piper.

Also, the JSON file that piper preprocess created for me needed some changes for your script to run. The language key and espeak key didn't look the same as the en_US-amy-medium.onnx.json file I compared it to. In en_US-amy-medium.onnx.json there is:

"espeak": {
    "voice": "en-us"
  }

and

"language": {
    "code": "en_US",
    "family": "en",
    "region": "US",
    "name_native": "English",
    "name_english": "English",
    "country_english": "United States"
  },

The json for my custom voice, trained from scratch only had this for language:

"language": {
        "code": "en"
    },

and also just "en" for espeak voice. This caused your example python script to error, so I adjusted the JSON manually. The JSON file for my onnx was created by piper preprocess, so maybe I used it wrong, which would explain why those fields are wrong/missing. I'll look into it some more.

@anita-smith1
Copy link

@csukuangfj Please check if my configuration for fine tuning a Vits model using coqui is okay. I am not getting intelligible sound after fine tuning using VitsCharacter, even for English words/phrases. Seems I am doing something wrong:

code = """import os

from trainer import Trainer, TrainerArgs

from TTS.tts.configs.shared_configs import BaseDatasetConfig, CharactersConfig
from TTS.tts.configs.vits_config import VitsConfig
from TTS.tts.datasets import load_tts_samples
from TTS.tts.models.vits import Vits, VitsAudioConfig
from TTS.tts.utils.text.tokenizer import TTSTokenizer
from TTS.utils.audio import AudioProcessor
#output_path = os.path.dirname(os.path.abspath(__file__))
##########################################
#Change this to your dataset directory
##########################################
output_path = "/content/drive/MyDrive/"""
code = code + dataset_name + "/" + output_directory + "/" + "\""

code=code + """
dataset_config = BaseDatasetConfig(
##########################################
#Change this to your dataset directory
##########################################
    formatter="ljspeech", meta_file_train="metadata.csv", path=os.path.join(output_path, "/content/drive/MyDrive/"""
code = code + dataset_name
code=code + """")

)
audio_config = VitsAudioConfig(
    sample_rate=22050, win_length=1024, hop_length=256, num_mels=80, mel_fmin=0, mel_fmax=None
)
#i have added character config for sherpa onnx support
character_config = CharactersConfig (
     characters_class="TTS.tts.models.vits.VitsCharacters",
     pad="_",
     eos="",
     bos="",
     characters="ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz",
     punctuations=';:,.!?¡¿—…"«»“” ',
     phonemes="ɑɐɒæɓʙβɔɕçɗɖðʤəɘɚɛɜɝɞɟʄɡɠɢʛɦɧħɥʜɨɪʝɭɬɫɮʟɱɯɰŋɳɲɴøɵɸθœɶʘɹɺɾɻʀʁɽʂʃʈʧʉʊʋⱱʌɣɤʍχʎʏʑʐʒʔʡʕʢǀǁǂǃˈˌːˑʼʴʰʱʲʷˠˤ˞↓↑→↗↘'̩'ᵻ"
)
config = VitsConfig(
    audio=audio_config,
    characters=character_config,
    run_name="vits_ljspeech_ly",
    batch_size=16,
    eval_batch_size=16,
    batch_group_size=5,
#    num_loader_workers=8,
    num_loader_workers=4,
    num_eval_loader_workers=4,
    run_eval=True,
    test_delay_epochs=-1,
    epochs=100000,
    save_step=1000,
	save_checkpoints=True,
	save_n_checkpoints=4,
	save_best_after=2000,
    text_cleaner="english_cleaners",
    use_phonemes=True,
    phoneme_language="en-us",
    phoneme_cache_path=os.path.join(output_path, "phoneme_cache"),
    compute_input_seq_cache=True,
    print_step=25,
    print_eval=True,
    mixed_precision=True,
    output_path=output_path,
    datasets=[dataset_config],
    cudnn_benchmark=False,
)
# INITIALIZE THE AUDIO PROCESSOR
# Audio processor is used for feature extraction and audio I/O.
# It mainly serves to the dataloader and the training loggers.
ap = AudioProcessor.init_from_config(config)

# INITIALIZE THE TOKENIZER
# Tokenizer is used to convert text to sequences of token IDs.
# config is updated with the default characters if not defined in the config.
tokenizer, config = TTSTokenizer.init_from_config(config)

# LOAD DATA SAMPLES
# Each sample is a list of ```[text, audio_file_path, speaker_name]```
# You can define your custom sample loader returning the list of samples.
# Or define your custom formatter and pass it to the `load_tts_samples`.
# Check `TTS.tts.datasets.load_tts_samples` for more details.
train_samples, eval_samples = load_tts_samples(
    dataset_config,
    eval_split=True,
    eval_split_max_size=config.eval_split_max_size,
    eval_split_size=config.eval_split_size,
)

# init model
model = Vits(config, ap, tokenizer, speaker_manager=None)

# init the trainer and 🚀
trainer = Trainer(
    TrainerArgs(),
    config,
    output_path,
    model=model,
    train_samples=train_samples,
    eval_samples=eval_samples,
)
trainer.fit()
"""

I read this and seems he fixed the issue by setting "use_phonemes=False", but I don't think that applies here.

@csukuangfj
Copy link
Author

Sorry that I am not familiar with coqui-ai/tts. I suggest that you ask in the repo of coqui-ai/tts.

@anita-smith1
Copy link

okay no problem. I am switching from coqui to piper since I'm facing some issues.

@anita-smith1
Copy link

I am currently training using "use_phonemes=False" (coqui tts) and seems to be working so far. If it still doesn't work I will switch completely to piper. Piper has very good documentation

@anita-smith1
Copy link

So I managed to get both coqui tts and piper working but I have decided to stick to piper because the model size is smaller than coqui tts therefore reducing latency. Piper seems to have better pronunciations too.

@csukuangfj I am not sure if you need to update script in model zip file.

pip install piper-phonemize onnx onnxruntime==1.16.0 returns:

ERROR: Could not find a version that satisfies the requirement piper-phonemize (from versions: none)
ERROR: No matching distribution found for piper-phonemize

changing the version to 1.16.1 doesn't work either.

so I changed to pip install onnx onnxruntime.

Also, I had to manually change the json file to include:

"language": {
    "code": "en_US",
    "family": "en",
    "region": "US",
    "name_native": "English",
    "name_english": "English",
    "country_english": "United States"
  }

because the original export from piper only had

language": {
        "code": "en-us"
    }

Without changing the python script for exporting to sherpa-onnx will fail at :

"language": config["language"]["name_english"],

since there is no "name_english"

@csukuangfj
Copy link
Author

Aah, ok, i ment standard tts-engine api bindings. I may try to do it in some future to use this tts as a standard andtoid tts engine for example with screenreaders.

On 10/29/23, Fangjun Kuang @.> wrote: > Are there using standard android text-to-speech api or not? @beqabeqa473 No, it uses sherpa-onnx with vits pre-trained models for tts. Everything is open-sourced. You can find the source code for the android project at https://github.com/k2-fsa/sherpa-onnx/tree/master/android/SherpaOnnxTts The underlying C++ code can be found at https://github.com/k2-fsa/sherpa-onnx The JNI C++ binding code can be found at https://github.com/k2-fsa/sherpa-onnx/tree/master/sherpa-onnx/jni You can find kotlin API examples at https://github.com/k2-fsa/sherpa-onnx/tree/master/kotlin-api-examples -- Reply to this email directly or view it on GitHub: #251 (comment) You are receiving this because you were mentioned. Message ID: @.>
-- with best regards Beqa Gozalishvili Tell: +995593454005 Email: @.*** Web: https://gozaltech.org Skype: beqabeqa473 Telegram: https://t.me/gozaltech facebook: https://facebook.com/gozaltech twitter: https://twitter.com/beqabeqa473 Instagram: https://instagram.com/beqa.gozalishvili

@beqabeqa473

I just supported replacing the system TTS engine in k2-fsa/sherpa-onnx#508

You can find a YouTube video at
https://www.youtube.com/watch?v=33QYuVzDORA

@nanaghartey
Copy link

@csukuangfj when will Sherpa support coqui XTTS-v2 models?

@csukuangfj
Copy link
Author

XTTS-v2

The model is larger than 1 GB, which requires a GPU, I think.

We won't support it in k2-fsa/sherpa-onnx, which is targeted mainly for embedded environment.

But we may support it in k2-fsa/sherpa, though we cannot say a time when it will be supported.

@nanaghartey
Copy link

@csukuangfj what about StyleTTS2 models which has elevenlabs human sounding quality and pytorch support https://github.com/yl4579/StyleTTS2

@csukuangfj
Copy link
Author

https://github.com/yl4579/StyleTTS2

Does it have onnx export support?

@nanaghartey
Copy link

https://github.com/yl4579/StyleTTS2

Does it have onnx export support?

Not at the moment

@nanaghartey
Copy link

@csukuangfj currently, which model sounds close to human quality on sherpa onnx? Coqui or piper tts models? And are these two the only shpera onnx supports?

@csukuangfj
Copy link
Author

Please visit
https://huggingface.co/spaces/k2-fsa/text-to-speech
to try all supported tts models.

There are more than 100 tts models and the best way to find out which model sounds best to you is to try it by yourself.
You don't need to install anything to try it.

Screenshot 2024-04-25 at 10 03 58

@csukuangfj
Copy link
Author

And are these two the only shpera onnx supports?

No.

shepra-onnx currently supports VITS tts models and it is not limited to coqui or piper.

@nanaghartey
Copy link

Please visit

https://huggingface.co/spaces/k2-fsa/text-to-speech

to try all supported tts models.

There are more than 100 tts models and the best way to find out which model sounds best to you is to try it by yourself.

You don't need to install anything to try it.

Screenshot 2024-04-25 at 10 03 58

I tried a couple of them in the past actually. I was hoping you'd have a "top 3" model list. What I noticed with sherpa onnx is there's a trade off between quality & on-device processing compared to cloud solutions out there.
Example standard coqui tts models sound okay but once converted to sherpa onnx the quality and intonation goes down. Are there any tips or tricks to get a good quality on sherpa onnx?

@csukuangfj
Copy link
Author

Example standard coqui tts models sound okay but once converted to sherpa onnx the quality and intonation goes down

Could you describe which model you are using? @nanaghartey

@nanaghartey
Copy link

Example standard coqui tts models sound okay but once converted to sherpa onnx the quality and intonation goes down

Could you describe which model you are using? @nanaghartey

I'm using my own fine tuned coqui and piper tts vits models. Both sound good before converting to sherpa onnx...but this is the case for the various other English models I tried out

@nanaghartey
Copy link

nanaghartey commented Jul 4, 2024

@csukuangfj Please take a look at this issue on StyleTTS2 - #117
Since someone has successfully converted to onnx, can you also convert to support sherpa onnx? if this can be achieved, sherpa Onnx will have human-level/high quality realistic TTS

@csukuangfj
Copy link
Author

We have already supported Piper. Is there anything special with Style TTS2 @nanaghartey

@nanaghartey
Copy link

@csukuangfj Piper can't be compared to StyleTTS2. StyleTTS2 is currently the only open source solution close to proprietary solutions like elevenlabs, open ai's tts, recent gemini voices..
You can compare your k2-fsa quality with an onnx implementation of styleTTS2 here to see the difference

@csukuangfj
Copy link
Author

What is the model size of StyleTTS2? Does it require GPU?

Could you post the link to the inference script with onnx for StyleTTS2?

@nanaghartey
Copy link

@csukuangfj The author who converted to onnx has not shared the script yet. I was thinking you'd take a look at the repo and see if it's something you can work on. As you can see from the thread, others are trying to export to onnx

@csukuangfj
Copy link
Author

I was thinking you'd take a look at the repo and see if it's something you can work on.

Sorry, I don't have extra time to do that. If there are existing ONNX inference scripts, I can take a look.

@nanaghartey
Copy link

@csukuangfj No problem. I will share the scripts once it's available. Thanks

@DavidDohmen
Copy link

Is this discussion still related to Rhasspy/Piper or has it drifted to another (impressive) project?
I mainly ask, because I'd love to see locally performed TTS natively in android for the Home Assistant use case.
Many HA users have android tablet, which are always on and constantly have the Home Assistant Companion app in the foreground. Running as many models/parts of the stack as possible on the device would have many benefits.
Faster responses, easier configuration, more processing power than Raspberry Pis e.g.

What's the status here and are there any ambitions to work in this direction? I don't have time but I would help funding this.

@csukuangfj
Copy link
Author

@DavidDohmen

sherpa-onnx provides runtime supports for models from various frameworks, including those from piper.

sherpa-onnx does not provide support to train your models, but piper does that.

Different from piper, sherpa-onnx provides support for various platforms and programming languages.

For instance, you can run piper models with sherpa-onnx on iOS, Android, Linux, windows, macoOS, etc.

Also, sherpa-onnx supports not only text-to-speech, but it also supports speech-to-text, speaker diarization, etc.

@DavidDohmen
Copy link

Thanks, yes - I understood these differences. Especially the portabililty to other OSes like Android are super valuable and my question is basically if there are considerations to bring the sherpa-onnx functionality into the HA companion Android app.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants