- Align datasets
- Implement modules
- Training
- End-To-End Synthesizer
- Add Loss CE RVQ
- Subjective Evaluation
- Objective Evaluation
- Demo Page
LibriTTS test clean
- ASR WER
whisper large-v2
- Speaker Embedding https://huggingface.co/docs/transformers/model_doc/wavlm#transformers.WavLMForXVector
Prompt | WER | Speaker cosine Similarity | UtteranceLevel Pitch Mean MAE | UtteranceLevel Pitch Std MAE | UtteranceLevel Duration Diff |
---|---|---|---|---|---|
Ground Truth | 0.86 | - | - | - | - |
2 Seconds | |||||
4 Seconds | |||||
6 Seconds | |||||
8 Seconds | |||||
4 Seconds(PrefixPrompt) | (avg utter duration) |