Skip to content

QuyAnh2005/vits-japanese

Repository files navigation

VITS for Japanese

VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech

Several recent end-to-end text-to-speech (TTS) models enabling single-stage training and parallel sampling have been proposed, but their sample quality does not match that of two-stage TTS systems. In the repository, I will introduce a VITS model for Japanese on pytorch version 2.0.0 that customed from VITS model.

We also provide the pretrained models.

VITS at training VITS at inference
VITS at training VITS at inference

Pre-requisites

  1. Python >= 3.6
  2. Clone this repository
  3. Install python requirements. Please refer requirements.txt
  4. Download datasets
    1. Download and extract the Japanese Speech dataset, then choose basic5000 dataset and move to jp_dataset folder.
  5. Run preprocessing if you use your own datasets.
# Preprocessing (g2p) for your own datasets. Preprocessed phonemes for Japanese dataset have been already provided.
python preprocess.py --text_index 1 --filelists filelists/jp_audio_text_train_filelist.txt filelists/jp_audio_text_val_filelist.txt filelists/jp_audio_text_test_filelist.txt

Training Example

# JP Speech
python train.py -c configs/jp_base.json -m jp_base

Inference Example

To get pretrained model for Japanese:

sh startup.sh

See vits_apply.ipynb or run streamlit run app.py to see demo on streamlit-share.