diff --git a/.gitignore b/.gitignore index 2ae7377..b6a1e93 100644 --- a/.gitignore +++ b/.gitignore @@ -1,4 +1,4 @@ *__pycache__* -dataset/ +dataset/* logs/ test.py \ No newline at end of file diff --git a/README.md b/README.md index cc064f6..02484e8 100644 --- a/README.md +++ b/README.md @@ -1,2 +1,104 @@ # JK-VITS Bilingual-TTS (Japanese and Korean) +This Repository can speak Japanese even if you train with Korean dataset, and can speak Korean even if you train with Japanese dataset. +By transcribing pronunciation from Japanese to Korean and Korean to Japanese, the unstable voice produced when using the existing multilingual ipa cleaners has been improved. + + + +## Table of Contents +- [Prerequisites](#prerequisites) +- [Installation](#installation) +- [Prepare_Datasets](#Prepare_Datasets) +- [Usage](#usage) +- [Inference](#inference) +- [References](#References) + + +## Pre-requisites +- A Windows/Linux system with a minimum of `16GB` RAM. +- A GPU with at least `12GB` of VRAM. +- Python >= 3.8 +- Anaconda installed. +- PyTorch installed. +- CUDA 11.7 installed. + + + +Pytorch install command: +```sh +pip install torch==1.13.1+cu117 torchvision==0.14.1+cu117 torchaudio==0.13.1 --extra-index-url https://download.pytorch.org/whl/cu117 +``` +CUDA 11.7 Install: +`https://developer.nvidia.com/cuda-11-7-0-download-archive` +CUDNN 11.x Install: +`https://developer.nvidia.com/rdp/cudnn-archive` + + +--- +## Installation +1. **Create an Anaconda environment:** + +```sh +conda create -n jk-vits python=3.8 +``` + +2. **Activate the environment:** + +```sh +conda activate jk-vits +``` + +3. **Clone this repository to your local machine:** + +```sh +git clone https://github.com/kdrkdrkdr/JK-VITS.git +``` + +4. **Navigate to the cloned directory:** + +```sh +cd JK-VITS +``` + +5. **Install the necessary dependencies:** + +```sh +pip install -r requirements.txt +pip install -U pyopenjtalk==0.2.0 --no-build-isolation +``` +--- + +## Preparing Dataset Example + +- Place the audio files as follows. +.wav files are okay. The sample rate of the audio must be 44100 Hz. + + +- Preprocessing (g2p) for your own datasets. Preprocessed phonemes for your dataset. +```sh +python preprocess.py --filelists filelists/train.txt filelists/val.txt +``` + +- Set configs. +If you train with japanese dataset, refer [configs/ja.json](configs/ja.json) +If you train with korean dataset, refer [configs/ko.json](configs/ko.json) +--- + +## Training Exmaple +```sh +python train.py -c configs/ft.json -m ft +``` + + +--- +## Inference Exmaple +See [inference.ipynb](inference.ipynb) + + + +--- +## References +For more information, please refer to the following repositories: +- [jaywalnut310/vits](https://github.com/jaywalnut310/vits.git) +- [MasayaKawamura/MB-iSTFT-VITS](https://github.com/MasayaKawamura/) +- [Kyubyong/g2pK](https://github.com/Kyubyong/g2pK) \ No newline at end of file diff --git a/configs/ja.json b/configs/ja.json index 4fca341..aec19c3 100644 --- a/configs/ja.json +++ b/configs/ja.json @@ -21,6 +21,8 @@ "window": "hann_window" }, "data": { + "is_japanese_dataset":true, + "is_korean_dataset":false, "training_files":"filelists/ja_train.txt.cleaned", "validation_files":"filelists/ja_val.txt.cleaned", "text_cleaners":["jk_cleaners"], diff --git a/configs/ko.json b/configs/ko.json index b51d847..85d2dc0 100644 --- a/configs/ko.json +++ b/configs/ko.json @@ -21,6 +21,8 @@ "window": "hann_window" }, "data": { + "is_japanese_dataset":false, + "is_korean_dataset":true, "training_files":"filelists/ko_train.txt.cleaned", "validation_files":"filelists/ko_val.txt.cleaned", "text_cleaners":["jk_cleaners"], diff --git a/configs/mari.json b/configs/mari.json deleted file mode 100644 index d760f66..0000000 --- a/configs/mari.json +++ /dev/null @@ -1,64 +0,0 @@ -{ - "train": { - "log_interval": 200, - "eval_interval": 1000, - "seed": 1234, - "epochs": 20000, - "learning_rate": 2e-4, - "betas": [0.8, 0.99], - "eps": 1e-9, - "batch_size": 64, - "fp16_run": false, - "lr_decay": 0.999875, - "segment_size": 8192, - "init_lr_ratio": 1, - "warmup_epochs": 0, - "c_mel": 45, - "c_kl": 1.0, - "fft_sizes": [384, 683, 171], - "hop_sizes": [30, 60, 10], - "win_lengths": [150, 300, 60], - "window": "hann_window" - }, - "data": { - "training_files":"filelists/mari_train.txt.cleaned", - "validation_files":"filelists/mari_val.txt.cleaned", - "text_cleaners":["jk_cleaners"], - "max_wav_value": 32768.0, - "sampling_rate": 44100, - "filter_length": 1024, - "hop_length": 256, - "win_length": 1024, - "n_mel_channels": 80, - "mel_fmin": 0.0, - "mel_fmax": null, - "add_blank": true, - "n_speakers": 0, - "cleaned_text": true - }, - "model": { - "ms_istft_vits": true, - "mb_istft_vits": false, - "istft_vits": false, - "subbands": 4, - "gen_istft_n_fft": 16, - "gen_istft_hop_size": 4, - "inter_channels": 192, - "hidden_channels": 192, - "filter_channels": 768, - "n_heads": 2, - "n_layers": 6, - "kernel_size": 3, - "p_dropout": 0.1, - "resblock": "1", - "resblock_kernel_sizes": [3,7,11], - "resblock_dilation_sizes": [[1,3,5], [1,3,5], [1,3,5]], - "upsample_rates": [4,4], - "upsample_initial_channel": 512, - "upsample_kernel_sizes": [16,16], - "n_layers_q": 3, - "use_spectral_norm": false, - "use_sdp": false - } - } - \ No newline at end of file diff --git a/filelists/train.txt b/filelists/train.txt new file mode 100644 index 0000000..e69de29 diff --git a/filelists/val.txt b/filelists/val.txt new file mode 100644 index 0000000..e69de29 diff --git a/inference.ipynb b/inference.ipynb index 3d092a5..7d0b3a5 100644 --- a/inference.ipynb +++ b/inference.ipynb @@ -2,14 +2,15 @@ "cells": [ { "cell_type": "code", - "execution_count": 6, + "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "[KO]안녕하세요.[KO][PREPROCESSED]ㅋㅗㄴㄴㅣㅊㅣㅇㅗㅏ.[PREPROCESSED]\n", + "/안녕하세요.\n", + "[PREPROCESSED]a↓NnyoNNhaseyo.[PREPROCESSED][JA]こんにちは.[JA]\n", "Mutli-stream iSTFT VITS\n" ] }, @@ -18,7 +19,7 @@ "text/html": [ "\n", " \n", " " @@ -44,13 +45,14 @@ "from text.j2k import japanese2korean\n", "\n", "\n", - "# 일본어로 학습한 경우 True, 한국어로 학습한 경우 False로 isJaModel 값을 바꿔주세요.\n", - "isJaModel = False # True\n", - "model_name = 'ko' # ja\n", + "model_name = 'ja'\n", "config_file = f\"./configs/{model_name}.json\"\n", - "model_file = f\"./logs/{model_name}/G_91000.pth\"\n", + "model_file = f\"./logs/{model_name}/G_0.pth\"\n", "device = 'cpu' # cuda:0\n", "\n", + "hps = utils.get_hparams_from_file(config_file)\n", + "isJaModel = hps.data.is_japanese_dataset\n", + "isKoModel = hps.data.is_korean_dataset\n", "\n", "text = \"\"\"\n", "[KO]안녕하세요.[KO]\n", @@ -60,7 +62,7 @@ "text = re.sub('[\\n]', '', text).strip()\n", "if isJaModel:\n", " text = re.sub(r'\\[KO\\](.*?)\\[KO\\]', lambda x: korean2katakana(x.group(1)), text)\n", - "else:\n", + "if isKoModel:\n", " text = re.sub(r'\\[JA\\](.*?)\\[JA\\]', lambda x: japanese2korean(x.group(1)), text)\n", "\n", "print(text)\n", @@ -72,9 +74,6 @@ " text_norm = torch.LongTensor(text_norm)\n", " return text_norm\n", "\n", - "\n", - "hps = utils.get_hparams_from_file(config_file)\n", - "\n", "net_g = SynthesizerTrn(\n", " len(symbols),\n", " hps.data.filter_length // 2 + 1,\n", diff --git a/preprocess.py b/preprocess.py index 92e62d0..6013916 100644 --- a/preprocess.py +++ b/preprocess.py @@ -6,7 +6,7 @@ parser = argparse.ArgumentParser() parser.add_argument("--out_extension", default="cleaned") parser.add_argument("--text_index", default=1, type=int) - parser.add_argument("--filelists", nargs="+", default=["filelists/mari_train.txt", "filelists/mari_val.txt"]) + parser.add_argument("--filelists", nargs="+", default=["filelists/train.txt", "filelists/val.txt"]) parser.add_argument("--text_cleaners", nargs="+", default=["jk_cleaners"]) args = parser.parse_args() diff --git a/requirements.txt b/requirements.txt index 06e88f6..327b231 100644 --- a/requirements.txt +++ b/requirements.txt @@ -1,31 +1,17 @@ -# utils cmake ffmpeg - -# torch ---extra-index-url https://download.pytorch.org/whl/cu117 -torch==1.13.1+cu117 -torchvision==0.14.1+cu117 -torchaudio==0.13.1 - -# vits -Cython==0.29.21 -librosa==0.8.0 matplotlib==3.3.1 -numpy==1.18.5 -scipy==1.5.2 tensorboard==2.3.0 Unidecode==1.1.1 pysoundfile==0.9.0.post1 monotonic-align g2pk2 -eunjeon ko_pron==1.3 jamo==0.4.1 -pyopenjtalk==0.2.0 +# pyopenjtalk==0.2.0 +jaconv protobuf==3.19.0 - -# # Nuwave2 -# prefetch_generator -# omegaconf==2.0.6 -# pytorch_lightning==1.2.10 \ No newline at end of file +Cython==0.29.21 +librosa==0.8.0 +numpy==1.18.5 +scipy==1.5.2 \ No newline at end of file diff --git a/train_latest.py b/train.py similarity index 100% rename from train_latest.py rename to train.py