It is the ordinary FastSpeech 2 architecture with some modifications. I just wanted to make the code base better and more readable. And finally have an open source implementation of FastSpeech 2 that doesn't sounds bad and is easier to hack and work with.
If you like this you will love Matcha-TTS
Changes from the original architecture:
- Instead of using MFA, I obtained alignment from a pretrained Matcha-TTS model.
- To save myself from the pain of setting up and training MFA
- Used IPA phonemes with blanks in between phones.
- No LR decay
- Duration prediction in log domain
- Everyone seems to be using the postnet from Tacotron 2; I've used it as well.
Link to LJ Speech checkpoint Running the code locally with cli will autodownload the checkpoint as well.
- Create an environment (suggested but optional)
conda create -n betterfs2 python=3.10 -y
conda activate betterfs2
- Install from source
git clone https://github.com/shivammehta25/BetterFastSpeech2.git
cd BetterFastSpeech2
pip install -e .
- Run CLI / gradio app / jupyter notebook
# This will download the required models
betterfs2 --text "<INPUT TEXT>"
or open synthesis.ipynb
on jupyter notebook
Let's assume we are training with LJ Speech
-
Download the dataset from here, extract it to
data/LJSpeech-1.1
, and prepare the file lists to point to the extracted data like for item 5 in the setup of the NVIDIA Tacotron 2 repo.
Your data directory should look like:
data/
└── LJSpeech-1.1
├── durations/ # Here
├── metadata.csv
├── README
├── test.txt
├── train.txt
├── val.txt
└── wavs/
- Clone and enter the BetterFastSpeech2 repository
git clone https://github.com/shivammehta25/BetterFastSpeech2.git
cd BetterFastSpeech2
- Install the package from source
pip install -e .
- Go to
configs/data/ljspeech.yaml
and change
train_filelist_path: data/LJSpeech-1.1/train.txt
valid_filelist_path: data/LJSpeech-1.1/val.txt
- Generate normalisation statistics with the yaml file of dataset configuration
python fs2/utils/preprocess.py -i ljspeech
# Output:
#{'pitch_min': 67.836174, 'pitch_max': 578.637146, 'pitch_mean': 207.001846, 'pitch_std': 52.747742, 'energy_min': 0.084354, 'energy_max': 190.849121, 'energy_mean': 21.330254, 'energy_std': 17.663319, 'mel_mean': -5.554245, 'mel_std': 2.059021}
Update these values in configs/data/ljspeech.yaml
under data_statistics
key.
data_statistics: # Computed for ljspeech dataset
pitch_min: 67.836174
pitch_max: 792.962036
pitch_mean: 211.046158
pitch_std: 53.012085
energy_min: 0.023226
energy_max: 241.037918
energy_mean: 21.821531
energy_std: 18.17124
mel_mean: -5.517035
mel_std: 2.064413
to the paths of your train and validation filelists.
- Run the training script
python fs2/train.py experiment=ljspeech
- for multi-gpu training, run
python fs2/train.py experiment=ljspeech trainer.devices=[0,1]
- Synthesise from the custom trained model
betterfs2 --text "<INPUT TEXT>" --checkpoint_path <PATH TO CHECKPOINT>
If you use our code or otherwise find this work useful, please cite our paper:
@inproceedings{mehta2024matcha,
title={Matcha-{TTS}: A fast {TTS} architecture with conditional flow matching},
author={Mehta, Shivam and Tu, Ruibo and Beskow, Jonas and Sz{\'e}kely, {\'E}va and Henter, Gustav Eje},
booktitle={Proc. ICASSP},
year={2024}
}
Since this code uses Lightning-Hydra-Template, you have all the powers that come with it.
Other source code we would like to acknowledge:
- Matcha-TTS: Base TTS from which we get alignments.
- FastPitch: For transformer implementation
- FastSpeech 2: For variance predictor implementations