This repository contains implementation of HiFi-VC paper. Model structure is based on analysis of graph
and code
methods of TorchScript checkpoint provided by the authors of the paper. Most of the missing details were recovered. In addition, repository containt pre-trained versions of Speaker Encoder: VAE-part of the original solution and ECAPA-TDNN taken from available implementation.
Currently, this implementation does not support F0 training. However, authors reported results are not that different with or without F0.
To stabilize training, Extra-Adam implementation was added based on this repo.
Install all packages using pip install -r requirements.txt
.
If you want to use pre-trained VAE, run the following script:
pip install gdown
gdown 1oFwMeuQtwaBEyOFkyG7c7LfBQiRe3RdW -O "model.pt"
To run the experiment, run the following command:
python3 train.py -cn CONFIG_NAME +dataset.data_path=PATH_TO_WAV48_DIR
Where CONFIG_NAME
is the name of the file (without .yaml
) from src/configs
folder, and PATH_TO_WAV48_DIR
is the path to the VCTK dataset. For example, in Kaggle the path may look like this: /kaggle/input/vctk-corpus/VCTK-Corpus/VCTK-Corpus/wav48
.
Note: add HYDRA_FULL_ERROR=1
before python3
to see errors.
Official repository (only inference). Extra-Adam implementation was taken from this repository and ECAPA-TDNN from this one