We provide instructions and pre-trained models for the work "Textless Speech-to-Speech Translation on Real Data (Lee et al. 2021)".
Model | Pretraining Data | Model | Quantizer |
---|---|---|---|
mHuBERT Base | VoxPopuli En, Es, Fr speech from the 100k subset | download | L11 km1000 |
Unit config | Unit size | Vocoder language | Dataset | Model |
---|---|---|---|---|
mHuBERT, layer 11 | 1000 | En | LJSpeech | ckpt, config |
mHuBERT, layer 11 | 1000 | Es | CSS10 | ckpt, config |
mHuBERT, layer 11 | 1000 | Fr | CSS10 | ckpt, config |
Language | Training data | Target unit config | Model |
---|---|---|---|
En | 10 mins | mHuBERT, layer 11, km1000 | download |
En | 1 hr | mHuBERT, layer 11, km1000 | download |
En | 10 hrs | mHuBERT, layer 11, km1000 | download |
Es | 10 mins | mHuBERT, layer 11, km1000 | download |
Es | 1 hr | mHuBERT, layer 11, km1000 | download |
Es | 10 hrs | mHuBERT, layer 11, km1000 | download |
Fr | 10 mins | mHuBERT, layer 11, km1000 | download |
Fr | 1 hr | mHuBERT, layer 11, km1000 | download |
Fr | 10 hrs | mHuBERT, layer 11, km1000 | download |
- Refer to the paper for the details of the training data.
- Download the pre-trained models, including the dictionary, to
DATA_DIR
. - Format the audio data.
# AUDIO_EXT: audio extension, e.g. wav, flac, etc.
# Assume all audio files are at ${AUDIO_DIR}/*.${AUDIO_EXT}
python examples/speech_to_speech/preprocessing/prep_sn_data.py \
--audio-dir ${AUDIO_DIR} --ext ${AUIDO_EXT} \
--data-name ${GEN_SUBSET} --output-dir ${DATA_DIR} \
--for-inference
- Run the speech normalizer and post-process the output.
mkdir -p ${RESULTS_PATH}
python examples/speech_recognition/new/infer.py \
--config-dir examples/hubert/config/decode/ \
--config-name infer_viterbi \
task.data=${DATA_DIR} \
task.normalize=false \
common_eval.results_path=${RESULTS_PATH}/log \
common_eval.path=${DATA_DIR}/checkpoint_best.pt \
dataset.gen_subset=${GEN_SUBSET} \
'+task.labels=["unit"]' \
+decoding.results_path=${RESULTS_PATH} \
common_eval.post_process=none \
+dataset.batch_size=1 \
common_eval.quiet=True
# Post-process and generate output at ${RESULTS_PATH}/${GEN_SUBSET}.txt
python examples/speech_to_speech/preprocessing/prep_sn_output_data.py \
--in-unit ${RESULTS_PATH}/hypo.units \
--in-audio ${DATA_DIR}/${GEN_SUBSET}.tsv \
--output-root ${RESULTS_PATH}
The pre-trained vocoders can support generating audio for both full unit sequences and reduced unit sequences (i.e. duplicating consecutive units removed). Set --dur-prediction
for generating audio with reduced unit sequences.
# IN_CODE_FILE contains one unit sequence per line. Units are separated by space.
python examples/speech_to_speech/generate_waveform_from_code.py \
--in-code-file ${IN_CODE_FILE} \
--vocoder ${VOCODER_CKPT} --vocoder-cfg ${VOCODER_CFG} \
--results-path ${RESULTS_PATH} --dur-prediction
To be updated.