Official PyTorch implementation for the paper "Speech2Lip: High-fidelity Speech to Lip Generation by Learning from a Short Video".
Feel free to contact [email protected] if you have any questions about the code.
-
You can create an environment with:
pip install -r requirements.txt
-
git clone https://github.com/facebookresearch/pytorch3d.git cd pytorch3d/ && pip install -e .
-
- Place
01_MorphableModel.mat
inpreprocess/data_util/face_tracking/3DMM/
- Convert the file:
cd preprocess/data_util/face_tracking/ python convert_BFM.py
- Place
-
FFmpeg is required to cut the video and combine the audio with the silent generated videos.
The source videos used in our experiments are referred to as LSP and Youtube Video. In this example, we use May's video and provide the bash scripts. After data preprocessing, the training data will be created in the dataset/may_face_crop_lip/
directory. Please replace it with your own data.
-
Video preprocessing
- Download the original video
may.mp4
. Refer to LSP for the URL and duration. - Convert to images:
ffmpeg -i may.mp4 -q:v 2 -r 25 %05d.jpg
- Place the images in
dataset/may/images/
. - Once the data preprocessing is complete, the directory
dataset/may/
can be deleted.
- Place the images in
- Extract the audio
audio.wav
:ffmpeg -i may.mp4 -vn -acodec pcm_s16le -ar 16000 audio.wav
- Place it in
dataset/may_face_crop_lip/audio/
.
- Place it in
- For convenience, we provide the cropped video of May here.
- Download the original video
-
Audio preprocessing
- Extract the DeepSpeech features
audio.npy
:cd preprocess/deepspeech_features/ bash extract_ds_features_may.sh
- If successful, a file named
audio.npy
will be created indataset/may_face_crop_lip/audio/
.
- If successful, a file named
- Extract the DeepSpeech features
-
Image preprocessing
- [Only for data preprocessing] Download
79999_iter.pth
and place it inpreprocess/face_parsing/
. - Generate all the files for training:
cd preprocess/ bash preprocess_may.sh
- [Only for data preprocessing] Download
-
Configuration file
- We offer a sample in
configs/face_simple_configs/may/
. - To train with your data, modify the data-related items which are highlighted in the provided sample.
- We offer a sample in
-
[Only for train] Sync expert network
- Download Sync expert network.
- Place
lipsync_expert.pth
inmodels/
.
We use May's video as an example and provide the bash scripts.
- Train with command:
bash scripts/example/train_may.sh
- Our pretrained models are available here.
- To run inference, place the pretrained model
model_may.pt
inlog/face_simple/may
.
We use May's video as an example and provide the bash scripts.
- For evaluation:
- Generate images
bash scripts/example/inference_may.sh
- We split the video into 90% train and 10% test sets.
- Images are generated in
rendering_result/may/example/postfusion
.
- Combine images into a video:
ffmpeg -r 25 -i %05d.jpg -c:v libx264 -pix_fmt yuv420p output.mp4
- Combine the video with the test audio:
ffmpeg -i output.mp4 -i audio_test.wav -c:v copy -c:a aac -strict experimental output_with_audio.mp4
- For the video demo, split the wav file into 90%/10% using ffmpeg, with the 10% used in inference.
- We provide
audio_test.wav
as an example.
- Evaluation metrics including PSNR, SSIM, CPBD, LMD and Sync score can be applied.
- Generate images
- For any given audio:
- Place new audio
audio.npy
indataset/may_face_crop_lip/audio_test/
bash scripts/example/inference_new_audio_may.sh
- Place new audio
If you find our work useful in your research, please consider citing our paper:
@inproceedings{wu2023speech2lip,
title={Speech2Lip: High-fidelity Speech to Lip Generation by Learning from a Short Video},
author={Wu, Xiuzhe and Hu, Pengfei and Wu, Yang and Lyu, Xiaoyang and Cao, Yan-Pei and Shan, Ying and Yang, Wenming and Sun, Zhongqian and Qi, Xiaojuan},
booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
pages={22168--22177},
year={2023}
}
We use face-parsing.PyTorch to compute head mask in the canonical space, DeepSpeech for audio feature extraction, Wav2Lip for sync expert network, and we are highly grateful to ADNeRF for their data preprocessing script.