Runtao Liu , Haoyu Wu
(
Recent progress in generative diffusion models has greatly advanced text-to-video generation. While text-to-video models trained on large-scale, diverse datasets can produce varied outputs, these generations often deviate from user preferences, highlighting the need for preference alignment on pre-trained models. Although Direct Preference Optimization (DPO) has demonstrated significant improvements in language and image generation, we pioneer its adaptation to video diffusion models and propose a VideoDPO pipeline by making several key adjustments. Unlike previous image alignment methods that focus solely on either (i) visual quality or (ii) semantic alignment between text and videos, we comprehensively consider both dimensions and construct a preference score accordingly, which we term the OmniScore. We design a pipeline to automatically collect preference pair data based on the proposed OmniScore and discover that re-weighting these pairs based on the score significantly impacts overall preference alignment. Our experiments demonstrate substantial improvements in both visual quality and semantic alignment, ensuring that no preference aspect is neglected.
- [2024/12/19] 🔥 We release the paper and the project.
- Merge to VideoTuna
- Release videocrafter2, t2v-turbo training dataset
- Release code for cogvideox
- Release code for videocrafter2 and t2v-turbo
conda create -n videodpo python=3.10 -y
conda activate videodpo
pip install -r requirements.txt
run following instruction to create initial checkpoints.
mkdir -p checkpoints/vc2
wget -P checkpoints/vc2 https://huggingface.co/VideoCrafter/VideoCrafter2/resolve/main/model.ckpt
python utils/create_ref_model.py
T2V-Turbo is latent consistency model. We provide finetuning LCM based on VC2. Please download vc2 checkpoints first. And then run:
mkdir -p checkpoints/t2v-turbo
wget -O checkpoints/t2v-turbo/unet_lora.pt "https://huggingface.co/jiachenli-ucsb/T2V-Turbo-VC2/resolve/main/unet_lora.pt?download=true"
download vidpro-vc2-dataset.tar from the following link. then ln -s the dataset to /data/vidpro-dpo-dataset. or u could also add dataset with same structure in configs/dpo/vidpro/train_data.yaml
to reduce peak memory use in training stage, we recommend to disable validation by not providing val_data.yaml.
bash configs/vc_dpo/run.sh
We support inference with different types of inputs and outputs. We support both json and text formats to read prompts.
bash script_sh/inference_t2v.sh
bash configs/t2v_turbo_dpo/run.sh
bash configs/t2v_turbo_dpo/turbo_visualize.sh
besides, we also provide some useful tools to improve your finetuning experiences. We could automatically remove training logs without any checkpoints saved.
python utils/clean_results.py -d ./results
@misc{liu2024videodpoomnipreferencealignmentvideo,
title={VideoDPO: Omni-Preference Alignment for Video Diffusion Generation},
author={Runtao Liu and Haoyu Wu and Zheng Ziqiang and Chen Wei and Yingqing He and Renjie Pi and Qifeng Chen},
year={2024},
eprint={2412.14167},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2412.14167},
}
Our work is developed on the following open-source projects,we would like to express our sincere thanks to their contributions: VideoCrafter2,T2V-turbo,CogvideoX,VideoTuna,Vbench, VidProM.
Thank I Chieh Chen for valuable suggesstions on demos.
Before Alignment | After Alignment |
Before Alignment | After Alignment |