Welcome to the SAVEn-Vid repository! This project redefines long video comprehension by seamlessly integrating audio-visual modalities, delivering state-of-the-art performance on complex benchmarks like AVBench. You can find more details in our Paper. We will release code and datasets soon~
- 📊 Benchmark Innovation: Introducing AVBench, a comprehensive evaluation suite for audio-visual reasoning in long video contexts.
- 🛠️ Data Pipeline: Automated, scalable data generation pipeline for large-scale multi-modal datasets.
- 💡 Model Excellence: SAVEn-Vid, an audio-visual large language model, achieves cutting-edge results through temporal-spatial alignment and fusion.
SAVEn-Vid leverages a novel Audio-Visual Temporal-Spatial (AVTS) Resampler, aligning features across time and space to enhance multi-modal understanding in complex, long video scenarios.
AVBench is our tailored benchmark for evaluating advanced audio-visual reasoning tasks in long video contexts. Explore our illustrative comparison with existing benchmarks below:
📄 [AVBench vs. Existing Benchmarks]
Generate high-quality audio-visual datasets with our scalable pipeline designed for efficiency and robustness.
📄 [Pipeline Overview]
Achieving state-of-the-art performance with its temporal-spatial alignment, adaptive resampling, and multi-modal feature fusion.
SAVEn-Vid achieves top-tier results on AVBench and other benchmarks:
Benchmark | Metric | SAVEn-Vid (7B) | Best Competitor |
---|---|---|---|
AVBench | Accuracy | 66.7% | 77.29% (GPT-4) |
VideoMME | Accuracy | 56.21% | 54.92% |
Music-AVQA | Accuracy | 83.14% | 81.85% |
git clone https://github.com/username/SAVEn-Vid.git
cd SAVEn-Vid
### xxxxTBD
✨ If you find SAVEn-Vid useful, don’t forget to ⭐ the repo! ✨
@article{li2024savenvid,
author={Jungang Li and Sicheng Tao and Yibo Yan and Xiaojie Gu and Haodong Xu and Xu Zheng and Yuanhuiyi Lyu and Linfeng Zhang and Xuming Hu},
title={SAVEn-Vid: Synergistic Audio-Visual Integration for Enhanced Understanding in Long Video Context},
journal = {arXiv preprint arXiv:2411.16213},
year = {2024},
}