Skip to content

SAVEn-Vid: Synergistic Audio-Visual Integration for Enhanced Understanding in Long Video Context

Notifications You must be signed in to change notification settings

LJungang/SAVEn-Vid

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 

Repository files navigation

🌟 SAVEn-Vid: Synergistic Audio-Visual Integration for Enhanced Understanding in Long Video Context

Welcome to the SAVEn-Vid repository! This project redefines long video comprehension by seamlessly integrating audio-visual modalities, delivering state-of-the-art performance on complex benchmarks like AVBench. You can find more details in our Paper. We will release code and datasets soon~

🎯 Highlights

  • 📊 Benchmark Innovation: Introducing AVBench, a comprehensive evaluation suite for audio-visual reasoning in long video contexts.
  • 🛠️ Data Pipeline: Automated, scalable data generation pipeline for large-scale multi-modal datasets.
  • 💡 Model Excellence: SAVEn-Vid, an audio-visual large language model, achieves cutting-edge results through temporal-spatial alignment and fusion.

🏆 Model Overview

SAVEn-Vid leverages a novel Audio-Visual Temporal-Spatial (AVTS) Resampler, aligning features across time and space to enhance multi-modal understanding in complex, long video scenarios.

SAVEn-Vid Architecture

🚀 Features

🔍 AVBench Benchmark

AVBench is our tailored benchmark for evaluating advanced audio-visual reasoning tasks in long video contexts. Explore our illustrative comparison with existing benchmarks below:

📄 [AVBench vs. Existing Benchmarks]

SAVEn-Vid Architecture

📦 Automated Data Pipeline

Generate high-quality audio-visual datasets with our scalable pipeline designed for efficiency and robustness.

📄 [Pipeline Overview]

SAVEn-Vid Architecture

🧠 SAVEn-Vid Model

Achieving state-of-the-art performance with its temporal-spatial alignment, adaptive resampling, and multi-modal feature fusion.

📈 Performance

SAVEn-Vid achieves top-tier results on AVBench and other benchmarks:

Benchmark Metric SAVEn-Vid (7B) Best Competitor
AVBench Accuracy 66.7% 77.29% (GPT-4)
VideoMME Accuracy 56.21% 54.92%
Music-AVQA Accuracy 83.14% 81.85%

📖 Getting Started

1️⃣ Clone the Repository

git clone https://github.com/username/SAVEn-Vid.git
cd SAVEn-Vid
### xxxxTBD

✨ If you find SAVEn-Vid useful, don’t forget to ⭐ the repo! ✨

Citation

  @article{li2024savenvid,
  author={Jungang Li and Sicheng Tao and Yibo Yan and Xiaojie Gu and Haodong Xu and Xu Zheng and Yuanhuiyi Lyu and Linfeng Zhang and Xuming Hu},
  title={SAVEn-Vid: Synergistic Audio-Visual Integration for Enhanced Understanding in Long Video Context}, 
  journal = {arXiv preprint arXiv:2411.16213},
  year = {2024},
  }

About

SAVEn-Vid: Synergistic Audio-Visual Integration for Enhanced Understanding in Long Video Context

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published