LongVILA: Scaling Long-Context Visual Language Models for Long Videos

💡 Introduction

Long-context capability is critical for multi-modal foundation models, especially for long video understanding. We introduce LongVILA, a full-stack solution for long-context visual-language models by co-designing the algorithm and system. For model training, we upgrade existing VLMs to support long video understanding by incorporating two additional stages, i.e., long context extension and long video supervised fine-tuning. However, training on long video is computationally and memory intensive. We introduce the long-context Multi-Modal Sequence Parallelism (MM-SP) system that efficiently parallelizes long video training and inference, enabling 2M context length training on 256 GPUs without any gradient checkpointing. LongVILA efficiently extends the number of video frames of VILA from 8 to 2048, achieving 99.8% accuracy in 6,000-frame (more than 1 million tokens) video needle-in-a-haystack. LongVILA-7B demonstrates strong accuracy on 9 popular video benchmarks, e.g., 65.1% VideoMME with subtitle. Besides, MM-SP is 2.1x - 5.7x faster than ring style sequence parallelism and 1.1x - 1.4x faster than Megatron with a hybrid context and tensor parallelism. Moreover, it seamlessly integrates with Hugging Face Transformers.

Training Pipeline

Long video SFT Dataset

Multi-Modal Sequence Parallelism System

6000-frame Needle in the Haystack (More than 1M context)

Results on 9 benchmarks

Installation

./environment_setup.sh vila

Models

Model	LLM Size	Context	Training frames	Link
LongVILA-1.5B-256f	1.5B	65536	256	qwen2-1.5b-longvila-256f
LongVILA-7B-256f	7B	131072	256	qwen2-7b-longvila-256f
LongVILA-7B-1M	7B	1048576	2048	qwen2-7b-longvila-1M

Datasets

Dataset Usage	Link	Comments
Stage4-LLM Context Extension	64k 256k 512k 1M	Encode SlimPajama via Qwen2 tokenizer
Stage5-LongVideo SFT	Data	Source long videos from Shot2Story

Training

We conduct continued training (Stage4 and Stage5) based on an VILA model as following.

Stage4: LLM Context Extension

This is the first stage of LongVILA training, in which we tune the LLM in the VILA model to long context using SlimPajama dataset. For 7B model, this stage runs on a 8xA100 node for 64k context extension and at leat two 8xA100 node for 256k context extension.

bash scripts/v1_5/train/8b/4_extend_llm_64k.sh [STAGE3_PATH] [OUTPUT_NAME] [DATA_FILE]

The script takes in three arguments. STAGE3_PATH points to the trained VILA model. OUTPUT_NAME is the desired folder name under checkpoints that stores the final checkpoint. DATA_FILE is the data that store 64k-context Slimpajama data.

bash scripts/v1_5/train/8b/4_extend_llm_256k.sh [EXTENDED_64k_PATH] [OUTPUT_NAME] [DATA_FILE]

The script is a progressive training from 64k context to 256 context. EXTENDED_64k_PATH points to the OUTPUT_NAME of 4_extend_llm_64k.sh. DATA_FILE is the data that store 256k-context Slimpajama data. If you do not need to train models longer than 256 frames (e.g., 512 or 1024 frames), you do not need to train this 256k context step.

Similar steps for 512k and 1M training scripts.

Stage5: Long Supervised fine-tuning

This is the last stage of LongVILA training, in which we tune the model to follow long videos instructions. This stage runs on 32 8xH100 nodes for all different configurations (i.e. 256 frames, and 512 frames).

bash scripts/v1_5/train/8b/5_long_sft_256frames.sh [EXTENDED_64k_PATH]
[OUTPUT_NAME]

bash scripts/v1_5/train/8b/5_long_sft_512frames.sh [EXTENDED_256k_PATH]
[OUTPUT_NAME]

The scripts takes in two arguments. EXTENDED_64k_PATH and EXTENDED_256k_PATH points to the OUTPUT_NAME of the stage 4 script. OUTPUT_NAME is the desired folder name under checkpoints that stores the final checkpoint.

Similar steps for 1024-frame and 2048-frame training scripts.

Note

💡Sequence Parallelism Configuration

To enable sequence parallelism, you can set the following parameters in the training script:

seq_parallel_size:The degree of sequence parallelism (SP). SP is disabled by default (value: -1).

seq_parallel_ring_size: The communication process group size using optimized Ring Attention approach in SP. Ring Attention approach is disabled by default in SP.

seq_parallel_ring_type: Ring Attention implementation. Support ['ring_varlen', 'zigzag_ring_varlen'] in 2D attention. Only works when seq_parallel_ring_size > 1.

Please note that when SP is enabled, we treat each group of seq_parallel_size GPUs as a single device, with the global batch size calculated as the product of the per-device batch size and the data parallelism size.

Evaluations

Needle in the Haystack Experiments

bash scripts/v1_5/eval/needle.sh LongVILA-7B-1M Efficient-Large-Model/qwen2-7b-longvila-1M $VIDEO_PATH 6000 300

Benchmarks

vila-eval -m Efficient-Large-Model/LongVILA-7B-256f -c auto -nf $NUM_VIDEO_FRAMES -t $TASKS

TASKS can be from {lmms-videomme-256,lmms-videomme_w_subtitle-256,vnbench_val,lmms-activitynetqa,egoschema_test,egoschema_val,eventbench_val,lmms-longvideobench_val_v,lmms-perceptiontest_val_mc,lmms-mvbench,lmms-nextqa_mc_test}. We set NUM_VIDEO_FRAMES as 256 for videomme, 128 for vnbench and 32 for others.

🔒 License

The code is released under the Apache 2.0 license as found in the LICENSE file.
The pretrained weights are released under the CC-BY-NC-SA-4.0 license.
The service is a research preview intended for non-commercial use only, and is subject to the following licenses and terms:
- Model License of Qwen2. For Qwen2-LongVILA checkpoints terms of use, please refer to the Qwen2 License for additional details.
- Terms of Use of the data generated by OpenAI
- Dataset Licenses for each one used during training.

Citations

@article{longvila,
      title={LongVILA: Scaling Long-Context Visual Language Models for Long Videos},
      author={Yukang Chen and Fuzhao Xue and Dacheng Li and Qinghao Hu and Ligeng Zhu and Xiuyu Li and Yunhao Fang and Haotian Tang and Shang Yang and Zhijian Liu and Yihui He and Hongxu Yin and Pavlo Molchanov and Jan Kautz and Linxi Fan and Yuke Zhu and Yao Lu and Song Han},
      year={2024},
      eprint={2408.10188},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Acknowledgement

LLaVA: the codebase we built upon. Thanks for their wonderful work.
LongVA: we borrowed the long video needle in the haystack evaluation script from this repository.
LongLoRA: we modified the low-rank long-context fine-tuning code from this repository.
USP (YunChang): we adopted the 2D attention implementation from this repository.
DeepSpeed Ulysses: we adopted the all-to-all implementation from this repository.
RingFlashAttention: we adopted the ring flash attention implementation from this repository.
Video-ChatGPT: we borrowed video evaluation script from this repository.
MMC4, COYO-700M, M3IT, OpenORCA/FLAN, ShareGPT4V, WIT, GSM8K-ScRel, VisualGenome, VCR, ScienceQA, Shot2Story, Youcook2, Vatex, ShareGPT-Video, ShareGPT4o for providing datasets used in this research.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

LongVILA: Scaling Long-Context Visual Language Models for Long Videos

💡 Introduction

Installation

Models

Datasets

Training

Stage4: LLM Context Extension

Stage5: Long Supervised fine-tuning

Evaluations

Needle in the Haystack Experiments

Benchmarks

🔒 License

Citations

Acknowledgement

Files

README.md

Latest commit

History

README.md

File metadata and controls

LongVILA: Scaling Long-Context Visual Language Models for Long Videos

💡 Introduction

Installation

Models

Datasets

Training

Stage4: LLM Context Extension

Stage5: Long Supervised fine-tuning

Evaluations

Needle in the Haystack Experiments

Benchmarks

🔒 License

Citations

Acknowledgement