Paper | Installation | Eviction | Quantization
We provide three implementations. ThinK_eager
contains the code for eager attention, ThinK_flash
utilizes FlashAttention and TinK_KIVI
which intergrates with KV quantization. Please note that the current implementations may not be fully optimized, and we are actively working on improving their efficiency. We use LongBench to evaluate the performance.
- Support More Models
- Support Multi-GPUs
- Optimize Efficiency
Step 1: Clone this repository
Step 2: Setup Environments
conda create -n think python=3.10
conda activate think
pip install -r requirements.txt
Evaluate on LongBench: You can first modify the hyperparameters in scripts/scripts_longBench/eval.sh
(e.g., pruning_ratio)
cd ThinK_flash
sh ./scripts/scripts_longBench/eval.sh
Results:
sh ./scripts/scripts_longBench/metrics.sh
cd ThinK_kivi
Set up the environments as per the instructions from KIVI, adding an additional argument, pruning_ratio
. Currently, only LLaMA-2 is supported.
Users need to make their own assessment regarding any obligations or responsibilities under the corresponding licenses or terms and conditions pertaining to the original datasets and data. This repository is being released for research purposes only.
@article{xu2024think,
title={ThinK: Thinner Key Cache by Query-Driven Pruning},
author={Xu, Yuhui and Jie, Zhanming and Dong, Hanze and Wang, Lei and Lu, Xudong and Zhou, Aojun and Saha, Amrita and Xiong, Caiming and Sahoo, Doyen},
journal={arXiv preprint arXiv:2407.21018},
year={2024}
}