Skip to content

enyac-group/Quamba

Repository files navigation

Quamba: A Post-Training Quantization Recipe for Selective State Space Models

Hung-Yueh Chiang, Chi-Chih Chang, Natalia Frumkin, Kai-Chiang Wu, Diana Marculescu

arXiv Project Page

⚡8-bit quantization (W8A8) for mamba blocks 🚀1.7 $\times$ speedup on Orin Nano 8G 🔻 2 $\times$ memory reduction Quamba

Real-time Generation on a NVIDIA Orin Nano 8G

Quamba

Setup

Hardware Requirements

  • NVIDIA GPU Ampere architecture or above

Software Requirements

  • CUDA 12.1 or above
  • CMAKE version 3.22.1 or above

Clone Quamba

  • Clone the repository with all submodules:
git clone --recurse-submodules [email protected]:enyac-group/Quamba.git
  • Run in docker (optional)

To build the docker image with customized kernels, run the following commands:

cd docker
./build_docker.sh
./run.sh # launch the container

Or Pull the pre-built docker image by

docker image pull hychiang/quamba-cuda-12.1:latest
  • Create Quamba conda environment
cd Quamba
conda create -n quamba python=3.10
conda activate quamba
pip install -r requirements.txt

Build 3rd-party Libraries

  • Install fast-hadamard-transform:
# set force build to include 12N, 40N from the newer commit
export FAST_HADAMARD_TRANSFORM_FORCE_BUILD=TRUE
pip install 3rdparty/fast-hadamard-transform
  • Install lm-evaluation-harness:
# lm_eval-0.4.2 word2number-1.1
pip install 3rdparty/lm-evaluation-harness
  • Install mamba
# set force build to use the commit for Quamba
export MAMBA_FORCE_BUILD=TRUE
pip install 3rdparty/mamba
  • Install CUTLASS
# cmake version >= 3.22.1
bash build_cutlass.sh

Build Quamba

pip install .

Generate

To generate the sentence from Mamba (FP16) given an input prompt:

python generate.py state-spaces/mamba-130m --prompt "My cat wrote all this CUDA code for a new language model and" --topp 0.9 --temperature 0.7 --repetition_penalty 1.2

To generate the sentence from Qamba (Int8) given an input prompt:

python generate.py state-spaces/mamba-130m --prompt "My cat wrote all this CUDA code for a new language model and" --topp 0.9 --temperature 0.7 --repetition_penalty 1.2 --quantize --act_scales_cache mamba-130m_scales.pt

Chat

To chat with Mamba (FP16), use the command:

python chat.py  --cache_graph

To chat with Quamba (Int8), use the command:

python chat.py  --cache_graph --act_scales_cache mamba-2.8b_scales_chat.pt  --quantize

Profile latency and memory

  • To profile time-to-first-token (prefilling stage):
python profile_mamba.py state-spaces/mamba-2.8b  --act_scales_cache mamba-2.8b_scales.pt --prompt_len 512 --ttft
  • To profile time-per-output-token (generation stage):
python profile_mamba.py state-spaces/mamba-2.8b  --act_scales_cache mamba-2.8b_scales.pt --tpot
  • To profile time-to-last-token (prefilling + generation stage):
python profile_mamba.py state-spaces/mamba-2.8b  --act_scales_cache mamba-2.8b_scales.pt --prompt_len 512 --gen_len 512 --ttlt
  • To profile memory usage (prefilling + generation stage):
python profile_mamba.py state-spaces/mamba-2.8b  --act_scales_cache mamba-2.8b_scales.pt --prompt_len 512 --gen_len 512 --size

Fake Quantization Evaluation

To evaluate the simulated quantization:

python main.py state-spaces/mamba-130m fake \
--do_hadamard \
--do_percentile_u \
--batch_size 16 \
--task_list lambada_openai \
--eval_zero_shot \
--log_dir logs

Real Quantization Evaluation

To evaluate the end-to-end quantization:

python main.py state-spaces/mamba-130m real \
--act_scales_cache mamba-130m_scales.pt \
--batch_size 1 \
--task_list lambada_openai \
--eval_zero_shot \
--log_dir logs

Citation

@article{chiang2024quamba,
  title={Quamba: A Post-Training Quantization Recipe for Selective State Space Models},
  author={Chiang, Hung-Yueh and Chang, Chi-Chih and Frumkin, Natalia and Wu, Kai-Chiang and Marculescu, Diana},
  journal={arXiv preprint arXiv:2410.13229},
  year={2024}
}