Hydra: Bidirectional State Space Models Through Generalized Matrix Mixers
Sukjun Hwang*, Aakash Lahoti*, Tri Dao, Albert Gu
Paper: https://arxiv.org/abs/2407.09941
Blogpost: https://goombalab.github.io/blog/2024/hydra-part1-matrix-mixer/
Follow the installation section of Mamba; simply,
pip install mamba-ssm
[Option] For training BERT (./hydra/bert
), install additional required packages via pip install -r requirements.txt
The quasiseparable matrix mixer, Hydra, is our best model for bidirectional sequence processing (details in Section 3).
The implementation is at ./hydra/modules/hydra.py.
import torch
from .hydra import Hydra
batch, length, dim = 2, 64, 16
x = torch.randn(batch, length, dim).to("cuda")
model = Hydra(
d_model=dim, # Model dimension d_model
d_state=64, # SSM state expansion factor
d_conv=7, # Local non-causal convolution width
expand=2, # Block expansion factor
use_mem_eff_path=False, # Nightly release. Thanks to Alston Lo
).to("cuda")
y = model(x)
assert y.shape == x.shape
The matrix mixer framework is implemented at ./hydra/modules/matrix_mixer.py.
You can easily integrate your own mixer matrix by following our implementations of various sequence mixers located at ./hydra/modules/matrix_mixers/!
from .hydra import MatrixMixer
model = MatrixMixer(
"""
matrix_mixer_type: options for matrix_mixer_type
{'dense', 'toeplitz', 'vandermonde', 'cauchy', 'low_rank', 'attention', 'quasiseparable'}
is_data_dependent: boolean flag to parameterize the mixer matrix to SAM
"""
matrix_mixer_type=matrix_mixer_type,
is_data_dependent=is_data_dependent,
d_model=dim, # Model dimension d_model
qk_dim=qk_dim, # dimension for QK
).to("cuda")
y = model(x)
assert y.shape == x.shape
Our code for training BERT (./hydra/bert/) is based on MosaicBERT and M2.
Follow the instructions of MosaicBERT (./hydra/bert/README.md) for details (e.g., setting up dataset and running code).
The default configurations for Hydra and MatrixMixer are located at:
- Pretrain: ./hydra/bert/yamls/pretrain
- Finetune: ./hydra/bert/yamls/finetune
Pretrain Hydra on C4 using a single GPU:
python main.py yamls/pretrain/hydra.yaml
Pretrain Hydra on C4 using 8 GPUs:
composer -n 8 main.py yamls/pretrain/hydra.yaml
Finetune Hydra on GLUE:
python glue.py yamls/finetune/hydra.yaml
Weights of Hydra with 23layers pretrained on C4 are uploaded to HuggingFace.
We thank the authors of Mamba, MosaicBERT, and M2 for their wonderful codebases.
If you use this codebase, or otherwise find our work valuable, please cite Hydra:
@article{hydra,
title={Hydra: Bidirectional State Space Models Through Generalized Matrix Mixers},
author={Hwang, Sukjun and Lahoti, Aakash and Dao, Tri and Gu, Albert},
journal={arXiv preprint arXiv:2407.09941},
year={2024}
}