Q-Sparse-LLM

Q-Sparse-LLM is an implementation of a sparse transformer architecture designed for efficient and high-performance language modeling. This project introduces sparsity and quantization techniques to the traditional transformer architecture, aiming to reduce computational costs and memory footprint while maintaining model performance.

Features

Top-K Sparsity: Implements a sparse activation mechanism that retains only the top K% of values in each layer.
Quantized Top-K Sparsity: Extends the sparsity mechanism with 8-bit quantization for further efficiency.
ReLU²GLU Activation: Uses a squared ReLU Gated Linear Unit for improved sparsity in feed-forward layers.

TODO:

Compatibility with 1-bit LLMs: Designed to be compatible with extremely quantized models like BitNet b1.58.

Architecture Overview

The Q-Sparse architecture is based on the Transformer architecture with modifications to enable sparsity in the activations:

Top-K Sparsity:
- Applies a mask to keep only the top K% of activations (by magnitude).
- Rescales the output by its L2 norm.
Quantized Top-K Sparsity:
- Quantizes the input to 8-bit representation before applying Top-K sparsity.
Squared ReLU (ReLU²GLU):
- Implements ReLU²GLU for feed-forward layers: ReLU²GLU(X) = X · W_up^T ⊙ ReLU²(X · W_gate^T)

Experiment: ReLU vs ReLU2GLU

ReLU

ReLU2GLU

Installation

git clone https://github.com/nanowell/Q-Sparse-LLM.git
cd Q-Sparse-LLM

Usage

Here's a basic example of how to use the Q-Sparse-LLM model:

from q_sparse import QSparseModel

# Initialize the model
model = QSparseModel(
    vocab_size=30000,
    d_model=768,
    nhead=12,
    num_layers=12,
    dim_feedforward=3072,
    k_ratio=0.5,
    quantized=True
)

# Use the model for inference or training
# (Add specific usage instructions based on your implementation)

Contributing

Contributions to Q-Sparse-LLM are welcome!

License

This project is licensed under the MIT License.

Citation

If you use Q-Sparse-LLM in your research, please cite:

@software{Q-Sparse-LLM,
  author = {nanowell},
  title = {Q-Sparse-LLM: Quantized Sparse Language Model},
  year = {2024},
  url = {https://github.com/nanowell/Q-Sparse-LLM}
}

Acknowledgements

This project builds upon the work Q-Sparse paper:

@misc{wang2024qsparselargelanguagemodels,
      title={Q-Sparse: All Large Language Models can be Fully Sparsely-Activated}, 
      author={Hongyu Wang and Shuming Ma and Ruiping Wang and Furu Wei},
      year={2024},
      eprint={2407.10969},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2407.10969}, 
}

Contact

For questions and feedback, please open an issue in the GitHub repository or contact [[email protected]].

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
LICENSE		LICENSE
Q_Sparse.py		Q_Sparse.py
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Q-Sparse-LLM

Features

TODO:

Architecture Overview

Experiment: ReLU vs ReLU2GLU

ReLU

ReLU2GLU

Installation

Usage

Contributing

License

Citation

Acknowledgements

Contact

About

Releases

Packages

Languages

License

nanowell/Q-Sparse-LLM

Folders and files

Latest commit

History

Repository files navigation

Q-Sparse-LLM

Features

TODO:

Architecture Overview

Experiment: ReLU vs ReLU2GLU

ReLU

ReLU2GLU

Installation

Usage

Contributing

License

Citation

Acknowledgements

Contact

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages