Releases: casper-hansen/AutoAWQ
Releases · casper-hansen/AutoAWQ
v0.1.8
What's Changed
- Fix MPT by @casper-hansen in #206
- Add config to Base model by @casper-hansen in #207
- Add Qwen model by @Sanster in #182
- Robust quantization for Catcher by @casper-hansen in #209
- New scaling to improve perplexity by @casper-hansen in #216
- Benchmark hf generate by @casper-hansen in #237
- Fix position ids by @casper-hansen in #215
- Pass
model_init_kwargs
tocheck_and_get_model_type
function by @rycont in #232 - Fixed an issue where the Qwen model had too much error after quantization by @jundolc in #243
- Load on CPU to avoid OOM by @casper-hansen in #236
- Update README.md by @casper-hansen in #245
- [
core
] Make AutoAWQ fused modules compatible with HF transformers by @younesbelkada in #244 - [
core
] Fix quantization issues with transformers==4.36.0 by @younesbelkada in #249 - FEAT: Add possibility of skipping modules when quantizing by @younesbelkada in #248
- Fix quantization issue with transformers >= 4.36.0 by @younesbelkada in #264
- Mixtral: Mixture of Experts quantization by @casper-hansen in #251
- Fused rope theta by @casper-hansen in #270
- FEAT: add llava to autoawq by @younesbelkada in #250
- Add Baichuan2 Support by @AoyuQC in #247
- Set default rope_theta on LlamaLikeBlock by @casper-hansen in #271
- Update news and models supported by @casper-hansen in #272
- Add vLLM async example by @casper-hansen in #273
- Bump to v0.1.8 by @casper-hansen in #274
New Contributors
- @Sanster made their first contribution in #182
- @rycont made their first contribution in #232
- @jundolc made their first contribution in #243
- @AoyuQC made their first contribution in #247
Full Changelog: v0.1.7...v0.1.8
v0.1.7
What's Changed
- Build older cuda wheels by @casper-hansen in #158
- Exclude download of CUDA wheels by @casper-hansen in #159
- New benchmarks in README by @casper-hansen in #160
- Fix typo in benchmark command by @casper-hansen in #161
- Yi support by @casper-hansen in #167
- Make sure to delete dummy model by @casper-hansen in #180
- Fix CUDA error: invalid argument by @casper-hansen in #179
- New logic for passing past_key_value by @younesbelkada in #177
- Reset cache on new generation by @casper-hansen in #178
- Adaptive batch sizing by @casper-hansen in #181
- Pass arguments to AutoConfig by @s4rduk4r in #97
- Fix cache util logic by @casper-hansen in #186
- Fix multi-GPU loading and inference by @casper-hansen in #190
- [
core
] ReplaceQuantLlamaMLP
withQuantFusedMLP
by @younesbelkada in #188 - [
core
] Addis_hf_transformers
flag by @younesbelkada in #195 - Fixed multi-GPU quantization by @casper-hansen in #196
Full Changelog: v0.1.6...v0.1.7
v0.1.6
What's Changed
- Pseudo dequantize function by @casper-hansen in #127
- CUDA 11.8.0 and 12.1.1 build by @casper-hansen in #128
- AwqConfig class by @casper-hansen in #132
- Fix init quant by @casper-hansen in #136
- Update readme by @casper-hansen in #137
- Benchmark info by @casper-hansen in #138
- Bump to v0.1.6 by @casper-hansen in #139
- CUDA 12 release by @casper-hansen in #140
- Revert to previous version by @casper-hansen in #141
- Fix performance regression by @casper-hansen in #148
- [
core
/attention
] Fix fused attention generation with newest transformers version by @younesbelkada in #146 - Fix condition when rolling cache by @casper-hansen in #150
- Default to safetensors for quantized models by @casper-hansen in #151
- Create fused LlamaLikeModel by @casper-hansen in #152
Full Changelog: v0.1.5...v0.1.6
v0.1.5
What's Changed
- Only apply attention mask if seqlen is greater than 1 by @casper-hansen in #96
- add gpt_neox support by @twaka in #113
- [
core
] Support fp32 / bf16 inference by @younesbelkada in #121 - Fix potential overflow by @casper-hansen in #102
- Fixing starcoder based models with 15B by @SebastianBodza in #118
- Support Aquila models. by @ftgreat in #123
- Add benchmark of Aquila2 34B AWQ in README.md. by @ftgreat in #126
New Contributors
- @twaka made their first contribution in #113
- @younesbelkada made their first contribution in #121
- @SebastianBodza made their first contribution in #118
- @ftgreat made their first contribution in #123
Full Changelog: v0.1.4...v0.1.5
v0.1.4
What's Changed
- Refactor cache and embedding modules by @casper-hansen in #95
- Fix
TypeError: 'NoneType' object is not subscriptable
Full Changelog: v0.1.3...v0.1.4
v0.1.3
What's Changed
- Turing inference support (Colab+Kaggle working) by @casper-hansen in #92
- Fix memory bug (save 2GB VRAM)
Full Changelog: v0.1.2...v0.1.3
v0.1.2
What's Changed
- Fix unexpected keyword by @casper-hansen in #88
- Fix Falcon n_kv_heads parameter by @casper-hansen in #89
- Mistral fused modules by @casper-hansen in #90
Full Changelog: v0.1.1...v0.1.2
v0.1.1
What's Changed
- Add GPT BigCode support (StarCoder) by @casper-hansen in #61
- Use typing classes over base types by @VikParuchuri in #69
- Fix KV cache shapes error by @casper-hansen in #75
- Mistral support by @casper-hansen in #79
- Add low_cpu_mem_usage=True in example by @casper-hansen in #80
- Offloading to cpu and disk by @s4rduk4r in #77
- Faster build, fix "no space left". by @casper-hansen in #84
New Contributors
- @VikParuchuri made their first contribution in #69
- @s4rduk4r made their first contribution in #77
Full Changelog: v0.1.0...v0.1.1
v0.1.0
What's Changed
- Support Falcon 180B by @casper-hansen in #35
- [NEW] GEMV kernel implementation by @casper-hansen in #40
- Allow user to use custom calibration data for quantization by @boehm-e in #27
- Safetensors and model sharding by @casper-hansen in #47
- 2x faster context processing with GEMV by @casper-hansen in #58
- Support kv_heads by @casper-hansen in #60
- Refactor quantization code by @casper-hansen in #62
- support windows by @qwopqwop200 in #53
- Improve model loading by @casper-hansen in #66
New Contributors
Full Changelog: v0.0.2...v0.1.0
v0.0.2
What's Changed
- Refactor fused modules by @casper-hansen in #18
- fuse_layers bug fix by @qwopqwop200 in #21
- support speedtest to benchmark FP16 model by @wanzhenchn in #25
- Implement batch size for speed test by @casper-hansen in #26
- [BUG] Fix illegal memory access + Quantized Multi-GPU support by @casper-hansen in #28
- YaRN support for LLaMa models by @casper-hansen in #23
New Contributors
- @wanzhenchn made their first contribution in #25
Full Changelog: v0.0.1...v0.0.2