Releases: vllm-project/llm-compressor
Releases · vllm-project/llm-compressor
v0.3.0
What's New in v0.3.0
Key Features and Improvements
- GPTQ Quantized-weight Sequential Updating (#177): Introduced an efficient sequential updating mechanism for GPTQ quantization, improving model compression performance and compatibility.
- Auto-Infer Mappings for SmoothQuantModifier (#119): Automatically infers
mappings
based on model architecture, making SmoothQuant easier to apply across various models. - Improved Sparse Compression Usability (#191): Added support for targeted sparse compression with specific ignore rules during inference, allowing for more flexible model configurations.
- Generic Wrapper for Any Hugging Face Model (#185): Added
wrap_hf_model_class
utility, enabling better support and integration for Hugging Face models i.e. not based onAutoModelForCausalLM
. - Observer Restructure (#837): Introduced calibration and frozen steps within
QuantizationModifier
, moving Observers from compressed-tensors to llm-compressor.
Bug Fixes
- Fix Tied Tensors Bug (#659)
- Observer Initialization in GPTQ Wrapper (#883)
- Sparsity Reload Testing (#882)
Documentation
- Updated SmoothQuant Tutorial (#115): Expanded SmoothQuant documentation to include detailed mappings for easier implementation.
What's Changed
- Fix compresed typo by @kylesayrs in #188
- GPTQ Quantized-weight Sequential Updating by @kylesayrs in #177
- Add: targets and ignore inference for sparse compression by @rahul-tuli in #191
- switch tests from weekly to nightly by @dhuangnm in #658
- Compression wrapper abstract methods by @kylesayrs in #170
- Explicitly set sequential_update in examples by @kylesayrs in #187
- Increase Sparsity Threshold for compressors by @rahul-tuli in #679
- Add a generic
wrap_hf_model_class
utility to support VLMs by @mgoin in #185 - Add tests for examples by @dbarbuzzi in #149
- Rename to quantization config by @kylesayrs in #730
- Implement Missing Modifier Methods by @kylesayrs in #166
- Fix 2/4 GPTQ Model Tests by @dsikka in #769
- SmoothQuant mappings tutorial by @rahul-tuli in #115
- Fix import of
ModelCompressor
by @rahul-tuli in #776 - update test by @dsikka in #773
- [Bugfix] Fix saving offloaded state dict by @kylesayrs in #172
- Auto-Infer
mappings
Argument forSmoothQuantModifier
Based on Model Architecture by @rahul-tuli in #119 - Update workflows/actions by @dbarbuzzi in #774
- [Bugfix] Prepare KD Models when Saving by @kylesayrs in #174
- Set Sparse compression to save_compressed by @rahul-tuli in #821
- Install compressed-tensors after llm-compressor by @dbarbuzzi in #825
- Fix test typo by @kylesayrs in #828
- Add
AutoModelForCausalLM
example by @dsikka in #698 - [Bugfix] Workaround tied tensors bug by @kylesayrs in #659
- Only untie word embeddings by @kylesayrs in #839
- Check for config hidden size by @kylesayrs in #840
- Use float32 for Hessian dtype by @kylesayrs in #847
- GPTQ: Depreciate non-sequential update option by @kylesayrs in #762
- Typehint nits by @kylesayrs in #826
- [ DOC ] Remove version restrictions in W8A8 exmaple by @miaojinc in #849
- Fix inconsistence in example config of 2:4 sparse quantization by @yzlnew in #80
- Fix forward function pass call by @dsikka in #845
- [Bugfix] Use weight parameter of linear layer by @kylesayrs in #836
- [Bugfix] Rename files to remove colons by @kylesayrs in #846
- cover all 3.9-3.12 in commit testing by @dhuangnm in #864
- Add marlin-24 recipe/configs for e2e testing by @dsikka in #866
- [Bugfix] onload during sparsity calculation by @kylesayrs in #862
- Fix HFTrainer overloads by @kylesayrs in #869
- Support Model Offloading Tied Tensors Patch by @kylesayrs in #872
- Add advice about dealing with non-invertable hessians by @kylesayrs in #875
- seed commit workflow by @andy-neuma in #877
- [Observer Restructure]: Add Observers; Add
calibration
andfrozen
steps toQuantizationModifier
by @dsikka in #837 - Bugfix observer initialization in
gptq_wrapper
by @rahul-tuli in #883 - BugFix: Fix Sparsity Reload Testing by @dsikka in #882
- Use custom unique test names for e2e tests by @dbarbuzzi in #892
- Revert "Use custom unique test names for e2e tests (#892)" by @dsikka in #893
- Move config["testconfig_path"] assignment by @dbarbuzzi in #895
- Cap accelerate version to avoid bug by @kylesayrs in #897
- Fix observing offloaded weight by @kylesayrs in #896
- Update image in README.md by @mgoin in #861
- update accelerate version by @kylesayrs in #899
- [GPTQ] Iterative Parameter Updating by @kylesayrs in #863
- Small fixes for release by @dsikka in #901
- use smaller portion of dataset by @dsikka in #902
- Update example to not fail hessian inversion by @dsikka in #904
- Bump version to 0.3.0 by @dsikka in #907
New Contributors
- @miaojinc made their first contribution in #849
- @yzlnew made their first contribution in #80
- @andy-neuma made their first contribution in #877
Full Changelog: 0.2.0...0.3.0
v0.2.0
What's Changed
- Correct Typo in SparseAutoModelForCausalLM docstring by @kylesayrs in #56
- Disable Default Bitmask Compression by @Satrat in #60
- TRL Example fix by @rahul-tuli in #59
- Fix typo by @rahul-tuli in #63
- Correct typo by @kylesayrs in #61
- correct import in README.md by @zzc0430 in #66
- Fix for issue #43 -- starcoder model by @horheynm in #71
- Update README.md by @robertgshaw2-neuralmagic in #74
- Layer by Layer Sequential GPTQ Updates by @Satrat in #47
- [ Docs ] Update main readme by @robertgshaw2-neuralmagic in #77
- [ Docs ]
gemma2
examples by @robertgshaw2-neuralmagic in #78 - [ Docs ] Update
FP8
example to use dynamic per token by @robertgshaw2-neuralmagic in #75 - [ Docs ] Overhaul
accelerate
user guide by @robertgshaw2-neuralmagic in #76 - Support
kv_cache_scheme
for quantizing KV Cache by @mgoin in #88 - Propagate
trust_remote_code
Argument by @kylesayrs in #90 - Fix for issue #81 by @horheynm in #84
- Fix for issue 83 by @horheynm in #85
- [ DOC ] Big Model Example by @robertgshaw2-neuralmagic in #99
- Enable obcq/finetune integration tests with
commit
cadence by @dsikka in #101 - metric logging on GPTQ path by @horheynm in #65
- Update test config files by @dsikka in #97
- remove workflows + update runners by @dsikka in #103
- metrics by @horheynm in #104
- add debug by @horheynm in #108
- Add FP8 KV Cache quant example by @mgoin in #113
- Add vLLM e2e tests by @dsikka in #117
- Fix style, fix noqa by @kylesayrs in #123
- GPTQ Algorithm Cleanup by @kylesayrs in #120
- GPTQ Activation Ordering by @kylesayrs in #94
- demote recipe string initialization to debug and make more descriptive by @kylesayrs in #116
- compressed-tensors main dependency for base-tests by @kylesayrs in #125
- Set
ready
label for transformer tests; add message reminder on PR opened by @dsikka in #126 - Fix markdown check test by @dsikka in #127
- Naive Run Compressed Pt. 2 by @Satrat in #62
- Fix transformer test conditions by @dsikka in #131
- Run Compressed Tests by @Satrat in #132
- Correct typo by @kylesayrs in #124
- Activation Ordering Strategies by @kylesayrs in #121
- Fix README Issue by @robertgshaw2-neuralmagic in #139
- update by @dsikka in #143
- Update finetune and oneshot tests by @dsikka in #114
- Validate Recipe Parsing Output by @kylesayrs in #100
- fix build error for nightly by @dhuangnm in #145
- Fix recipe nested in configs by @kylesayrs in #140
- MOE example with warning by @rahul-tuli in #87
- Bug Fix: recipe stages were not being concatenated by @rahul-tuli in #150
- fix package name bug for nightly by @dhuangnm in #155
- Add descriptions for pytest marks by @kylesayrs in #156
- Fix Sparsity Unit Test by @Satrat in #153
- Fix: Error during model saving with shared tensors by @rahul-tuli in #158
- Update 2:4 Examples by @dsikka in #161
- DeepSeek: Fix Hessian Estimation by @Satrat in #157
- bump up main to 0.2.0 by @dhuangnm in #163
- Fix help dialogue by @kylesayrs in #151
- Add MoE and Compressed Inference Examples by @Satrat in #160
- Separate
trust_remote_code
args by @kylesayrs in #152 - Enable a skipped finetune test by @dsikka in #169
- Fix filename in example command by @dbarbuzzi in #173
- Add DeepSeek V2.5 Example by @dsikka in #171
- fix quality by @dsikka in #176
- Patch log function name in gptq by @kylesayrs in #168
- README for Modifiers by @Satrat in #165
- Fix default for sequential updates by @dsikka in #186
- fix default test case by @dsikka in #193
- Fix Initalize typo by @Imss27 in #190
- Update MoE examples by @mgoin in #192
New Contributors
- @zzc0430 made their first contribution in #66
- @horheynm made their first contribution in #71
- @dsikka made their first contribution in #101
- @dhuangnm made their first contribution in #145
- @Imss27 made their first contribution in #190
Full Changelog: 0.1.0...0.2.0
v0.1.0
What's Changed
- Address Test Failures by @Satrat in #1
- Remove SparseZoo Usage by @Satrat in #2
- SparseML Cleanup by @markurtz in #6
- Remove all references to Neural Magic copyright within LLM Compressor by @markurtz in #7
- Add FP8 Support by @Satrat in #4
- Fix Weekly Test Failure by @Satrat in #8
- Add Scheme UX for QuantizationModifier by @Satrat in #9
- Add Group Quantization Test Case by @Satrat in #10
- Loguru logging standardization for LLM Compressor by @markurtz in #11
- Clarify Function Names for Logging by @Satrat in #12
- [ Examples ] E2E Examples by @robertgshaw2-neuralmagic in #5
- Update setup.py by @robertgshaw2-neuralmagic in #15
- SmoothQuant Mapping Defaults by @Satrat in #13
- Initial README by @bfineran in #3
- [Bug] Fix validation errors for smoothquant modifier + update examples by @rahul-tuli in #19
- [MOE Quantization] Warn against "undercalibrated" modules by @dbogunowicz in #20
- Port SparseML Remote Code Fix by @Satrat in #21
- Update Quantization Save Defaults by @Satrat in #22
- [Bugfix] Add fix to preserve modifier order when passed as a list by @rahul-tuli in #26
- GPTQ - move calibration of quantiztion params to after hessian calibration by @bfineran in #25
- Fix typos by @eldarkurtic in #31
- Remove ceiling from
datasets
dep by @mgoin in #27 - Revert naive compression format by @Satrat in #32
- Fix layerwise targets by @Satrat in #36
- Move Weight Update Out Of Loop by @Satrat in #40
- Fix End Epoch Default by @Satrat in #39
- Fix typos in example for w8a8 quant by @eldarkurtic in #38
- Model Offloading Support Pt 2 by @Satrat in #34
- set version to 1.0.0 for release by @bfineran in #44
- Update version for first release by @markurtz in #50
- BugFix: Update TRL example scripts to point to the right SFTTrainer by @rahul-tuli in #51
- Update examples/quantization_24_sparse_w4a16 README by @dbarbuzzi in #52
- Fix Failing Transformers Tests by @Satrat in #53
- Offloading Bug Fix by @Satrat in #58
New Contributors
- @markurtz made their first contribution in #6
- @bfineran made their first contribution in #3
- @dbogunowicz made their first contribution in #20
- @eldarkurtic made their first contribution in #31
- @mgoin made their first contribution in #27
- @dbarbuzzi made their first contribution in #52
Full Changelog: https://github.com/vllm-project/llm-compressor/commits/0.1.0