Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sync with 0.6.4.post1 #235

Merged
merged 418 commits into from
Nov 27, 2024
Merged
Changes from 1 commit
Commits
Show all changes
418 commits
Select commit Hold shift + click to select a range
64384bb
[torch.compile] upgrade tests (#9858)
youkaichao Oct 30, 2024
abbfb61
[Misc][OpenAI] deprecate max_tokens in favor of new max_completion_to…
gcalmettes Oct 31, 2024
890ca36
Revert "[Bugfix] Use host argument to bind to interface (#9798)" (#9852)
khluu Oct 31, 2024
d087bf8
[Model] Support quantization of Qwen2VisionTransformer (#9817)
mgoin Oct 31, 2024
3ea2dc2
[Misc] Remove deprecated arg for cuda graph capture (#9864)
ywang96 Oct 31, 2024
5608e61
[Doc] Update Qwen documentation (#9869)
jeejeelee Oct 31, 2024
16b8f7a
[CI/Build] Add Model Tests for Qwen2-VL (#9846)
alex-jw-brooks Oct 31, 2024
77f7ef2
[CI/Build] Adding a forced docker system prune to clean up space (#9849)
Alexei-V-Ivanov-AMD Oct 31, 2024
55650c8
[Bugfix] Fix `illegal memory access` error with chunked prefill, pref…
sasha0552 Oct 31, 2024
9fb12f7
[BugFix][Kernel] Fix Illegal memory access in causal_conv1d in H100 (…
mzusman Oct 31, 2024
b63c64d
[ci/build] Configure dependabot to update pip dependencies (#9811)
khluu Oct 31, 2024
031a799
[Bugfix][Frontend] Reject guided decoding in multistep mode (#9892)
joerunde Nov 1, 2024
96e0c9c
[torch.compile] directly register custom op (#9896)
youkaichao Nov 1, 2024
37a4947
[Bugfix] Fix layer skip logic with bitsandbytes (#9887)
mgoin Nov 1, 2024
566cd27
[torch.compile] rework test plans (#9866)
youkaichao Nov 1, 2024
93a76dd
[Model] Support bitsandbytes for MiniCPMV (#9891)
mgoin Nov 1, 2024
2b5bf20
[torch.compile] Adding torch compile annotations to some models (#9876)
CRZbulabula Nov 1, 2024
d3aa2a8
[Doc] Update multi-input support (#9906)
DarkLight1337 Nov 1, 2024
06386a6
[Frontend] Chat-based Embeddings API (#9759)
DarkLight1337 Nov 1, 2024
30a2e80
[CI/Build] Add Model Tests for PixtralHF (#9813)
mgoin Nov 1, 2024
ba0d892
[Frontend] Use a proper chat template for VLM2Vec (#9912)
DarkLight1337 Nov 1, 2024
1dd4cb2
[Bugfix] Fix edge cases for MistralTokenizer (#9625)
tjohnson31415 Nov 1, 2024
4581d2c
[Core] Refactor: Clean up unused argument in Scheduler._preempt (#9696)
andrejonasson Nov 1, 2024
aff1fd8
[torch.compile] use interpreter with stable api from pytorch (#9889)
youkaichao Nov 1, 2024
598b6d7
[Bugfix/Core] Flashinfer k_scale and v_scale (#9861)
pavanimajety Nov 1, 2024
18bd758
[1/N] pass the complete config from engine to executor (#9933)
youkaichao Nov 1, 2024
27cd36e
[Bugfix] PicklingError on RayTaskError (#9934)
GeneDer Nov 1, 2024
d151fde
[ci/build] Bump the patch-update group with 10 updates (#9897)
dependabot[bot] Nov 1, 2024
6c0b7f5
[Core][VLM] Add precise multi-modal placeholder tracking (#8346)
petersalas Nov 1, 2024
d522034
[ci/build] Have dependabot ignore pinned dependencies (#9935)
khluu Nov 1, 2024
a78dd33
[Encoder Decoder] Add flash_attn kernel support for encoder-decoder m…
sroy745 Nov 2, 2024
af7380d
[torch.compile] fix cpu broken code (#9947)
youkaichao Nov 2, 2024
eed92f1
[Docs] Update Granite 3.0 models in supported models table (#9930)
njhill Nov 2, 2024
1d4cfe2
[Doc] Updated tpu-installation.rst with more details (#9926)
mikegre-google Nov 2, 2024
e893795
[2/N] executor pass the complete config to worker/modelrunner (#9938)
youkaichao Nov 2, 2024
d6459b4
[V1] Fix `EngineArgs` refactor on V1 (#9954)
robertgshaw2-neuralmagic Nov 2, 2024
74b529c
[bugfix] fix chatglm dummy_data_for_glmv (#9955)
youkaichao Nov 2, 2024
cea808f
[3/N] model runner pass the whole config to model (#9958)
youkaichao Nov 2, 2024
1b73ab2
[CI/Build] Quoting around > (#9956)
nokados Nov 2, 2024
ae5279a
[torch.compile] Adding torch compile to vision-language models (#9946)
CRZbulabula Nov 2, 2024
3bb4bef
[bugfix] fix tsts (#9959)
youkaichao Nov 2, 2024
1f1b6d6
[V1] Support per-request seed (#9945)
njhill Nov 3, 2024
5459772
[Model] Add support for H2OVL-Mississippi models (#9747)
cooleel Nov 4, 2024
91c9ebb
[V1] Fix Configs (#9971)
robertgshaw2-neuralmagic Nov 4, 2024
c49f040
[Bugfix] Fix MiniCPMV and Mllama BNB bug (#9917)
jeejeelee Nov 4, 2024
b67feb1
[Bugfix]Using the correct type hints (#9885)
gshtras Nov 4, 2024
4dbcbbe
[Misc] Compute query_start_loc/seq_start_loc on CPU (#9447)
zhengy001 Nov 4, 2024
ea4aded
[Bugfix] Fix E2EL mean and median stats (#9984)
daitran2k1 Nov 4, 2024
ccb5376
[Bugfix][OpenVINO] Fix circular reference #9939 (#9974)
MengqingCao Nov 4, 2024
ac6b8f1
[Frontend] Multi-Modality Support for Loading Local Image Files (#9915)
chaunceyjiang Nov 4, 2024
8d72bb2
[4/N] make quant config first-class citizen (#9978)
youkaichao Nov 4, 2024
fb2716d
[Misc]Reduce BNB static variable (#9987)
jeejeelee Nov 4, 2024
603a661
[Model] factoring out MambaMixer out of Jamba (#8993)
mzusman Nov 4, 2024
1c45f4c
[CI] Basic Integration Test For TPU (#9968)
robertgshaw2-neuralmagic Nov 4, 2024
5208dc7
[Bugfix][CI/Build][Hardware][AMD] Shard ID parameters in AMD tests ru…
hissu-hyvarinen Nov 4, 2024
6e056bc
[Doc] Update VLM doc about loading from local files (#9999)
ywang96 Nov 4, 2024
04cef2c
[Bugfix] Fix `MQLLMEngine` hanging (#9973)
robertgshaw2-neuralmagic Nov 4, 2024
9a5664d
[Misc] Refactor benchmark_throughput.py (#9779)
lk-chen Nov 4, 2024
ac04a97
[Frontend] Add max_tokens prometheus metric (#9881)
tomeras91 Nov 4, 2024
d93478b
[Bugfix] Upgrade to pytorch 2.5.1 (#10001)
bnellnm Nov 4, 2024
2094062
[4.5/N] bugfix for quant config in speculative decode (#10007)
youkaichao Nov 4, 2024
8f0a9ca
[Bugfix] Respect modules_to_not_convert within awq_marlin (#9895)
mgoin Nov 4, 2024
04bbf38
[Core] Use os.sched_yield in ShmRingBuffer instead of time.sleep (#9994)
tlrmchlsmth Nov 5, 2024
bbc3619
[Core] Make encoder-decoder inputs a nested structure to be more comp…
DarkLight1337 Nov 5, 2024
ad23318
[Bugfix] Fixup Mamba (#10004)
tlrmchlsmth Nov 5, 2024
7a83b1a
[BugFix] Lazy import ray (#10021)
GeneDer Nov 5, 2024
93dee88
[Misc] vllm CLI flags should be ordered for better user readability (…
chaunceyjiang Nov 5, 2024
5952d81
[Frontend] Fix tcp port reservation for api server (#10012)
russellb Nov 5, 2024
cd34029
Refactor TPU requirements file and pin build dependencies (#10010)
richardsliu Nov 5, 2024
09d3550
[Misc] Add logging for CUDA memory (#10027)
yangalan123 Nov 5, 2024
731aec5
[CI/Build] Limit github CI jobs based on files changed (#9928)
russellb Nov 5, 2024
a53046b
[Model] Support quantization of PixtralHFTransformer for PixtralHF (#…
mgoin Nov 5, 2024
d2e8033
[Feature] Update benchmark_throughput.py to support image input (#9851)
lk-chen Nov 5, 2024
b9c64c0
[Misc] Modify BNB parameter name (#9997)
jeejeelee Nov 5, 2024
0246246
[CI] Prune tests/models/decoder_only/language/* tests (#9940)
mgoin Nov 5, 2024
235366f
[CI] Prune back the number of tests in tests/kernels/* (#9932)
mgoin Nov 5, 2024
ca9844b
[bugfix] fix weak ref in piecewise cudagraph and tractable test (#10048)
youkaichao Nov 5, 2024
43300bd
[Bugfix] Properly propagate trust_remote_code settings (#10047)
zifeitong Nov 6, 2024
966e316
[Bugfix] Fix pickle of input when async output processing is on (#9931)
wallashss Nov 6, 2024
0c63c34
[Bugfix][SpecDecode] kv corruption with bonus tokens in spec decode (…
llsj14 Nov 6, 2024
c4cacba
[v1] reduce graph capture time for piecewise cudagraph (#10059)
youkaichao Nov 6, 2024
82bfc38
[Misc] Sort the list of embedding models (#10037)
DarkLight1337 Nov 6, 2024
ffc0f2b
[Model][OpenVINO] Fix regressions from #8346 (#10045)
petersalas Nov 6, 2024
2bcbae7
[Bugfix] Fix edge-case crash when using chat with the Mistral Tekken …
tjohnson31415 Nov 6, 2024
ea928f6
[Bugfix] Gpt-j-6B patch kv_scale to k_scale path (#10063)
arakowsk-amd Nov 6, 2024
9d59b75
[Bugfix] Remove CustomChatCompletionContentPartParam multimodal input…
zifeitong Nov 6, 2024
4089985
[V1] Integrate Piecewise CUDA graphs (#10058)
WoosukKwon Nov 6, 2024
4be3a45
[distributed] add function to create ipc buffers directly (#10064)
youkaichao Nov 6, 2024
21063c1
[CI/Build] drop support for Python 3.8 EOL (#8464)
aarnphm Nov 6, 2024
a5fda50
[CI/Build] Fix large_gpu_mark reason (#10070)
Isotr0py Nov 6, 2024
a02a50e
[Hardware][Intel-Gaudi] Add Intel Gaudi (HPU) inference backend (#6143)
kzawora-intel Nov 6, 2024
6a585a2
[Hotfix] Fix ruff errors (#10073)
WoosukKwon Nov 6, 2024
2003cc3
[Model][LoRA]LoRA support added for LlamaEmbeddingModel (#10071)
jeejeelee Nov 6, 2024
a5bba7d
[Model] Add Idefics3 support (#9767)
jeejeelee Nov 6, 2024
406d4cc
[Model][LoRA]LoRA support added for Qwen2VLForConditionalGeneration (…
ericperfect Nov 6, 2024
399c798
Remove ScaledActivation for AWQ (#10057)
mgoin Nov 6, 2024
098f94d
[CI/Build] Drop Python 3.8 support (#10038)
russellb Nov 6, 2024
87bd7e0
[CI/Build] change conflict PR comment from mergify (#10080)
russellb Nov 6, 2024
d58268c
[V1] Make v1 more testable (#9888)
joerunde Nov 6, 2024
74f2f8a
[CI/Build] Always run the ruff workflow (#10092)
russellb Nov 6, 2024
719c1ca
[core][distributed] add stateless_init_process_group (#10072)
youkaichao Nov 7, 2024
4ab3256
[Bugfix] Fix FP8 torch._scaled_mm fallback for torch>2.5 with CUDA<12…
mgoin Nov 7, 2024
d3859f1
[Misc][XPU] Upgrade to Pytorch 2.5 for xpu backend (#9823)
yma11 Nov 7, 2024
29862b8
[Frontend] Adjust try/except blocks in API impl (#10056)
njhill Nov 7, 2024
a4b3e0c
[Hardware][CPU] Update torch 2.5 (#9911)
bigPYJ1151 Nov 7, 2024
e7b84c3
[doc] add back Python 3.8 ABI (#10100)
youkaichao Nov 7, 2024
1fa020c
[V1][BugFix] Fix Generator construction in greedy + seed case (#10097)
njhill Nov 7, 2024
db7db4a
[Misc] Consolidate ModelConfig code related to HF config (#10104)
DarkLight1337 Nov 7, 2024
104d729
[CI/Build] re-add codespell to CI (#10083)
russellb Nov 7, 2024
d7263a1
Doc: Improve benchmark documentation (#9927)
rafvasq Nov 7, 2024
6192e9b
[Core][Distributed] Refactor ipc buffer init in CustomAllreduce (#10030)
hanzhi713 Nov 7, 2024
e036e52
[CI/Build] Improve mypy + python version matrix (#10041)
russellb Nov 7, 2024
aa9078f
Adds method to read the pooling types from model's files (#9506)
flaviabeo Nov 7, 2024
0dfba97
[Frontend] Fix multiple values for keyword argument error (#10075) (#…
DIYer22 Nov 7, 2024
a6f332d
[Hardware][CPU][bugfix] Fix half dtype support on AVX2-only target (#…
bigPYJ1151 Nov 7, 2024
999df95
[Bugfix] Make image processor respect `mm_processor_kwargs` for Qwen2…
li-plus Nov 7, 2024
a62bc01
[Misc] Add Gamma-Distribution Request Generation Support for Serving …
spliii Nov 7, 2024
ae62fd1
[Frontend] Tool calling parser for Granite 3.0 models (#9027)
maxdebayser Nov 7, 2024
9d43afc
[Feature] [Spec decode]: Combine chunked prefill with speculative dec…
NickLucche Nov 7, 2024
de0e61a
[CI/Build] Always run mypy (#10122)
russellb Nov 7, 2024
3be5b26
[CI/Build] Add shell script linting using shellcheck (#7925)
russellb Nov 7, 2024
a2f1f3b
[CI/Build] Automate PR body text cleanup (#10082)
russellb Nov 7, 2024
97b8475
Bump actions/setup-python from 5.2.0 to 5.3.0 (#9745)
dependabot[bot] Nov 7, 2024
28b2877
Online video support for VLMs (#10020)
litianjian Nov 7, 2024
93bff42
Bump actions/checkout from 4.2.1 to 4.2.2 (#9746)
dependabot[bot] Nov 7, 2024
073a472
[Misc] report relevant env vars in collect_env.py tool (#9293)
ycool Nov 8, 2024
42b4f46
[V1] Add all_token_ids attribute to Request (#10135)
WoosukKwon Nov 8, 2024
201fc07
[V1] Prefix caching (take 2) (#9972)
comaniac Nov 8, 2024
6bb52b0
[CI/Build] Give PR cleanup job PR write access (#10139)
russellb Nov 8, 2024
40d0e74
[Doc] Update FAQ links in spec_decode.rst (#9662)
whyiug Nov 8, 2024
ad39bd6
[Bugfix] Add error handling when server cannot respond any valid toke…
DearPlanet Nov 8, 2024
7371749
[Misc] Fix ImportError causing by triton (#9493)
MengqingCao Nov 8, 2024
3a7f15a
[Doc] Move CONTRIBUTING to docs site (#9924)
russellb Nov 8, 2024
da07a9e
Fixes a typo about 'max_decode_seq_len' which causes crashes with cud…
sighingnow Nov 8, 2024
aea6ad6
Add hf_transfer to testing image (#10096)
mgoin Nov 8, 2024
f4c2187
[Misc] Fix typo in #5895 (#10145)
DarkLight1337 Nov 8, 2024
f10797c
[Bugfix][XPU] Fix xpu tp by introducing XpuCommunicator (#10144)
yma11 Nov 8, 2024
1ff4aed
[Model] Expose size to Idefics3 as mm_processor_kwargs (#10146)
Isotr0py Nov 8, 2024
208ce62
[V1]Enable APC by default only for text models (#10148)
ywang96 Nov 8, 2024
b489fc3
[CI/Build] Update CPU tests to include all "standard" tests (#5481)
DarkLight1337 Nov 8, 2024
0535e5f
Fix edge case Mistral tokenizer (#10152)
patrickvonplaten Nov 8, 2024
f677862
Disable spec-decode + chunked-prefill for draft models with tensor pa…
sroy745 Nov 8, 2024
6b30471
[Misc] Improve Web UI (#10090)
rafvasq Nov 8, 2024
b5815c8
[V1] Fix non-cudagraph op name (#10166)
WoosukKwon Nov 8, 2024
87713c6
[CI/Build] Ignore .gitignored files for shellcheck (#10162)
ProExpertProg Nov 8, 2024
e1b5a82
Rename vllm.logging to vllm.logging_utils (#10134)
flozi00 Nov 8, 2024
4f93dfe
[torch.compile] Fuse RMSNorm with quant (#9138)
ProExpertProg Nov 8, 2024
10b67d8
[Bugfix] SymIntArrayRef expected to contain concrete integers (#10170)
bnellnm Nov 8, 2024
127c074
[Kernel][Triton] Add Triton implementation for scaled_mm_triton to su…
rasmith Nov 9, 2024
d7edca1
[CI/Build] Adding timeout in CPU CI to avoid CPU test queue blocking …
bigPYJ1151 Nov 9, 2024
e0191a9
[0/N] Rename `MultiModalInputs` to `MultiModalKwargs` (#10040)
DarkLight1337 Nov 9, 2024
f83fecc
[Bugfix] Ignore GPTQ quantization of Qwen2-VL visual module (#10169)
mgoin Nov 9, 2024
47672f3
[CI/Build] Fix VLM broadcast tests `tensor_parallel_size` passing (#1…
Isotr0py Nov 9, 2024
49d2a41
[Doc] Adjust RunLLM location (#10176)
DarkLight1337 Nov 9, 2024
1a95f10
[5/N] pass the whole config to model (#9983)
youkaichao Nov 9, 2024
8e1529d
[CI/Build] Add run-hpu-test.sh script (#10167)
xuechendi Nov 9, 2024
f192aeb
[Bugfix] Enable some fp8 and quantized fullgraph tests (#10171)
bnellnm Nov 9, 2024
bd46357
[bugfix] fix broken tests of mlp speculator (#10177)
youkaichao Nov 9, 2024
8a4358e
[doc] explaining the integration with huggingface (#10173)
youkaichao Nov 9, 2024
9e37266
bugfix: fix the bug that stream generate not work (#2756)
caijizhuo Nov 9, 2024
d88bff1
[Frontend] add `add_request_id` middleware (#9594)
cjackal Nov 9, 2024
b09895a
[Frontend][Core] Override HF `config.json` via CLI (#5836)
KrishnaM251 Nov 9, 2024
51c2e1f
[CI/Build] Split up models tests (#10069)
DarkLight1337 Nov 9, 2024
9fa4bdd
[ci][build] limit cmake version (#10188)
youkaichao Nov 10, 2024
1968202
[Doc] Fix typo error in CONTRIBUTING.md (#10190)
FuryMartin Nov 10, 2024
bfb7d61
[doc] Polish the integration with huggingface doc (#10195)
CRZbulabula Nov 10, 2024
20cf2f5
[Misc] small fixes to function tracing file path (#9543)
ShawnD200 Nov 10, 2024
73b9083
[misc] improve cloudpickle registration and tests (#10202)
youkaichao Nov 11, 2024
ad9a78b
[Doc] Fix typo error in vllm/entrypoints/openai/cli_args.py (#10196)
yansh97 Nov 11, 2024
f0f2e56
[doc] improve debugging code (#10206)
youkaichao Nov 11, 2024
f89d18f
[6/N] pass whole config to inner model (#10205)
youkaichao Nov 11, 2024
9804ac7
Bump the patch-update group with 5 updates (#10210)
dependabot[bot] Nov 11, 2024
58170d6
[Hardware][CPU] Add embedding models support for CPU backend (#10193)
Isotr0py Nov 11, 2024
36e4acd
[LoRA][Kernel] Remove the unused libentry module (#10214)
jeejeelee Nov 11, 2024
5fb1f93
[V1] Allow `tokenizer_mode` and `trust_remote_code` for Detokenizer (…
ywang96 Nov 11, 2024
2cebda4
[Bugfix][Hardware][CPU] Fix broken encoder-decoder CPU runner (#10218)
Isotr0py Nov 11, 2024
874f551
[Metrics] add more metrics (#4464)
HarryWu99 Nov 11, 2024
36fc439
[Doc] fix doc string typo in block_manager `swap_out` function (#10212)
yyccli Nov 11, 2024
e6de978
[core][distributed] add stateless process group (#10216)
youkaichao Nov 11, 2024
25144ce
Bump actions/setup-python from 5.2.0 to 5.3.0 (#10209)
dependabot[bot] Nov 11, 2024
f9dadfb
[V1] Fix detokenizer ports (#10224)
WoosukKwon Nov 11, 2024
d7a4f22
[V1] Do not use inductor for piecewise CUDA graphs (#10225)
WoosukKwon Nov 11, 2024
330e82d
[v1][torch.compile] support managing cudagraph buffer (#10203)
youkaichao Nov 11, 2024
fe15729
[V1] Use custom ops for piecewise CUDA graphs (#10227)
WoosukKwon Nov 11, 2024
4800339
Add docs on serving with Llama Stack (#10183)
terrytangyuan Nov 11, 2024
8a7fe47
[misc][distributed] auto port selection and disable tests (#10226)
youkaichao Nov 11, 2024
9d5b4e4
[V1] Enable custom ops with piecewise CUDA graphs (#10228)
WoosukKwon Nov 11, 2024
08f93e7
Make shutil rename in python_only_dev (#10233)
shcheglovnd Nov 11, 2024
6ace6fb
[V1] `AsyncLLM` Implementation (#9826)
robertgshaw2-neuralmagic Nov 11, 2024
d1c6799
[doc] update debugging guide (#10236)
youkaichao Nov 11, 2024
9cdba96
[Doc] Update help text for `--distributed-executor-backend` (#10231)
russellb Nov 12, 2024
eea55cc
[1/N] torch.compile user interface design (#10237)
youkaichao Nov 12, 2024
7f5edb5
[Misc][LoRA] Replace hardcoded cuda device with configurable argument…
jeejeelee Nov 12, 2024
812c981
Splitting attention kernel file (#10091)
maleksan85 Nov 12, 2024
3a28f18
[doc] explain the class hierarchy in vLLM (#10240)
youkaichao Nov 12, 2024
d201d41
[CI][CPU]refactor CPU tests to allow to bind with different cores (#1…
zhouyuan Nov 12, 2024
36c513a
[BugFix] Do not raise a `ValueError` when `tool_choice` is set to the…
gcalmettes Nov 12, 2024
a838ba7
[Misc]Fix Idefics3Model argument (#10255)
jeejeelee Nov 12, 2024
176fcb1
[Bugfix] Fix QwenModel argument (#10262)
DamonFool Nov 12, 2024
47db6ec
[Frontend] Add per-request number of cached token stats (#10174)
zifeitong Nov 12, 2024
7c65527
[V1] Use pickle for serializing EngineCoreRequest & Add multimodal in…
WoosukKwon Nov 12, 2024
b41fb9d
[Encoder Decoder] Update Mllama to run with both FlashAttention and X…
sroy745 Nov 12, 2024
8a06428
[LoRA] Adds support for bias in LoRA (#5733)
followumesh Nov 12, 2024
1f55e05
[V1] Enable Inductor when using piecewise CUDA graphs (#10268)
WoosukKwon Nov 12, 2024
96ae0ea
[doc] fix location of runllm widget (#10266)
youkaichao Nov 12, 2024
1808145
[doc] improve debugging doc (#10270)
youkaichao Nov 12, 2024
377b74f
Revert "[ci][build] limit cmake version" (#10271)
youkaichao Nov 12, 2024
112fa0b
[V1] Fix CI tests on V1 engine (#10272)
WoosukKwon Nov 13, 2024
0d4ea3f
[core][distributed] use tcp store directly (#10275)
youkaichao Nov 13, 2024
bbd3e86
[V1] Support VLMs with fine-grained scheduling (#9871)
WoosukKwon Nov 13, 2024
56a955e
Bump to compressed-tensors v0.8.0 (#10279)
dsikka Nov 13, 2024
032fcf1
[Doc] Fix typo in arg_utils.py (#10264)
xyang16 Nov 13, 2024
3945c82
[Model] Add support for Qwen2-VL video embeddings input & multiple im…
imkero Nov 13, 2024
1b886aa
[Model] Adding Support for Qwen2VL as an Embedding Model. Using MrLig…
FurtherAI Nov 13, 2024
b6dde33
[Core] Flashinfer - Remove advance step size restriction (#10282)
pavanimajety Nov 13, 2024
d909acf
[Model][LoRA]LoRA support added for idefics3 (#10281)
B-201 Nov 13, 2024
bb7991a
[V1] Add missing tokenizer options for `Detokenizer` (#10288)
ywang96 Nov 13, 2024
0b8bb86
[1/N] Initial prototype for multi-modal processor (#10044)
DarkLight1337 Nov 13, 2024
ac49b59
[Bugfix] bitsandbytes models fail to run pipeline parallel (#10200)
HoangCongDuc Nov 13, 2024
15bb833
[Bugfix] Fix tensor parallel for qwen2 classification model (#10297)
Isotr0py Nov 14, 2024
504ac53
[misc] error early for old-style class (#10304)
youkaichao Nov 14, 2024
e0853b6
[Misc] format.sh: Simplify tool_version_check (#10305)
russellb Nov 14, 2024
f67ce05
[Frontend] Pythonic tool parser (#9859)
mdepinet Nov 14, 2024
52b48c1
[BugFix]: properly deserialize `tool_calls` iterator before processin…
gcalmettes Nov 14, 2024
294bf46
[Model] Add BNB quantization support for Idefics3 (#10310)
B-201 Nov 14, 2024
29f3ef2
[ci][distributed] disable hanging tests (#10317)
youkaichao Nov 14, 2024
03025c0
[CI/Build] Fix CPU CI online inference timeout (#10314)
Isotr0py Nov 14, 2024
675d603
[CI/Build] Make shellcheck happy (#10285)
DarkLight1337 Nov 14, 2024
1dbae03
[Docs] Publish meetup slides (#10331)
WoosukKwon Nov 14, 2024
4a18fd1
Support Roberta embedding models (#9387)
maxdebayser Nov 14, 2024
b2e0ad3
[Perf] Reduce peak memory usage of llama (#10339)
andoorve Nov 15, 2024
554af92
[Bugfix] use AF_INET6 for OpenAI Compatible Server with ipv6 (#9583)
jxpxxzj Nov 15, 2024
11cd1ae
[Tool parsing] Improve / correct mistral tool parsing (#10333)
patrickvonplaten Nov 15, 2024
972112d
[Bugfix] Fix unable to load some models (#10312)
DarkLight1337 Nov 15, 2024
bf2ddc6
[bugfix] Fix static asymmetric quantization case (#10334)
ProExpertProg Nov 15, 2024
2885ba0
[Misc] Change RedundantReshapesPass and FusionPass logging from info …
tlrmchlsmth Nov 15, 2024
b40cf64
[Model] Support Qwen2 embeddings and use tags to select model tests (…
DarkLight1337 Nov 15, 2024
2ec8827
[Bugfix] Qwen-vl output is inconsistent in speculative decoding (#10…
skylee-01 Nov 15, 2024
2ac6d0e
[Misc] Consolidate pooler config overrides (#10351)
DarkLight1337 Nov 15, 2024
02dbf30
[Build] skip renaming files for release wheels pipeline (#9671)
simon-mo Nov 15, 2024
3d158cd
Add default value to avoid Falcon crash (#5363) (#10347)
wchen61 Nov 15, 2024
b311efd
[Misc] Fix import error in tensorizer tests and cleanup some code (#1…
DarkLight1337 Nov 15, 2024
2690855
[Doc] Remove float32 choice from --lora-dtype (#10348)
xyang16 Nov 15, 2024
1d65ec7
[Bugfix] Fix fully sharded LoRA bug (#10352)
jeejeelee Nov 15, 2024
f2056f7
[Misc] Fix some help info of arg_utils to improve readability (#10362)
ShangmingCai Nov 15, 2024
3a763ba
[core][misc] keep compatibility for old-style classes (#10356)
youkaichao Nov 15, 2024
691a3ec
[Bugfix] Ensure special tokens are properly filtered out for guided s…
gcalmettes Nov 15, 2024
79ee45b
[Misc] Bump up test_fused_moe tolerance (#10364)
ElizaWszola Nov 15, 2024
a6221a1
[Misc] bump mistral common version (#10367)
simon-mo Nov 15, 2024
9a3f1ac
Sync with upstream@v0.6.4.post1
dtrifiro Nov 27, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
[BugFix][Kernel] Fix Illegal memory access in causal_conv1d in H100 (v…
…llm-project#9838)

Signed-off-by: mzusman <mor.zusmann@gmail.com>
mzusman authored Oct 31, 2024
commit 9fb12f7848d427b6c1c29052271030a5e96bd74a
34 changes: 32 additions & 2 deletions csrc/mamba/causal_conv1d/causal_conv1d.cu
Original file line number Diff line number Diff line change
@@ -418,6 +418,31 @@ void causal_conv1d_fwd_kernel(ConvParamsBase params) {
typename Ktraits::BlockStoreT(smem_store).Store(out, out_vals_store, seqlen - chunk * kChunkSize);
}
out += kChunkSize;

int final_state_position = ((seqlen - (kWidth - 1)) - (n_chunks - 1) * kChunkSize);
// in case the final state is separated between the last "smem_exchange" and
// and the one before it (chunk = n_chunks - 1 and chunk = n_chunks - 2),
// (which occurs when `final_state_position` is a non-positivie index)
// we load the correct data from smem_exchange from both chunks, the last chunk iteration and the one before it
if (final_state_position < 0 && seqlen > kWidth){
input_t vals_load[kNElts] = {0};
if ((chunk == n_chunks - 2) && (tidx == kNThreads - 1)){
// chunk = n_chunks - 2, a segment of the final state sits in the last index
reinterpret_cast<vec_t *>(vals_load)[0] = smem_exchange[kNThreads - 1];
#pragma unroll
for (int w = 0; w < -final_state_position; ++w){
conv_states[w] = vals_load[kNElts + final_state_position + w];
}
}
if ((chunk == n_chunks - 1) && tidx == 0){
// chunk = n_chunks - 1, the second segment of the final state first positions
reinterpret_cast<vec_t *>(vals_load)[0] = smem_exchange[0];
for (int w = -final_state_position; w < kWidth - 1; ++w){
conv_states[w] = vals_load[w + final_state_position];
}
return;
}
}
}
// Final state is stored in the smem_exchange last token slot,
// in case seqlen < kWidth, we would need to take the final state from the
@@ -446,9 +471,14 @@ void causal_conv1d_fwd_kernel(ConvParamsBase params) {
}
else {
// in case the final state is in between the threads data
reinterpret_cast<vec_t *>(x_vals_load)[1] = smem_exchange[last_thread + 1];
reinterpret_cast<vec_t *>(x_vals_load)[0] = smem_exchange[last_thread];
const int offset = ((seqlen - (kWidth - 1)) % (kNElts));
if ((offset + kWidth - 2) >= kNElts && (last_thread + 1 < kNThreads)){
// In case last_thread == kNThreads - 1, accessing last_thread + 1 will result in a
// illegal access error on H100.
// Therefore, we access last_thread + 1, only if the final state data sits there
reinterpret_cast<vec_t *>(x_vals_load)[1] = smem_exchange[last_thread + 1];
}
reinterpret_cast<vec_t *>(x_vals_load)[0] = smem_exchange[last_thread];
#pragma unroll
for (int w = 0; w < kWidth - 1; ++w){
conv_states[w] = x_vals_load[offset + w ];
7 changes: 5 additions & 2 deletions tests/kernels/test_causal_conv1d.py
Original file line number Diff line number Diff line change
@@ -151,7 +151,7 @@ def causal_conv1d_opcheck_fn(x: torch.Tensor,
@pytest.mark.parametrize("has_bias", [True])
@pytest.mark.parametrize("width", [4])
@pytest.mark.parametrize(
'seqlen', [1, 8, 16, 32, 64, 128, 256, 512, 784, 1024, 2048, 4096])
'seqlen', [1, 8, 16, 32, 64, 128, 256, 512, 784, 1024, 1025, 2048, 4096])
@pytest.mark.parametrize('dim', [64])
@pytest.mark.parametrize('batch', [1])
def test_causal_conv1d(batch, dim, seqlen, width, has_bias, silu_activation,
@@ -420,7 +420,10 @@ def test_causal_conv1d_varlen(with_padding, dim, seqlen, width, has_bias,

unpadded_out = out[:, :out_ref_tensor.shape[-1]]
assert torch.allclose(unpadded_out, out_ref_tensor, rtol=rtol, atol=atol)
assert torch.allclose(final_states, final_states_ref, rtol=rtol, atol=atol)
assert torch.allclose(final_states[state_indices],
final_states_ref[state_indices],
rtol=rtol,
atol=atol)

causal_conv1d_opcheck_fn(x.squeeze(0), weight, bias, cumsum.cuda(),
padded_state_indices, has_initial_states,
6 changes: 3 additions & 3 deletions tests/kernels/test_mamba_ssm.py
Original file line number Diff line number Diff line change
@@ -555,7 +555,7 @@ def test_selective_state_update_with_batch_indices(with_padding, dim, dstate,
device = "cuda"
rtol, atol = (3e-4, 1e-3) if itype == torch.float32 else (5e-3, 1e-2)
if itype == torch.bfloat16:
rtol, atol = 7e-2, 7e-2
rtol, atol = 1e-1, 1e-1
if torch.version.hip:
atol *= 2
# set seed
@@ -610,8 +610,8 @@ def test_selective_state_update_with_batch_indices(with_padding, dim, dstate,
dt_bias=dt_bias,
dt_softplus=True)

print("Output diff max", (out - out_ref[0]).max())
print("Output diff mean", (out - out_ref[0]).mean())
print("Output diff max", (out[:batch_size] - out_ref).max())
print("Output diff mean", (out[:batch_size] - out_ref).mean())
print("Output state diff max", (state[state_indices, :] - state_ref).max())
print("Output state diff mean",
(state[state_indices, :] - state_ref).mean())