batching/dynamic batching #112

nivibilla · 2024-02-27T08:03:04Z

Thanks for the amazing work! It really is super fast at bs=1.

Can batch usecases, or dynamic batching be supported?

Chillee · 2024-02-27T17:47:53Z

It is not so difficult to modify it to support batch usecases, but supporting dynamic batching is quite a bit more work.

If you really want continuous batching I would suggest looking at projects like vllm or TensorRT-LLM for now.

Ying1123 · 2024-07-28T12:50:58Z

For anyone interested in this issue, we have successfully integrated torch.compile into a dynamic batching serving system: https://github.com/sgl-project/sglang.

We use flashinfer for attention kernels and torch.compile for all other parts. We found this combination makes it faster than TensorRT-LLM and original gpt-fast. It is also much faster than vLLM. It supports all other features such as continuous batching and prefix caching.

You can give it a try

python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B --enable-torch-compile

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

batching/dynamic batching #112

batching/dynamic batching #112

nivibilla commented Feb 27, 2024

Chillee commented Feb 27, 2024

Ying1123 commented Jul 28, 2024 •

edited

Loading

batching/dynamic batching #112

batching/dynamic batching #112

Comments

nivibilla commented Feb 27, 2024

Chillee commented Feb 27, 2024

Ying1123 commented Jul 28, 2024 • edited Loading

Ying1123 commented Jul 28, 2024 •

edited

Loading