Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

batching/dynamic batching #112

Open
nivibilla opened this issue Feb 27, 2024 · 2 comments
Open

batching/dynamic batching #112

nivibilla opened this issue Feb 27, 2024 · 2 comments

Comments

@nivibilla
Copy link
Contributor

Thanks for the amazing work! It really is super fast at bs=1.

Can batch usecases, or dynamic batching be supported?

@Chillee
Copy link
Contributor

Chillee commented Feb 27, 2024

It is not so difficult to modify it to support batch usecases, but supporting dynamic batching is quite a bit more work.

If you really want continuous batching I would suggest looking at projects like vllm or TensorRT-LLM for now.

@Ying1123
Copy link

Ying1123 commented Jul 28, 2024

For anyone interested in this issue, we have successfully integrated torch.compile into a dynamic batching serving system: https://github.com/sgl-project/sglang.

We use flashinfer for attention kernels and torch.compile for all other parts. We found this combination makes it faster than TensorRT-LLM and original gpt-fast. It is also much faster than vLLM. It supports all other features such as continuous batching and prefix caching.

You can give it a try

python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B --enable-torch-compile

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants