Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make RegexGuide pickleable again for vllm and tgi #99

Merged
merged 5 commits into from
Dec 2, 2024

Conversation

joennlae
Copy link
Contributor

I understand that pickleable is not your priority right now. But the RegexGuide needs to be pickled for vllm production use, which is multiprocessing-based.

This PR reintroduces this pickling capability + some tests.

I understand that this introduces more effort on your side.

References:
dottxt-ai/outlines#1274
vllm-project/vllm#10490
vllm-project/vllm#10576
vllm-project/vllm#10489

It would also tackle the current caching issues:
huggingface/text-generation-inference#2766
dottxt-ai/outlines#1283

Closes:
#95

joennlae added a commit to 44ai-labs/vllm that referenced this pull request Dec 1, 2024
Essentially a cleaned up version of this `pr`:
vllm-project#10785

Especially since `outlines` is rather slow and the new version is though
to intergrate as they do not focus on being pickleable which is a key
feature for us using the multiprocessing engine: dottxt-ai/outlines-core#99

I assume more and more will change over to `xgrammar`.

This is a minimum implementation.

https://arxiv.org/pdf/2411.15100

Signed-off-by: Jannis Schönleber <[email protected]>
joennlae added a commit to 44ai-labs/vllm that referenced this pull request Dec 1, 2024
Essentially a cleaned up version of this `pr`:
vllm-project#10785

Especially since `outlines` is rather slow and the new version is though
to intergrate as they do not focus on being pickleable which is a key
feature for us using the multiprocessing engine: dottxt-ai/outlines-core#99

I assume more and more will change over to `xgrammar`.

This is a minimum implementation.

https://arxiv.org/pdf/2411.15100

Signed-off-by: Jannis Schönleber <[email protected]>
@rlouf
Copy link
Member

rlouf commented Dec 2, 2024

Hey @joennlae, thank you for contributing, and we would like to move forward quickly with this! Is it ready for review?

@joennlae
Copy link
Contributor Author

joennlae commented Dec 2, 2024

In my opinion this should be ready :-)

src/python_bindings/mod.rs Outdated Show resolved Hide resolved
src/python_bindings/mod.rs Outdated Show resolved Hide resolved
torymur
torymur previously approved these changes Dec 2, 2024
Copy link
Contributor

@torymur torymur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @joennlae 👋

Thank you for adding serialization support, let's handle unwraps and we'll be ready to merge it right away 🚀

@torymur torymur added bug Something isn't working enhancement New feature or request labels Dec 2, 2024
@joennlae
Copy link
Contributor Author

joennlae commented Dec 2, 2024

Hi @torymur :-)

Thank you for the review. I just wanted to let you know that I updated it accordingly.

@torymur torymur merged commit d1a0e8c into dottxt-ai:main Dec 2, 2024
7 checks passed
@rlouf
Copy link
Member

rlouf commented Dec 2, 2024

I just cut a new release @joennlae. Do you need this in Outlines? (Although you should be able to use outlines-core without importing outlines at this point)

@joennlae
Copy link
Contributor Author

joennlae commented Dec 2, 2024

For me, it is all right as I install it directly. But in the long run I think it would be nice :-)

markmc added a commit to russellb/vllm that referenced this pull request Dec 3, 2024
Currently with MQLLMEngine, we are initializing LogitsProcessors
on the client side, pickling the entire list of LogitsProcessors,
and sending them over ZeroMQ to the engine.

This was put in place so that the expensive initialization (tens
of second) of the Outlines LogitsProcessor could happen in a thread,
such that the client could defer submitting the request to the engine
until the initialization had completed.

This became an issue because recent (Rust-based) Outlines does not
support pickle serialization, but this has resolved by
dottxt-ai/outlines-core#99.

However, this approach is also not desirable in the case of
XGrammar because the initialization is not expensive (hundreds of
milliseconds) and the serialization is just unnecessary complexity.

And so, let's remove the code from the client side of MQLLMEngine
to special case the creation of logits_processors based on guided
decoding params. This will now happen on the engine side once
again.

Signed-off-by: Mark McLoughlin <[email protected]>
markmc added a commit to russellb/vllm that referenced this pull request Dec 3, 2024
Currently with MQLLMEngine, we are initializing LogitsProcessors
on the client side, pickling the entire list of LogitsProcessors,
and sending them over ZeroMQ to the engine.

This was put in place so that the expensive initialization (tens
of second) of the Outlines LogitsProcessor could happen in a thread,
such that the client could defer submitting the request to the engine
until the initialization had completed.

This became an issue because recent (Rust-based) Outlines does not
support pickle serialization, but this has resolved by
dottxt-ai/outlines-core#99.

However, this approach is also not desirable in the case of
XGrammar because the initialization is not expensive (hundreds of
milliseconds) and the serialization is just unnecessary complexity.

And so, let's remove the code from the client side of MQLLMEngine
to special case the creation of logits_processors based on guided
decoding params. This will now happen on the engine side once
again.

Signed-off-by: Mark McLoughlin <[email protected]>
russellb pushed a commit to russellb/vllm that referenced this pull request Dec 3, 2024
Currently with MQLLMEngine, we are initializing LogitsProcessors
on the client side, pickling the entire list of LogitsProcessors,
and sending them over ZeroMQ to the engine.

This was put in place so that the expensive initialization (tens
of second) of the Outlines LogitsProcessor could happen in a thread,
such that the client could defer submitting the request to the engine
until the initialization had completed.

This became an issue because recent (Rust-based) Outlines does not
support pickle serialization, but this has resolved by
dottxt-ai/outlines-core#99.

However, this approach is also not desirable in the case of
XGrammar because the initialization is not expensive (hundreds of
milliseconds) and the serialization is just unnecessary complexity.

And so, let's remove the code from the client side of MQLLMEngine
to special case the creation of logits_processors based on guided
decoding params. This will now happen on the engine side once
again.

Signed-off-by: Mark McLoughlin <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants