Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Warning: Do not support sampling multiple responses #99

Open
Mushoz opened this issue Nov 16, 2024 · 4 comments
Open

Warning: Do not support sampling multiple responses #99

Mushoz opened this issue Nov 16, 2024 · 4 comments
Labels
question Further information is requested

Comments

@Mushoz
Copy link

Mushoz commented Nov 16, 2024

In the README the following warning can be read:

"Note that the Anthropic API, llama-server (and ollama) currently does not support sampling multiple responses from a model, which limits the available approaches"

I have two questions regarding this warning:

  1. Is this accurate? Llama-server allows for the -np switch which will allow for decoding parallel requests. Should this allow the MOA approach to work for example?
  2. Would it be possible to do these requests sequentially instead of in parallel? I understand that this won't be ideal for speed, but it's better than not working at all.
@codelion
Copy link
Owner

For 1, I haven’t tried the np switch, can you point me to some documentation that describes what it does?

There is a PR open to support 2 here - #83 but it will not be the same as sampling multiple responses in the same request using the ‘n’ parameter like the OpenAI client can do. Some API providers also do not support it, e.g Claude doesn’t support it but Gemini does. Once the pr is merged I can try benchmarking again.

@codelion codelion added the question Further information is requested label Nov 16, 2024
@Mushoz
Copy link
Author

Mushoz commented Nov 17, 2024

Sure! The documentation can be read here: https://github.com/ggerganov/llama.cpp/blob/master/examples/server/README.md

Though I think I misunderstood the OptiLLM warning initially. Does it mean that for chatgpt a single request will have a parameter that signifies multiple responses are wanted? In that case, the -np switch is not going to make a difference. The -np switch simply allows the backend to process multiple requests in parallel, but it doesn't allow a single request to request multiple responses.

But that doesn't mean the -np switch is useless though. Instead of a single request requesting multiple responses, OptiLLM could send out multiple requests asynchronously with the same prompt. Those multiple requests would then be able to be processed in parallel. That way, OptiLLM still obtains multiple responses, although not as elegant perhaps as the chatgpt API that supports multiple responses for a single request natively.

It's good to hear that 2. will be supported as a nice stopgap solution!

@codelion
Copy link
Owner

Yes, the n parameter allows extracting multiple responses from the same request during decoding. This is not the same as sending multiple requests. On top of that many providers will do caching so we will need to change the seed with every request or use some other hack to avoid that. That’s the reason why it won’t work the same as using the n parameter.

@Mushoz
Copy link
Author

Mushoz commented Nov 17, 2024

To be fair though, the PR to support 2 will run into the exact same issues with caching. I still think it's worthwhile to have the option, since caches can be disabled in some cases (when running locally). But doing 2 through multiple requests done in parallel will be much faster than doing the required requests sequentially.

Caches could potentially also be invalidated to include a nonce at the beginning of the system prompt, but that does indeed sound a little bit hacky.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants