Warning: Do not support sampling multiple responses #99

Mushoz · 2024-11-16T21:06:14Z

In the README the following warning can be read:

"Note that the Anthropic API, llama-server (and ollama) currently does not support sampling multiple responses from a model, which limits the available approaches"

I have two questions regarding this warning:

Is this accurate? Llama-server allows for the -np switch which will allow for decoding parallel requests. Should this allow the MOA approach to work for example?
Would it be possible to do these requests sequentially instead of in parallel? I understand that this won't be ideal for speed, but it's better than not working at all.

codelion · 2024-11-16T23:59:10Z

For 1, I haven’t tried the np switch, can you point me to some documentation that describes what it does?

There is a PR open to support 2 here - #83 but it will not be the same as sampling multiple responses in the same request using the ‘n’ parameter like the OpenAI client can do. Some API providers also do not support it, e.g Claude doesn’t support it but Gemini does. Once the pr is merged I can try benchmarking again.

Mushoz · 2024-11-17T14:53:41Z

Sure! The documentation can be read here: https://github.com/ggerganov/llama.cpp/blob/master/examples/server/README.md

Though I think I misunderstood the OptiLLM warning initially. Does it mean that for chatgpt a single request will have a parameter that signifies multiple responses are wanted? In that case, the -np switch is not going to make a difference. The -np switch simply allows the backend to process multiple requests in parallel, but it doesn't allow a single request to request multiple responses.

But that doesn't mean the -np switch is useless though. Instead of a single request requesting multiple responses, OptiLLM could send out multiple requests asynchronously with the same prompt. Those multiple requests would then be able to be processed in parallel. That way, OptiLLM still obtains multiple responses, although not as elegant perhaps as the chatgpt API that supports multiple responses for a single request natively.

It's good to hear that 2. will be supported as a nice stopgap solution!

codelion · 2024-11-17T18:51:22Z

Yes, the n parameter allows extracting multiple responses from the same request during decoding. This is not the same as sending multiple requests. On top of that many providers will do caching so we will need to change the seed with every request or use some other hack to avoid that. That’s the reason why it won’t work the same as using the n parameter.

Mushoz · 2024-11-17T20:06:04Z

To be fair though, the PR to support 2 will run into the exact same issues with caching. I still think it's worthwhile to have the option, since caches can be disabled in some cases (when running locally). But doing 2 through multiple requests done in parallel will be much faster than doing the required requests sequentially.

Caches could potentially also be invalidated to include a nonce at the beginning of the system prompt, but that does indeed sound a little bit hacky.

codelion added the question Further information is requested label Nov 16, 2024

codelion mentioned this issue Nov 18, 2024

deepseek can't support n > 1 #83

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Warning: Do not support sampling multiple responses #99

Warning: Do not support sampling multiple responses #99

Mushoz commented Nov 16, 2024

codelion commented Nov 16, 2024

Mushoz commented Nov 17, 2024

codelion commented Nov 17, 2024

Mushoz commented Nov 17, 2024

Warning: Do not support sampling multiple responses #99

Warning: Do not support sampling multiple responses #99

Comments

Mushoz commented Nov 16, 2024

codelion commented Nov 16, 2024

Mushoz commented Nov 17, 2024

codelion commented Nov 17, 2024

Mushoz commented Nov 17, 2024