You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I don't know why but I'm encountering this problem with the library. Here I show my simple script:
importollamaclient=ollama.Client(host=llm_config["base_url"], timeout=600)
client.chat(model=config["ollama"]["model"], messages=[{
"role":"user",
"content":"Why is the sky blue?"
}])
Where llm_config["base_url"] is the ollama url server (it's a serverless gpu) that I can reach successfully from open-webui and even query the model without issues. The model I'm using is: qwen2.5:32b-instruct-q4_K_M and the GPU is a RTX A6000.
The traceback (client-side) is the following:
Traceback (most recent call last):
File "/mnt/shared/devilteo911/cvr-agent/.venv/lib/python3.11/site-packages/ollama/_client.py", line 236, in chat
return self._request_stream(
^^^^^^^^^^^^^^^^^^^^^
File "/mnt/shared/devilteo911/cvr-agent/.venv/lib/python3.11/site-packages/ollama/_client.py", line 99, in _request_stream
return self._stream(*args, **kwargs) if stream else self._request(*args, **kwargs).json()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/shared/devilteo911/cvr-agent/.venv/lib/python3.11/site-packages/ollama/_client.py", line 75, in _request
raise ResponseError(e.response.text, e.response.status_code) from None
ollama._types.ResponseError: <html><body><h1>504 Gateway Time-out</h1>
The server didn't respond in time.
</body></html>
Hey @devilteo911 - have you tried not setting a timeout and seeing if there's an issue on the server side regardless? Trying to narrow down if some information is not passing all the way through to the server or if there is an error on the server side.
The issue seems to occur only on the first call, which consistently results in a 504 error. Subsequent calls with the same input perform the generation without any problems.
I believe the problem is related to the time it takes to generate the first token, particularly during a cold start of my service. During a cold start, the model needs to be downloaded from Hugging Face, as my serverless GPU provider lacks permanent storage to keep the model locally.
I don't know why but I'm encountering this problem with the library. Here I show my simple script:
Where
llm_config["base_url"]
is the ollama url server (it's a serverless gpu) that I can reach successfully from open-webui and even query the model without issues. The model I'm using is:qwen2.5:32b-instruct-q4_K_M
and the GPU is a RTX A6000.The traceback (client-side) is the following:
and this is what I see on the server side:
It happens everytime after 50 seconds even if the timeout is 600 seconds. Am I missing something?
The text was updated successfully, but these errors were encountered: