-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
(fix) runtime: tweak _wait_until_alive tenacity and exception handling #3878
Conversation
openhands/runtime/client/runtime.py
Outdated
wait=tenacity.wait_exponential(multiplier=2, min=2, max=20), | ||
retry=tenacity.retry_if_exception_type(ConnectionRefusedError), | ||
reraise=True, # Re-raise exceptions after retries | ||
retry_error_callback=lambda retry_state: None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I'm reading this right, now we retry only if it's ConnectionRefusedError. Is that the only one we use?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reverted the tenacity changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The revert might be good, I was just seeing some other exception in _wait_until_alive
. Thanks, let me give it a quick run, this is so weird to reproduce but a few tries won't hurt. 😅
It's still doing this:
Then two things happen:
The last looks like this: (the lines with DEBUG were when I pressed Ctrl+C again)
|
As far I can tell, that's all "normal" behavior as the runtime in Maybe @xingyaoww has additional ideas on how to tweak/change either the retry/tenacity and/or the LogBuffer? |
Edit: that doesn't solve the problem. Yes, at some point the app has "died" and the runtime client keeps trying to connect, something like that is happening. The error in retry is ConnectionError. FWIW, the experience is worse with UI than without. Some timeouts we use mean that it just won't let go for long minutes. I agree some revert is still good... but wait. I thought it actually did work well to revert the commit simply. But that's a bit strange actually... |
openhands/runtime/client/runtime.py
Outdated
response = self.session.get(f'{self.api_url}/alive') | ||
if response.status_code == 200: | ||
except KeyboardInterrupt: | ||
logger.debug('KeyboardInterrupt: exiting _wait_until_alive.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest to not add this, because it doesn't work as expected. The message is not in the logs.
It will be confusing us, when we see it in the code and assume it worked. 😢
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've reverted all changes to _wait_until_alive now.
Thanks for looking into this. I think we can merge the removal of the container logs line, it does appear to work - not always, but better...- on the PR branch. |
Short description of the problem this fixes or functionality that this introduces. This may be used for the CHANGELOG
EventStreamRuntime: tweak _wait_until_alive tenacity and exception handling
Give a summary of what the PR does, explaining any non-trivial design decisions
run_action
(PR (fix) Update logs after run_action (EventStreamRuntime) #3870) commented out and added TODOLink of any specific issues this addresses
Fixes #3876