(fix) runtime: tweak _wait_until_alive tenacity and exception handling #3878

tobitege · 2024-09-15T20:03:07Z

Short description of the problem this fixes or functionality that this introduces. This may be used for the CHANGELOG

EventStreamRuntime: tweak _wait_until_alive tenacity and exception handling

Give a summary of what the PR does, explaining any non-trivial design decisions

extra docker log fetching in run_action (PR (fix) Update logs after run_action (EventStreamRuntime) #3870) commented out and added TODO
added exception handling to leave method on KeyboardInterrupt

Link of any specific issues this addresses

Fixes #3876

enyst · 2024-09-15T20:10:18Z

openhands/runtime/client/runtime.py

+        wait=tenacity.wait_exponential(multiplier=2, min=2, max=20),
+        retry=tenacity.retry_if_exception_type(ConnectionRefusedError),
+        reraise=True,  # Re-raise exceptions after retries
+        retry_error_callback=lambda retry_state: None,


If I'm reading this right, now we retry only if it's ConnectionRefusedError. Is that the only one we use?

Reverted the tenacity changes.

The revert might be good, I was just seeing some other exception in _wait_until_alive. Thanks, let me give it a quick run, this is so weird to reproduce but a few tries won't hurt. 😅

enyst · 2024-09-15T21:20:15Z

It's still doing this:

(llama): open the UI, write: 42
Llama starts to create an app.py as described by our example in the prompt... 😅
I let it run for ~18-20 steps (it got BrowsingAgent to work but it didn't finish)
do Ctrl+C

Then two things happen:

first, it seems to run more steps. I think this shouldn't happen, and didn't happen in the past
second, it tries to get container logs, over and over.

The last looks like this: (the lines with DEBUG were when I pressed Ctrl+C again)

23:13:28 - openhands:DEBUG: runtime.py:297 - Getting container logs...
23:13:32 - openhands:DEBUG: runtime.py:297 - Getting container logs...

openhands-py3.12(base) ➜  odie git:(wait-alive-exception) ✗ 23:13:40 - openhands:DEBUG: runtime.py:297 - Getting container logs...
23:13:56 - openhands:DEBUG: runtime.py:297 - Getting container logs...

openhands-py3.12(base) ➜  odie git:(wait-alive-exception) ✗ 23:14:16 - openhands:DEBUG: runtime.py:297 - Getting container logs...
23:14:36 - openhands:DEBUG: runtime.py:297 - Getting container logs...
23:14:56 - openhands:DEBUG: runtime.py:297 - Getting container logs...
23:15:16 - openhands:DEBUG: runtime.py:297 - Getting container logs...
23:15:36 - openhands:DEBUG: runtime.py:297 - Getting container logs...
ERROR:asyncio:Task exception was never retrieved
future: <Task finished name='Task-232' coro=<Runtime.on_event() done, defined at /Users/enyst/repos/odie/openhands/runtime/runtime.py:109> exception=ConnectionError(MaxRetryError("HTTPConnectionPool(host='localhost', port=32363): Max retries exceeded with url: /alive (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x16d0fc350>: Failed to establish a new connection: [Errno 61] Connection refused'))"))>

tobitege · 2024-09-15T21:28:26Z

future: <Task finished name='Task-232' coro=<Runtime.on_event() done, defined at /Users/enyst/repos/odie/openhands/runtime/runtime.py:109> exception=ConnectionError(MaxRetryError("HTTPConnectionPool(host='localhost', port=32363): Max retries exceeded with url: /alive (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x16d0fc350>: Failed to establish a new connection: [Errno 61] Connection refused'))"))>

As far I can tell, that's all "normal" behavior as the runtime in on_event fetches logs e.g. before an action is about to be executed.
The problem here is potentially, that the app inside the container (or the websocket?) just died.
~~The ConnectionRefusedError is raised by the Jupyter plugin's _connect method (this isn't ideal, in the logs one might think it is the docker or websocket connection).~~
Seeing the intermittent errors in integration tests, I'd still go through with the revert in here at least.

Maybe @xingyaoww has additional ideas on how to tweak/change either the retry/tenacity and/or the LogBuffer?

enyst · 2024-09-15T21:46:29Z

~~We can use retry_when_not_exception_type to avoid the retry: https://tenacity.readthedocs.io/en/latest/api.html?highlight=except#tenacity.retry.retry_if_not_exception_type~~

Edit: that doesn't solve the problem. Yes, at some point the app has "died" and the runtime client keeps trying to connect, something like that is happening. The error in retry is ConnectionError.

FWIW, the experience is worse with UI than without. Some timeouts we use mean that it just won't let go for long minutes.

I agree some revert is still good... but wait. I thought it actually did work well to revert the commit simply. But that's a bit strange actually...

enyst · 2024-09-15T22:33:16Z

openhands/runtime/client/runtime.py

-        response = self.session.get(f'{self.api_url}/alive')
-        if response.status_code == 200:
+        except KeyboardInterrupt:
+            logger.debug('KeyboardInterrupt: exiting _wait_until_alive.')


I'd suggest to not add this, because it doesn't work as expected. The message is not in the logs.

It will be confusing us, when we see it in the code and assume it worked. 😢

I've reverted all changes to _wait_until_alive now.

enyst · 2024-09-15T22:35:53Z

Thanks for looking into this. I think we can merge the removal of the container logs line, it does appear to work - not always, but better...- on the PR branch.

openhands/runtime/client/runtime.py

runtime: tweak _wait_until_alive tenacity and exception handling

07184d6

tobitege added the enhancement New feature or request label Sep 15, 2024

tobitege requested a review from enyst September 15, 2024 20:03

enyst reviewed Sep 15, 2024

View reviewed changes

revert tenacity on _wait_until_alive

5887004

tobitege requested a review from enyst September 15, 2024 20:18

enyst reviewed Sep 15, 2024

View reviewed changes

revert changes to _wait_until_alive in runtime

e9b0ea3

enyst reviewed Sep 16, 2024

View reviewed changes

openhands/runtime/client/runtime.py Show resolved Hide resolved

enyst approved these changes Sep 16, 2024

View reviewed changes

tobitege merged commit a45b20a into main Sep 16, 2024

tobitege deleted the wait-alive-exception branch September 16, 2024 02:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(fix) runtime: tweak _wait_until_alive tenacity and exception handling #3878

(fix) runtime: tweak _wait_until_alive tenacity and exception handling #3878

tobitege commented Sep 15, 2024 •

edited

Loading

enyst Sep 15, 2024

tobitege Sep 15, 2024 •

edited

Loading

enyst Sep 15, 2024

enyst commented Sep 15, 2024

tobitege commented Sep 15, 2024 •

edited

Loading

enyst commented Sep 15, 2024 •

edited

Loading

enyst Sep 15, 2024

tobitege Sep 16, 2024 •

edited

Loading

enyst commented Sep 15, 2024

(fix) runtime: tweak _wait_until_alive tenacity and exception handling #3878

(fix) runtime: tweak _wait_until_alive tenacity and exception handling #3878

Conversation

tobitege commented Sep 15, 2024 • edited Loading

enyst Sep 15, 2024

Choose a reason for hiding this comment

tobitege Sep 15, 2024 • edited Loading

Choose a reason for hiding this comment

enyst Sep 15, 2024

Choose a reason for hiding this comment

enyst commented Sep 15, 2024

tobitege commented Sep 15, 2024 • edited Loading

enyst commented Sep 15, 2024 • edited Loading

enyst Sep 15, 2024

Choose a reason for hiding this comment

tobitege Sep 16, 2024 • edited Loading

Choose a reason for hiding this comment

enyst commented Sep 15, 2024

tobitege commented Sep 15, 2024 •

edited

Loading

tobitege Sep 15, 2024 •

edited

Loading

tobitege commented Sep 15, 2024 •

edited

Loading

enyst commented Sep 15, 2024 •

edited

Loading

tobitege Sep 16, 2024 •

edited

Loading