Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(fix) runtime: tweak _wait_until_alive tenacity and exception handling #3878

Merged
merged 3 commits into from
Sep 16, 2024

Conversation

tobitege
Copy link
Collaborator

@tobitege tobitege commented Sep 15, 2024

Short description of the problem this fixes or functionality that this introduces. This may be used for the CHANGELOG

EventStreamRuntime: tweak _wait_until_alive tenacity and exception handling


Give a summary of what the PR does, explaining any non-trivial design decisions


Link of any specific issues this addresses

Fixes #3876

@tobitege tobitege added the enhancement New feature or request label Sep 15, 2024
@tobitege tobitege requested a review from enyst September 15, 2024 20:03
wait=tenacity.wait_exponential(multiplier=2, min=2, max=20),
retry=tenacity.retry_if_exception_type(ConnectionRefusedError),
reraise=True, # Re-raise exceptions after retries
retry_error_callback=lambda retry_state: None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm reading this right, now we retry only if it's ConnectionRefusedError. Is that the only one we use?

Copy link
Collaborator Author

@tobitege tobitege Sep 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted the tenacity changes.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The revert might be good, I was just seeing some other exception in _wait_until_alive. Thanks, let me give it a quick run, this is so weird to reproduce but a few tries won't hurt. 😅

@tobitege tobitege requested a review from enyst September 15, 2024 20:18
@enyst
Copy link
Collaborator

enyst commented Sep 15, 2024

It's still doing this:

  • (llama): open the UI, write: 42
  • Llama starts to create an app.py as described by our example in the prompt... 😅
  • I let it run for ~18-20 steps (it got BrowsingAgent to work but it didn't finish)
  • do Ctrl+C

Then two things happen:

  • first, it seems to run more steps. I think this shouldn't happen, and didn't happen in the past
  • second, it tries to get container logs, over and over.

The last looks like this: (the lines with DEBUG were when I pressed Ctrl+C again)

23:13:28 - openhands:DEBUG: runtime.py:297 - Getting container logs...
23:13:32 - openhands:DEBUG: runtime.py:297 - Getting container logs...

openhands-py3.12(base) ➜  odie git:(wait-alive-exception) ✗ 23:13:40 - openhands:DEBUG: runtime.py:297 - Getting container logs...
23:13:56 - openhands:DEBUG: runtime.py:297 - Getting container logs...

openhands-py3.12(base) ➜  odie git:(wait-alive-exception) ✗ 23:14:16 - openhands:DEBUG: runtime.py:297 - Getting container logs...
23:14:36 - openhands:DEBUG: runtime.py:297 - Getting container logs...
23:14:56 - openhands:DEBUG: runtime.py:297 - Getting container logs...
23:15:16 - openhands:DEBUG: runtime.py:297 - Getting container logs...
23:15:36 - openhands:DEBUG: runtime.py:297 - Getting container logs...
ERROR:asyncio:Task exception was never retrieved
future: <Task finished name='Task-232' coro=<Runtime.on_event() done, defined at /Users/enyst/repos/odie/openhands/runtime/runtime.py:109> exception=ConnectionError(MaxRetryError("HTTPConnectionPool(host='localhost', port=32363): Max retries exceeded with url: /alive (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x16d0fc350>: Failed to establish a new connection: [Errno 61] Connection refused'))"))>

@tobitege
Copy link
Collaborator Author

tobitege commented Sep 15, 2024

future: <Task finished name='Task-232' coro=<Runtime.on_event() done, defined at /Users/enyst/repos/odie/openhands/runtime/runtime.py:109> exception=ConnectionError(MaxRetryError("HTTPConnectionPool(host='localhost', port=32363): Max retries exceeded with url: /alive (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x16d0fc350>: Failed to establish a new connection: [Errno 61] Connection refused'))"))>

As far I can tell, that's all "normal" behavior as the runtime in on_event fetches logs e.g. before an action is about to be executed.
The problem here is potentially, that the app inside the container (or the websocket?) just died.
The ConnectionRefusedError is raised by the Jupyter plugin's _connect method (this isn't ideal, in the logs one might think it is the docker or websocket connection).
Seeing the intermittent errors in integration tests, I'd still go through with the revert in here at least.

Maybe @xingyaoww has additional ideas on how to tweak/change either the retry/tenacity and/or the LogBuffer?

@enyst
Copy link
Collaborator

enyst commented Sep 15, 2024

We can use retry_when_not_exception_type to avoid the retry: https://tenacity.readthedocs.io/en/latest/api.html?highlight=except#tenacity.retry.retry_if_not_exception_type

Edit: that doesn't solve the problem. Yes, at some point the app has "died" and the runtime client keeps trying to connect, something like that is happening. The error in retry is ConnectionError.

FWIW, the experience is worse with UI than without. Some timeouts we use mean that it just won't let go for long minutes.

I agree some revert is still good... but wait. I thought it actually did work well to revert the commit simply. But that's a bit strange actually...

response = self.session.get(f'{self.api_url}/alive')
if response.status_code == 200:
except KeyboardInterrupt:
logger.debug('KeyboardInterrupt: exiting _wait_until_alive.')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest to not add this, because it doesn't work as expected. The message is not in the logs.

It will be confusing us, when we see it in the code and assume it worked. 😢

Copy link
Collaborator Author

@tobitege tobitege Sep 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've reverted all changes to _wait_until_alive now.

@enyst
Copy link
Collaborator

enyst commented Sep 15, 2024

Thanks for looking into this. I think we can merge the removal of the container logs line, it does appear to work - not always, but better...- on the PR branch.

@tobitege tobitege merged commit a45b20a into main Sep 16, 2024
@tobitege tobitege deleted the wait-alive-exception branch September 16, 2024 02:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug]: Ctrl+C no longer stops the app (sometimes?)
2 participants