-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[nexus] Explicitly terminate pools for qorb #6881
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Just a few questions.
datastore.terminate().await; | ||
db.cleanup().await.unwrap(); | ||
logctx.cleanup_successful(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How hard would it be to abstract this out into a cleanup function? Seems a little excessive to have it add it everywhere.
Making this an abstract function would also help with things like ensuring the right order. (For example, can db.cleanup().await
and datastore.terminate().await
be interchanged? Is it valid to use tokio::join!
across both?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I gave this a shot in 320960f , for all of nexus/db-queries
.
For bad reasons, it's harder for other part of Nexus to use this helper struct defined in nexus/db-queries/src/db/datastore/pub_test_utils.rs
- the structure itself depends on nexus-db-queries
, but also on nexus-test-utils
. I appear to be bumping into circular dependency issues making this struct usable everywhere across Omicron.
But at least for everything in db-queries
, I've made this conversion, and we're only using a single .terminate().await
call in each test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm taking this on as a follow-up PR, to refactor the testing utilities and make this usable outside nexus-db-queries
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This must have been so tedious -- thanks for doing this.
hmm, this timeout waiting for a test run to exit seems like it could, conceivably, be related? https://buildomat.eng.oxide.computer/wg/0/details/01JADQS55ZJQ9HD1PQCW9CMAT8/ykpQYTGG5Xqoq6TY4bO1jA49o9OWzhxQstcpHFCebPRUhWPd/01JADQSX7AT6DE733S18DNAT71#S6648 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for doing this, i'm so sorry you had to hand update basically very single test ever <3
I'm... confused by this test failure. It's from https://github.com/oxidecomputer/omicron/blob/main/dev-tools/omicron-dev/tests/test-omicron-dev.rs#L128 , and seems to be sending SIGINT to a bunch of things to simulate calling "Control + C". I've spun up scripts on both my linux and illumos box to run this test in a loop, and I'm seeing no failures. But I agree, this does seem related to shutdown - the But that's the weird part to me, this is definitely getting called by our other tests, and working fine? |
Looks like merging |
Might be worth sending out a PSA once this lands, over email and in the channel. Thanks again for getting this done. |
As a follow-up to #6881 , refactors the `TestDatabase` logic within `nexus_db_queries` to be usable outside the crate too, if the `testing` feature is enabled.
Fixes #6505 , integrates usage of the new qorb APIs to terminate pools cleanly: oxidecomputer/qorb#45
Background
qorb
was integrated into Omicron in Qorb integration as connection pool for database #5876, I used it to connect to our database backend (CockroachDB). This included usage in tests, even with a "single backend host" (one, test-only CRDB server) -- I wanted to ensure that we used the same pool and connector logic in tests and prod.What Went Wrong
As identified in #6505 , we saw some tests failing during termination. The specific cause of the failure was a panic from async-bb8-diesel, where we attempted to spawn tasks with a terminating tokio executor. This issue stems from async-bb8-diesel's usage of
tokio::task::spawn_blocking
, where the returnedJoinHandle
is immediately awaited and unwrapped, with an expectation that "we should always be able to spawn a blocking task".There was a separate discussion about "whether or not async-bb8-diesel should be unwrapping in this case" (see: oxidecomputer/async-bb8-diesel#77), but instead, I chose to focus on the question:
Why are we trying to send requests to async-bb8-diesel while the tokio runtime is exiting?
The answer to this question lies in qorb's internals -- qorb itself spawns many tokio tasks to handle ongoing work, including monitoring DNS resolution, checking on backend health, and making connections before they are requested. One of these qorb tasks calling
ping_async
, which checks connection health, used theasync-bb8-diesel
interface that ultimately panicked.Within qorb most of these tokio tasks have a drop implementation of the form:
Tragically, without async drop support in Rust, this is the best we can do implicitly -- signal that the background tasks should stop ASAP -- but that may not be soon enough! Calling
.abort()
on theJoinHandle
does not terminate the task immediately, it simply signals that it should shut down at the next yield point.As a result, we can still see the flake observed in #6505:
drop
on theqorb
pool, which signals the background task should abortping_async
, which callsspawn_blocking
. This fails, because the tokio runtime is terminating, and returns a JoinError::Cancelled.async-bb8-diesel
unwraps thisJoinError
, and the test panics.How do we mitigate this?
That's where this PR comes in. Using the new qorb APIs, we don't rely on the synchronous drop methods -- we explicitly call
.terminate().await
functions which do the following:tokio::sync::oneshot
s as signals totokio::tasks
that they should exit.await
theJoinHandle
for those tasks before returningDoing this work explicitly as a part of cleanup ensures that there are not any background tasks attempting to do new work while the tokio runtime is terminating.