-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] handle runtime shutdowns while tasks are being spawned #77
[RFC] handle runtime shutdowns while tasks are being spawned #77
Conversation
Created using spr 1.3.6-beta.1
I really don't know how to feel about this! It really sucks and adds a significant extra burden to callers, but I think it might be necessary.
|
@hawkw wonder if you have thoughts about this. Feels unfortunate how janky this is |
src/error.rs
Outdated
|
||
/// Errors encountered while running a function on a connection pool. | ||
#[derive(Error, Debug, Clone, Copy, Eq, PartialEq)] | ||
pub enum RunError<E> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this actually need to be generic? Could we just have E
be DieselError
?
I'm aware that (before this PR) there were spots where we used E: From<DieselError>
-- seems like those spots could become E: From<RunError>
instead, but without making RunError
itself be generic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure! I'll try it out. I was trying to minimize changes in the PR but not having a generic here would certainly simplify things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing the type parameter is possible as long as we change the run
callback to always return a DieselError
. Another option is to introduce a separate error type just for that function.
src/async_traits.rs
Outdated
} | ||
} | ||
} | ||
} | ||
} | ||
|
||
fn handle_spawn_blocking_error<T, E>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're going to be propagating errors out during runtime shutdown, this seems reasonable to me (modulo the comment on whether we can collapse one layer of generics, see my other comment).
However, I'm still a little confused here - before this PR, we had a call-stack of:
- User code
- ... calls async_bb8_diesel API
- ... which tries to spawn a task and fails, leading to a panic
Now, we'll have:
- User code
- ... calls async_bb8_diesel API
- ... which tries to spawn a task and fails, returning an error
But what do we expect the caller to do? In the shutdown case, won't the caller most likely panic anyway?
Put another way -- what useful thing do we expect the caller to do other than panic here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, to be clear, I'm asking this question not because I believe propagating the error is wrong, but because I fear I may have a gap in my understanding of "why this will make the oxidecomputer/omicron#6505 issues less bad".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yeah that's what I was talking about in my comments — the caller would have to learn to squelch the error.
Basically there is some point in the call stack at which the return value would be (), or some other type with a sensible default value. At that point, the caller would have to examine the error and immediately return the default value.
One alternative would be to move the runtime shutdown situation into the Ok case but that feels even weirder.
Definitely not a great situation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a useful next step would be to try patching this PR into Omicron and seeing how to manage this situation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So. I agree that this is a bit unfortunate, but I also agree that the only better solution is just "have upstream code promise to never shut down the runtime", which I don't know if we can reasonably do. I do agree with #77 (comment) that we should probably see what callers in Omicron would actually do with the new error variant before committing to this change.
Beyond that, I had some very small suggestions, but none of them are particularly important to me.
src/async_traits.rs
Outdated
Err(err) => { | ||
if err.is_cancelled() { | ||
// The only way a spawn_blocking task can be marked cancelled | ||
// is if the runtime started shutting down _before_ | ||
// spawn_blocking was called. | ||
Err(RunError::RuntimeShutdown) | ||
} else if err.is_panic() { | ||
// Propagate panics. | ||
std::panic::panic_any(err.into_panic()); | ||
} else { | ||
// Not possible to reach this as of Tokio 1.40, but maybe in | ||
// future versions. | ||
panic!("unexpected JoinError: {:?}", err); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very minor style suggestion, take it or leave it: i might roll the if conditions into the match, personally:
Err(err) => { | |
if err.is_cancelled() { | |
// The only way a spawn_blocking task can be marked cancelled | |
// is if the runtime started shutting down _before_ | |
// spawn_blocking was called. | |
Err(RunError::RuntimeShutdown) | |
} else if err.is_panic() { | |
// Propagate panics. | |
std::panic::panic_any(err.into_panic()); | |
} else { | |
// Not possible to reach this as of Tokio 1.40, but maybe in | |
// future versions. | |
panic!("unexpected JoinError: {:?}", err); | |
} | |
// The only way a spawn_blocking task can be marked cancelled | |
// is if the runtime started shutting down _before_ | |
// spawn_blocking was called. | |
Err(err) if err.is_cancelled() => Err(RunError::RuntimeShutdown), | |
// Propagate panics | |
Err(err) if err.is_panic() => std::panic::panic_any(err.into_panic()), | |
// Not possible to reach this as of Tokio 1.40, but maybe in | |
// future versions. | |
Err(err) => panic!("unexpected JoinError: {:?}", err), |
src/async_traits.rs
Outdated
} else { | ||
// Not possible to reach this as of Tokio 1.40, but maybe in | ||
// future versions. | ||
panic!("unexpected JoinError: {:?}", err); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also tremendously nitpicky; i might consider unreachable!()
here, and i might also reword the panic message to make it explicit that this is likely a result of a new Tokio version adding new join error variants?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed the panic message. I deliberately didn't use unreachable because I felt that it was potentially reachable with future Tokio versions, I think of unreachable as handling cases where you can prove that a line of code won't be hit (e.g. you checked for a value earlier in a function, then panicked on seeing that value again). wdyt?
Created using spr 1.3.6-beta.1
Tried integrating this into omicron: oxidecomputer/omicron#6875. It does appear to be possible but it's a lot of work for marginal gain. oxidecomputer/qorb#45 should work well enough, so closing this. |
Fixes #6505 , integrates usage of the new qorb APIs to terminate pools cleanly: oxidecomputer/qorb#45 # Background - [https://github.com/oxidecomputer/qorb](https://github.com/oxidecomputer/qorb) is a connection pooling crate, which spins up tokio task that attempt to connect to backends. - When `qorb` was integrated into Omicron in #5876, I used it to connect to our database backend (CockroachDB). This included usage in tests, even with a "single backend host" (one, test-only CRDB server) -- I wanted to ensure that we used the same pool and connector logic in tests and prod. # What Went Wrong As identified in #6505 , we saw some tests failing during **termination**. The specific cause of the failure was a panic from [async-bb8-diesel](https://github.com/oxidecomputer/async-bb8-diesel), where we attempted to spawn tasks with a terminating tokio executor. This issue stems from async-bb8-diesel's usage of `tokio::task::spawn_blocking`, where the returned `JoinHandle` is immediately awaited and **unwrapped**, with an expectation that "we should always be able to spawn a blocking task". There was a separate discussion about "whether or not async-bb8-diesel should be unwrapping in this case" (see: oxidecomputer/async-bb8-diesel#77), but instead, I chose to focus on the question: ## Why are we trying to send requests to async-bb8-diesel while the tokio runtime is exiting? The answer to this question lies in qorb's internals -- qorb itself spawns many tokio tasks to handle ongoing work, including monitoring DNS resolution, checking on backend health, and making connections before they are requested. One of these qorb tasks calling `ping_async`, which checks connection health, used the `async-bb8-diesel` interface that ultimately panicked. Within qorb most of these tokio tasks have a drop implementation of the form: ```rust struct MyQorbStructure { handle: tokio::task::JoinHandle<()>, } impl Drop for MyQorbStructure { fn drop(&mut self) { self.handle.abort(); } } ``` Tragically, without async drop support in Rust, this is the best we can do implicitly -- signal that the background tasks should stop ASAP -- but that may not be soon enough! Calling `.abort()` on the `JoinHandle` does not terminate the task immediately, it simply signals that it should shut down at the next yield point. As a result, we can still see the flake observed in #6505: - A qorb pool is initialized with background tasks - One of the qorb worker tasks is about to make a call to check on the health of a connection to a backend - The test finishes, and returns. The tokio runtime begins terminating - We call `drop` on the `qorb` pool, which signals the background task should abort - The aforementioned qorb worker task makes the call to `ping_async`, which calls `spawn_blocking`. This fails, because the tokio runtime is terminating, and returns a [JoinError::Cancelled](https://buildomat.eng.oxide.computer/wg/0/details/01J9YQVN7X5EQNXFSEY6XJBH8B/zfviqPz9RoJp3bY4TafbyqXTwbhqdr7w4oupqBtVARR00CXF/01J9YQWAXY36WM0R2VG27QMFRK#S6049). - `async-bb8-diesel` unwraps this `JoinError`, and the test panics. # How do we mitigate this? That's where this PR comes in. Using the new qorb APIs, we don't rely on the synchronous drop methods -- we explicitly call `.terminate().await` functions which do the following: - Use `tokio::sync::oneshot`s as signals to `tokio::tasks` that they should exit - `.await` the `JoinHandle` for those tasks before returning Doing this work explicitly as a part of cleanup ensures that there are not any background tasks attempting to do new work while the tokio runtime is terminating.
If a runtime has been marked as shutting down before
spawn_blocking
iscalled, or before the blocking task is first scheduled, the
JoinHandle
willreturn a cancellation error. Currently, this panics, which tends to manifest as
test failures (e.g. oxidecomputer/omicron#6505)
Instead, add a
RunError
wrapper that captures the cancellation situation.Callers might choose to bubble this error up or to squelch it -- unfortunately,
we can't make this decision.
I haven't written tests yet but I should be able to if this looks reasonble.
The API surface here is really unfortunate, but I can't see a better way to do
this than to propagate errors up the stack.