[RFC] handle runtime shutdowns while tasks are being spawned #77

sunshowers · 2024-10-12T01:46:48Z

If a runtime has been marked as shutting down before spawn_blocking is
called, or before the blocking task is first scheduled, the JoinHandle will
return a cancellation error. Currently, this panics, which tends to manifest as
test failures (e.g. oxidecomputer/omicron#6505)

Instead, add a RunError wrapper that captures the cancellation situation.
Callers might choose to bubble this error up or to squelch it -- unfortunately,
we can't make this decision.

I haven't written tests yet but I should be able to if this looks reasonble.

The API surface here is really unfortunate, but I can't see a better way to do
this than to propagate errors up the stack.

Created using spr 1.3.6-beta.1

sunshowers · 2024-10-12T01:57:19Z

I really don't know how to feel about this! It really sucks and adds a significant extra burden to callers, but I think it might be necessary.

If Diesel were async and we weren't using tasks, this wouldn't be a problem because the calling code would also get shut down as part of the same task.
If Tokio had a way to "propagate cancellations across tasks", e.g. by panicking the current task with a special type rather than returning a value, this also wouldn't be a problem. (Though that might still not be good enough at squelching errors properly.)
Unfortunately we must return a value, so we have to add an enum variant to return.

sunshowers · 2024-10-12T16:28:18Z

@hawkw wonder if you have thoughts about this. Feels unfortunate how janky this is

smklein · 2024-10-14T18:22:21Z

src/error.rs

+
+/// Errors encountered while running a function on a connection pool.
+#[derive(Error, Debug, Clone, Copy, Eq, PartialEq)]
+pub enum RunError<E> {


Does this actually need to be generic? Could we just have E be DieselError?

I'm aware that (before this PR) there were spots where we used E: From<DieselError> -- seems like those spots could become E: From<RunError> instead, but without making RunError itself be generic.

I'm not sure! I'll try it out. I was trying to minimize changes in the PR but not having a generic here would certainly simplify things.

Removing the type parameter is possible as long as we change the run callback to always return a DieselError. Another option is to introduce a separate error type just for that function.

smklein · 2024-10-14T18:25:06Z

src/async_traits.rs

                }
            }
        }
    }
 }

+fn handle_spawn_blocking_error<T, E>(


If we're going to be propagating errors out during runtime shutdown, this seems reasonable to me (modulo the comment on whether we can collapse one layer of generics, see my other comment).

However, I'm still a little confused here - before this PR, we had a call-stack of:

User code

... calls async_bb8_diesel API

... which tries to spawn a task and fails, leading to a panic

Now, we'll have:

User code

... calls async_bb8_diesel API

... which tries to spawn a task and fails, returning an error

But what do we expect the caller to do? In the shutdown case, won't the caller most likely panic anyway?

Put another way -- what useful thing do we expect the caller to do other than panic here?

Also, to be clear, I'm asking this question not because I believe propagating the error is wrong, but because I fear I may have a gap in my understanding of "why this will make the oxidecomputer/omicron#6505 issues less bad".

Ah yeah that's what I was talking about in my comments — the caller would have to learn to squelch the error.

Basically there is some point in the call stack at which the return value would be (), or some other type with a sensible default value. At that point, the caller would have to examine the error and immediately return the default value.

One alternative would be to move the runtime shutdown situation into the Ok case but that feels even weirder.

Definitely not a great situation.

I think a useful next step would be to try patching this PR into Omicron and seeing how to manage this situation.

src/async_traits.rs

hawkw

So. I agree that this is a bit unfortunate, but I also agree that the only better solution is just "have upstream code promise to never shut down the runtime", which I don't know if we can reasonably do. I do agree with #77 (comment) that we should probably see what callers in Omicron would actually do with the new error variant before committing to this change.

Beyond that, I had some very small suggestions, but none of them are particularly important to me.

hawkw · 2024-10-14T20:06:17Z

src/async_traits.rs

+        Err(err) => {
+            if err.is_cancelled() {
+                // The only way a spawn_blocking task can be marked cancelled
+                // is if the runtime started shutting down _before_
+                // spawn_blocking was called.
+                Err(RunError::RuntimeShutdown)
+            } else if err.is_panic() {
+                // Propagate panics.
+                std::panic::panic_any(err.into_panic());
+            } else {
+                // Not possible to reach this as of Tokio 1.40, but maybe in
+                // future versions.
+                panic!("unexpected JoinError: {:?}", err);
+            }


very minor style suggestion, take it or leave it: i might roll the if conditions into the match, personally:

Suggested change

Err(err) => {

if err.is_cancelled() {

// The only way a spawn_blocking task can be marked cancelled

// is if the runtime started shutting down _before_

// spawn_blocking was called.

Err(RunError::RuntimeShutdown)

} else if err.is_panic() {

// Propagate panics.

std::panic::panic_any(err.into_panic());

} else {

// Not possible to reach this as of Tokio 1.40, but maybe in

// future versions.

panic!("unexpected JoinError: {:?}", err);

}

// The only way a spawn_blocking task can be marked cancelled

// is if the runtime started shutting down _before_

// spawn_blocking was called.

Err(err) if err.is_cancelled() => Err(RunError::RuntimeShutdown),

// Propagate panics

Err(err) if err.is_panic() => std::panic::panic_any(err.into_panic()),

// Not possible to reach this as of Tokio 1.40, but maybe in

// future versions.

Err(err) => panic!("unexpected JoinError: {:?}", err),

hawkw · 2024-10-14T20:07:15Z

src/async_traits.rs

+            } else {
+                // Not possible to reach this as of Tokio 1.40, but maybe in
+                // future versions.
+                panic!("unexpected JoinError: {:?}", err);


also tremendously nitpicky; i might consider unreachable!() here, and i might also reword the panic message to make it explicit that this is likely a result of a new Tokio version adding new join error variants?

Changed the panic message. I deliberately didn't use unreachable because I felt that it was potentially reachable with future Tokio versions, I think of unreachable as handling cases where you can prove that a line of code won't be hit (e.g. you checked for a value earlier in a function, then panicked on seeing that value again). wdyt?

src/error.rs

src/async_traits.rs

Created using spr 1.3.6-beta.1

sunshowers · 2024-10-15T19:18:13Z

Tried integrating this into omicron: oxidecomputer/omicron#6875. It does appear to be possible but it's a lot of work for marginal gain.

oxidecomputer/qorb#45 should work well enough, so closing this.

Fixes #6505 , integrates usage of the new qorb APIs to terminate pools cleanly: oxidecomputer/qorb#45 # Background - [https://github.com/oxidecomputer/qorb](https://github.com/oxidecomputer/qorb) is a connection pooling crate, which spins up tokio task that attempt to connect to backends. - When `qorb` was integrated into Omicron in #5876, I used it to connect to our database backend (CockroachDB). This included usage in tests, even with a "single backend host" (one, test-only CRDB server) -- I wanted to ensure that we used the same pool and connector logic in tests and prod. # What Went Wrong As identified in #6505 , we saw some tests failing during **termination**. The specific cause of the failure was a panic from [async-bb8-diesel](https://github.com/oxidecomputer/async-bb8-diesel), where we attempted to spawn tasks with a terminating tokio executor. This issue stems from async-bb8-diesel's usage of `tokio::task::spawn_blocking`, where the returned `JoinHandle` is immediately awaited and **unwrapped**, with an expectation that "we should always be able to spawn a blocking task". There was a separate discussion about "whether or not async-bb8-diesel should be unwrapping in this case" (see: oxidecomputer/async-bb8-diesel#77), but instead, I chose to focus on the question: ## Why are we trying to send requests to async-bb8-diesel while the tokio runtime is exiting? The answer to this question lies in qorb's internals -- qorb itself spawns many tokio tasks to handle ongoing work, including monitoring DNS resolution, checking on backend health, and making connections before they are requested. One of these qorb tasks calling `ping_async`, which checks connection health, used the `async-bb8-diesel` interface that ultimately panicked. Within qorb most of these tokio tasks have a drop implementation of the form: ```rust struct MyQorbStructure { handle: tokio::task::JoinHandle<()>, } impl Drop for MyQorbStructure { fn drop(&mut self) { self.handle.abort(); } } ``` Tragically, without async drop support in Rust, this is the best we can do implicitly -- signal that the background tasks should stop ASAP -- but that may not be soon enough! Calling `.abort()` on the `JoinHandle` does not terminate the task immediately, it simply signals that it should shut down at the next yield point. As a result, we can still see the flake observed in #6505: - A qorb pool is initialized with background tasks - One of the qorb worker tasks is about to make a call to check on the health of a connection to a backend - The test finishes, and returns. The tokio runtime begins terminating - We call `drop` on the `qorb` pool, which signals the background task should abort - The aforementioned qorb worker task makes the call to `ping_async`, which calls `spawn_blocking`. This fails, because the tokio runtime is terminating, and returns a [JoinError::Cancelled](https://buildomat.eng.oxide.computer/wg/0/details/01J9YQVN7X5EQNXFSEY6XJBH8B/zfviqPz9RoJp3bY4TafbyqXTwbhqdr7w4oupqBtVARR00CXF/01J9YQWAXY36WM0R2VG27QMFRK#S6049). - `async-bb8-diesel` unwraps this `JoinError`, and the test panics. # How do we mitigate this? That's where this PR comes in. Using the new qorb APIs, we don't rely on the synchronous drop methods -- we explicitly call `.terminate().await` functions which do the following: - Use `tokio::sync::oneshot`s as signals to `tokio::tasks` that they should exit - `.await` the `JoinHandle` for those tasks before returning Doing this work explicitly as a part of cleanup ensures that there are not any background tasks attempting to do new work while the tokio runtime is terminating.

[spr] initial version

d9147e2

Created using spr 1.3.6-beta.1

sunshowers mentioned this pull request Oct 12, 2024

test failed in CI: test_omdb_success_cases oxidecomputer/omicron#6505

Closed

sunshowers requested review from smklein and davepacheco October 12, 2024 01:49

smklein reviewed Oct 14, 2024

View reviewed changes

hawkw reviewed Oct 14, 2024

View reviewed changes

please work

01be5c0

Created using spr 1.3.6-beta.1

sunshowers mentioned this pull request Oct 15, 2024

[do not land] handle runtime shutdown more gracefully oxidecomputer/omicron#6875

Closed

sunshowers closed this Oct 15, 2024

sunshowers deleted the sunshowers/spr/rfc-handle-runtime-shutdowns-while-tasks-are-being-spawned branch October 15, 2024 19:19

smklein mentioned this pull request Oct 16, 2024

[nexus] Explicitly terminate pools for qorb oxidecomputer/omicron#6881

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] handle runtime shutdowns while tasks are being spawned #77

[RFC] handle runtime shutdowns while tasks are being spawned #77

sunshowers commented Oct 12, 2024 •

edited

Loading

sunshowers commented Oct 12, 2024 •

edited

Loading

sunshowers commented Oct 12, 2024

smklein Oct 14, 2024

sunshowers Oct 14, 2024

sunshowers Oct 15, 2024 •

edited

Loading

smklein Oct 14, 2024

smklein Oct 14, 2024

sunshowers Oct 14, 2024

sunshowers Oct 14, 2024

hawkw left a comment

hawkw Oct 14, 2024

hawkw Oct 14, 2024

sunshowers Oct 15, 2024

sunshowers commented Oct 15, 2024

[RFC] handle runtime shutdowns while tasks are being spawned #77

[RFC] handle runtime shutdowns while tasks are being spawned #77

Conversation

sunshowers commented Oct 12, 2024 • edited Loading

sunshowers commented Oct 12, 2024 • edited Loading

sunshowers commented Oct 12, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sunshowers Oct 15, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hawkw left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sunshowers commented Oct 15, 2024

sunshowers commented Oct 12, 2024 •

edited

Loading

sunshowers commented Oct 12, 2024 •

edited

Loading

sunshowers Oct 15, 2024 •

edited

Loading