Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] handle runtime shutdowns while tasks are being spawned #77

Conversation

sunshowers
Copy link

@sunshowers sunshowers commented Oct 12, 2024

If a runtime has been marked as shutting down before spawn_blocking is
called, or before the blocking task is first scheduled, the JoinHandle will
return a cancellation error. Currently, this panics, which tends to manifest as
test failures (e.g. oxidecomputer/omicron#6505)

Instead, add a RunError wrapper that captures the cancellation situation.
Callers might choose to bubble this error up or to squelch it -- unfortunately,
we can't make this decision.

I haven't written tests yet but I should be able to if this looks reasonble.

The API surface here is really unfortunate, but I can't see a better way to do
this than to propagate errors up the stack.

Created using spr 1.3.6-beta.1
@sunshowers
Copy link
Author

sunshowers commented Oct 12, 2024

I really don't know how to feel about this! It really sucks and adds a significant extra burden to callers, but I think it might be necessary.

  • If Diesel were async and we weren't using tasks, this wouldn't be a problem because the calling code would also get shut down as part of the same task.
  • If Tokio had a way to "propagate cancellations across tasks", e.g. by panicking the current task with a special type rather than returning a value, this also wouldn't be a problem. (Though that might still not be good enough at squelching errors properly.)
  • Unfortunately we must return a value, so we have to add an enum variant to return.

@sunshowers
Copy link
Author

@hawkw wonder if you have thoughts about this. Feels unfortunate how janky this is

src/error.rs Outdated

/// Errors encountered while running a function on a connection pool.
#[derive(Error, Debug, Clone, Copy, Eq, PartialEq)]
pub enum RunError<E> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this actually need to be generic? Could we just have E be DieselError?

I'm aware that (before this PR) there were spots where we used E: From<DieselError> -- seems like those spots could become E: From<RunError> instead, but without making RunError itself be generic.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure! I'll try it out. I was trying to minimize changes in the PR but not having a generic here would certainly simplify things.

Copy link
Author

@sunshowers sunshowers Oct 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing the type parameter is possible as long as we change the run callback to always return a DieselError. Another option is to introduce a separate error type just for that function.

}
}
}
}
}

fn handle_spawn_blocking_error<T, E>(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're going to be propagating errors out during runtime shutdown, this seems reasonable to me (modulo the comment on whether we can collapse one layer of generics, see my other comment).

However, I'm still a little confused here - before this PR, we had a call-stack of:

  • User code
  • ... calls async_bb8_diesel API
  • ... which tries to spawn a task and fails, leading to a panic

Now, we'll have:

  • User code
  • ... calls async_bb8_diesel API
  • ... which tries to spawn a task and fails, returning an error

But what do we expect the caller to do? In the shutdown case, won't the caller most likely panic anyway?

Put another way -- what useful thing do we expect the caller to do other than panic here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, to be clear, I'm asking this question not because I believe propagating the error is wrong, but because I fear I may have a gap in my understanding of "why this will make the oxidecomputer/omicron#6505 issues less bad".

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah that's what I was talking about in my comments — the caller would have to learn to squelch the error.

Basically there is some point in the call stack at which the return value would be (), or some other type with a sensible default value. At that point, the caller would have to examine the error and immediately return the default value.

One alternative would be to move the runtime shutdown situation into the Ok case but that feels even weirder.

Definitely not a great situation.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a useful next step would be to try patching this PR into Omicron and seeing how to manage this situation.

src/async_traits.rs Outdated Show resolved Hide resolved
Copy link
Member

@hawkw hawkw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So. I agree that this is a bit unfortunate, but I also agree that the only better solution is just "have upstream code promise to never shut down the runtime", which I don't know if we can reasonably do. I do agree with #77 (comment) that we should probably see what callers in Omicron would actually do with the new error variant before committing to this change.

Beyond that, I had some very small suggestions, but none of them are particularly important to me.

Comment on lines 429 to 442
Err(err) => {
if err.is_cancelled() {
// The only way a spawn_blocking task can be marked cancelled
// is if the runtime started shutting down _before_
// spawn_blocking was called.
Err(RunError::RuntimeShutdown)
} else if err.is_panic() {
// Propagate panics.
std::panic::panic_any(err.into_panic());
} else {
// Not possible to reach this as of Tokio 1.40, but maybe in
// future versions.
panic!("unexpected JoinError: {:?}", err);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very minor style suggestion, take it or leave it: i might roll the if conditions into the match, personally:

Suggested change
Err(err) => {
if err.is_cancelled() {
// The only way a spawn_blocking task can be marked cancelled
// is if the runtime started shutting down _before_
// spawn_blocking was called.
Err(RunError::RuntimeShutdown)
} else if err.is_panic() {
// Propagate panics.
std::panic::panic_any(err.into_panic());
} else {
// Not possible to reach this as of Tokio 1.40, but maybe in
// future versions.
panic!("unexpected JoinError: {:?}", err);
}
// The only way a spawn_blocking task can be marked cancelled
// is if the runtime started shutting down _before_
// spawn_blocking was called.
Err(err) if err.is_cancelled() => Err(RunError::RuntimeShutdown),
// Propagate panics
Err(err) if err.is_panic() => std::panic::panic_any(err.into_panic()),
// Not possible to reach this as of Tokio 1.40, but maybe in
// future versions.
Err(err) => panic!("unexpected JoinError: {:?}", err),

} else {
// Not possible to reach this as of Tokio 1.40, but maybe in
// future versions.
panic!("unexpected JoinError: {:?}", err);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also tremendously nitpicky; i might consider unreachable!() here, and i might also reword the panic message to make it explicit that this is likely a result of a new Tokio version adding new join error variants?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed the panic message. I deliberately didn't use unreachable because I felt that it was potentially reachable with future Tokio versions, I think of unreachable as handling cases where you can prove that a line of code won't be hit (e.g. you checked for a value earlier in a function, then panicked on seeing that value again). wdyt?

src/error.rs Outdated Show resolved Hide resolved
src/async_traits.rs Outdated Show resolved Hide resolved
Created using spr 1.3.6-beta.1
@sunshowers
Copy link
Author

Tried integrating this into omicron: oxidecomputer/omicron#6875. It does appear to be possible but it's a lot of work for marginal gain.

oxidecomputer/qorb#45 should work well enough, so closing this.

@sunshowers sunshowers closed this Oct 15, 2024
@sunshowers sunshowers deleted the sunshowers/spr/rfc-handle-runtime-shutdowns-while-tasks-are-being-spawned branch October 15, 2024 19:19
smklein added a commit to oxidecomputer/omicron that referenced this pull request Oct 18, 2024
Fixes #6505 , integrates
usage of the new qorb APIs to terminate pools cleanly:
oxidecomputer/qorb#45

# Background

-
[https://github.com/oxidecomputer/qorb](https://github.com/oxidecomputer/qorb)
is a connection pooling crate, which spins up tokio task that attempt to
connect to backends.
- When `qorb` was integrated into Omicron in
#5876, I used it to connect
to our database backend (CockroachDB). This included usage in tests,
even with a "single backend host" (one, test-only CRDB server) -- I
wanted to ensure that we used the same pool and connector logic in tests
and prod.

# What Went Wrong

As identified in #6505 , we saw some tests failing during
**termination**. The specific cause of the failure was a panic from
[async-bb8-diesel](https://github.com/oxidecomputer/async-bb8-diesel),
where we attempted to spawn tasks with a terminating tokio executor.
This issue stems from async-bb8-diesel's usage of
`tokio::task::spawn_blocking`, where the returned `JoinHandle` is
immediately awaited and **unwrapped**, with an expectation that "we
should always be able to spawn a blocking task".

There was a separate discussion about "whether or not async-bb8-diesel
should be unwrapping in this case" (see:
oxidecomputer/async-bb8-diesel#77), but instead,
I chose to focus on the question:

## Why are we trying to send requests to async-bb8-diesel while the
tokio runtime is exiting?

The answer to this question lies in qorb's internals -- qorb itself
spawns many tokio tasks to handle ongoing work, including monitoring DNS
resolution, checking on backend health, and making connections before
they are requested. One of these qorb tasks calling `ping_async`, which
checks connection health, used the `async-bb8-diesel` interface that
ultimately panicked.

Within qorb most of these tokio tasks have a drop implementation of the
form:

```rust
struct MyQorbStructure {
  handle: tokio::task::JoinHandle<()>,
}

impl Drop for MyQorbStructure {
  fn drop(&mut self) {
    self.handle.abort();
  }
}
```

Tragically, without async drop support in Rust, this is the best we can
do implicitly -- signal that the background tasks should stop ASAP --
but that may not be soon enough! Calling `.abort()` on the `JoinHandle`
does not terminate the task immediately, it simply signals that it
should shut down at the next yield point.

As a result, we can still see the flake observed in #6505:

- A qorb pool is initialized with background tasks
- One of the qorb worker tasks is about to make a call to check on the
health of a connection to a backend
- The test finishes, and returns. The tokio runtime begins terminating
- We call `drop` on the `qorb` pool, which signals the background task
should abort
- The aforementioned qorb worker task makes the call to `ping_async`,
which calls `spawn_blocking`. This fails, because the tokio runtime is
terminating, and returns a
[JoinError::Cancelled](https://buildomat.eng.oxide.computer/wg/0/details/01J9YQVN7X5EQNXFSEY6XJBH8B/zfviqPz9RoJp3bY4TafbyqXTwbhqdr7w4oupqBtVARR00CXF/01J9YQWAXY36WM0R2VG27QMFRK#S6049).
- `async-bb8-diesel` unwraps this `JoinError`, and the test panics.

# How do we mitigate this?

That's where this PR comes in. Using the new qorb APIs, we don't rely on
the synchronous drop methods -- we explicitly call `.terminate().await`
functions which do the following:

- Use `tokio::sync::oneshot`s as signals to `tokio::tasks` that they
should exit
- `.await` the `JoinHandle` for those tasks before returning

Doing this work explicitly as a part of cleanup ensures that there are
not any background tasks attempting to do new work while the tokio
runtime is terminating.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants