Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cleanup partially imported disks on CTRL+C #760

Closed
wants to merge 3 commits into from

Conversation

wfchandler
Copy link
Contributor

@wfchandler wfchandler commented Jul 23, 2024

If a disk import request is interrupted then the disk is likely to be
left in a ImportReady or ImportingFromBulkWrites state, which
prevents the user from deleting it. The commonly occurs when users
interrupt the request using ^C, and is a frustrating experience.

Add a new GracefulShutdown struct to listen for SIGINTs. This will
cancel the executing Future on interrupt and run a cleanup task which
will return an error, even if cleanup succeeds. A second SIGINT causes
the process to exit immediately.

Hook calls make in the critical section of the import into the new
system to ensure we delete the partially imported disk.

This introduces a race where the user may cancel the import operation
while the disk creation saga is still executing and the disk is in
Creating state. Attempting to finalize or delete the disk in this
state will fail, so we are forced to wait until it has transitioned to
ImportReady before proceeding. To avoid spamming the API too heavily,
we poll the disk state every 500ms for no more than ten tries.

The new polling creates a problem for our testing, though. We need the
initial disk API query to return a 404, but subsequent ones to find the
disk. httpmock does not provide a way to remove a mock after N hits,
and we have no graceful way of replacing the mock while running the
oxide binary. There is an open issue to add an option like this to
httpmock, but for right now we have no option but to delete the tests
that check for failure.

Closes #651

@wfchandler wfchandler force-pushed the wc/disk-import-signal branch from 29f03c8 to 4891399 Compare July 23, 2024 14:11
@wfchandler
Copy link
Contributor Author

Recording:

disk_cleanup_demo.mov

Function `test_get_disk_size` is tagged with the `#[test]` attribute,
but not in a module configured for tests, so it is not as part of our
test suite.

Move it into a new `tests` submodule.
@wfchandler wfchandler force-pushed the wc/disk-import-signal branch from 4891399 to e81bfe6 Compare July 25, 2024 14:04
If a `disk import` request is interrupted then the disk is likely to be
left in a `ImportReady` or `ImportingFromBulkWrites` state, which
prevents the user from deleting it. The commonly occurs when users
interrupt the request using ^C, and is a frustrating experience.

Add a new `GracefulShutdown` struct to listen for SIGINTs. This will
cancel the executing `Future` on interrupt and run a cleanup task which
will return an error, even if cleanup succeeds. A second SIGINT causes
the process to exit immediately.

Hook calls make in the critical section of the import into the new
system to ensure we delete the partially imported disk.

This introduces a race where the user may cancel the import operation
while the disk creation saga is still executing and the disk is in
`Creating` state. Attempting to finalize or delete the disk in this
state will fail, so we are forced to wait until it has transitioned to
`ImportReady` before proceeding. To avoid spamming the API too heavily,
we poll the disk state every 500ms for no more than ten tries.

The new polling creates a problem for our testing, though. We need the
initial disk API query to return a 404, but subsequent ones to find the
disk. `httpmock` does not provide a way to remove a mock after N hits,
and we have no graceful way of replacing the mock while running the
`oxide` binary. There is an open issue[0] to add an option like this to
`httpmock`, but for right now we have no option but to delete the tests
that check for failure.

Closes #651

[0] alexliesenfeld/httpmock#96
@wfchandler
Copy link
Contributor Author

Opened alexliesenfeld/httpmock#108 to add a delete_after_n method to httpmock that would allow us to clear the initial mock. In the meantime I've just removed the failing tests in this PR. We can add them back in if/when my change to httpmock is accepted.

@wfchandler wfchandler force-pushed the wc/disk-import-signal branch from e81bfe6 to 1ebaa1f Compare July 25, 2024 18:54

if shutdown.shutdown_requested()? {
self.remove_failed_disk(client).await?;
bail!("user canceled request");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to say something more concrete like "User canceled image upload"

@ahl
Copy link
Collaborator

ahl commented Jul 29, 2024

I had been moving the disk upload functionality to the SDK. This seems to complicate that. I had imagined we would do cleanup from a custom Drop implementation (albeit synchronously in the absence of async Drop). When we receive a SIGINT we could cancel the future and let the Drop logic do the cleanup.

From a scan through your changes, it seems like there may be obstacles in the API such as needing to wait for particular state transitions. There may be situations where we want to change the API or state machine.

Copy link
Contributor

@jmpesp jmpesp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes look good to me, though Adam's review should supersede this one. One comment from me:


if force_shutdown {
eprintln_nopipe!("Shutting down immediately.\n{exit_msg}.");
std::process::exit(130);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where did this value come from?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is matching the exit code bash would use had the standard SIGINT handler been invoked. It's not exactly the same behavior as WIFSIGNALED won't be aware of the signal, but I think it's close enough for our purposes.

@wfchandler
Copy link
Contributor Author

I had been moving the disk upload functionality to the SDK. This seems to complicate that. I had imagined we would do cleanup from a custom Drop implementation (albeit synchronously in the absence of async Drop). When we receive a SIGINT we could cancel the future and let the Drop logic do the cleanup.

From a scan through your changes, it seems like there may be obstacles in the API such as needing to wait for particular state transitions. There may be situations where we want to change the API or state machine.

@ahl I'm not sure where that leaves this PR. Should I close it, or do we expect the work to update the API to take long enough it's worthwhile merging this as a stopgap.

@ahl
Copy link
Collaborator

ahl commented Aug 8, 2024

I had been moving the disk upload functionality to the SDK. This seems to complicate that. I had imagined we would do cleanup from a custom Drop implementation (albeit synchronously in the absence of async Drop). When we receive a SIGINT we could cancel the future and let the Drop logic do the cleanup.
From a scan through your changes, it seems like there may be obstacles in the API such as needing to wait for particular state transitions. There may be situations where we want to change the API or state machine.

@ahl I'm not sure where that leaves this PR. Should I close it, or do we expect the work to update the API to take long enough it's worthwhile merging this as a stopgap.

@wfchandler I was hoping we might change the approach here to do cleanup when we drop the future. Do you think that's viable?

@wfchandler
Copy link
Contributor Author

I had been moving the disk upload functionality to the SDK. This seems to complicate that. I had imagined we would do cleanup from a custom Drop implementation (albeit synchronously in the absence of async Drop). When we receive a SIGINT we could cancel the future and let the Drop logic do the cleanup.
From a scan through your changes, it seems like there may be obstacles in the API such as needing to wait for particular state transitions. There may be situations where we want to change the API or state machine.

@ahl I'm not sure where that leaves this PR. Should I close it, or do we expect the work to update the API to take long enough it's worthwhile merging this as a stopgap.

@wfchandler I was hoping we might change the approach here to do cleanup when we drop the future. Do you think that's viable?

@ahl I think the latest revision of this PR should be compatible with your SDK changes. We can replace self.upload_disk with the equivalent SDK command when that's available. WDYT?

@david-crespo
Copy link
Contributor

Ran into this today. Would be nice to figure out how to get this in. In the meantime I think I’m going to add stop and and finalize to the web console.

@wfchandler
Copy link
Contributor Author

Ran into this today. Would be nice to figure out how to get this in. In the meantime I think I’m going to add stop and and finalize to the web console.

I have almost finished a rewrite of this that adds a convenience wrapper to the SDK, hopefully ready to go later today.

@wfchandler
Copy link
Contributor Author

Closing in favor of #840

@wfchandler wfchandler closed this Sep 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

disk requires special intervention to delete if upload process is Ctrl-C'd
4 participants