Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nexus: Delete state machine on terminal state -- Part 3 #6984

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

justinp-tt
Copy link
Contributor

@justinp-tt justinp-tt commented Dec 12, 2024

What changed?

Add deletion calls for Nexus operation nodes once they reach a terminal state (Completed, Failed, Canceled, Timed Out). Previously, terminal operations lingered in the state machine, potentially causing confusion and wasting resources. In addition, this PR addresses previously existing TODOs related to node cleanup.

Key Changes

  • Automatic Node Deletion on Terminal States:
    Operations are now removed from the state machine immediately after they complete, fail, cancel, or time out.
  • Handler Updates:
    The command handlers (schedule and cancel) now reflect the new terminal state deletion. Attempts to cancel an operation that has already reached a terminal state will produce a clear, non-retryable error.
  • Tests for Terminal State Deletion:
    Introduced two new test suites:
    • TestTerminalStatesDeletion to verify that applying terminal events removes the operation node as expected.
    • TestOperationNodeDeletionOnTerminalEvents to ensure no further modifications (like cancelation) can be made after an operation is terminal.

Areas for Review Focus

  1. Deletion Semantics: Verify that node removal is triggered correctly for all terminal states and that no non-terminal states inadvertently trigger deletion.
  2. Handler Consistency: Ensure that the updated schedule/cancel handlers function correctly when dealing with non-existent or already terminated operations.
  3. Error Handling: Confirm that attempts to modify a deleted operation produce the intended error behavior.
  4. Test Coverage: Review the newly added test suites to ensure all terminal states and edge cases are thoroughly tested.

Why?

  • Prevent confusion and wasted resources by removing irrelevant operation nodes.
  • Prevent customers from experiencing workflow task failures prematurely.

How did you test it?

  • New Test Suites:
    Added TestTerminalStatesDeletion and TestOperationNodeDeletionOnTerminalEvents to specifically target and validate the new deletion behavior.
  • Existing Tests:
    All pre-existing tests have been re-run to ensure no regressions.

Potential risks

  • Assumptions About Completed Operations:
    If any internal or external code implicitly relied on the presence of completed operations, its assumptions might break. Such usage was never officially supported.

Documentation

No external or user-facing documentation changes are required since this feature matches the intended design.

Is hotfix candidate?

No. This change is an enhancement to internal logic and state management, not a fix for a critical bug, so it should follow the normal release process.

@justinp-tt justinp-tt changed the title Add deletion on terminal state Nexus: Delete state machine on terminal state -- Part 3 Dec 12, 2024
@justinp-tt justinp-tt requested a review from bergundy December 12, 2024 22:39
@justinp-tt justinp-tt self-assigned this Dec 12, 2024
return output, err
}

if err := root.DeleteChild(node.Key); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically you'll want to separate this into two operations, a transition, followed by a deletion. Move the DeleteChild call to after the transitionOperation call.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's also a boolean on the operation state saying whether it should be deleted on completion but we can get rid of that and just delete here always.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -216,27 +216,22 @@ func (ch *commandHandler) HandleCancelCommand(
if errors.Is(err, hsm.ErrStateMachineNotFound) {
return workflow.FailWorkflowTaskError{
Cause: enumspb.WORKFLOW_TASK_FAILED_CAUSE_BAD_REQUEST_CANCEL_NEXUS_OPERATION_ATTRIBUTES,
Message: fmt.Sprintf("requested cancelation for a non-existing operation with scheduled event ID of %d", attrs.ScheduledEventId),
// TODO(bergundy): Message: fmt.Sprintf("requested cancelation for a non-existing or already completed operation with scheduled event ID of %d", attrs.ScheduledEventId),
Message: fmt.Sprintf("requested cancelation for a nonexistent or terminated operation with scheduled event ID %d", attrs.ScheduledEventId),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The message I had in the todo is better. The operation isn't terminated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

if !nexusoperations.TransitionCanceled.Possible(op) && !ms.HasAnyBufferedEvent(makeNexusOperationTerminalEventFilter(attrs.ScheduledEventId)) {

// Operation exists but can't be canceled - must be in terminal state
if !nexusoperations.TransitionCanceled.Possible(op) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not possible anymore. And please distinguish between a terminal state and a terminated operation. The latter is not a term that we use.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

// We allow the workflow to request canceling an operation that has just completed while a workflow task is in
// flight since it cannot know about the state of the operation.
// TODO(bergundy): When we support state machine deletion, this condition will have to change.
if !nexusoperations.TransitionCanceled.Possible(op) && !ms.HasAnyBufferedEvent(makeNexusOperationTerminalEventFilter(attrs.ScheduledEventId)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You will need allow cancel while a terminal event is buffered (e.g. the SDK isn't aware yet that the operation has completed).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The check for buffered terminal event should be moved to where we error out when the node is not found.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is what you have in mind. Please take a look

@bergundy
Copy link
Member

Didn't review commands_test.go yet. I will get to that in the next round after some comments are addressed.

@justinp-tt justinp-tt requested a review from bergundy December 13, 2024 21:05
@justinp-tt justinp-tt marked this pull request as ready for review December 13, 2024 21:06
@justinp-tt justinp-tt requested a review from a team as a code owner December 13, 2024 21:06
},
},
})
var failWFTErr workflow.FailWorkflowTaskError
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not to be confused with a WTFErr lol

return node, op, eventID
}

applyEventAndCheckDeletion := func(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can just inline this in the loop body, there's no value in extracting to a function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants