-
Notifications
You must be signed in to change notification settings - Fork 859
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nexus: Delete state machine on terminal state -- Part 3 #6984
base: main
Are you sure you want to change the base?
Conversation
components/nexusoperations/events.go
Outdated
return output, err | ||
} | ||
|
||
if err := root.DeleteChild(node.Key); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically you'll want to separate this into two operations, a transition, followed by a deletion. Move the DeleteChild call to after the transitionOperation call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's also a boolean on the operation state saying whether it should be deleted on completion but we can get rid of that and just delete here always.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -216,27 +216,22 @@ func (ch *commandHandler) HandleCancelCommand( | |||
if errors.Is(err, hsm.ErrStateMachineNotFound) { | |||
return workflow.FailWorkflowTaskError{ | |||
Cause: enumspb.WORKFLOW_TASK_FAILED_CAUSE_BAD_REQUEST_CANCEL_NEXUS_OPERATION_ATTRIBUTES, | |||
Message: fmt.Sprintf("requested cancelation for a non-existing operation with scheduled event ID of %d", attrs.ScheduledEventId), | |||
// TODO(bergundy): Message: fmt.Sprintf("requested cancelation for a non-existing or already completed operation with scheduled event ID of %d", attrs.ScheduledEventId), | |||
Message: fmt.Sprintf("requested cancelation for a nonexistent or terminated operation with scheduled event ID %d", attrs.ScheduledEventId), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The message I had in the todo is better. The operation isn't terminated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
if !nexusoperations.TransitionCanceled.Possible(op) && !ms.HasAnyBufferedEvent(makeNexusOperationTerminalEventFilter(attrs.ScheduledEventId)) { | ||
|
||
// Operation exists but can't be canceled - must be in terminal state | ||
if !nexusoperations.TransitionCanceled.Possible(op) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not possible anymore. And please distinguish between a terminal state and a terminated operation. The latter is not a term that we use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
// We allow the workflow to request canceling an operation that has just completed while a workflow task is in | ||
// flight since it cannot know about the state of the operation. | ||
// TODO(bergundy): When we support state machine deletion, this condition will have to change. | ||
if !nexusoperations.TransitionCanceled.Possible(op) && !ms.HasAnyBufferedEvent(makeNexusOperationTerminalEventFilter(attrs.ScheduledEventId)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You will need allow cancel while a terminal event is buffered (e.g. the SDK isn't aware yet that the operation has completed).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The check for buffered terminal event should be moved to where we error out when the node is not found.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is what you have in mind. Please take a look
Didn't review |
}, | ||
}, | ||
}) | ||
var failWFTErr workflow.FailWorkflowTaskError |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not to be confused with a WTFErr lol
return node, op, eventID | ||
} | ||
|
||
applyEventAndCheckDeletion := func( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can just inline this in the loop body, there's no value in extracting to a function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
What changed?
Add deletion calls for Nexus operation nodes once they reach a terminal state (Completed, Failed, Canceled, Timed Out). Previously, terminal operations lingered in the state machine, potentially causing confusion and wasting resources. In addition, this PR addresses previously existing TODOs related to node cleanup.
Key Changes
Operations are now removed from the state machine immediately after they complete, fail, cancel, or time out.
The command handlers (schedule and cancel) now reflect the new terminal state deletion. Attempts to cancel an operation that has already reached a terminal state will produce a clear, non-retryable error.
Introduced two new test suites:
TestTerminalStatesDeletion
to verify that applying terminal events removes the operation node as expected.TestOperationNodeDeletionOnTerminalEvents
to ensure no further modifications (like cancelation) can be made after an operation is terminal.Areas for Review Focus
Why?
How did you test it?
Added
TestTerminalStatesDeletion
andTestOperationNodeDeletionOnTerminalEvents
to specifically target and validate the new deletion behavior.All pre-existing tests have been re-run to ensure no regressions.
Potential risks
If any internal or external code implicitly relied on the presence of completed operations, its assumptions might break. Such usage was never officially supported.
Documentation
No external or user-facing documentation changes are required since this feature matches the intended design.
Is hotfix candidate?
No. This change is an enhancement to internal logic and state management, not a fix for a critical bug, so it should follow the normal release process.