Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UI shows stale information about WorkflowTaskFailed and doesn't update #2262

Open
spandan-sharma opened this issue Aug 12, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@spandan-sharma
Copy link

Describe the bug

Originally reported in the forum thread. If the workflow task fails, the failure reason remains stuck/frozen with the original failure message even if it's now failing due to a different reason (because of a new deployment for eg) now making it a bit hard to debug and giving a wrong impression that the new deployment didn't go through or the workflow updated hasn't happened etc.

To Reproduce
In pseudo code, here's the first deployment with which the workflow is run:

do-activity-1
raise Exception('foo')
do-activity-2

So obviously, this is going to finish activity-1 and result in WorkflowTaskFailed with reason being that exception foo was raised. The workflow task will keep getting retried.

Now change code to:

do-activity-1
raise Exception('bar')
do-activity-2

and redploy the worker. From the worker logs I can see that it’s now raising the new exception bar, but the UI doesn’t update the status of event history, it remains frozen at WorkflowTaskFailed with the reason that exception foo occurred which is no longer accurate. This is just an example but it makes troubleshooting a bit difficult by looking things up in the UI. It’s as-if the worker was running stale code and wasn’t updated.

Then even introduce non-determinism by changing the code to the following and redeploy the (only) worker:

do-activity-3
raise Exception('bar')
do-activity-2

Again from the worker logs, I can see it quits execution as soon as it sees divergence (expects activity-1 to be completed by looking at event history but finds activity-3 in its place in new code). So it immediately detects non-determinism and errors out (and then will be retried as usual and so on) but the UI for the workflow remains frozen with just the original error that workflow-task failed due to exception foo.

If you now revert back to the 1st snippet and get rid of the exception, the worker now successfully completes the workflow (on the next retry), finishing activity-2 too now, and the UI updates with all that and finally shows the workflow as completed.

However in the meantime, due to lack of updates it makes troubleshooting a bit difficult.

Expected behavior
On retires, if the failure reasons for the workflow-tasks have changed, the UI should update to show the new reasons (like it does for activity-tasks in pending activity)

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

  • OS: MacOS
  • Browser: Brave, Firefox etc

Additional context
Tried this both, in the "old" UI and the "new" UI toggled with the "Labs On" button on the bottom left.

@spandan-sharma spandan-sharma added the bug Something isn't working label Aug 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant