You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
(From a user) We are trying to get good logs to do alerts on following usecases:
When a workflow task fails for more than 3 times (possibly becasue of implementation issue)
Workflow fails (because of ApplicationFailure or ActivityFailure etc)
Describe the bug
On Workflow Task failure, the lifecycle logger prints out a message indicating that Workflow failed; that's exactly the same error message as on actual Workflow Failure, making it impossible to differentiate these cases.
Similarly, one may see Workflow started printed multiple time for a same Workflow Execution, i.e. every single time that the a Worker needs to rebuild (aka “replay”) the runtime state of that Workflow Execution from the very beginning.
Additional context
The thing is that this lifecycle handler is logging things from the perspective of "the Cached Workflow Instance” (i.e. the specific instance of that workflow execution in the cache of that specific Workflow Worker), rather than from the perspective of the actual Workflow Execution’s lifecycle.
We need to think of a more precise way of formulating those messages. For various reasons, no mention of “Workflow” or “Workflow Task” (starting, failing, completing…) would be 100% reliable at that precise place. For example, Workflow code may attempt to “Complete Workflow”, but the completion command times out or get rejected by the server because of new incoming events, and so what appears to be “Workflow completed” actually ends up being a Workflow Task Failure or Timeout.
Describe what you are trying to do
(From a user) We are trying to get good logs to do alerts on following usecases:
Describe the bug
On Workflow Task failure, the lifecycle logger prints out a message indicating that
Workflow failed
; that's exactly the same error message as on actualWorkflow Failure
, making it impossible to differentiate these cases.Similarly, one may see
Workflow started
printed multiple time for a same Workflow Execution, i.e. every single time that the a Worker needs to rebuild (aka “replay”) the runtime state of that Workflow Execution from the very beginning.Additional context
The thing is that this lifecycle handler is logging things from the perspective of "the Cached Workflow Instance” (i.e. the specific instance of that workflow execution in the cache of that specific Workflow Worker), rather than from the perspective of the actual Workflow Execution’s lifecycle.
We need to think of a more precise way of formulating those messages. For various reasons, no mention of “Workflow” or “Workflow Task” (starting, failing, completing…) would be 100% reliable at that precise place. For example, Workflow code may attempt to “Complete Workflow”, but the completion command times out or get rejected by the server because of new incoming events, and so what appears to be “Workflow completed” actually ends up being a Workflow Task Failure or Timeout.
Community Slack conversation: https://temporalio.slack.com/archives/C01DKSMU94L/p1727436127246899
The text was updated successfully, but these errors were encountered: