[Bug] Workflow lifecycle logging is misleading #1537

mjameswh · 2024-09-27T15:43:58Z

(From a user) We are trying to get good logs to do alerts on following usecases:

When a workflow task fails for more than 3 times (possibly becasue of implementation issue)
Workflow fails (because of ApplicationFailure or ActivityFailure etc)

On Workflow Task failure, the lifecycle logger prints out a message indicating that Workflow failed; that's exactly the same error message as on actual Workflow Failure, making it impossible to differentiate these cases.
Similarly, one may see Workflow started printed multiple time for a same Workflow Execution, i.e. every single time that the a Worker needs to rebuild (aka “replay”) the runtime state of that Workflow Execution from the very beginning.

The thing is that this lifecycle handler is logging things from the perspective of "the Cached Workflow Instance” (i.e. the specific instance of that workflow execution in the cache of that specific Workflow Worker), rather than from the perspective of the actual Workflow Execution’s lifecycle.
We need to think of a more precise way of formulating those messages. For various reasons, no mention of “Workflow” or “Workflow Task” (starting, failing, completing…) would be 100% reliable at that precise place. For example, Workflow code may attempt to “Complete Workflow”, but the completion command times out or get rejected by the server because of new incoming events, and so what appears to be “Workflow completed” actually ends up being a Workflow Task Failure or Timeout.
Community Slack conversation: https://temporalio.slack.com/archives/C01DKSMU94L/p1727436127246899

The text was updated successfully, but these errors were encountered:

ilijaNL · 2024-10-03T04:20:21Z

Since temporal sdk has metric counters, why wouldnt it be possible to do logging in the same place when the counter increases?

mjameswh · 2024-11-06T00:32:02Z

Since temporal sdk has metric counters, why wouldnt it be possible to do logging in the same place when the counter increases?

It's unfortunately not that simple. Metrics bookkeeping is handled by Core SDK, whereas logging is handled on the TypeScript side.

mjameswh added bug Something isn't working visibility labels Sep 27, 2024

Provide feedback