You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We do at present not distinguish "not found" errors (permanent) from e.g. "the Kubernetes API server temporary can not be reached" (transient). Because of this, a Stage's verification process may fail prematurely while the controller could theoretically automatically recover it, if given the time.
As manually recovering from it is both cumbersome to a user, and potentially a waste of computing power used by the AnalysisRun. I think we can do a better job at distinguishing these type of errors, and prevent giving up on transient ones by e.g. requeueing and not erasing AnalysisRun references, etc.
I think we've made progress on this and there's more to be made still, but I think that, like #1479, this is an on-going effort that we can kick from release to release until we feel satisfied.
We do at present not distinguish "not found" errors (permanent) from e.g. "the Kubernetes API server temporary can not be reached" (transient). Because of this, a Stage's verification process may fail prematurely while the controller could theoretically automatically recover it, if given the time.
As manually recovering from it is both cumbersome to a user, and potentially a waste of computing power used by the AnalysisRun. I think we can do a better job at distinguishing these type of errors, and prevent giving up on transient ones by e.g. requeueing and not erasing AnalysisRun references, etc.
xref: #1611 (comment)
Note: While I have only observed this to happen for a Stage's verification process, this may actually apply to more areas of Kargo.
The text was updated successfully, but these errors were encountered: