failed to wait for object to sync in-cache after patching context deadline exceeded #1017

pkit · 2024-07-01T20:09:13Z

wut?
really, what does it mean?
why there's no other logs that describe what's going on?

2024-07-01T20:04:53.761Z info HelmRelease/something.flux-system - release out-of-sync with desired state: release config values changed 
2024-07-01T20:04:53.791Z info HelmRelease/something.flux-system - running 'upgrade' action with timeout of 5m0s 
2024-07-01T20:04:54.720Z info HelmRelease/something.flux-system - release is in a failed state 
2024-07-01T20:04:54.789Z info HelmRelease/something.flux-system - running 'rollback' action with timeout of 5m0s 
2024-07-01T20:05:05.069Z error HelmRelease/something.flux-system - failed to wait for object to sync in-cache after patching context deadline exceeded

The text was updated successfully, but these errors were encountered:

stefanprodan · 2024-10-18T12:32:13Z

failed to wait for object to sync in-cache after patching context deadline exceeded

This means the controller stopped receiving data from Kubernetes API, I suspect your Kubernetes controler plane is having issues.

fcuello-fudo · 2024-10-18T15:15:41Z

We are having the same problem, but also, the helm-controller pod is in a CrashLoopBackOff because of repeated failed Liveness probe .

Probably the Liveness probe should still work even if there are problems contacting the control plane

stefanprodan · 2024-10-18T15:44:01Z

Probably the Liveness probe should still work even if there are problems contacting the control plane

Not if you build your controller with Kubernetes controller-runtime. Having the controller running and DDOSing the API endpoint would do you no good, kubelet will restart to controller with an exponential backoff which prevents the API server from being overloaded once it starts.

fcuello-fudo · 2024-10-18T15:53:39Z

Having the controller running and DDOSing the API endpoint would do you no good,

We downgraded the control-plane (GKE rapid channel) and now everything seems to be fine again. I still haven't really found the root cause, but my point was that if the controller is behaving properly, but the k8s API is overloaded or unresponsive for some other reason than the controller, the liveness probe on the controller should still pass the checks, right?

stefanprodan · 2024-10-18T17:39:34Z

the liveness probe on the controller should still pass the checks, right?

Not if the CNI is failing, kubelet can't reach the port. There is nothing special about the liveness probe, it's the standard controller-runtime ping handler https://github.com/fluxcd/pkg/blob/ac1007b57e37838e73b8bc95365dab9a0e856e8e/runtime/probes/probes.go#L45

fcuello-fudo · 2024-10-21T06:22:31Z

Not if the CNI is failing, kubelet can't reach the port.

That it's not the case as there are several other applications running in the same cluster (and same node as flux controllers ) and non of them have any problems, neither communicating to the internet nor among each other.

Also, the liveness port of the flux controllers is reachable, but it just doesn't respond.

What I think is happening, is that the problematic version of the control plane has changed something related to rate limiting of API queries and that is only affecting flux because in our case it's the app that queries the k8s API the most.

I'm pretty sure we can reproduce the issue easily by switching the control plane back to the problematic version if you are willing to debug this together.

stefanprodan · 2024-10-21T06:39:33Z

@fcuello-fudo if Flux runs into rate limits there must be error logs, if you can post those would be helpful. We use the Kubernetes PriorityAndFairness flow control to make our controllers comply with Kubernetes API rate limits, if the flow API is buggy this could lead to a disconnect https://github.com/fluxcd/pkg/blob/ac1007b57e37838e73b8bc95365dab9a0e856e8e/runtime/client/client.go#L76

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

failed to wait for object to sync in-cache after patching context deadline exceeded #1017

failed to wait for object to sync in-cache after patching context deadline exceeded #1017

pkit commented Jul 1, 2024

stefanprodan commented Oct 18, 2024

fcuello-fudo commented Oct 18, 2024

stefanprodan commented Oct 18, 2024 •

edited

Loading

fcuello-fudo commented Oct 18, 2024

stefanprodan commented Oct 18, 2024

fcuello-fudo commented Oct 21, 2024

stefanprodan commented Oct 21, 2024

failed to wait for object to sync in-cache after patching context deadline exceeded #1017

failed to wait for object to sync in-cache after patching context deadline exceeded #1017

Comments

pkit commented Jul 1, 2024

stefanprodan commented Oct 18, 2024

fcuello-fudo commented Oct 18, 2024

stefanprodan commented Oct 18, 2024 • edited Loading

fcuello-fudo commented Oct 18, 2024

stefanprodan commented Oct 18, 2024

fcuello-fudo commented Oct 21, 2024

stefanprodan commented Oct 21, 2024

stefanprodan commented Oct 18, 2024 •

edited

Loading