Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: retry on errors when watching pods #9373

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

mikedld
Copy link

@mikedld mikedld commented Apr 1, 2024

Fixes: #8658

Description
If timeout (or some network error) occurs while waiting for a pod initialization or termination event, e.g. when build takes a long time, skaffold becomes stuck and never finishes the operation. Use retry watcher to handle the errors gracefully.

This PR is based on the patch I posted in #8658 last year; never got any feedback on it there so decided to go ahead. I'm using this patch since then and it works fine on my end. To reiterate,

Also note that the same issue affects WaitForDeploymentToStabilize (and probably some other places where Watch is used) but I can't test it so I didn't patch it.

I only managed to fix exising unit test, not add any new test(s), as I'm not at all comfortable with Go. If that's an issue, I'm okay with someone else picking this up.

Copy link

google-cla bot commented Apr 1, 2024

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

mikedld added 3 commits April 2, 2024 00:11
If timeout (or some network error?) occurs while waiting for a pod
initialization or termination event, e.g. when build takes a long time,
skaffold becomes stuck and never finishes the operation. Use retry
watcher to handle the errors gracefully.
@mikedld mikedld force-pushed the bugfix/gh8658-pod-wait-misses-events-after-timeout branch from 6f3b074 to e1ec0c5 Compare April 1, 2024 23:11
@mikedld mikedld changed the title Retry on errors when watching pods fix: retry on errors when watching pods Apr 1, 2024
@certifiedloud
Copy link

How can we encourage this fix to be merged? This issue is causing significant issues for skaffold users who want to utilize kaniko.

@alphanota alphanota self-assigned this Dec 17, 2024
@alphanota
Copy link
Contributor

@mikedld Thank you for this PR. Would you mind fixing the conflicting files and that the PR is synced to skaffold main?

@mikedld mikedld requested a review from a team as a code owner December 17, 2024 21:58
@mikedld mikedld requested a review from plumpy December 17, 2024 21:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Kaniko build hangs waiting for a long-running build that has already finished and pushed the image
3 participants