-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kaniko build hangs waiting for a long-running build that has already finished and pushed the image #8658
Comments
The below patch seems to fix it for me. Not sure how good it is (my first time dealing with Go ever), but can open a PR. Also note that the same issue affects diff --git a/pkg/skaffold/kubernetes/wait.go b/pkg/skaffold/kubernetes/wait.go
index f1d517f6a..11d5e8f09 100644
--- a/pkg/skaffold/kubernetes/wait.go
+++ b/pkg/skaffold/kubernetes/wait.go
@@ -32,6 +32,8 @@ import (
"k8s.io/apimachinery/pkg/watch"
"k8s.io/client-go/kubernetes"
corev1 "k8s.io/client-go/kubernetes/typed/core/v1"
+ "k8s.io/client-go/tools/cache"
+ watchtools "k8s.io/client-go/tools/watch"
"github.com/GoogleContainerTools/skaffold/v2/pkg/skaffold/output/log"
)
@@ -61,7 +63,7 @@ func watchUntilTimeout(ctx context.Context, timeout time.Duration, w watch.Inter
func WaitForPodSucceeded(ctx context.Context, pods corev1.PodInterface, podName string, timeout time.Duration) error {
log.Entry(ctx).Infof("Waiting for %s to be complete", podName)
- w, err := pods.Watch(ctx, metav1.ListOptions{})
+ w, err := newPodsWatcher(ctx, pods)
if err != nil {
return fmt.Errorf("initializing pod watcher: %s", err)
}
@@ -101,7 +103,7 @@ func isPodSucceeded(podName string) func(event *watch.Event) (bool, error) {
func WaitForPodInitialized(ctx context.Context, pods corev1.PodInterface, podName string) error {
log.Entry(ctx).Infof("Waiting for %s to be initialized", podName)
- w, err := pods.Watch(ctx, metav1.ListOptions{})
+ w, err := newPodsWatcher(ctx, pods)
if err != nil {
return fmt.Errorf("initializing pod watcher: %s", err)
}
@@ -154,3 +156,16 @@ func WaitForDeploymentToStabilize(ctx context.Context, c kubernetes.Interface, n
return false, nil
})
}
+
+func newPodsWatcher(ctx context.Context, pods corev1.PodInterface) (watch.Interface, error) {
+ initList, err := pods.List(ctx, metav1.ListOptions{})
+ if err != nil {
+ return nil, err
+ }
+
+ return watchtools.NewRetryWatcher(initList.GetResourceVersion(), &cache.ListWatch{
+ WatchFunc: func(listOptions metav1.ListOptions) (watch.Interface, error) {
+ return pods.Watch(ctx, listOptions)
+ },
+ })
+} |
Hi, I'm experiencing the exact same problem. In my case the build takes around 40 minutes and even though kaniko's pod finishes succesfully Skaffold is not aware and hangs forever. @mikedld Were you able to solve the problem in other way? |
@JRuedas, my pipeline builds skaffold with this patch applied instead of installing a prebuilt binary. Adds about 3 minutes which is negligible compared to the actual images build (about 3-5 hours). So no, haven't found another way and still waiting for this to be fixed upstream. |
Expected behavior
Cluster (kaniko) build succeeds regardless of how long it takes.
Actual behavior
If build takes considerable time (in my case, more than 35-55 minutes), skaffold hangs after completing and pushing an image.
Information
Steps to reproduce the behavior
skaffold --interactive=false --verbosity=debug build --default-repo=000000000000.dkr.ecr.eu-west-1.amazonaws.com
While trying to troubleshoot, I've cloned the repo, added some logging (patch attached), and built it myself, hence the version in the log doesn't match 2.3.0 release but current
main
. The log shows that at some point (before the image is built and before the cluster timeout is reached) the pods watcher starts to report events with empty type and nil object in a tight loop, no pod termination event is ever reported.I've adjusted the cluster timeout for this report so that it comes 2 minutes after sleep ends in Dockerfile, to reduce the log size, otherwise Jenkins kills the build. Increasing the cluster timeout doesn't help, I once waited for 4 hours and nothing happened, it was still there waiting after the "Pushed <image>" message.
Increasing cluster resource requests (for kaniko pod) doesn't help, I tried with 3000m and 8Gi. EKS nodes are r5.xlarge and were idling during the test.
Files:
The text was updated successfully, but these errors were encountered: