Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: leaking memory and goroutines #3016

Closed
adrianrobotka opened this issue Oct 5, 2023 · 8 comments · Fixed by #3026
Closed

bug: leaking memory and goroutines #3016

adrianrobotka opened this issue Oct 5, 2023 · 8 comments · Fixed by #3026
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/regression Categorizes issue or PR as related to a regression from a prior release. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. severity/timebomb Something that we must fix before production

Comments

@adrianrobotka
Copy link

adrianrobotka commented Oct 5, 2023

Describe the bug

KCP consumes increasing amount of memory until OOM killed. While looking at other metrics from its exporter, I found that the number of goroutines shows a high correlation with memory consumption. Due to the high number of goroutines, I suspect a resource-leak type programming error.
kcp_leak

Steps To Reproduce

I used the ghcr.io/kcp-dev/kcp:0531e126 image in Kubernetes which corresponds to v0.20.0.

My application has the following dimensions in order of magnitudes:

  • <<100 CR
  • <<10 CRD
  • on average 1 status update per >>10 sec
  • resource creation hardly ever
  • <10 controller watching

Expected Behaviour

Don’t consume all the memory that is given.

Additional Context

I used pprof to profile KCP remotely. All the captures are available in the zip file.

@adrianrobotka adrianrobotka added the kind/bug Categorizes issue or PR as related to a bug. label Oct 5, 2023
@kcp-ci-bot kcp-ci-bot added this to kcp Oct 5, 2023
@github-project-automation github-project-automation bot moved this to New in kcp Oct 5, 2023
@adrianrobotka
Copy link
Author

adrianrobotka commented Oct 5, 2023

I tried to examine my samples and read after the usual suspects of goroutine leakage.
My current understanding is that pprof-goroutines.svg points out two hot-spots. I hope solving/patching those potential leak points will be enough to eliminate the vast majority of the leaks.

Next to pprof-goroutines.svg, a goroutines dump text file helped me locate the relevant code lines:

Excerpts:

goroutine 20852 [select, 14 minutes]:
github.com/kcp-dev/apimachinery/v2/third_party/informers.(*processorListener).pop(0xc00c262400)
	/go/pkg/mod/github.com/kcp-dev/apimachinery/[email protected]/third_party/informers/shared_informer.go:660 +0x119
...

goroutine 22074 [chan receive, 14 minutes]:
github.com/kcp-dev/apimachinery/v2/third_party/informers.(*processorListener).run.func1()
	/go/pkg/mod/github.com/kcp-dev/apimachinery/[email protected]/third_party/informers/shared_informer.go:690 +0x57
...

The referenced code lines:

Most probably at least one of these are valid leak point (due to missing/impossible receival data on the channel).
Ideas for workaround:

  • selects should use timeouts
  • a range usually hangs on missing close on the related channel -> it needs further investigation

@mjudeikis mjudeikis added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. kind/regression Categorizes issue or PR as related to a regression from a prior release. severity/timebomb Something that we must fix before production labels Oct 6, 2023
@adrianrobotka
Copy link
Author

I tried to understand how shared_informer.go works, to come up with a workaround. It was without success.

While reading the code I found a comment:

a failing/stalled listener will have infinite pendingNotifications added until we OOM.

It raises the concern, what could I check to ensure that my app behaves (using the API) correctly?

Apart, what else should I check or try to help with this issue?

@mjudeikis
Copy link
Contributor

mjudeikis commented Oct 9, 2023

If you are in the position to try the main branch now, this would be helpful so we know bug still exists in main branch and we ca n try look into it. We just landed 1.28 rebase.
other than that I think we need to sit down and try replicate (its not straight forward as it sounds ;/)

@adrianrobotka
Copy link
Author

Same. shared_informer.go: run() and pop() leak the goroutines.

OCI image: ghcr.io/kcp-dev/kcp:02d405e0

goroutine 42290 [chan receive, 31 minutes]:
github.com/kcp-dev/apimachinery/v2/third_party/informers.(*processorListener).run.func1()
	/go/pkg/mod/github.com/kcp-dev/apimachinery/[email protected]/third_party/informers/shared_informer.go:702 +0x57
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x30?)
	/go/pkg/mod/github.com/kcp-dev/kubernetes/staging/src/k8s.io/[email protected]/pkg/util/wait/backoff.go:226 +0x3e
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc01ec7af38?, {0x3c05580, 0xc01590c390}, 0x1, 0xc022e23380)
	/go/pkg/mod/github.com/kcp-dev/kubernetes/staging/src/k8s.io/[email protected]/pkg/util/wait/backoff.go:227 +0xb6
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0x5996730?, 0x3b9aca00, 0x0, 0xf0?, 0x5996730?)
	/go/pkg/mod/github.com/kcp-dev/kubernetes/staging/src/k8s.io/[email protected]/pkg/util/wait/backoff.go:204 +0x89
k8s.io/apimachinery/pkg/util/wait.Until(...)
	/go/pkg/mod/github.com/kcp-dev/kubernetes/staging/src/k8s.io/[email protected]/pkg/util/wait/backoff.go:161
github.com/kcp-dev/apimachinery/v2/third_party/informers.(*processorListener).run(0xc00bed3950)
	/go/pkg/mod/github.com/kcp-dev/apimachinery/[email protected]/third_party/informers/shared_informer.go:701 +0x6b
k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1()
	/go/pkg/mod/github.com/kcp-dev/kubernetes/staging/src/k8s.io/[email protected]/pkg/util/wait/wait.go:72 +0x5a
created by k8s.io/apimachinery/pkg/util/wait.(*Group).Start
	/go/pkg/mod/github.com/kcp-dev/kubernetes/staging/src/k8s.io/[email protected]/pkg/util/wait/wait.go:70 +0x85

goroutine 61501 [select, 28 minutes]:
github.com/kcp-dev/apimachinery/v2/third_party/informers.(*processorListener).pop(0xc00ef2e6c0)
	/go/pkg/mod/github.com/kcp-dev/apimachinery/[email protected]/third_party/informers/shared_informer.go:672 +0x119
k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1()
	/go/pkg/mod/github.com/kcp-dev/kubernetes/staging/src/k8s.io/[email protected]/pkg/util/wait/wait.go:72 +0x5a
created by k8s.io/apimachinery/pkg/util/wait.(*Group).Start
	/go/pkg/mod/github.com/kcp-dev/kubernetes/staging/src/k8s.io/[email protected]/pkg/util/wait/wait.go:70 +0x85

image

@adrianrobotka
Copy link
Author

I am up for making an env/app to reproduce this. (if I got you correctly about the replication of the issue) However, I don't know how hard it will be (time-wise).

@adrianrobotka
Copy link
Author

I got advice (from my college) to find out what was the point when the leak started.
I managed to narrow down that it happened between 7d017c7 and 88d6071. These commits correspond to two subsequent merged PRs on main. 7d017c7 was fine without leaking, while 88d6071 started to leak (with my app).

@mjudeikis
Copy link
Contributor

As we moved a lot of things now and in the process lost some tooling I am going to try to create an observable development setup on my end today. If you could create simple replicable we could "run".

@adrianrobotka
Copy link
Author

I managed to make a simple app/environment to reproduce the issue: https://github.com/adrianrobotka/kcp-demo

Against my expectations, the simplest possible controller will also cause leakage. Details in the readme of the linked repo.

@mjudeikis mjudeikis self-assigned this Oct 12, 2023
@embik embik moved this from New to In Progress in kcp Oct 12, 2023
@github-project-automation github-project-automation bot moved this from In Progress to Done in kcp Oct 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/regression Categorizes issue or PR as related to a regression from a prior release. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. severity/timebomb Something that we must fix before production
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

2 participants