-
Notifications
You must be signed in to change notification settings - Fork 213
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix OOM and Handler timeout
issue by only returning one item in ListAllMetrics by default
#623
Fix OOM and Handler timeout
issue by only returning one item in ListAllMetrics by default
#623
Conversation
32a26a4
to
6d08ee5
Compare
6d08ee5
to
5a3bd71
Compare
Do we know the root cause for "http2: stream closed" and could you elaborate how returning empty ListAllMetrics will solve this issue? |
ListAllMetrics is called during api discovery and isn't needed for HPA. In this HPA case, if #target pods is 5, it will call this function 5 times at the same time. It's not scalable for a large cluster. In my small cluster, during 60s, ListAllMetrics is called 11 times and each time returns 864502*2 bytes=1.72 megabytes
It's a timeout issue, so it will be fixed after returning empty vaules. |
/retest I just added more CI pipelines. |
55df82b
to
a983c73
Compare
…item for ListCustomMetrics
ee5a839
to
673d295
Compare
Hi @slash4, could you check the version by I didn't see this error in my cluster, will use custom-metrics-stackdriver-adapter/deploy/production/adapter_new_resource_model.yaml to reproduce. |
Sure : here's the output of the command 👍
I'm so grateful you responded so quickly. Thanks :) |
Handler timeout
issue by only returning one item in ListAllMetrics by default
btw, I found this one https://stackoverflow.com/questions/67073909/error-scaling-up-in-hpa-in-gke-apiserver-was-unable-to-write-a-json-response-h, could you try this? |
Regarding this : https://stackoverflow.com/questions/67073909/error-scaling-up-in-hpa-in-gke-apiserver-was-unable-to-write-a-json-response-h I'm already in Could it be the |
Ok, I understand that spam logs are annoying. If I need more information, will let you know. |
Thanks a lot ! Sure, I'm here if you need me :) |
Hi @slash4, does your cluster still have spam logs? Just want to see whether spam logs are gone after x hours. |
@CatherineF-dev @slash4 does the spam logs are gone now? if yes please let me know what was the root cause, since we are also seeing the same errors |
Could you provide detailed steps to reproduce? @hariapollo Are you using the latest custom-metrics-stackdriver-adapter? |
Hey @CatherineF-dev, We were on
|
After upgrading it to
|
@hariapollo could you open a new issue? Trace logs are not related to this issue. They are ignorable. |
Sure, you mean for cert auth issue right? |
Adapted from #311
Also,
Instead of returning empty, this PR returns 1 metric resource item. Because returning empty response is not fine for generic clients that list resources with api discovery, like the namespace garbage collector and GitOps serivces like Config Sync and ArgoCD. An APIService that returns an empty list is invalid and causes an error in client-go.
added a feature-gate
list-full-custom-metrics
with default value = false.metricsCache is the same as before previous.
Fixes: #582, #545, #510, #458
Tested:
The memory drop will be more significant in a large cluster.
"http2: stream closed" is from calling
/apis/custom.metrics.k8s.io/v1beta2
. Since it only returns 1 item, it's hard to have timeout error.ListMetrics is not used in HPA, so it's safe to change it.
Cons:
API discovery around /apis/custom.metrics.k8s.io/v1beta2 returns an incomplete resource list, instead of listing all available metrics. This is fine since customers can find full metric names from GCP monitoring dashboards.The feature-gate
list-full-custom-metrics
returns all custom-metrics when it's true.