Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gcsfusecsi-metrics-collector container getting OOM killed #373

Open
pdfrod opened this issue Nov 11, 2024 · 6 comments
Open

gcsfusecsi-metrics-collector container getting OOM killed #373

pdfrod opened this issue Nov 11, 2024 · 6 comments
Assignees
Labels
bug Something isn't working

Comments

@pdfrod
Copy link

pdfrod commented Nov 11, 2024

I'm experiencing occasional OOM kills of the gcsfusecsi-metrics-collector container (part of gcsfusecsi-node DaemonSet). This container has a somewhat low memory limit (30Mi). Is there a way to costumize the memory limit of this container?

@hime
Copy link
Collaborator

hime commented Nov 11, 2024

Hi @pdfrod, this is interesting behavior. Can you provide GKE cluster version? Can you provide the number of pods that you are running on each node? Can you also confirm if this is causing issues in your workload? In this case, I can provide the steps to disable metrics exporting.

Could you share the Cluster ID with me? You can get the id by running gcloud container clusters describe <cluster-name> --location <cluster-location> | grep id:

@hime hime self-assigned this Nov 11, 2024
@hime hime added the bug Something isn't working label Nov 11, 2024
@pdfrod
Copy link
Author

pdfrod commented Nov 12, 2024

Sure, here's the info you requested @hime.

  • GKE cluster version: 1.31.1-gke.1678000
  • the number of pods and nodes varies due to autoscaling. At the moment the most busy node has 18 pods (9 kube-system pods and 9 application pods). In total there are 98 pods (69 kube-system + 29 application pods) over 8 nodes.
  • I haven't noticed any issues with my workloads, but the pods that are using GCS FUSE CSI driver only need to access the volume very rarely (maybe a couple of times a week), so I would be very unlikely to notice any problems.
  • the Cluster ID is 121bfe79164042aa9d9011c96cc4c2166952fc6e990d4282b9d3be45c069f917.

I should probably mention that I don't remember seeing this problem when there were just a couple of deployments using this driver. Now that I have 12 deployments using the driver, I'm seeing OOM kills of the metrics collector every day.

If there's a way to disable the metrics collector container, that would be even better as currently I'm not using those metrics.

Let me know if you need more info.

@hime
Copy link
Collaborator

hime commented Nov 13, 2024

Thank you @pdfrod. could you add the following volumeAttribute to your spec? See details here

volumeAttributes:
 ...
 disableMetrics: "true"

Please let me know if that stops the OOMs. We are working on fixing this issue.

@pdfrod
Copy link
Author

pdfrod commented Nov 13, 2024

Cool, I'll give that a try. Thanks!

@hime
Copy link
Collaborator

hime commented Nov 13, 2024

Hi @pdfrod, Thanks for reporting this issue! I have created and merged #375 to disable metrics exporting by default. We're going to have to research a good way to scale this solution for customers running many workloads on the same VM.

@pdfrod
Copy link
Author

pdfrod commented Nov 14, 2024

Cool, thanks a lot!

Since I've disabled metrics on my cluster I haven't seen any OOM kills, so it's looking good so far.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants