Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Splunk Operator: slow mounting of ebs volume hence pod is keeping "container creating" state for too long #1288

Closed
yaroslav-nakonechnikov opened this issue Feb 20, 2024 · 17 comments
Assignees

Comments

@yaroslav-nakonechnikov
Copy link

Please select the type of request

Enhancement

Tell us more

Describe the request
as we are using EBS volumes with quite big sizes (10Tb+) for indexers, and sometimes it is requred to change node, we found that mounting of EBS and starting pod takes too much time.
In our case it is 70 minutes just to start start pod after assignment to node.

after investigation, we found that k8s by default forces persmissions. ref: https://kubernetes.io/docs/tasks/configure-pod-container/security-context/#configure-volume-permission-and-ownership-change-policy-for-pods
and it takes a lot of time.

Expected behavior
In documenation it is mentioned with some examples how to solve it and crd has default value for fsGroupChangePolicy = "OnRootMismatch"

@yaroslav-nakonechnikov yaroslav-nakonechnikov changed the title Splunk Operator: ebs to slow to mount hence pod is keeping "container creating" state for too long Splunk Operator: slow mounting of ebs volume hence pod is keeping "container creating" state for too long Feb 20, 2024
@akondur akondur assigned akondur and unassigned kumarajeet, jryb and vivekr-splunk Feb 20, 2024
@vivekr-splunk
Copy link
Collaborator

@yaroslav-nakonechnikov just wanted to check what version of splunk operator you are using

@yaroslav-nakonechnikov
Copy link
Author

@vivekr-splunk crd didn't changed a lot from beginning. But i'd say 2.4 and 2.5 doesn't have that feature.

@logsecvuln
Copy link

@vivekr-splunk @akondur Splunk support ticket has also been raised for that matter. Please refer to the following case number "CASE [3423864]".

@akondur
Copy link
Collaborator

akondur commented Feb 21, 2024

Hi @yaroslav-nakonechnikov , is the request here to change the fsGroupChangePolicy to OnRootMismatch?

@yaroslav-nakonechnikov
Copy link
Author

request is to add support for it and inform users about potential issues with big volumes.

as a result it can be changed by default, as from my perspective it doesn't look necessary to change permissions on each mount

@akondur
Copy link
Collaborator

akondur commented Feb 23, 2024

@yaroslav-nakonechnikov , have you tried changing the fsGroupChangePolicy to OnRootMismatch and check if that fixes the issue in your environment? This can be done my manully disabling the operator(temporarily) and testing it on one of your Splunk instances? We are currently evaluating the option on our end.

@yaroslav-nakonechnikov
Copy link
Author

@akondur how? any change in statefulset/pod leads to recreate it. and crd doesn't have that option

@akondur
Copy link
Collaborator

akondur commented Feb 23, 2024

@yaroslav-nakonechnikov You could create a simple Splunk statefulSet which attaches to EBS volumes and try reproducing the issue - post which you can change the policy to see if it changes. Alternatively before changing nodes for the pods, you could delete the operator temporarily and edit the statefulSet

@yaroslav-nakonechnikov
Copy link
Author

@akondur in that case why you can't recheck it if you already know what and how to recheck?

i reported problem as a customer. now it is your step to get most of it and repeat for it.
Honestly, i don't understand why i have to spin another cluster with another 11Tb disks and fill it all with some dump data? Will you pay for it?

@vivekr-splunk
Copy link
Collaborator

Hello @yaroslav-nakonechnikov, Thank you for investigating this issue and identifying a possible solution. We will replicate the problem on our end and test to see if your fix resolves it. we will get back to you soon on this

@akondur
Copy link
Collaborator

akondur commented Feb 28, 2024

Hey @yaroslav-nakonechnikov , we have merged the change to update the fsGroupChangePolicy. Please let us know if the issue still persists and we can re-visit the issue.

@yaroslav-nakonechnikov
Copy link
Author

@akondur this is good.
so now, need to wait till it will be released.

as for now i don't know how to check it, knowing that fact that 2.5.0 and 2.5.1 also not working as expected.

@akondur
Copy link
Collaborator

akondur commented Feb 29, 2024

@yaroslav-nakonechnikov We have reverted the change as we are going to release 2.5.2 this week. Will re-introduce it right after in develop. If this change is needed soon - we will make another minor release. Will update the PR here as soon as it's ready.

@akondur
Copy link
Collaborator

akondur commented Mar 6, 2024

Hey @yaroslav-nakonechnikov , please find the merged MR into develop here. Please let me know if you're still facing issues with this change.

@akondur
Copy link
Collaborator

akondur commented Apr 16, 2024

Closing this issue per the MR. Please re-open it if the issue still persists.

@akondur akondur closed this as completed Apr 16, 2024
@yaroslav-nakonechnikov
Copy link
Author

how it can be closed, if it is not released yet?

@yaroslav-nakonechnikov
Copy link
Author

all good, it is there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants