Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: fluent-bit process once again hangs sometimes after being restarted #1407

Open
jjsiv opened this issue Nov 8, 2024 · 3 comments
Open
Assignees
Labels
bug Something isn't working

Comments

@jjsiv
Copy link
Collaborator

jjsiv commented Nov 8, 2024

Describe the issue

Some time ago the fluentbit-watcher has been reworked to utilise the hot-reload feature
90d364b

This also meant removal of the SIGKILL call when the process is hanging. And so the issue that I initially reported in #510 has been reintroduced.

This is something that ideally would be fixed in fluent-bit itself (and I will report it there as well once I investigate this problem more in-depth and can reproduce it consistently...), but in the meantime I think it would be great to have handling for these situations reintroduced in fluent-operator.

To Reproduce

No clear steps to reproduce. Seems to happen when fluent-bit is restarted many times in a row, but not always

Expected behavior

Fluent-bit is restarted and works

Your Environment

- Fluent Operator version:
- Container Runtime:
- Operating system:
- Kernel version:

How did you install fluent operator?

No response

Additional context

Keeping this as somewhat of a remainder go get back to this after 18.11 or so

@wenchajun
Copy link
Member

I think this issue might be due to Fluent Bit. I will try to reproduce and test it.

@wenchajun wenchajun added the bug Something isn't working label Nov 18, 2024
@jjsiv
Copy link
Collaborator Author

jjsiv commented Nov 18, 2024

Sure, I've been testing livenessProbe as a workaround to restart the pod when it happens, not sure if it works yet. Here is a log from when the issue happens:

level=info time=2024-10-07T14:23:52Z msg="Config file changed, reloading..."
[2024/10/07 14:23:52] [engine] caught signal (SIGHUP)
level=info time=2024-10-07T14:23:52Z msg="Config file changed, reloading..."
level=info time=2024-10-07T14:23:52Z msg="Config file changed, reloading..."
[2024/10/07 14:23:52] [engine] caught signal (SIGHUP)
[2024/10/07 14:23:52] [2024/10/07 14:23:52] [error] reloading in progress, aborting.
[engine] caught signal (SIGHUP)
[2024/10/07 14:23:52] [error] reloading in progress, aborting.
[2024/10/07 14:23:52] [error] reloading in progress, aborting.
level=info time=2024-10-07T15:35:46Z msg="Config file changed, reloading..."
[2024/10/07 15:35:46] [engine] caught signal (SIGHUP)
[2024/10/07 15:35:46] [error] reloading in progress, aborting.
level=info time=2024-10-07T15:35:46Z msg="Config file changed, reloading..."
level=info time=2024-10-07T15:35:46Z msg="Config file changed, reloading..."
[2024/10/07 15:35:46] [engine] caught signal (SIGHUP)
[2024/10/07 15:35:46] [error] reloading in progress, aborting.
level=info time=2024-10-07T16:35:34Z msg="Config file changed, reloading..."
[2024/10/07 16:35:34] [engine] caught signal (SIGHUP)
[2024/10/07 16:35:34] [error] reloading in progress, aborting.
level=info time=2024-10-07T16:35:34Z msg="Config file changed, reloading..."
level=info time=2024-10-07T16:35:34Z msg="Config file changed, reloading..."
[2024/10/07 16:35:34] [engine] caught signal (SIGHUP)
[2024/10/07 16:35:34] [2024/10/07 16:35:34] [error] reloading in progress, aborting.
[engine] caught signal (SIGHUP)
[2024/10/07 16:35:34] [error] reloading in progress, aborting.

And nothing happens after that. The process is still running in the pod, but logs are not collected. I have not thought to check if the server is responsive, but I will if it see it happen again.

Ultimately I think this issue could be closed and moved to fluent-bit's repo, perhaps this shouldn't be fixed on fluent-operator as any "fix" would be just a workaround.

@jjsiv
Copy link
Collaborator Author

jjsiv commented Nov 22, 2024

I've added some more info on an existing fluent-bit issue: fluent/fluent-bit#9354 (comment)

@wenchajun @benjaminhuo - what is your opinion on this, is a workaround for this problem something that should be once again added to fluent-operator? Or should we wait until this problem is resolved on fluent-bit (uncertain when)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants