Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vcontrold seems to crash if it receives a signal at the same time a request is processed #141

Open
l3u opened this issue Jul 10, 2024 · 1 comment

Comments

@l3u
Copy link
Collaborator

l3u commented Jul 10, 2024

Hi all,

I'm not completely sure what causes this yet. I changed two things recently: I switched the machine vcontrold runs on from Devuan to Alpine, which is built upon musl and not glibc. However, I'm pretty sure this does not cause the issue, because:

The other thing I changed was: I noticed the vcontrold log to grow big, as it ran for years and I never cared about the log file. So I set up a logrotate script. I thought it was possibly a good idea to restart the daemon via a postrotate script.

What happened was: As of today, for the second time now, vcontrold crashed. I run it since 2015 and I never-ever experienced a single crash. First time, I thought this could be due to Alpine/musl, restarted the daemon and moved on. But today, I has a closer look. The crash happened at the very time the logrotate script ran, after the script would stop the daemon. The log contained "Received SIGTERM" as the last entry. Then, the init system (OpenRC) would not restart the daemon anymore due to it being crashed.

And the time this happened could be the very point in time where my other automated scripts polled the daemon for data. Which possibly works in most of cases, because it's not the very second the request is processed, but those two times it crashed, it may have been really simultaneously.

I thus concluded that maybe, the daemon crashes if some request is processed, and whilst this happens, the daemon receives a signal. Could this be the case? I have write access here and back then, I revised the build system – but I'm not a C pro and I'm not really into such deep stuff like daemon programming. So I fear I can't say much about the code … It would be nice if someone with more insight of this could have a look at the sources and check if we possibly have a bug here.

Apart from that, I now try to simply copytruncate the log and not restart the daemon via logrotate to see if this works – after all, there seems to be no real reason to restart it in the first place. But If we actually have a problem with concurrency, I think we should fix it. And I know that things that happen at the same time through different means can cause a lot of trouble ;-)

Thanks a lot!

Cheers, Tobias

@l3u
Copy link
Collaborator Author

l3u commented Jul 10, 2024

Additional info:

The log when the daemon crashed looks like this:

[32153] Wed Jul 10 02:00:04 2024 : Client connected 127.0.0.1:45842 (FD:1)
[32153] Wed Jul 10 02:00:04 2024 : Received SIGTERM
[1072] Wed Jul 10 02:00:04 2024 : Received SIGTERM

Then I started it manually again, and it continues with:

[1404] Wed Jul 10 07:11:02 2024 : TCP socket 0.0.0.0:3002 opened
[1404] Wed Jul 10 07:15:03 2024 : Client connected 127.0.0.1:48592 (FD:1)
[1404] Wed Jul 10 07:20:03 2024 : Client connected 127.0.0.1:39668 (FD:1)
...

So apparently, it got a connection, then SIGTERM (by the logrotate script), and then, it logged another SIGTERM with already another PID, which should be the actual crash. All in the very same second.

The other logs with no crash look like this:

[30854] Tue Jul  9 02:00:01 2024 : Received SIGTERM
[32153] Tue Jul  9 02:00:04 2024 : TCP socket 0.0.0.0:3002 opened
[32153] Tue Jul  9 02:05:03 2024 : Client connected 127.0.0.1:50344 (FD:1)
[32153] Tue Jul  9 02:10:03 2024 : Client connected 127.0.0.1:58136 (FD:1)
...

and a day before:

[29536] Mon Jul  8 02:00:02 2024 : Received SIGTERM
[30854] Mon Jul  8 02:00:05 2024 : TCP socket 0.0.0.0:3002 opened
[30854] Mon Jul  8 02:05:03 2024 : Client connected 127.0.0.1:35112 (FD:1)
[30854] Mon Jul  8 02:10:03 2024 : Client connected 127.0.0.1:40382 (FD:1)
...

So they start with the "Received SIGTERM" – and not with a client connection.

I hope this helps …

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant