-
Notifications
You must be signed in to change notification settings - Fork 323
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fails to properly clean up closed connections #104
Comments
Hi @tfheen! Thanks a lot for the report. I am not able to reproduce this on the example dashboard in a new project. The code you're referring to predates my involvement with Dashing - but it's entirely possible that it's the culprit! We did roll out a Sinatra upgrade a few months back... From what I can tell in the docs though, what we have in Smashing/Dashing was never correct. I have a few questions before I can help further:
If I can reproduce the issue - I'd be super happy to solve it for you! Cheers. |
It's the I haven't tried reproducing with the example dashboard, I'll try that and if unable to reproduce, I'll see if I can make a test case for you. (The dashboard as such is not particularly sensitive, but it talks to various internal systems which aren't open to the world, so you'd not be able to usefully run it.) |
Sorry, that this has taken a while to get around to testing. I went around this slightly differently, I added a
When I noticed the dashboard had out-of-date widgets, I hit the debug endpoint and it showed me:
The first of those timestamps is correct, the last one is about a week earlier (this job should run every 30 minutes). Is this sufficient or do you still need a reproduction case with the default dashboard? |
No worries on the delay - such is life with open source! Thanks for the debug info, I'm still confused how we end up with half open connections inside After doing a bit of digging here - the change you're asking me to make is equivalent to the existing code save one detail. They're both array operations, If this dashboard is being shown only on a TV and people do not often open it in their browser, I can see this change creating a hilarious debugging session where one notices the TV is not updating - so someone opens it in a browser and then closes the window and it starts updating again. That behavior is not very intuitive! Instead of calling As a workaround for now, inside of require 'rufus-scheduler'
scheduler = Rufus::Scheduler.new
scheduler.in '1s', :overlap => false do
settings.connections.reject!(&:closed?)
end If that works, I'll add it into smashing - shouldn't be harmful (although inefficient / "should be" unnecessary). I'm still really curious why the pool seems to drain for others but yours remains clogged...
|
Where are you seeing that? To me, it looks like there might be (and I'm guessing here) threads that are reused improperly, so some of those that would otherwise do the updates end up stuck on stuck connections. The dashboard is generally shown on a few TVs, but some people are likely to load it in a browser now and then.
unicorn (though we're not particularly attached to it, it's all through Docker in a kubernetes cluster, so happy to try something else if you think it'll help).
It happens within a few hours to a day in general. There are likely to be some clients that disappear off the network, but hopefully it should be detected through SSE or TCP keepalives or similar? |
https://github.com/Smashing/smashing/blob/master/lib/dashing/app.rb#L141-L14 Okay - I think I've figured it out here. Rufus 2.0 used to rescue on all exceptions, but in Rufus 3.0 it was changed to only StandardError: I'm betting there is a race condition where jobs can execute at exactly the wrong moment: where Sinatra knows the connection is invalid but it hasn't triggered the callback set up in the Since Rufus no longer handles this exception, this kills the thread. The chance of this issue occurring has likely existed for a long time, but the change in Rufus combined with your unique configuration seems to have increased it's impact and likelihood. I can imagine a great number of factors that would contribute to the duration of the window where this race condition could occur (purely speculation):
Rather than periodically cleaning the pool, then, it would be better if I clean the pool when we call send event. Even better, it looks like Sinatra has some mutexing built into it - if |
... and / or I could just add some exception handling into the |
Yeah, we are definitely breaking # 4 in this list here :) |
Ah, also - Unicorn is a mutli-process model vs. Thin which is a multithreaded model: That also seems highly related to why concurrency bugs are more likely... |
So - in conclusion, Dashing was always a bit risky with it's connection pooling. It essentially relied on EventMachine in subtle ways to mostly ensure (through adhoc side-effects) that concurrent access to connections was synchronized across: jobs, sinatra callbacks and other endpoints. It also relied on a catch-all exception handler in Rufus to rescue it from any problems. Increasingly, Smashing is being run in environments where these mechanisms are breaking down. Unicorn, Puma, JRuby - I've seen all of these exhibit this kind of behavior. CNI and other whacky k8sisms will probably also expose some flaws in this slapdash approach. The "ultimate" solutions for these issues would be to use something like celluloid-io, angelo, or to wrap |
First of all, thanks a lot for taking the time to dig through all this and figuring what seems to be going wrong. I know it takes a lot of effort, as I've been on your side of the table enough times. :-)
Data point: This problem seems to have increased in frequency when I switched to smashing from dashing and at the same time enabled the gzip middleware. I could turn it off and see if that makes it less likely to happen if that's useful? |
Thanks for the offer, but I'm just going to sledgehammer this whole class of problem. PR with fix should come in later today. |
I'm getting errors from dashing where it complains about trying to send data to closed streams:
I think the problem might be the use of
out.callback { settings.connections.delete(out)
inget '/events'
. The example from https://www.rubydoc.info/gems/sinatra#streaming-responses shows usingconnections.reject!(&:closed?)
instead.The text was updated successfully, but these errors were encountered: