Fails to properly clean up closed connections #104

tfheen · 2018-05-08T07:16:37Z

I'm getting errors from dashing where it complains about trying to send data to closed streams:

{ 14492200 rufus-scheduler intercepted an error:
  14492200   job:
  14492200     Rufus::Scheduler::EveryJob "30s" {}
  14492200   error:
  14492200     14492200
  14492200     Zlib::GzipFile::Error
  14492200     closed gzip stream
  14492200       /usr/local/bundle/gems/rack-2.0.5/lib/rack/deflater.rb:80:in `write'
  14492200       /usr/local/bundle/gems/rack-2.0.5/lib/rack/deflater.rb:80:in `block in each'
  14492200       /usr/local/bundle/gems/sinatra-2.0.1/lib/sinatra/base.rb:433:in `block in <<'
  14492200       /usr/local/bundle/gems/sinatra-2.0.1/lib/sinatra/base.rb:406:in `schedule'
  14492200       /usr/local/bundle/gems/sinatra-2.0.1/lib/sinatra/base.rb:433:in `<<'
  14492200       /usr/local/bundle/gems/smashing-1.1.0/lib/dashing/app.rb:146:in `block in send_event'
  14492200       /usr/local/bundle/gems/smashing-1.1.0/lib/dashing/app.rb:146:in `each'
  14492200       /usr/local/bundle/gems/smashing-1.1.0/lib/dashing/app.rb:146:in `send_event'
  14492200       /usr/src/app/jobs/redis_queue.rb:2:in `block in <top (required)>'
  14492200       /usr/local/bundle/gems/rufus-scheduler-3.4.2/lib/rufus/scheduler/jobs.rb:210:in `do_call'
  14492200       /usr/local/bundle/gems/rufus-scheduler-3.4.2/lib/rufus/scheduler/jobs.rb:254:in `trigger_now'
  14492200       /usr/local/bundle/gems/rufus-scheduler-3.4.2/lib/rufus/scheduler/jobs.rb:296:in `block (3 levels) in start_work_thread'
  14492200       /usr/local/bundle/gems/rufus-scheduler-3.4.2/lib/rufus/scheduler/jobs.rb:299:in `block (2 levels) in start_work_thread'
  14492200       /usr/local/bundle/gems/rufus-scheduler-3.4.2/lib/rufus/scheduler/jobs.rb:285:in `loop'
  14492200       /usr/local/bundle/gems/rufus-scheduler-3.4.2/lib/rufus/scheduler/jobs.rb:285:in `block in start_work_thread'
  14492200   tz:
  14492200     ENV['TZ']: 
  14492200     Time.now: 2018-05-08 06:42:36 +0000
  14492200     local_tzone: #<TZInfo::DataTimezone: Etc/UTC>
  14492200   et-orbi:
  14492200     (etz:nil,tnz:"UTC",tzid:nil,rv:"2.3.3",rp:"x86_64-linux",rorv:nil,astz:nil,eov:"1.1.1",eotnz:#<TZInfo::DataTimezone: Etc/UTC>,eotnfz:"+0000",eotlzn:"Etc/UTC",debian:"Etc/UTC",centos:nil,osx:nil)
  14492200   scheduler:
  14492200     object_id: 15787320
  14492200     opts:
  14492200       {}
  14492200       frequency: 0.3
  14492200       scheduler_lock: #<Rufus::Scheduler::NullLock:0x00000001e1fc48>
  14492200       trigger_lock: #<Rufus::Scheduler::NullLock:0x00000001e1fdb0>
  14492200     uptime: 4860.382827997208 (1h21m383)
  14492200     down?: false
  14492200     threads: 10
  14492200       thread: #<Thread:0x00000002c85940>
  14492200       thread_key: rufus_scheduler_15787320
  14492200       work_threads: 9
  14492200         active: 6
  14492200         vacant: 3
  14492200         max_work_threads: 28
  14492200       mutexes: {}
  14492200     jobs: 29
  14492200       at_jobs: 0
  14492200       in_jobs: 0
  14492200       every_jobs: 29
  14492200       interval_jobs: 0
  14492200       cron_jobs: 0
  14492200     running_jobs: 6
  14492200     work_queue: 0
} 14492200 .

I think the problem might be the use of out.callback { settings.connections.delete(out) in get '/events'. The example from https://www.rubydoc.info/gems/sinatra#streaming-responses shows using connections.reject!(&:closed?) instead.

The text was updated successfully, but these errors were encountered:

terraboops · 2018-05-08T19:35:23Z

Hi @tfheen! Thanks a lot for the report. I am not able to reproduce this on the example dashboard in a new project. The code you're referring to predates my involvement with Dashing - but it's entirely possible that it's the culprit! We did roll out a Sinatra upgrade a few months back... From what I can tell in the docs though, what we have in Smashing/Dashing was never correct.

I have a few questions before I can help further:

Can you help me out by providing your Ruby version, what OS you're running it on and what browser you're using to view / close the SSE stream?
Does the example dashboard work for you?
Are you able to share the code? If the jobs, dashboards and widgets are sensitive - can you share the rest of the code? If not, can you make a quick list of modifications made compared to the example dashboard?

If I can reproduce the issue - I'd be super happy to solve it for you! Cheers.

tfheen · 2018-05-08T19:44:33Z

It's the ruby:2.3.3-onbuild docker image, which iirc is based on Debian jessie. It's running in a container where the host is either Debian Jessie or Stretch. The browser used is chromium, version 66.0.3359.117-1~deb9u1. We're also seeing the problem with Chrome on ChromeOS, so I don't think it's dependent on the browser.

I haven't tried reproducing with the example dashboard, I'll try that and if unable to reproduce, I'll see if I can make a test case for you. (The dashboard as such is not particularly sensitive, but it talks to various internal systems which aren't open to the world, so you'd not be able to usefully run it.)

tfheen · 2018-05-22T07:01:25Z

Sorry, that this has taken a while to get around to testing. I went around this slightly differently, I added a lib/debug.rb with the contents:

require 'sinatra'

get '/_/debug' do
  { connections: settings.connections,
    history: settings.history }.to_json
end

get '/_/defibrillate' do
    settings.connections.reject!(&:closed?)
    "done!"
end

When I noticed the dashboard had out-of-date widgets, I hit the debug endpoint and it showed me:

{
  "connections": [
    "#<Sinatra::Helpers::Stream:0x007f3ee03d28b0>",
    "#<Sinatra::Helpers::Stream:0x007f3ee06b5668>",
    "#<Sinatra::Helpers::Stream:0x007f3ee0b3bbb0>"
  ],
  "history": {
…
    "zendesk": "data: {\"current\":15,\"created\":1,\"solved\":6,\"change\":-5,\"id\":\"zendesk\",\"updatedAt\":1526971590243}\n\n",
…
    "running_fiaas_migrations": "data: {\"min\":0,\"value\":73,\"max\":100,\"moreinfo\":\"299 of 407 services\",\"id\":\"running_fiaas_migrations\",\"updatedAt\":1526388114874}\n\n",
…

The first of those timestamps is correct, the last one is about a week earlier (this job should run every 30 minutes).
Hitting the defibrillate endpoint made the graph update correctly again, so I think maybe the reject change is what is needed.

Is this sufficient or do you still need a reproduction case with the default dashboard?

terraboops · 2018-05-22T15:47:38Z

No worries on the delay - such is life with open source!

Thanks for the debug info, I'm still confused how we end up with half open connections inside settings.connections. I'm even more confused that it seems like one Rufus job is updating correctly (IE: has a correct view of settings.connections) whereas another job is not updating (IE: has a view of settings.connections that has a stale connection in it)... That should "not be possible". :mindblown:

After doing a bit of digging here - the change you're asking me to make is equivalent to the existing code save one detail. They're both array operations, reject is a filter operation that removes elements matching the condition &:closed, whereas the original code is deleting the specific connection. So, your suggested code will remove all closed connections any time a connection is closed, versus the existing code which removes the specific connection which closed any time a connection is closed. This change is fine, but it won't actually solve the problem of having half open connections stuck in settings.connections. However, it will mean that the next time someone goes to view the dashboard and then closes the tab it will cause smashing to defibrillate.

If this dashboard is being shown only on a TV and people do not often open it in their browser, I can see this change creating a hilarious debugging session where one notices the TV is not updating - so someone opens it in a browser and then closes the window and it starts updating again. That behavior is not very intuitive!

Instead of calling settings.connections.reject!(&:closed?) inside the callback, I think it might make more sense to run this periodically to clean stale connections out of the pool. This is a total hack, but I've dealt with too many leaky connection pools in my day to care about hacks like this any more 😂. Somewhere, we've got a leaky abstraction. I'd bet it's somewhere inside Sinatra, for some reason the callback is not being fired... Like this: sinatra/sinatra#446.

As a workaround for now, inside of lib/debug.rb can you try adding something like this (untested) and let me know how it works:

require 'rufus-scheduler'

scheduler = Rufus::Scheduler.new

scheduler.in '1s', :overlap => false do
  settings.connections.reject!(&:closed?)
end

If that works, I'll add it into smashing - shouldn't be harmful (although inefficient / "should be" unnecessary).

I'm still really curious why the pool seems to drain for others but yours remains clogged...

Are you using a standard ruby runtime, or something like JRuby?
What about the web server - are you using Thin?
Does this happen infrequently? Any extenuating circumstances you can think of - perhaps the dashboard client is on a spotty network connection?

tfheen · 2018-05-22T16:34:35Z

I'm even more confused that it seems like one Rufus job is updating correctly (IE: has a correct view of settings.connections) whereas another job is not updating (IE: has a view of settings.connections that has a stale connection in it)

Where are you seeing that? To me, it looks like there might be (and I'm guessing here) threads that are reused improperly, so some of those that would otherwise do the updates end up stuck on stuck connections.

The dashboard is generally shown on a few TVs, but some people are likely to load it in a browser now and then.

Are you using a standard ruby runtime, or something like JRuby?

ruby 2.3.3p222 (2016-11-21 revision 56859) [x86_64-linux]

What about the web server - are you using Thin?

unicorn (though we're not particularly attached to it, it's all through Docker in a kubernetes cluster, so happy to try something else if you think it'll help).

Does this happen infrequently? Any extenuating circumstances you can think of - perhaps the dashboard client is on a spotty network connection?

It happens within a few hours to a day in general. There are likely to be some clients that disappear off the network, but hopefully it should be detected through SSE or TCP keepalives or similar?

terraboops · 2018-05-22T17:20:38Z

Where are you seeing that?

https://github.com/Smashing/smashing/blob/master/lib/dashing/app.rb#L141-L14
When you call send_event it is blowing up because one of the connections is invalid - but all jobs should see the same copy of settings.connections...

Okay - I think I've figured it out here. Rufus 2.0 used to rescue on all exceptions, but in Rufus 3.0 it was changed to only StandardError:
https://github.com/jmettraux/rufus-scheduler/blob/master/README.md#notable-changes

I'm betting there is a race condition where jobs can execute at exactly the wrong moment: where Sinatra knows the connection is invalid but it hasn't triggered the callback set up in the /events endpoint yet. If send_event were called during the moment that Sinatra knew the connection was closed but hadn't fired the callback, Smashing would not have removed out from settings.connections yet. This would generate an exception when Smashing attempts to write to each connection here:
https://github.com/Smashing/smashing/blob/master/lib/dashing/app.rb#L146

Since Rufus no longer handles this exception, this kills the thread. The chance of this issue occurring has likely existed for a long time, but the change in Rufus combined with your unique configuration seems to have increased it's impact and likelihood.

I can imagine a great number of factors that would contribute to the duration of the window where this race condition could occur (purely speculation):

gzip middleware needs to be closed before other layers need to be closed
any other unicorn plugins or rack middlewares executing before callback fired

Rather than periodically cleaning the pool, then, it would be better if I clean the pool when we call send event.

Even better, it looks like Sinatra has some mutexing built into it - if send_event could block until Sinatra signaled it was finished cleaning up the connection, then it should never try to write to a connection that's half open. This change would be trickier and tbh I don't have the bandwidth to do that - but maybe one day.
http://sinatrarb.com/configuration.html#lock---ensure-single-request-concurrency-with-a-mutex-lock

terraboops · 2018-05-22T17:26:50Z

... and / or I could just add some exception handling into the SCHEDULER instance that forces jobs to be restarted. That's probably a good idea also and really easy to add both, since a dead thread is also not intuitive at all for a fledgling hacker.

terraboops · 2018-05-22T17:32:52Z

Yeah, we are definitely breaking # 4 in this list here :)
http://blog.ifyouseewendy.com/blog/2016/02/16/ruby-concurrency-in-practice/

terraboops · 2018-05-22T17:36:43Z

Ah, also - Unicorn is a mutli-process model vs. Thin which is a multithreaded model:
https://www.toptal.com/ruby/ruby-concurrency-and-parallelism-a-practical-primer

That also seems highly related to why concurrency bugs are more likely...

terraboops · 2018-05-22T17:47:12Z

So - in conclusion, Dashing was always a bit risky with it's connection pooling. It essentially relied on EventMachine in subtle ways to mostly ensure (through adhoc side-effects) that concurrent access to connections was synchronized across: jobs, sinatra callbacks and other endpoints. It also relied on a catch-all exception handler in Rufus to rescue it from any problems.

Increasingly, Smashing is being run in environments where these mechanisms are breaking down. Unicorn, Puma, JRuby - I've seen all of these exhibit this kind of behavior. CNI and other whacky k8sisms will probably also expose some flaws in this slapdash approach.

The "ultimate" solutions for these issues would be to use something like celluloid-io, angelo, or to wrap settings.connections in a collection that is more atomic and safe for use in various threading modes. However, I don't have time for a total re-write, so I'm gonna add back an exception handler into Rufus and add a settings.connections.reject!(&:closed?) into send_event before we loop over the connections.

tfheen · 2018-05-22T18:02:43Z

First of all, thanks a lot for taking the time to dig through all this and figuring what seems to be going wrong. I know it takes a lot of effort, as I've been on your side of the table enough times. :-)

gzip middleware needs to be closed before other layers need to be closed

Data point: This problem seems to have increased in frequency when I switched to smashing from dashing and at the same time enabled the gzip middleware. I could turn it off and see if that makes it less likely to happen if that's useful?

terraboops · 2018-05-22T18:05:24Z

Thanks for the offer, but I'm just going to sledgehammer this whole class of problem. PR with fix should come in later today.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fails to properly clean up closed connections #104

Fails to properly clean up closed connections #104

tfheen commented May 8, 2018

terraboops commented May 8, 2018

tfheen commented May 8, 2018

tfheen commented May 22, 2018

terraboops commented May 22, 2018

tfheen commented May 22, 2018

terraboops commented May 22, 2018 •

edited

Loading

terraboops commented May 22, 2018

terraboops commented May 22, 2018 •

edited

Loading

terraboops commented May 22, 2018

terraboops commented May 22, 2018

tfheen commented May 22, 2018

terraboops commented May 22, 2018

Fails to properly clean up closed connections #104

Fails to properly clean up closed connections #104

Comments

tfheen commented May 8, 2018

terraboops commented May 8, 2018

tfheen commented May 8, 2018

tfheen commented May 22, 2018

terraboops commented May 22, 2018

tfheen commented May 22, 2018

terraboops commented May 22, 2018 • edited Loading

terraboops commented May 22, 2018

terraboops commented May 22, 2018 • edited Loading

terraboops commented May 22, 2018

terraboops commented May 22, 2018

tfheen commented May 22, 2018

terraboops commented May 22, 2018

terraboops commented May 22, 2018 •

edited

Loading

terraboops commented May 22, 2018 •

edited

Loading