Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crashes on DNS resolvers receiving thousands of requests #26

Open
jaredledvina opened this issue Mar 21, 2017 · 8 comments
Open

Crashes on DNS resolvers receiving thousands of requests #26

jaredledvina opened this issue Mar 21, 2017 · 8 comments

Comments

@jaredledvina
Copy link

Stack Trace:

panic: runtime error: index out of range

goroutine 20 [running]:
panic(0x5ba220, 0xc4200120a0)
        /usr/lib/golang/src/runtime/panic.go:500 +0x1a1
main.initLogEntry(0xc4236eb2f0, 0x4, 0x10, 0xc4236eb310, 0x4, 0x10, 0x0, 0x0, 0x0, 0x0, ...)
        /builddir/build/BUILD/gopassivedns-aa047e156339bc54f46cf6ccc7fa99f24d94c9d2/gocode/src/github.com/phillipmartin/gopassivedns/main.go:157 +0x541
main.handleDns(0xc421450000, 0xc423560180, 0xc42000c1e0, 0xc4236eb2f0, 0x4, 0x10, 0xc4236eb310, 0x4, 0x10)
        /builddir/build/BUILD/gopassivedns-aa047e156339bc54f46cf6ccc7fa99f24d94c9d2/gocode/src/github.com/phillipmartin/gopassivedns/main.go:224 +0x59e
main.handlePacket(0xc42000c300, 0xc42000c1e0, 0x29e8d60800, 0xfffffff207b8a800, 0x1, 0x0)
        /builddir/build/BUILD/gopassivedns-aa047e156339bc54f46cf6ccc7fa99f24d94c9d2/gocode/src/github.com/phillipmartin/gopassivedns/main.go:291 +0x423
created by main.doCapture
        /builddir/build/BUILD/gopassivedns-aa047e156339bc54f46cf6ccc7fa99f24d94c9d2/gocode/src/github.com/phillipmartin/gopassivedns/main.go:383 +0x278

We have a few DNS resolvers that handle between 2800-3800 requests per minute (according to the logs gopassivedns generates). It's much more frequent on resolves seeing >7500 requests per minute. It seems that after a while and not consistently the process will die and dump the above stack trace. This is not an issue at all on resolves seeing between 200-700 requests per minute. Happy to provide any other information that might be useful!

@jaredledvina jaredledvina changed the title Crashes on DNS resolvers receiving Crashes on DNS resolvers receiving thousands of requests Mar 21, 2017
@jaredledvina
Copy link
Author

I believe this is happening due to this section

gopassivedns/main.go

Lines 376 to 402 in a513e78

/* init channels for the packet handlers and kick off handler threads */
var channels []chan *packetData
for i := 0; i < config.numprocs; i++ {
channels = append(channels, make(chan *packetData, 100))
}
for i := 0; i < config.numprocs; i++ {
go handlePacket(channels[i], logChan, gcIntervalDur, gcAgeDur, i, stats)
}
// Use the handle as a packet source to process all packets
packetSource := gopacket.NewPacketSource(handle, handle.LinkType())
//only decode packet in response to function calls, this moves the
//packet processing to the processing threads
packetSource.DecodeOptions.Lazy = true
//We don't mutate bytes of the packets, so no need to make a copy
//this does mean we need to pass the packet via the channel, not a pointer to the packet
//as the underlying buffer will get re-allocated
packetSource.DecodeOptions.NoCopy = true
/*
parse up to the IP layer so we can consistently balance the packets across our
processing threads
TODO: in the future maybe pass this on the channel to so we don't reparse
but the profiling I've done doesn't point to this as a problem
*/

If it helps, in our case, our nproc is 8.

@Phillipmartin
Copy link
Owner

interesting, if you can compile a custom version and test, see if changing the channel size at

channels = append(channels, make(chan *packetData, 100))
to something much larger (I'd start with 500) makes that go away. I could see this being caused by that pre-allocated channel filling up during spikes.

@Phillipmartin
Copy link
Owner

looking at this a little more in depth, I think this is a regression I introduced when I refactored handlePacket to take a channel instead of an individual packet. We turn on ZeroCopy when we capture packets because there is a performance win, however, that means that libpcap is using a circular buffer internally to hold the packets for processing. If the packet capture rate causes the packet being processed to be overwritten before we finish processing it, we can get crazy-looking errors like from your stacktrace because the underlying data changes. An easy way to test this is to comment out https://github.com/Phillipmartin/gopassivedns/blob/master/main.go#L394, re-compile and run the binary on one of these heavy-load systems. If it does not show the same failure, we can look at turning off ZeroCopy or copying the packet data (depending on the perf impact of each strategy)

@jimmystewpot
Copy link
Contributor

Increasing the channel buffer as previously mentioned

channels = append(channels, make(chan *packetData, 100))

has dramatically improved the stability of gopassivedns on our resolvers doing ~5k qps. However as mentioned above it is not a bullet proof fix.

I will investigate further the recommended NoCopy change to see if that has any direct improvement in uptime vs cpu usage penalties.

@jimmystewpot
Copy link
Contributor

@jaredledvina a pull request has recently been merged, in our use case we have seen no crash with significant DNS query load. I would be very interested to know if you have time to see if you see improvements in stability.

@jaredledvina
Copy link
Author

Hey @jimmystewpot ,

Oh nice, it's been a while since I've worked with our install of gopassivedns. I can follow up and see if we can deploy packages for the latest version. Thanks for the ping!

@jimmystewpot
Copy link
Contributor

@jaredledvina did you end up seeing any improvements?

@jaredledvina
Copy link
Author

We're currently looking at reworking how we've built our gopassivedns RPM's so, currently I haven't had a chance to get the latest commits into production. Once we iron that out though, I can follow up as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants