Testing for Flappers

"runit has more flappers than the Great Gatsby."

-- Erik Mackdanz

Flappers are a big problem but they're hard to test for. This page will collect ideas for testing for them

Runit scripts should all append to a log (TBD) when they come up, restart, etc... then IronCuke can just grep for restarts within some short timespan of each other. Flip approved (and generated) idea
Allow components to announce a regex that will be matched against the logs to test for stability. I.e. many components echo to their logs something along the lines of "listening on port 8080" when they finally come up.
- Concerns around applicability (how many components actually do this?)
- Also concerned about logs filling up/not flushing to disk etc...
- Should be careful not to count past successful startups as a current successful startup
Nagios's thoughts
Consider using "once" instead of "up" in runit: http://smarden.org/runit/sv.8.html
- This may remove the need for flap detection (mostly) since
  - Almost all flapping is due to runit's aggressive restart policy (in my brief experience)
  - The service would simply be down, which can already be detected.
- Downside: no more process restarting, which arguably has a legit use case.
  - Probably better to fix a crasher instead of whitewashing it, assuming the crash is something in our power to prevent.

Check PID, sleep for a second, check if PID is same

Bad because a) it's smelly b) there is no natural implementation and c) the PID can stay the same while the service is still flapping. The only evidence of flapping would be in the logs.

Provide feedback