Accompanying Repository for the "Recovering From Alert Fagitue" talk given at Monitorama 2016 [Slides]
##Abstract Systems that generate numerous critical alerts result in alert fatigue which can result in service outages and developer burnout. My team at Twitter found themselves in this situation. The services had scaled by an order of magnitude in two years and were generating hundreds of alerts per quarter. Over the course of a quarter I led an initiative to decrease the number of alerts, improve the experience of being on call, and increase the reliability of the services. These efforts were incredibly successful reducing the number of critical alerts by 50%. In this talk I’ll discuss the process and alerting best practices we’ve put in places to successfully combat alert fatigue and avoid over alerting in the future.
##References
- Novel Approach to Cardiac Alarm Management on Telemetry Units
- How one Hospital Tweaks its EHR to fight alert fatigue
- Applying Cardiac Alarm Management to your Oncall
- Checklist Manifesto
- Engineering for the Long Game
- WTF is OPerations? #serverless
- Devops for Developers Building an Effective Ops Org
##Observability at Twitter
- Technical Overview Part 1
- Technical Overview Part 2
- Of the Order of Billions: Building Observability at Twitter
##Related Tweets
##Bio Caitie McCaffrey is a Backend Brat and Distributed Systems Diva at Twitter, where she is the Tech Lead of the Observability Team. Prior to that she spent the majority of her career building large scale services and systems that power the entertainment industry at 343 Industries, Microsoft Game Studios, and HBO. Caitie has a degree in Computer Science from Cornell University, and has worked on several video games including Gears of War 2, Gears of War 3, Halo 4, and Halo 5 She maintains a blog at CaitieM.com and frequently discusses technology on Twitter @Caitie