Making Monitoring Boring

Concepts

  • Monitoring: Using real-time production data to provide insight into the system and detect unusual conditions
  • Boring: That which does not provide excitement
  • Excitement: Getting paged at 3am, all-day outages, guessing what is wrong with your system
  • Kitten: A young cat

So you built a thing

kittn.io

  • Disruptive
  • Agile
  • Distributed
  • Kittens (as a Service)
  • See here

Your first outage

  • Phone rings: No kittens!
  • Debugging all night
  • You need monitoring

Your first monitoring setup

  • Install ... something
  • Collect data: Is my server running? Is it overloaded? Is there enough free RAM?
  • Page whenever something bad happens
  • Detects 97 out of each 10 outages
  • ...including some of the real ones.
  • Most of the excitement now caused by being paged.

Improvements

  • Blackbox monitoring
  • Not everything deserves a page
  • System grows in complexity
  • Some excitement replaced by confusion

Doing it properly

Service Level Objective

“I want it to work this well”

At least 99% requests succeed,

90% requests handled in 200ms,

evaluated monthly

What makes a good SLO?

  • Relevant: Something you care about
  • Measurable: You need the numbers
  • Realistic: No one needs five nines of kittens

Alerting becomes simple

  • Page if you will be out of SLO unless you do something today
  • Ticket if you will be out of SLO unless you do something this week
  • Exceptions

Other best practices

Configuration is code

  • No clicking
  • Version control
  • Code reviews

Document everything

  • Playbook for each alert
  • Explain intent, not implementation

Informational alerts

  • Detect conditions the user might find interesting (An application server is down)
  • May be cause-based
  • Never send any notifications

Let's have an outage

Questions?

Thank you