Monitoring Production Systems – Part I

In past decades, internet has grown beyond imagination; it has a tremendous impact on lives of even most ordinary citizens of the planet. We now take internet based services as commodity – something which should just work. No screeching sounds from a dial-up modem, no partially loaded pages. A website which doesn’t work on a mobile device as good as it works on a laptop is “so last generation”. The benchmark is even higher for e-commerce services – we don’t expect “service unavailable” screen once we have authorized the payment, or we are waiting on the road to get a taxi. On the other side of the table, companies don’t expect to lose business just because a server crashed!

With this innovation, comes the tremendous complexity of keeping such ever-growing systems stable and running, even when the world is asleep. Folks, welcome to un-glamorous side of the internet!

At TravelTriangle, the technical team remains committed in providing uninterrupted services to all our users: friends who are just planning their next adventure, a couple who has just started to explore the world together, our partner agents who use our portal throughout the day for their business, or our internal business units that work round the clock to delight our customers!

Diving a bit into our history, a couple of years ago, our production infrastructure had only one running EC2 instance which hosted everything including ruby-on-rails passenger application, MySQL database, sphinx, etc. Fast forward to today, we have reached a scale, where we are autoscaling our stateless tier of web servers horizontally, while maintaining dedicated clusters for data stores managing state, namely MySQL, Elasticsearch, Redis etc. More on our autoscaling system later.

To start with, we had following software components which needed to be monitored:

  • Phusion passenger server
  • Sidekiq
  • MySQL
  • Elasticsearch
  • Redis
  • Postfix

We had a few tools working for us from the beginning:

  1. Cloudwatch. Our infra being hosted on AWS, cloudwatch was a natural choice.
  2. NewRelic; which provides a mix of services like application monitoring (providing details of rpm, time consumed grouped by controller/actions); infrastructure monitoring (CPU, memory, disk, etc) & user experience monitoring (page load time, apdex score, etc).
  3. Exception notifier to notify us of server error via email; and it also plays nicely with Sidekiq.

As our product, team & infrastructure grew, we started facing several issues in our monitoring infrastructure. Broadly, we could categorize the problems in two parts:

  1. On the technical front, we had many tools with different kind of limitations in each one; for e.g.:
    1. Cloudwatch metrics (Traffic, CPU, etc) are at most 1 min resolution which made traffic spikes, and resultant glitches/downtime(s) much harder to detect.
    2. Newrelic APM gives profiling around controllers & actions; rather than URLs. Now, one single controller action served around ~50 internal analytics reports (based on query parameter) and it wasn’t possible to measure performance/frequency of each report individually.
    3. Exception notifier emails would run in thousand in case of a major bug or outage.No single “dashboard” to give system status.
    4. We would need to login to aws cloudwatch or SSH into server(s) and try to interpret htop, iotop; use tail, grep, sed & awk to crunch server/application logs or open sidekiq web UI to check queues latency, failures, etc.
  2. On the product side, we started experimenting more and more; and maintaining balance between execution speed vs system stability/sanctity became a challenge for both tech & product teams.
    1. We would rely either on our database (which will store final state of mostly transactional data) or tools like GA (which would capture user intent), etc. to measure success/failure or the experiment.
    2. We had no robust system to detect any anomaly which an experiment/bug might have introduced. For e.g., due a bug in an HTML form, a few of optional user inputs weren’t getting stored in database. In such scenarios – looking for what exactly happened would take us hours or even days!
    3. Even apart from experiments, we had no real-time “dashboard” to give us hourly/daily summary of business statistics such as requests created, quotation shared by agents, invoices being created, payments processed, etc.

During this journey of scaling our systems, we had few interesting learnings:-

  1. Business loss due to an EBS volume crash is no different from a Javascript bug in our request form which causes server to return a 400 error, as both of them end up hurting business directly or indirectly.
  2. Separating “devops” responsibilities from “dev” isn’t recommended, as engineers cannot be immune to how their code behaves under infrastructure issues or uncontrolled events, such requests spikes, slow networks, socket connection drops etc.
  3. It is known that systems will be unstable from time-time when you are shipping code at a fast pace, but having a strong layer of metrics and alerts helps catching issues early and contain the impact they can have on business.
  4. Not all bugs will result in system metrics to be affected, hence having strong alerting system on business metrics helps catching issues quickly.

With this realization, we decided to take a more unified approach for monitoring our system metrics as well as business metrics and have single infrastructure to measure and visualise them. Broadly, we would want to monitor following entities:

  1. Infrastructure, e.g.:
    1. Hardware parameter like CPU, memory, Disk IO, etc.
    2. Database health like queue depth, replica lag, etc.
    3. Elasticsearch health like JVM memory usage, etc.
    4. No of healthy/un-healthy hosts in internal/external load balancers.
    5. Traffic pattern: overall, module-wise. Not of 2xx, 5xx, etc responses from various levels like passenger, nginx & ELB.
    6. No of emails bounced, etc.
  2. Application
    1. Response time, browser load time of pages.
    2. Latency of API, critical pages, etc.
    3. Async systems like time taken in sending a “forget password” email; no of jobs being processed.
    4. Database slow queries and affected pages.
    5. Varnish cache hit/miss ratio.
  3. Business
    1. Trip requests being created on day-on-day basis; no of followups, etc.
    2. Most popular destinations, packages, etc.
    3. Quotes & invoices being created, payment success/failure rate.
    4. Notifications system: no of emails, sms, mobile/desktop push notifications sent.

And the answer turned out to be simple: break down the problems in part and solve each part separately. Any metrics and monitoring system can be broken down into 4 essential components:

  1. Data Collection: Metric data is essentially a collection of immutable numeric values  with time being the primary-key. The sources can be many and varied: cloudwatch, zabbix, nagios, sensu, telegraf, application code, apdex score, server queue length, telegraf, or even sql scripts.
  2. Data Storage: Having a separate system for storing time-series data. A number of such databases exists: OpenTSDB, Graphite, InfluxDB, Prometheus, etc.
  3. Visualization: Tool(s) which can build any visualization you want from your data store. Monitoring tools generally ship with visualization capabilities built-in; however a number of tools exists with varying capabilities: Grafana, Chronograf, or one can even build APIs to use with Google visualization. Apart from charts & graphs, various log searching & visualization tools exist: Graylog, Kibana, Splunk, etc.
  4. Alarms & Alerting: A tool which can generate alerts with either real-time or pre-defined time interval options; based on data in storage system. This is a two step process: a) determining the state of alarm (Ok, warning, critical, etc) based on data point and rules; and b) Raising the alert via various channels like emails, push notifications, phone calls & text messages. Generally, monitoring tools have built-in alert functionalities as well, e.g. Alertmanager in Prometheus.

When we looked at it, we first thought was: these are too many tools! After a thorough investigations of tools available, we decided to use (modified) TICK stack as our primary monitoring and metrics stack:

  • T for telegraf, the data collection agent.
  • I for influxdb, the time series database.
  • Grafana (in place of Chronograf) for visualization.
  • K for kapacitor, the alerting subsystem.

It is worth noting that one single solution would, perhaps, not be able to serve all the needs. While it is tempting to use a specialized tool for every need; however restricting the count of tools to just a handful would be the best of both worlds.

This concludes the first part of this series; in the second part we’ll talk about tools we selected for each problem and pros/cons of our choices. Stay tuned!