How to Set Up a Stack to Instantly Respond to Critical Outages

How to Set Up a Stack to Instantly Respond to Critical Outages

How we built a system to monitor and respond to outage incidents instantly using PagerDuty and Datadog as a small start-up

The day started off like any other day. Doug and I had just logged onto Pragli and were about to sync up on our engineering tasks for the week. That's when things took a turn for the worse. I tried to start an audio conversation with him and... nothing happened. Our entire AV infrastructure was down, likely for hours, and we had no clue.

We determined that our Firebase clients were desynchronized with our server instance and immediately restarted the relevant AV services. During our post-mortem chat about the outage, we realized that despite how well we stress-tested Pragli in development, outages at our scale were inevitable. We needed to have a system in place to immediately respond to those incidents.

Monitoring and Failure Analysis

Before we implemented a monitoring/alerting system, we itemized the features that we wanted to monitor along with their relative response urgency and potential failure points:

Audio / Video Placement Monitoring Chart
Audio / Video Signaling Monitoring Chart
Main Site Monitoring Chart
Auxiliary Background Services Monitoring Chart

Selecting Our Stack: Datadog + PagerDuty

Once we identified the failure points of our backend services, we selected the products that would help us monitor key performance metrics and notify us about incidents. Rather than doing a holistic bakeoff of the available tools in the ecosystem (e.g. PagerDuty vs. VictorOps vs Opsgenie), we chatted with other tech founders at our stage to inform our decision-making.

The combination of Datadog and PagerDuty was the overwhelming consensus.

  • Datadog for monitoring
  • PagerDuty for incident management and notifications

Datadog

Installation

Installing Datadog on the Ubuntu virtual machines in our infrastructure was dead simple.

  • We executed a single installation shell script on each machine
  • We configured the agent using the /etc/datadog-agent/datadog.yaml  file to monitor process information
Process configuration on Datadog
  • Finally, we specified the set of SystemD services that the agent should monitor and report info on with the /etc/datadog-agent/conf.d/systemd.d/conf.yaml file
SystemD instances to monitor and report on

Monitoring Usage

Once the virtual machines were setup with the agent, data started flowing into the Datadog dashboard within a few minutes. The data included statistics about CPU, memory and unintended SystemD process restarts broken down by host. These stats comprised most of the relevant information that we needed to determine an outage.

Consolidated view of all servers Datadog is tracking internally
A consolidated view of all the servers DataDog is tracking internally along with their health statuses

We then setup the necessary set of Monitors in Datadog that would identify the outage.

  • For our SystemD services, if the agent detected that any service was restarted more than twice due to a process failure (exit code > 0), the Monitor would go red.
How to setup monitoring and alerts using the Monitor in DataDog
  • For CPU and memory-intensive services, we used 70% consistent CPU usage and 70% memory usage as the pivot point for a Monitor turning red. For most of our virtual machines, we found that this 70% threshold correlated with our system becoming a certifiable shitshow.
  1. Data from requests would swap to disk
  2. Firebase event listeners would lag server updates by several seconds
  3. And the above two points would backup the system and progressively snowball CPU/memory usage until the system was pegged within a few minutes

For all of our Monitors, we leveraged the PagerDuty integration for Datadog to forward alerts to PagerDuty to handle notifications. Once integrated, forwarding alerts to PagerDuty was simple - just reference the integration using the "@" symbol in the team notification section of the Monitor.

How to forward alerts to your team with PagerDuty using PagerDuty integration for Datadog

PagerDuty

Forwarded Datadog Alerts

Simply create a service in PagerDuty and specify that the service will be forwarded alerts from Datadog.

How to integrate Datadog to PagerDuty

If you followed the PagerDuty integration guide from the previous section, you can find the integration key from the "Integrations" tab once the service is created.

Custom Incidents

To cover custom incidents that Datadog doesn't know about such as code exceptions that don't entirely crash theSystemD process, we created PagerDuty services which use the PagerDuty Events API.

How to create a PagerDuty service which uses their Events API

When we run into exceptions that signal downtime, we use this simple JavaScript utility function to initiate error notifications in PagerDuty using the service routing key. For example, we use this logic in a script that periodically checks that the main Pragli site is available.

import fetch from 'node-fetch'

const handlePGError = async ({ id, server, routingKey, summary, dedupKey, ts }) => {
  const request = {
    routing_key: routingKey,
    event_action: 'trigger',
    dedup_key: dedupKey,
    payload: {
      summary,
      source: id,
      severity: 'error',
      timestamp: (ts ? new Date(ts) : new Date()).toISOString(),
      group: 'media',
      custom_details: server
    },
  }
  
  const rsp = await fetch('<https://events.pagerduty.com/v2/enqueue>', {
    method: 'POST',
    headers: {
      Accept: 'application/vnd.pagerduty+json;version=2',
      'Content-Type': 'application/json',
      From: '[email protected]',
    },
    body: JSON.stringify(request)
  })
  
  const data = await rsp.text()

Notifications in PagerDuty

We configured PagerDuty to send loud and aggressive notifications when Datadog or our custom scripts notify it about an incident. We currently have the system configured to firehose us with:

  • Emails
  • Automated phone calls
  • iOS critical alerts
PagerDuty iOS critical alerts

PagerDuty iOS critical alerts have especially become a lifesaver because they cut right through "do not disturb" and notification muting.

Scheduling Pages & Escalating Incidents

When we determined the list of services that we wanted to monitor, we noted down how urgent downtime for a particular service would be.

  • When creating services for higher urgency services, we configured the system to send us notifications immediately.
  • When creating services for lower urgency services, we configured the system to send us notifications at the start of our normal working hours (9 AM to 5 PM PST). Doug and I really didn't want to wake up at 3 AM for a Sloth Racing outage 😬.

If the primary responder for that service did not acknowledge the incident within 5 minutes, PagerDuty escalates the incident to the backup responder. Originally, we configured escalation to happen within 30 minutes but realized (after a few painful incidents) that if someone did not respond to a page within 5 minutes, they were very likely not going to respond within 30 minutes.

Future Modifications & Improvements

We are making two significant changes to our engineering infrastructure over the next few months. We plan to:

  • Roll out our own real-time data infrastructure
  • Convert our SystemD services to run within Docker containers to deploy on Google Kubernetes Engine (GKE)

We have used Firebase for the last two years since it allows us to quickly prototype new features without worrying about creating API routes or performing database migrations. But recently, as we've started to grow our user base, we've reached fundamental scale limitations with the platform and have been working to roll out our own infrastructure. Although we've excited about this transition, we will have to actively monitor:

  • Multiple PostgreSQL databases (active and standby)
  • Thousands of concurrent WebSocket connections
  • GraphQL API servers
  • Containers within a GKE cluster

But despite the order-of-magnitude increase in infrastructure complexity, we're confident that this monitoring stack provides enough flexibility to grow with us. Datadog and PagerDuty have a wide integration ecosystem for Apollo and Postgres and provide first-class support for container platforms like Kubernetes.

If you have any questions about how we set up our monitoring stack, feel free to reach out to us on Twitter or shoot me an email at [email protected]!

What's Pragli?

Pragli is a virtual office for remote teams. We built and monitor our product using awesome solutions like Datadog and PagerDuty.

Learn more here - it's free!

Show Comments