December 14, 2024

Dec 14: On-call systems

Technical

Luke Erlacher

Platform Engineer

This is the fourteenth post in the Fastmail Advent 2024 series. The previous post was Dec 13: It’s knot DNS. There’s no way it’s DNS. It is DNS! The next post is Dec 15: Platform Team working agreement.

After the scary interlude for Friday the 13th, here’s a follow-up to our blog about support and on-call.

At Fastmail, we own and operate most of our stack, from networking gear and servers in the hosted datacenters we use, server OS setup, internal services, backups and redundancy, through to user-facing services and mail flow.

In the last post, we introduced our organizational model of how our support and on-call team manage on-call and incidents. In this blog post, we get a little more technical into the systems we use to help us do this.

How Incidents are raised

We source alerts from our service stack through a number of mechanisms:

Prometheus alerts for infrastructure and service metrics such as disk usage and replication delay
Logwatchers that alert on things like Cyrus errors
Cron scripts that run regular tests and alert on failures
Alerts from external monitoring

We’re currently experimenting with integrating all of these sources into a single observability stack (I’m a big fan of the unified observability model) - maybe you will read about that in next year’s advent blog post series!

But right now, all of these alerts go straight to Pagerduty to create an incident there.

Our support team and any staff member is empowered to raise an incident at any time when they observe issues that indicate an infrastructure or service failure. This is done via slack pings during working hours, and otherwise by paging through our slackops bot.

Pagerduty alerts

Our Pagerduty on-call rotation and escalation process ensure that incidents are always promptly responded to.

The primary on-call engineer is also responsible for Business As Usual (BAU) tasks such as reviewing non-critical errors and escalations. When incidents have to be escalated past the primary on-call engineer, they go first to the team lead. This allows the on-call engineers that are off rotation to focus on project work during the day, and decompress outside of work hours without worrying about on-call.

In order to minimize the attention load of on-call alerts on the platform engineers, engineers can override the on-call schedule to take on-call when they do large deploys that result in alerts.

We have recently created a dedicated incident discussion channel. Previously, incident communication would take place either in the alerts feed channel, the “ops” channel where we post updates for visibility into changes and deploys, or in a team channel for the team principally responding to the incident.

A dedicated incident discussion channel removes incident discussion noise from other channels. We decided against going with the “create a dedicated channel for every incident” as we think this would create too much churn and reduce visibility.

Outage notifications

When our service is experiencing issues or outages, we want our users to know as soon as possible, both for their benefit and our support team. As we have support staff on shift 24/7, they handle outage notifications via our status page on https://fastmailstatus.com/, Zendesk banners, and posting on our social channels. During incidents, they act as the communication conduit to our users - ascertaining the extent of outages, ensuring timely updates, and checking that when services are restored this is reflected in user experience.

Pagerduty nitty-gritty

Some of the things we do to automate and simplify our life as on-call engineers butt up against the limits of Pagerduty.

For example, it would be nice to have an empty escalation layer as the first layer that during normal times falls through to the next layer instantly. Then people can add themselves in this layer to override the on-call escalation chain for doing deploys. However Pagerduty does not allow this so we have to override the schedules for primary on-call instead.

We also have regular unavailabilities during the week for on-call engineers for scheduled events such as gym training or dance classes. For this, we want to add exclusions to the schedule so that the alerts go to the fallback on-call immediately.

These unavailabilities are different for every individual engineer. Pagerduty doesn’t allow to model this - on-call schedules can have almost arbitrary scheduling through weekdays, but this is per schedule and not per on-call user. So we have to split the schedule and make a separate schedule for every engineer. However, we then can no longer make a rotating schedule - inside one schedule, a user can’t be on for a week and then off for a week - unless that off week is taken by another user.

To work around this we have created (and paid for) a “Blank” dummy user that has no contact methods and is only there to fill the off week for a user.

As you can see in the picture, there are 3 layers in the schedule: The first layer is a base layer that is a weekly rotation between the fallback engineer and the Blank user.

The second layer has the primary oncall engineer’s shift for the first half of the day, and the third layer has the second half of the day. This is also in a weekly rotation between the oncall engineer and the Blank user.

To understand what’s going on here, imagine that the primary oncall engineer needs to be offline every monday from 4PM to 6PM. For that, we put an 8AM - 4PM shift time for every monday in the second layer, and 6PM to 10PM in the third layer. During the 4PM to 6PM time, there is no on-call user in the second or third layer, so it falls back to the base layer and so the fallback engineer will be active.

Finally, we make a separate schedule for every on-call engineer following the same pattern, but offset by one week.

What works well

I have done overnight on-call at previous companies and that is a lot more stressful. I definitely think being able to split on-call is a lot better for engineers’ quality of life!

We have ops tools to do things like log searching, show replication stats, grafana dashboards, and prometheus alerts that give us quick insights into the state of our infrastructure and services to pin down the root cause of an incident quickly. These could be more comprehensive and better organized but they work well.

We have runbooks for some, but not all, alerts. Where they exist they are quite good and comprehensive and we frequently review and update them after incidents.

What could work better

Most of our incidents auto-resolve when the service / monitor returns to nominal service. This is good! We are lucky to have very few flappy alerts that bounce up and down.

But a handful of alerts do not auto-resolve because the alerting mechanism is not stateful. This is confusing for engineers, and it is an extra annoyance to clean up alerts after an incident.

We don’t currently have good categorization and prioritization of alerts. The only way to know whether an alert is important or not is to know from experience, or asking someone with experience. We need to spend more time and be more ruthless in weeding out low-quality alerts!