August 8, 2018

FastMail Outage, August 1st

Company

Bron Gondwana

CEO

We know how vital email is to all our customers, whether personal or business customers. We sincerely apologise for the disruption caused by downtime. We appreciate that not being able to access your email causes frustration and concern.

Just the facts

FastMail was unavailable to the vast majority of our customers for nearly 3 hours on August 1st.

This was due to a network equipment failure in our primary datacentre at NYI New Jersey. FastMail staff were working closely with NYI during the entire outage period; the failure was within NYI’s equipment, and the fix was done entirely by their staff.

No mail was lost during the outage because power remained on, and no servers were reset. Incoming mail to FastMail customers was either delayed until the New Jersey network came back up, or delivered to the inbound mail service in our Seattle datacenter.

Timeline

12:28am PDT / 3:28am EDT / 7:28am GMT / 5:28pm AEST
We went offline to the majority of networks. During this time our servers could still contact some networks, some customers found they could route via VPNs in some parts of the world, and some smartphones still received push updates.

3:15am PDT / 6:15am EDT / 10:15am GMT / 8:15pm AEST
Full service was restored.

Total time of seriously degraded connectivity: 2 hours and 47 minutes.

Technical Tidbits

A memory leak on an NYI BGP backbone router caused BGP sessions to re-initialize. The router was successfully deprioritized and taken out of production by the NYI automated systems, and traffic was diverted to alternate paths. However, this revealed an unrelated issue with advertisements of routes within NYI, which caused FastMail’s primary IP range (66.111.4.0/24) to not be advertised correctly.

The 66.111.4.0/24 range is quite special. It was originally hosted at NYI’s New York datacentre, and was routed to the New Jersey datacentre when we moved our servers last year. It’s also automatically protected behind a Level 3 DDoS filter when needed. This complexity led to it taking the overnight staff longer to diagnose and repair the network issue caused by the router failure than it otherwise may have done.

The memory leak appears to be a bug in the underlying firmware, which NYI have reported to the vendor and are working with engineers on getting resolved and updated as quickly as possible. They do not anticipate the issue re-occurring prior to the firmware update and no downtime is expected when the update is applied.

What does redundancy mean?

Whenever we experience any outage, we receive a bunch of feedback asking if we have redundant servers or not.

We do have secondary IMAP servers, currently split between two locations - NYI Seattle and Switch in Amsterdam. Both these datacentres were up, and indeed able to communicate with our New York datacentre via VPN links which run over private IP ranges which aren’t advertised to the public, because they would become DDoS risks.

In theory we could have routed traffic via those existing links, however we currently have a 1 hour Time To Live (TTL) on all our DNS records, which means it would take 1 hour from us making that decision until our customers were back online via that link. Given that we expected NYI’s network to stabilise at any second, we figured that it would not be valuable to start the timer on rerouting all traffic to another datacentre.

In hindsight, we may have been able to shorten the outage slightly by switching, but hindsight is always easy, and it still would have been over an hour!

Improvements we can make

While this outage was entirely outside our control, there are steps we can take to reduce the impact of any similar issue in future. NYI have already put into place additional automated checks and regression tests to catch the secondary issue with the advertisements on alternate paths, and tested the automated scripts thoroughly to ensure full functionality.

They have also engaged with a third party to provide a full audit of the relevant network architecture to assure maximum reliability. Given that this has never happened before in the over 15 years we have been with NYI, we’re confident that it won’t happen again.

It would be easy to react in a knee-jerk fashion and add measures which actually reduced our reliability overall. Any course of action which works around a rare failure needs to be at least so reliable that it fails less often than the case being protected against! We have learned this lesson in the past from high availability solutions which failed more than our servers did.

Having said that, we will take actions which are valuable for a wider range of potential outage situations, such as:

Short term improvements:

Adding a second network range directly in NYI/NJ’s network block, which doesn’t have the BGP routing or DDoS protection complications. We will have this bound on all our external hosts as well as our 66.111.4.0/24 range, and be able to switch to it without long delays. This gives FastMail operations another option if our main range is hurt for some reason, and will be tested frequently.
Reducing DNS TTLs. We haven’t blogged about it, but we were hit really badly with a DNS DDoS last year. We’ve been cautious of risks with DNS, however we are using Cloudflare DNS fronting which absorbs the bulk of the DNS load, and we believe our system can remain stable with a shorter TTL. This enables us to switch networks faster, moving to either another network range at NYI, or elsewhere.

Medium term improvements:

We are close to launching a beta of our new JMAP-based webmail interface. The JMAP protocol will allow fetching API endpoints separately. By hosting the login/session system in multiple places, we can more easily move sessions when a datacentre goes offline without impacting performance in the common case where everything is alive. This would have worked very well in this particular outage, where our servers could still talk happily to each other, but not all of them could reach the rest of the world.
For non-web services, we can host endpoints in multiple datacentres with minimal performance impact. There is the triangle problem (connecting directly to New Jersey via the internet will almost always be faster than connecting to Seattle and from there to New Jersey), but with the fast DNS switching above, we could ease this in only when needed.

Long term improvements:

We’re still working towards improvements to our Cyrus IMAP server’s replication system to allow a true multi-master configuration which would allow users to transparently migrate between datacentres. This has always been a goal, but there’s a ton of work to make it really safe and avoid human intervention. You can follow the open source project mailing lists to see how we’re going with developing this.
Our web interface can be changed to report when parts of the system are down, so that our customers don’t have to check Twitter or fastmailstatus.com directly for partial outages.