Dec 12: Following the Sun
Post categories
Chief Experience Officer
Platform Engineer
This is the twelfth post in the Fastmail Advent 2024 series. The previous post was Dec 11: Meet the team—Marc. The next post is Dec 13: It’s knot DNS. There’s no way it’s DNS. It is DNS!.
As Fastmail provides a global service for end-users, we operate 24/7 and respond to issues around the clock. Given that humans do not operate around the clock, this requires some thoughtful process to effectively maintain our service.
To do this, our support team is split across three timezones, and our on-call engineers are split across two timezones.
As a term of art, this is often referred to as “Follow the Sun support”: Wherever the sun is shining right now, agents and support engineers are working and ready there, resulting in 24/7 coverage.
Support
Our support staff is located in both of our offices in Melbourne and Philadelphia, and remotely in India. Having staff located around the world works really well for us and for our customers, who are also spread across the globe!
Having staff spread across multiple time zones allows us to provide support to you, our customer, when you need it. We aim to respond to routine questions within a few hours. We usually do much better than that, and respond within an hour!
If we were all located in a single office, not only would our customers potentially be left waiting for a full day to get a response to a ticket, but our support team would start each morning with a daunting backlog of tickets. It’d be much trickier to surface customers’ most urgent tickets, and the natural response to a backlog might be to rush through tickets. Instead, we have opted out of that unnecessary stress and can give each user the time it takes to fully research their issue and then send them a thorough, thoughtful response.
Having a team across multiple locations does present challenges, too! Particularly with collaboration. To mitigate that, the support team has multiple brief huddles, with each location having some hours of overlap within their schedule. At the end of our day, we hand the baton off to those who are just starting their day, letting them know if there is anything impacting multiple customers, like moving to a new data center or migrating Pobox users. We also use these huddles to help each other with particularly challenging tickets, to make sure the team is aware of work on the horizon, and to share newly acquired knowledge across the team.
Outside of our daily huddles, we communicate with everyone on the Fastmail team synchronously in Slack and asynchronously via our sister product, Topicbox.
On-Call Engineers
At Fastmail, we own and operate most of our stack, from networking gear and servers in the datacenter space we use, server OS setup, internal services, backups, and redundancy, through to user-facing services and mail flow.
To maintain the availability of all these 24/7, our platform team operates an on-call process split between our US and Australian teams.
With the Australian team working to Australia East timezone, and the US team working to Eastern Standard timezone, we have set this up for the Australian team to be on call from 8AM to 10PM their time, and the US team from 6AM to 4PM their time.
So the times are a little less friendly for the US team, but in return, they have less time to cover overall.
As well as being the primary on-call engineers, the platform team also owns our observability stack and is empowered to make changes to all parts of our stack to improve monitoring and alerting. Nothing is worse than being responsible for something you can’t fix, so we make sure to avoid that!
Another practice we follow to reduce on-call load is to help the support team troubleshoot and fix issues without needing to escalate to engineers. Our support team can search logs and run admin and fixup tools to resolve problems themselves.
One of Fastmail’s operational maxims is “the spice must flow”. So, the first priority during incidents is to restore service—the mail has to flow, and users need to be able to access their mailboxes and use all the other features we provide. Our on-call engineers are experts in most, but not all, systems and processes at Fastmail, so sometimes this involves paging experts or technical owners to help troubleshoot.
Once service is restored, everything else has less urgency, and incident reports and follow-ups are coordinated via - what else - emails on our internal mailing lists. We make sure that we understand what happened, fix bugs, and make the system more robust for the future.