December 3, 2024

Dec 3: On moving house — bringing a new data centre online

Company

Graeme Lee

Senior System Engineer

This is the third post in the Fastmail Advent 2024 series. The previous post was Dec 2: Thowback: security — confidentiality, integrity, and availability. The next post is Dec 4: Meet the team — Bek.

Who hasn’t moved house at some point in their lives? Whether we choose to rent or buy, board, or backpack, each relocation in premises has its own challenges which have to be considered.

You might be striking out, leaving the nest and finding your own dwelling. Or you may have been comfortably in the one location for many years, and now it’s time to move on. Maybe you live in a motor home? Then relocating is probably something you just do. It’s different for everyone, and the decisions and circumstances are unique to each individual.

Well, for reasons, Fastmail decided it was time to relocate our DC’s. And it felt a lot like moving house. How? Well, here are some cross-overs.

How much space do we need? Usually the first question asked, I suppose. Do we have enough already? Are we wasting space? What if we need more? How do we expand? If the family has moved out, we can downsize. Or there might be another project on the way, and we need another room.
What’s the neighbourhood like? When house-hunting, we will consider how we might fit in with the community we are integrating into. We will look at services, schools, shopping facilities. Whatever we may prioritise, it will be unique to each one’s circumstances. When we look at data centres, we might consider their physical location. What about connectivity? Do we have specific requirements for any services? What are the facilities like when connecting, upgrading, or disconnecting?
What’s the access like? Is it miles from anywhere, or is it in the CBD? Are we prepared to commute, or do we want quick and easy access? Does it have a garage or is it street parking only? The same questions can be asked when choosing a new DC. We might appreciate the opportunity to get out and visit. Or do we want something close that has convenient parking.
Is the furniture we have going to work in our new abode? Maybe it’s all we need, and we’ll “make it fit”. Do we have the budget to refurbish? What can we throw away? What do we keep? We could go on-and-on here. But it’s important to accept that what may work for one person may not necessarily be the right fit for the next. The same goes for moving to a new data centre. You might need new hardware, so it’s time to find a new space and start fresh. Whatever. What you choose to bring and what you choose to leave behind is always your choice alone.

Timing is everything

How long will the move take? Maybe you’re moving across the continent, so it’s going to take a few days travel by road. Or it’s just around the corner, so logistically, the move looks like it will be fairly quick. But now, you’re moving data centres. The lights need to stay on! We now have a conundrum. Your servers might be fine with being turned off. But not your service!

How do you go about moving a live production system from one data centre to another with zero downtime as the goal? This is where your existing data integrity, resilience and backup strategies come in to play.

At Fastmail, data integrity is of utmost importance. It was time to take our integrity policy to task! We were about to stretch our network’s capabilities to the limit, break some old rules and make new ones, and find some new corner cases that we had not considered.

And you can’t just do “1 thing”, and move on to the next. A lot of things need to be thought out and prepared for before you can turn up at the doorstep and ask for the keys!

We did the tours. We shopped around. And we found a data centre that we felt good about. It was accessible, and had the right demographic of network carriers in it for our needs. Our floorplan was drawn up, cables were run, and the plumbing was good to go.

Let’s do this!

New switching and routing hardware was delivered and installed. We needed some way to bridge the two data centres together, so we used Megaport to provide a private circuit between our sites so that we could transparently continue to provide services from our existing network connections without disrupting normal operations.

We decided to divide and conquer our data. We already have data replicated between our primary and backup DC’s. It made sense to employ a similar technique. We transferred the operational load onto the servers that were staying up, offlined our standby systems, put them on a truck, and shipped. Things were ok. But as time passed, we discovered that the load now on our remaing hosts was taxing them to their limits! Fortunately, the time-in-transit was only a few hours, and we were able to keep things running.

Once our standby hosts were in place, we began to re-sync them with our master systems. This was fairly straightforward. But we quickly discovered that we had not provisioned enough bandwidth between our DC’s to cater for the high intra-network load. Extra bandwidth on our DC cross-connect was provisioned, and things synced up nicely.

Because we experienced a higher than expected load on our existing servers, we had to re-think our redundancy strategy. We had already moved 2/3 of our compute, with 1/3 remaining (excluding our backup DC), and we felt that it wasn’t desirable to rely on 1/3 in the event of an emergency, so a rethink was in order. Our server pool was reconfigured to provide more compute and more resilience to our primary DC.

Finally, we brought the remainder of our servers across. This was possibly the most straight-forward part of the move. Our new DC was promoted to master, and the servers were offlined, shipped, and installed.

Happy days!

But wait, there’s more!

A lot more! TL;DR more (At this point, does ‘TL’ mean ‘Too Late!’?) It wasn’t plain sailing. We had some hurdles that had to be swiftly cleared so that we could continue with our migration. Cables didn’t arive. We ran out of certain types of SFP modules. We upgraded every NIC on every blade. And we managed to migrate our backup DC right on the heels of our primary!

I’m sure we could write another post or two about these things. But we are still happy with the outcome. Some of our tooling got a fresh look at under stress, and we were able to improve even further our backup and migration tools for both present and future use.

The whole global team stepped up for this, and as a result, we were successful. Our network has been revitalised for the future, and we are in a great position to grow and improve our backend to provide the excellent service our customers expect and deserve.

No emails were harmed or lost in the migration of our data centres!

Postscript by Bron: the whole team did amazing work here. While we didn’t lose any email, we did have some customer visible downtime unfortunately. This all happened in just a few weeks - shipping equipment from Seattle to Philly over a week first, then the crazy day where we were running fully split as we shipped half of New Jersey to Philly, then finally I was off at a conference while the remaining crew handled the move of the last parts from New Jersey to our new secondary location.

We identfied three major learning points from this move:

We hadn’t done enough testing of the states we were going to be in during the move. We believed things would work that turned out to not be able to handle the complete real-world load during the US daytime hourly spikes.

We didn’t actually have the capacity! We also had a Cyrus replication repair inefficiency which, along with a very inopportune IMAP server crash due to the extra load of running overloaded - caused days of scrambling to get things back into the right state. We didn’t lose any data, but we were running with lower redundancy than I wanted for longer than I wanted! We have ordered a bunch of new servers which were shipped just last week and will be up and running by the end of the year.

There was no single coordinator driving the entire process. We still need to be better at designating a single responsible person. With such a senior and self-directing team it wasn’t a disaster, but we could have been even more efficient with a bit more up-front planning.

There were also some network hiccups that happened in the weeks immediately following; but we’re now in a really stable configuration, with more options for routing traffic than we’ve ever had before, so we can work around faults faster in future.