December 22, 2024

Dec 22: Why we use our own hardware at Fastmail

Technical

Rob Mueller

Founder & CTO

This is the twenty-second post in the Fastmail Advent 2024 series. The previous post was Dec 21: Fastmail in a box. The next post is Dec 23: Ten years of JMAP.

Why we use our own hardware

There has recently been talk of cloud repatriation where companies are moving from the cloud to on premises, with some particularly noisy examples.

Fastmail has a long history of using our own hardware. We have over two decades of experience running and optimising our systems to use our own bare metal servers efficiently.

We get way better cost optimisation compared to moving everything to the cloud because:

We understand our short, medium and long term usage patterns, requirements and growth very well. This means we can plan our hardware purchases ahead of time and don’t need the fast dynamic scaling that cloud provides.
We have in house operations experience installing, configuring and running our own hardware and networking. These are skills we’ve had to maintain and grow in house since we’ve been doing this for 25 years.
We are able to use our hardware for long periods. We find our hardware can provide useful life for anywhere from 5-10 years depending on what it is and when in the global technology cycle it was bought, meaning we can amortise and depreciate the cost of any hardware over many years.

Yes, that means we have to do more ourselves, including planning, choosing, buying, installing, etc, but the tradeoff for us has and we believe continues to be significantly worth it.

Hardware over the years

Of course over the 25 years we’ve been running Fastmail we’ve been through a number of hardware changes. For many years, our IMAP server storage platform was a combination of spinning rust drives and ARECA RAID controllers. We tended to use faster 15k RPM SAS drives in RAID1 for our hot meta data, and 7.2k RPM SATA drives in RAID6 for our main email blob data.

In fact it was slightly more complex than this. Email blobs were written to the fast RAID1 SAS volumes on delivery, but then a separate archiving process would move them to the SATA volumes at low server activity times. Support for all of this had been added into cyrus and our tooling over the years in the form of separate “meta”, “data” and “archive” partitions.

Moving to NVMe SSDs

A few years ago however we made our biggest hardware upgrade ever. We moved all our email servers to a new 2U AMD platform with pure NVMe SSDs. The density increase (24 x 2.5" NVMe drives vs 12 x 3.5" SATA drives per 2U) and performance increase was enormous. We found that these new servers performed even better than our initial expectations.

At the time we upgraded however NVMe RAID controllers weren’t widely available. So we had to decide on how to handle redundancy. We considered a RAID-less setup using raw SSDs drives on each machine with synchronous application level replication to other machines, but the software changes required were going to be more complex than expected.

We were looking at using classic Linux mdadm RAID, but the write hole was a concern and the write cache didn’t seem well tested at the time.

We decided to have a look at ZFS and at least test it out.

Despite some of the cyrus on disk database structures being fairly hostile to ZFS Copy-on-write semantics, they were still incredibly fast at all the IO we threw at them. And there were some other wins as well.

ZFS compression and tuning

When we rolled out ZFS for our email servers we also enabled transparent Zstandard compression. This has worked very well for us, saving about 40% space on all our email data.

We’ve also recently done some additional calculations to see if we could tune some of the parameters better. We sampled 1 million emails at random and calculated how many blocks would be required to store those emails uncompressed, and then with ZFS record sizes of 32k, 128k or 512k and zstd-3 or zstd-9 compression options. Although ZFS RAIDz2 seems conceptually similar to classic RAID6, the way it actually stores blocks of data is quite different and so you have to take into account volblocksize, how files are split into logical recordsize blocks, and number of drives when doing calculations.

               Emails: 1,026,000
           Raw blocks: 34,140,142
 32k & zstd-3, blocks: 23,004,447 = 32.6% saving
 32k & zstd-9, blocks: 22,721,178 = 33.4% saving
128k & zstd-3, blocks: 20,512,759 = 39.9% saving
128k & zstd-9, blocks: 20,261,445 = 40.7% saving
512k & zstd-3, blocks: 19,917,418 = 41.7% saving
512k & zstd-9, blocks: 19,666,970 = 42.4% saving

This showed that the defaults of 128k record size and zstd-3 were already pretty good. Moving to a record size of 512k improved compression over 128k by a bit over 4%. Given all meta data is cached separately, this seems a worthwhile improvement with no significant downside. Moving to zstd-9 improved compression over zstd-3 by about 2%. Given the CPU cost of compression at zstd-9 is about 4x zstd-3, even though emails are immutable and tend to be kept for a long time, we’ve decided not to implement this change.

ZFS encryption

We always enable encryption at rest on all of our drives. This was usually done with LUKS. But with ZFS this was built in. Again, this reduces overall system complexity.

Going all in on ZFS

So after the success of our initial testing, we decided to go all in on ZFS for all our large data storage needs. We’ve now been using ZFS for all our email servers for over 3 years and have been very happy with it. We’ve also moved over all our database, log and backup servers to using ZFS on NVMe SSDs as well with equally good results.

SSD lifetimes

The flash memory in SSDs has a finite life and finite number of times it can be written to. SSDs employ increasingly complex wear levelling algorithms to spread out writes and increase drive lifetime. You’ll often see the quoted endurance of an enterprise SSD as either an absolute figure of “Lifetime Writes”/“Total bytes written” like 65 PBW (petabytes written) or a relative per-day figure of “Drive writes per day” like 0.3, which you can convert to lifetime figure by multiplying by the drive size and the drive expected lifetime which is often assumed to be 5 years.

Although we could calculate IO rates for existing HDD systems, we were making a significant number of changes moving to the new systems. Switching to a COW filesystem like ZFS, removing the special casing meta/data/archive partitions, and the massive latency reduction and performance improvements mean that things that might have taken extra time previously and ended up batching IO together, are now so fast it actually causes additional separated IO actions.

So one big unknown question we had was how fast would the SSDs wear in our actual production environment? After several years, we now have some clear data. From one server at random but this is fairly consistent across the fleet of our oldest servers:

# smartctl -a /dev/nvme14
...
Percentage Used:                    4%

At this rate, we’ll replace these drives due to increased drive sizes, or entirely new physical drive formats (such E3.S which appears to finally be gaining traction) long before they get close to their rated write capacity.

We’ve also anecdotally found SSDs just to be much more reliable compared to HDDs for us. Although we’ve only ever used datacenter class SSDs and HDDs failures and replacements every few weeks were a regular occurrence on the old fleet of servers. Over the last 3+ years, we’ve only seen a couple of SSD failures in total across the entire upgraded fleet of servers. This is easily less than one tenth the failure rate we used to have with HDDs.

Storage cost calculation

After converting all our email storage to NVMe SSDs, we were recently looking at our data backup solution. At the time it consisted of a number of older 2U servers with 12 x 3.5" SATA drive bays and we decided to do some cost calculations on:

Move to cloud storage.
Upgrade the HD drives in existing servers.
Upgrade to SSD NVMe machines.

1. Cloud storage:

Looking at various providers, the per TB per month price, and then a yearly price for 1000Tb/1Pb (prices as at Dec 2024)

Amazon S3 - $21 -> $252,000/y
Cloudflare R2 - $15 -> $180,000/y
Wasabi - $6.99 -> $83,880/y
Backblaze B2 - $6 -> $72,000/y
Amazon S3 Glacier Instant Retrieval - $4 -> $48,000/y
Amazon S3 Glacier Deep Archive (12 hour retrieval time) - $0.99 -> $11,880/y

Some of these (e.g. Amazon) have potentially significant bandwidth fees as well.

It’s interesting seeing the spread of prices here. Some also have a bunch of weird edge cases as well. e.g. “The S3 Glacier Flexible Retrieval and S3 Glacier Deep Archive storage classes require an additional 32 KB of data per object”. Given the large retrieval time and extra overhead per-object, you’d probably want to store small incremental backups in regular S3, then when you’ve gathered enough, build a biggish object to push down to Glacier. This adds implementation complexity.

Pros: No limit to amount we store. Assuming we use S3 compatible API, can choose between multiple providers.
Cons: Implementation cost of converting existing backup system that assumes local POSIX files to S3 style object API is uncertain and possibly significant. Lowest cost options require extra careful consideration around implementation details and special limitations. Ongoing monthly cost that will only increase as amount of data we store increases. Uncertain if prices will go down or not, or even go up. Possible significant bandwidth costs depending on provider.

2. Upgrade HDDs

Seagate Exos 24 HDs are 3.5" 24T HDDs. This would allow us to triple the storage on existing servers. Each HDD is about $500, so upgrading one 2U machine would be about $6,000 and have storage of 220T or so.

Pros: Reuses existing hardware we already have. Upgrades can be done a machine at a time. Fairly low price
Cons: Will existing units handle 24T drives? What’s the rebuild time on drive failure look like? It’s almost a day for 8T drives already, so possibly nearly a week for a failed 24T drive? Is there enough IO performance to handle daily backups at capacity?

3. Upgrade to new hardware

As we know, SSDs are denser (2.5" -> 24 per 2U vs 3.5" -> 12 per 2U), more reliable, and now higher capacity - up to 61T per 2.5" drive. A single 2U server with 24 x 61T drives with 2 x 12 RAIDz2 = 1220T. Each drive is about $7k right now, prices fluctuate. So all up 24 x $7k = $168k + ~$20k server =~ $190k for > 1000T storage one-time cost.

Pros: Much higher sequential and random IO than HDDs will ever have. Price < 1 year of standard S3 storage. Internal to our WAN, no bandwidth costs and very low latency. No new development required, existing backup system will just work. Consolidate on single 2U platform for all storage (cyrus, db, backups) and SSD for all storage. Significant space and power savings over existing HDD based servers
Cons: Greater up front cost. Still need to predict and buy more servers as backups grow.

One thing you don’t see in this calculation is datacenter space, power, cooling, etc. The reason is that compared to the amortised yearly cost of a storage server like this, these are actually reasonably minimal these days, on the order of $3000/2U/year. Calculating person time is harder. We have a lot of home built automation systems that mean installing and running one more server has minimal marginal cost.

Result

We ended up going with the the new 2U servers option:

NVME IMAP Servers

The 2U AMD NVMe platform with ZFS is a platform we have experience with already
SSDs are much more reliable and much higher IO compared to HDDs
No uncertainty around super large HDDs, RAID controllers, rebuild times, shuffling data around, etc.
Significant space and power saving over existing HDD based servers
No new development required, can use existing backup system and code
Long expected hardware lifetime, controlled upfront cost, can depreciate hardware cost

So far this has worked out very well. The machines have bonded 25Gbps networks and when filling them from scratch we were able to saturate the network links streaming around 5Gbytes/second of data from our IMAP servers, compressing and writing it all down to a RAIDz2 zstd-3 compressed ZFS dataset.

Conclusion

Running your own hardware might not be for everyone and has distinct tradeoffs. But when you have the experience and the knowledge of how you expect to scale, the cost improvements can be significant.