December 12, 2014

Dec 12: FastMail's MySQL Replication: Multi-Master, Fault Tolerance, Performance. Pick Any Three

Historical

Alfie John

This blog post is part of the FastMail 2014 Advent Calendar.

The previous post on 11th December was from our support team. The following post on 13th December is about hosting your own domain with us.

Technical level: medium

For those who prefer watching videos over reading, here’s a talk I gave at Melbourne Perl Mongers in 2011 on FastMail’s open-sourced, custom MySQL replication system.

Most online services store their data within a database, and thanks to the culture of Open Source, these days there are plenty of robust RDBMs to choose from. At FastMail, we use a Percona build of MySQL 5.1 because of their customised tooling and performance patches (If you haven’t heard of Percona, I recommend trying them out). However, even though MySQL 5.1 is a great platform to work with, we do something differently here - we don’t use its built-in replication system and instead opted to roll our own.

First, what’s the problem with running an online service on a single database? The most important reason against this is the lack of redundancy. If your database catches on fire or as what happens more often the oom-killer chooses to zap your database server as it’s usually the biggest memory hog on a machine, none of your applications can continue without their needed data and so your service is taken offline. By using multiple databases, when a single database server is downed, your applications still have the others to chose from and connect to.

Another reason against using a single database for your online service is degraded performance - as more and more applications connect and perform work, your database’s server load increases. Once a server can’t take the requested load any longer, you’re left with query timeout responses and even refused connections, which again takes your service offline. By using multiple database servers, you can tell your applications to spread their load across the database farm thus reducing the work a single database server has to cope with, while gaining an upshot in performance for free across the board.

Clearly the best practice is to have multiple databases, but why re-invent the wheel? Mention replication to a veteran database admin and then prepare yourself a nice cup of hot chocolate while they tell your horror stores from the past as if you’re sitting around a camp fire. We needed to re-invent the wheel because there are a few fundamental issues with MySQL’s built-in replication system.

When you’re working with multiple databases and problems arise in your replication network, your service can grind to a halt and possibly take you offline until every one of your database servers back up and replicating happily again. We never wanted to be put in a situation like this and so wanted the database replication network itself to be redundant. By design, MySQL’s built-in replication system couldn’t give us that.

What we wanted was a database replication network where every database server could be a “master”, all at the same time. In other words, all database servers could be read and written to by all connecting applications. Each time an update occurred on any master, the query would then be replicated to all the other masters. MySQL’s built-in replication system allows for this, but it comes with a very high cost - it is a nightmare to manage if a single master was downed.

To achieve master-master replication with more than two masters, MySQL’s built-in replication system needs the servers be configured in a ring network topology. Every time an updated occurs on a master, it executes the query locally, then passes it off to the next server in the ring, which applies the query to its local database, and so on - much like participants playing pass-the-parcel. And this works nicely and is in place in many companies. The nightmares begin however if a single database server is downed, thus breaking the ring. Since the path of communication is broken, queries stop travelling around the replication network and data across every database server begin to get stale.

Instead, our MySQL replication system (MySQL::Replication) is based on a peer-to-peer design. Each database server runs its own MySQL::Replication daemon which serves out its local database updates. They then run a separate MySQL::Replication client for each master it wants a feed from (think of a mesh network topology). Each time a query is executed on a master, the connected MySQL::Replication clients take a copy and applies it locally. The advantage here is that when a database server is downed, only that single feed is broken. All other communication paths continue as normal, and query flow across the database replication network continue as if nothing ever happened. And once the downed server comes back online, the MySQL::Replication clients notice and continue where they left off. Win-win.

Another issue with MySQL’s built-in replication system is that a slave’s position relative to its master is recorded in a plain text-file called relay-log.info which is not atomically synced to disk. Once a slave dies and comes back online, files may be in an inconsistent state. If the InnoDB tablespace was flushed to disk before the crash but relay-log.info wasn’t, the slave will restart replication from an incorrect position and so will replay queries, leaving your data in an invalid state.

MySQL::Replication clients store their position relative to their masters inside the InnoDB tablespace itself (sounds recursive, but it’s not since there is no binlogging of MySQL::Replication queries). As updates are done within the same transaction as replicated queries are executed in, writes are completely atomic. Once a slave dies and comes back online, we are still in a consistent state since either the transaction was committed or it will be rolled back. It’s a nice place to be in.

MySQL::Replication - multi-master, peer-to-peer, fault tolerant, performant and without the headaches. It can be found here.