November 1, 2006

All users now on replicated servers

Historical

Rob Mueller

Founder & CTO

All users email is now on replicated servers. This means that every email delivered or deleted and every email action performed is replicated within a second to a completely separate server with a completely separate copy of all users emails.

We now have at least three levels of redundancy, three copies of every email, and all those copies are on RAID redundant storage themselves.

All users now have their email stored on a system with RAID disks
and all servers and RAID arrays have dual power supplies.

This means a single drive or power supply failure should cause no
interruption to service at all, we just replace the drive/power
supply while the system is live and online. Hard drives and power
supplies are the most common failing hardware components in computer
systems.
All users now have their email replicated to an identical replica
system (RAID drives, dual power supplies, etc). Each system is
completely separate; it’s own operating system, filesystem, drives,
power, connections, etc. The replication is performed at the
semantic email level, not at the filesystem level. So a filesystem
corruption on the source server will not be replicated. This means
if there is a disk or filesystem corruption on a single machine, we
can just switch to the replica
(failover) and it won’t
cause a multi-day outage.

The failover is not automatic, it is manual. Thus depending on the
actual problem that occurs and our ability to analyse and respond,
it should be on the order of minutes to an hour to failover to a
replica if we decided it’s needed. In some cases, we may decide it’s
easier and safer to reboot a frozen or crashed machine than failover
to the replica, so it might be possible to still have outages up to
an hour. If we believe the outage is going to go over that time, we
will most likely failover to the replica.

We can also use the failover ability to do maintenance on machines
more easily. If we decide a machine needs servicing (kernel upgrade,
hardware change, etc), we can just failover to a replica machine
safely, do the work, start the machine up again and wait for
replication to catch up, then failback to the machine. For users,
the only visible downtime will be the controlled failover portion,
which is usually on the order of 1 minute or so.
All users have their email store backed up incrementally each night
to a separate system and RAID array. The backups of email are kept
for 1 week after the email is deleted to allow restoring in case of
accident. In an emergency situation if both a master and replica
server should fail catastrophically, we can still perform a restore
from this backup

We believe that this will provide us the highest possible reliability while still allowing us to continue to grow our user base.