December 19, 2024

Dec 19: Building offline: mail storage

Technical

Neil Jenkins

Chief Product Officer

This is the nineteenth post in the Fastmail Advent 2024 series. The previous post was Dec 18: Building offline: syncing changes back to the server. The next post is Dec 20: How Fastmail uses Fastmail!.

Yesterday, we looked at how we store changes you make offline so we can accurately and efficiently sync them back to the server when you come online. Today, we’ll discuss why email is special, and what else we do to make this super fast, with support for full-text search offline.

Why offline email is hard

As discussed earlier, because we use JMAP for all of our APIs, once we can implement generic offline support and have it work for everything (currently 56 data types and counting in our app!). However, mail is special. And the reason it’s special is purely the volume of data.

Most web apps severely underestimate how small their data is. In almost all cases, you will be more efficient and way faster to just suck it all into memory and do a linear filter pass whenever you need to query it. This is the difference between response as-you-type autocomplete and frustrating loading spinners on each key stroke. Even for users with 10,000 contacts this is only a few megabytes of data — perfectly cacheable.

Email is different though. We have users with millions of messages. Even with attachments handled separately in JMAP, each message could have hundreds of kilobytes of HTML as the body. But we expect opening a mailbox to load a listing pretty much instantly, and searches to be fast too. To make this work, we have to add a number of tricks to our standard offline approach.

Splitting the data

The first trick is to split the data into two separate object stores:

EmailMetadata: this stores just the data that’s not parsed from the email content, like the id, thread id, keywords it has, and mailboxes it’s in. This keeps it small, but crucially also contains all the mutable data. This is treated like our standard JMAP object store for a data type.
EmailContent: this stores the email content; who it was sent from/to, the subject, body, list of attachments (but not the attachment data itself) etc.

Due to the volume of data, we can’t load everything at once. We page in the data in stages instead:

We fetch a list of just the ids and create placeholder entries in the EmailMetadata object store.
We page in the metadata and basic headers (like to/from/subject) for all messages in batches. This gives us everything we need to show the listing for any folder or label.
We page in the body for pinned and recent messages, or everything if the user has selected this option in settings, again in batches.

This split is useful, because for most queries we can get away with just loading the metadata into memory, not the content. This is a big saving in time and memory when deserialising the objects from the underlying datastore.

Efficient mailbox querying

A linear pass through all the metadata is surprisingly tractable, even for large mailboxes, however it’s slower than we want for common queries (like opening your inbox). This is where we introduce a couple of extra custom indexes — separate object stores we are careful to update in lock step with any changes to our data.

The first of these is EmailMailboxes. This stores an entry for each addition or removal of a message from a folder/label, allowing us to both very efficiently compute the list of messages/conversations in a particular mailbox, and also calculate a delta update to the query when making changes.

The key for this object store is:

[MAILBOX_ID, REMOVED_MODSEQ, ADDED_MODSEQ];

The values look like:

[EMAIL_ID, THREAD_ID, DATE, IS_UNREAD];

Whenever a message is added to a mailbox, a new entry is created. ADDED_MODSEQ is the current “updated” moseq of the message, and REMOVED_MODSEQ is 0.

If the message is removed from the mailbox, the old entry is deleted, and a new one added with the same ADDED_MODSEQ, but REMOVED_MODSEQ set to the new “updated” modseq of the message.

From this, we can quickly get the list of current messages in a particular mailbox by doing a range query for entries with keys that start: [MAILBOX_ID, 0]. The values include the date and thread id, allowing us to do the most common sort, and remove duplicates for the same thread id, without having to even fetch the metadata objects for the emails.

Delta query updates

JMAP has a way for a client to ask for what’s changed in a query. This allows it to more efficiently update its local store and uses less bandwidth. With the EmailMailboxes index, we can also implement this. First we fetch the entries for the current messages as before, but then we also fetch the entries for messages that have been removed since our last state (this is a range query between [MAILBOX_ID, sinceModSeq + 1] and [MAILBOX_ID, max_int]). We sort these entries together according to the sort order the user has requested, normally date descending:

mailboxRecords.sort(
    (a, b) =>
        b[DATE] - a[DATE] ||
        (a[EMAIL_ID] < b[EMAIL_ID] ? 1 : a[EMAIL_ID] > b[EMAIL_ID] ? -1 : 0) ||
        a[ADDED_MODSEQ] - b[ADDED_MODSEQ],
);

Then we can iterate through to calculate what has been added or removed from the query, like so. (“Exemplar” is our term for the email that’s representing a thread when the “collapseThreads” argument is true.)

let index = -1;
const seenExemplar = collapseThreads ? new Set() : null;
const seenOldExemplar = collapseThreads ? new Set() : null;
let uptoHasBeenFound = false;
let total = 0;
const added = [];
const removed = [];
for (const record of mailboxRecords) {
    const isDeleted = !!record[REMOVED_MODSEQ];
    // Created and deleted after our previous state? Ignore.
    const isNew = record[ADDED_MODSEQ] > sinceModSeq;
    if (isNew && isDeleted) {
        continue;
    }

    // Is this message the current exemplar?
    let isNewExemplar = false;
    let isOldExemplar = false;
    const emailId = record[EMAIL_ID];
    const threadId = record[THREAD_ID];
    if (!isDeleted && (!collapseThreads || !seenExemplar.has(threadId))) {
        isNewExemplar = true;
        index += 1;
        total += 1;
        if (collapseThreads) {
            seenExemplar.add(threadId);
        }
    }
    // Was this message an old exemplar?
    // 1. Must not have been added to mailbox after the client's state
    // 2. Must have been removed from mailbox before the client's state
    // 3. Must not have already found the old exemplar.
    if (!isNew && (!collapseThreads || !seenOldExemplar.has(threadId))) {
        isOldExemplar = true;
        if (collapseThreads) {
            seenOldExemplar.add(threadId);
        }
    }

    if (isOldExemplar && !isNewExemplar) {
        removed.push(emailId);
    } else if (!isOldExemplar && isNewExemplar) {
        // If the message has been moved out and back in again
        // we'll have separate mailbox records for added/removed
        // so not detect it's both the old and new exemplar;
        // check for that here.
        const removedIndex = isMutableSort ? -1 : removed.indexOf(emailId);
        if (removedIndex > -1) {
            removed.splice(removedIndex, 1);
        } else {
            added.push({
                index,
                id: emailId,
            });
        }
    }

    // Special case for mutable sorts (based on isFlagged/isUnread)
    if (isMutableSort && isOldExemplar && isNewExemplar) {
        // Has the isUnread/isFlagged status of the message/thread
        // (as appropriate) possibly changed since the client's state?
        // If so, we need to remove the exemplar from the client view
        // and add it back in at the correct position.
        const mayHaveMoved = collapseThreads
            ? threadChanged.has(threadId)
            : emailChanged.has(emailId);
        if (mayHaveMoved) {
            removed.push(emailId);
            added.push({
                index,
                id: emailId,
            });
        }
    }
    // If this is the last message the client cares about, we can stop
    // here and just return what we've calculated so far. We already
    // know the total count for this message list as we keep it pre
    // calculated and cached in the Mailbox object.
    // However, if the sort is mutable we can't break early, as
    // messages may have moved from the region we care about to lower
    // down the list.
    if (!isMutableSort && !isNew && emailId === upToId) {
        uptoHasBeenFound = true;
        break;
    }
}

Mail search

Fastmail supports an extremely powerful set of search operators, allowing for fast, precise searching. We support almost all of it offline, with a few caveats discussed below.

To make full-text search work and be performant, we need to build another index. If you have hundreds of thousands of messages, it would be unusably slow to scan through all of them looking for a word, phrase or email address.

Our index is stored in another IndexedDB object store called EmailSearch. The key for each entry is [token, emailId]. The token is usually a word or other sequence of letters and numbers extracted from the email. We also have special token variations to represent a list-id or email addresses found in the headers. We create an entry in EmailSearch for each such token we find in the email. The value encodes where the token was found (e.g. in the To header, or the message body), and the index(es) of the token so we can do phrase searches.

We decided to index the content on the device, rather than download the indexes from the server. This ensured our search index would be completely in sync with the cached messages you have on your device, and we could index and make searchable messages and memos you wrote while you were offline.

However, this does mean the offline search works a little differently to our server-based search, so may return slightly different results (although we think both will do a great job in most cases). In particular:

Our offline search doesn’t index any text inside attachments. When online you can search for content in attached PDFs, spreadsheets, and other documents.
Our offline search doesn’t do stemming. Stemming tries to reduce a word to its common root, so if you search in English for bus you would also match emails containing buses, but not business. Stemming requires language analysis of the email content and custom stemming algorithms for each language, and we decided the extra complexity and code download size was not currently worth it for our offline search. Instead, our offline search does prefix matching by default, so bus will still match buses but also business. Of course, if you wrap the term in quotes (like "bus") it will only look for exact matches, just like with server-based search.

And of course, the search index will only contain messages you have downloaded for offline, which might not be everything in your account. We therefore try to do a search on the server first and only fallback to the local search if you are offline.

Search tokenisation

To create our index we have to be able to extract the tokens from a sequence of text. We have users around the world, so we knew we had to handle multilingual text and scripts. In the end, we settled on a simple but effective tokenisation algorithm:

We normalise the string into Unicode NFKD normal form. This will decompose diacritics to make it easy to strip them, and replace various variations of letters and numbers (such as typographic ligatures, or subscript numbers) with the baseline equivalent.
We divide the string into segments according to the Unicode text segmentation word boundary algorithm.
For each segment, we apply the full Unicode case folding substitutions (for example, this will replace uppercase letters with lowercase for Latin text), then we strip every code point that’s not categorised by Unicode as a number, letter, joining punctuation, or emoji.

If we have anything left, that’s our token. So to give an example, supposing we had the text:

The café is über cheap — only $3.60 a ☕️!!

We would end up with the following tokens:

the
cafe
is
uber
cheap
only
360
a
☕️

Wrapping it up

We now have the indexes we need for fast, precise search. There’s still a lot of work involved in putting it all together though! When you search for something complex like in:inbox from:@example.com (is:pinned OR "very important"), we analyse the query to work out which indexes to use and efficiently combine them to compute the results. The speed will depend on how much mail you have—and how fast your device is!—but we believe it lives up to the Fastmail promise of great search everywhere.

There’s so much interesting tech behind our offline support, but for now I need to stop writing. If you’ve read all of this mini series on how we are making our app work offline: thank you, and I hope you found it interesting! Please give the beta a go, and let us know any feedback you might have. We’re excited to finish polishing this highly requested feature and we hope to ship it to everyone early in the new year.