Dec 6: Twoskip and more
Post categories
CEO
This is the sixth post in the Fastmail Advent 2024 series. The previous post is Dec 5: MySQL InnoDB innodb_trx is cached. The next post is Dec 7: Revision of the Core Email Specifications.
Rob wrote yesterday about some of the issues he ran into using MySQL triggers to implement JMAP access to our database tables, so I figured I’d follow up with talking about the other database that a lot of our data is stored in.
I love talking about twoskip. It’s used within the Cyrus IMAP server for all our big databases. And I wrote something about possible futures for the format back in the 2016 advent. Well, we over-engineered our attempt at zeroskip, a more copy-on-write friendly alternative to twoskip, so we’ve been running twoskip basically unmodified since then despite moving to ZFS for all our email storage.
It turns out that when you have really fast NVMe, even suboptimal filesystem usage is still pretty fast.
Before twoskip
When I first started at Fastmail, we were using the Cyrus with the skiplist database format. In the early days, it could get corrupted pretty easily on a busy server. I wrote a tool in perl back in 2006:
commit 20e5db17fd5be878a84c9b122fc22f49fe0f7ea2
Author: Bron Gondwana <brong@fastmailteam.com>
Date: Wed Sep 27 02:59:54 2006 +0000
tool to dump even corrupted skiplists in something resembling a usable form
diff --git a/utils/oneoff/skiplist_dump.pl b/utils/oneoff/skiplist_dump.pl
new file mode 100644
index 0000000000..e8bf1b27d1
--- /dev/null
+++ b/utils/oneoff/skiplist_dump.pl
@@ -0,0 +1,115 @@
+#!/usr/bin/perl
+
+use strict;
+use warnings;
+
+use IO::File;
+use Data::Dumper;
+
+use constant INORDER => 1;
+use constant ADD => 2;
+use constant DELETE => 4;
+use constant COMMIT => 255;
+use constant DUMMY => 257;
+
+my $fh = IO::File->new(shift);
Because we were getting so much corruption, I wound up figuring out what was going on and writing a patch to Cyrus. Here was the description:
conf/patches/cyrus_quilt/cyrus-skiplist-bugfixes-2.3.10.diff
SKIPLIST bugfixes
<b>ACCEPTED UPSTREAM 2007-11-16</b>
In the past we have had issues with bugs in skiplist on seen
files, and we truncated files at the offset with the issue
since they were only seen data.
Lately, we have had more tools updating mailboxes.db more
often, and have lost multiple mailboxes.db files.
There are two detectable issues:
1) incorrect header "logstart" values causing recovery to
fail with either unexpected ADD/DELETE records or
unexpected INORDER records depending which side of the
correct location the logstart value is wrong.
2) a bunch of zero bytes between transactions in the log
section.
The attached patch fixes the following issues:
a) recovery failed to update db->map_base if it truncated
a partial transaction. This reliably recreated the
zero bytes issues above by having the next store command
lseek to a location past the new end of the file, and
hence fill the remainder with blanks.
b) the logic in the "delete" handler for detecting "no
record exists" (ptr == db->map_base) was backwards,
meaning that a delete on a record which didn't exist
caused reads of PTR(db->map_base, i), which is bogus
and nasty. This is the suspect for logstart breakage
though I haven't proven this yet.
c) unsure if this is a real risk, but added a ftruncate
to checkpoint to ensure new file really is empty,
since we don't open it with O_EXCL.
d.1) when abort is called it needs to update_lock() to
ensure that the records it's about to rollback are
actually locked. This fix stopped segfaults in
my testing.
d.2) delete didn't check retry_write for success, and
also suffered from the same problem as:
d.3) if retry_writev in store failed, then it called
abort, but without the records actually written,
abort had no way of knowing which offsets needed
to be switched back, meaning bogus pointers could
be left in the file until the next recovery().
Changed it to write the records first, then only
once that succeeded update the pointers. This way
abort will do the right thing regardless.
*PHEW* - that's three days I want back, but it has survived
10 concurrent processes doing nasty things to it for 10000
operations each and it's still going strong:
skiplist: recovered /tmp/hammer.db (143177 records, 5965096 bytes) in 0 seconds
So that was that. But skiplist was still expensive to recover after a crash, so by 2011 I had replaced it with twoskip. Twoskip had many design goals (this is all in comments at the top of the source file, you can read it on Github).
Design Goals:
- 64 bit throughout
- Checksums on all content
- Single on-disk file
- Fully embedded, no daemons or external helper
- Readable without causing writes (if clean)
- May require that an aligned 512 bytes either fully writes or doesn’t write anything.
- May require fsync to work
We achieved all those, which is why twoskip has been such a workhorse. With that said,
A corrupt twoskip file
Last year, we had our only incident of corrupt twoskip files in this entire time!
I’m basically just going to dump the entire Topicbox thread here, because it’s a peek behind the scenes at how we debugged the issue, and also how having redundant servers with failover meant that we could do all this debugging and recovery without impacting access to email for users.
First email, by me:
Sunday, February 05, 2023 12:28
I got paged about:
2023-02-04T15:54:56.525746-05:00 imap41 sloti41n57/syncserver[2135571]: DBERROR: twoskip checksum head error: filename=</mnt/i41n57/sloti41n57/store56/conf/user/uuid/c/9/c993a9a5-5fed-4515-807e-0b1326c7200d/conversations.db> offset=<00370B30> syserror=<No such file or directory> func=<read_onerecord>
It turns out there’s a block of corruption from 00360000 to 00380000 in that file. And it’s not all zeros. I’m still investigating.
Anyway, failed everything off the machine, rebooted - corruption is still there. Initiated a new zfs scrub. That slot is left down, all others are up.
Marc and I are both heading out for a bit to do personal things today, but I’ll look in again on the scrub later and also keep looking at the corruption. It looks like this:
00360000 2b 01 00 34 00 00 00 19 00 00 00 00 00 76 e3 80 |+..4.........v..| 00360010 00 00 00 00 00 36 00 70 28 b1 a3 a4 fa ef 97 5e |.....6.p(......^| 00360020 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| * 003600e0 2b 01 00 30 00 00 00 19 00 00 00 00 00 82 4a c8 |+..0..........J.| 003600f0 00 00 00 00 00 36 01 50 11 20 22 7c 62 dd 0d d3 |.....6.P. "|b...| 00360100 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| * 00360150 2b 02 00 31 00 00 00 19 00 00 00 00 00 00 00 00 |+..1............| 00360160 00 00 00 00 00 36 01 c8 00 00 00 00 00 36 03 f8 |.....6.......6..| 00360170 62 6b f2 75 12 c0 98 2e 00 00 00 00 00 00 00 00 |bk.u............| 00360180 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| * 00360230 00 00 00 00 00 00 00 00 2b 01 00 31 00 00 00 19 |........+..1....| 00360240 00 00 00 00 00 75 4d d8 00 00 00 00 00 36 02 a8 |.....uM......6..| 00360250 f8 55 a3 e8 24 1c 9e 87 00 00 00 00 00 00 00 00 |.U..$...........| 00360260 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| * 00360310 00 00 00 00 00 00 00 00 2b 01 00 30 00 00 00 19 |........+..0....| 00360320 00 00 00 00 00 76 df 08 00 00 00 00 00 36 03 88 |.....v.......6..| 00360330 0d 41 4e 13 67 8d aa b4 00 00 00 00 00 00 00 00 |.AN.g...........| 00360340 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
Which is so weird because it looks like keys but not values being written or something. But the checksums also aren’t consistent so it’s not pure on-write corruption. I don’t know yet.
I’m not thrilled with the number of filesystem level blue moons we’ve had in the past few weeks. I want to do more digging later, but for now we’ll let the scrub run.
I came back later with more detail
Sunday, February 05, 2023 22:04
OK - this IS consistent with a zeroed out area being written over later. So the file was written, then this block got zeroed, then later writes came back and updated just the headers of these records, which is how it works.
So it SMELLS like maybe this is a pwrite vs mmap issue. Linux (maybe ZFS) failed to page in the content, and then wrote bits of it, and wrote back the block with zeros for all the bits it wasn’t writing.
So it’s for sure a page cache bug. OUCH. We can maybe work around this by using mmap for writes as well as reads, certainly if we see this again we’re in trouble.
I looked to see if there were ANY other files of the same type on the same machine, and no - everything else was consistent. This took hours:
[fastmail root(brong)@imap41 ~]# find /mnt/*/*/*/conf/user/uuid -name conversations.db | xargs -I{} sh -c 'echo -n {} ": "; sudo -u cyrus /mnt/i41n57/sloti41n57/store56/conf/pkg/bin/cyr_dbtool -C /etc/cyrus/imapd-sloti41n57.conf {} twoskip consistent' | grep No /mnt/i41n57/sloti41n57/store56/conf/user/uuid/c/9/c993a9a5-5fed-4515-807e-0b1326c7200d/conversations.db : No, not consistent [fastmail root(brong)@imap41 ~]#
So I moved it aside and reconstructed the user:
[fastmail root(brong)@imap41 ~]# mv /mnt/i41n57/sloti41n57/store56/conf/user/uuid/c/9/c993a9a5-5fed-4515-807e-0b1326c7200d/conversations.db /mnt/i41n57/sloti41n57/store56/conf/user/uuid/c/9/c993a9a5-5fed-4515-807e-0b1326c7200d/conversations.db.old [fastmail root(brong)@imap41 ~]# cyr sloti41n57 ctl_conversationsdb -R [REDACTED USER] [REDACTED DETAIL] [fastmail root(brong)@imap41 ~]#
And finally started up the slot.
Rob M, our CTO, chimed in
Monday, February 06, 2023 10:26
I note that this is 128kb, which is the ZFS recordsize we use for the cyrus spools. That seems awfully coincidental.
It’s well known that the way ZFS uses memory (ARC) and it’s interaction with the Linux page cache is messy and caused problems in the past, but that it had been basically dealt with.
This problem though suggests there might still be a very rare hole somewhere. Given this is the first corruption we’ve seen in over a year of heavy usage across many machines, it seems it’s going to be a very very rare edge case, and likely to be extremely hard to ever track down :(
Rob N, Senior Sysadmin
Saturday, February 11, 2023 12:21
This is likely to be https://github.com/openzfs/zfs/issues/13608, or very adjacent to it. Its describing an mmap read returning wrong/corrupted data.
No one yet really has a handle on it. It is being investigated a little elsewhere and was discussed at the OpenZFS dev call in early January. It seems clear that its possible for the page cache to be dirty when ZFS thinks it should be clean, but no one has been able to track it down reliably. It might be related to concurrent mmap reads and modifies over the same record, and then being invalidated inconsistently (which is why clones are in play in the above issue).
Assuming it is the same issue, we might in a position to assist because we’re coming at it from a different angle.
What I want to do next is try to determine the rough sequence events (mapped read -> write -> read -> whatever) including any other programs that might have been operating on the same data at the same time. I suspect we can put together a rough sequence just by reading code. Bron, if you have some time this week, I would like to go through this with you (mostly: show me the code and I’ll nod along). If we can pin down a clear sequence, maybe we can write something that can reproduce it if we drive it hard enough. And if not, maybe we can at least write this all down for the ZFS folk to consider.
And finally Rob N wrote up a great summary
Sunday, February 19, 2023 22:08
Bron and I spent a couple of hours last Tuesday going over exactly how twoskip does its work and trying to get a sense of what happened. This is something of a writeup. Thin on technical detail, because that would take even longer for me to write.
Tiny tiny twoskip overview
It helps to know what a twoskip file is before we talk about how it writes to disk.
From outside, its a KV store, with variable-length keys and values. Internally, its effectively a linked list of records, sorted by key (its a skiplist, so there’s actually multiple linked lists, but that doesn’t matter for our purposes).
Each record is either a “RECORD”, that is, a real active item in the store, or a “DELETE”, which is a tombstone in place of a thing that used to exist. (There’s a couple of others, that don’t matter). Making a modification means writing a new record to the end of the file, and then going back through the file to fix the linked list pointers to hook that record up in the right place (if you ever wrote a sorted linked list and remember the fiddling to insert a thing in the middle of the list, its just a slightly fancier version of that).
Cyrus does fairly regular repack operations, which means creating a new file and writing all the “live” records to it, removing any RECORDs that have a DELETE, and of course writing all the records in order in the file, rather than having pointers back and forth through the file. Same end result, but a better optimised file.
Reading and writing
Accessing a twoskip file is done via two different mechanisms:
mmap()
for reads,write()
(actuallypwritev()
) for writes.For the uninitiated,
mmap()
is (in this case) a way to pretend a file is really a region of memory. Rather than having to read the contents of a file into memory piece by piece, you can just tell the kernel to give you a big chunk of memory. When you try to read from it, the kernel will go and get the data from the file and make it appear in that memory space, so you can kind of pretend that its already been read into memory. The main reason to do this is performance - the kernel is usually given you a view over its own filesystem cache memory, where its already likely got a copy of a busy file available, and it doesn’t have to copy that data into a user-space memory buffer.For writing, we doing a more conventional
write()
at specific offsets. You can set up ammap()
for writes too, but for small random writes throughout a file (as twoskip writes are) it doesn’t really gain much on performance (the kernel has to copy from userspace) and can be tougher for the user program to manage (in a variety of subtle handwavey ways that tbh I am not an expert in). The calls towrite()
will be reflected in the existing read map anyway (the kernel will copy it there before it sends it down to disk anyway), so it all works out quite nicely in the end.The twoskip write cycle
When we open a twoskip file, we maps the entire file into memory with
mmap()
. The map region is the length of the file, rounded up to the nearest 16K boundary. At this point we could happily read the existing contents out of memory.When we write a new record, we seek to the end of the file, and then write four distinct parts (in a single
write()
call):
- a 32-byte header, which includes the lengths of the key and value, linked list pointers, and checksums for the header and the data
- variable-length key
- variable-length value
- 0-7 padding bytes to bring the entire record to an 8-byte boundary
Then, we go back through the file (before the record), finding the nodes that should point to this new record, and overwrite their 32-byte header updating the pointers to splice it into the list. Each header is a separate call to
write()
.Now, a
mmap()
region has a fixed length, and we started with it as the length of the file, plus padding. After writing a record, the file may now actually be larger than the map. There’s nothing wrong with this;mmap()
is done by offset and length; it doesn’t have to cover the whole file. This does of course mean, if we have just written a record past the end of the map, we can’t actually read it back (it would cause a segfault). To keep everything nice, after eachwrite()
, we check to see if we’ve gone past the end of the map and if so, we extend it out past the end (to the nearest 16K).An important thing to note: if you
mmap()
past the end of a file (and we do as a matter of course, with the 16K alignment) and then you read from that area, you get zeroes back.ZFS is not like other filesystems
There’s a few things to know about ZFS that is different to a more conventional overwriting filesystem in this situation.
ZFS’ unit of IO is the “record size”. Its configurable, but we’re using the default, which is 128K. Which is to say, any write to an existing file of smaller than 128K requires loading a full 128K record from disk, modifying it in memory, then writing the full 128K back down again (to a new record; ZFS never overwrites existing records, but that’s not really relevant here).
ZFS does not use the kernel-provided filesystem cache (the “page cache”) because it has its own cache (the ARC) which has quite different characteristics to support all sorts of ZFS features. However,
mmap()
by definition exposes memory in the page cache. To make this work, ZFS copies data between the page cache and the ARC as necessary when a file is mapped.For asynchronous writes, a conventional filesystem will end up having written data staged in the page cache, which is periodically flushed to disk. ZFS has the same concept, but it uses the ARC as the cache, and writes are attached to a “transaction group”, which is flushed periodically. This is mostly interesting here because I’m going to say “transaction group” soon. The details past that don’t matter, mostly its just a collection of writes waiting to go to the storage pool.
What even happened?
So, what we saw is a single 128K section of the file that had record headers, but zeroes through the data sections. Recall that new records are written to the end of the file, making it larger, then we extend the map over it and past the real end, and then we go back through the file and update the headers.
This record has headers, but no data. It seems that the only way this could have happened is if the initial record writes were lost somehow, but the header writes survived. The initial record writes are the only thing that can write “outside” of the currently mapped region; the header fixups always happen on existing records that the map will always be covering.
This and the fact that the data is all-zeroes (and not scrambled random memory contents) suggests that
write()
successfully staged stuff to be written (that is, an ARC buffer for the record was allocated and was attached to a transaction group), but whenmmap()
was called to extend the map, ZFS decided that the page cache contents (all zeroes) was dirty, and overwrote the pending ARC buffer with its contents, and so those zeroes were written down. Whatever the bug is, this is the heart of it.Another interesting factor is that last twoskip record in the previous 128K ZFS record ends exactly on at the end of the record. This means that the map wouldn’t have been extended, as it was already right on the end of the block, covering the entire file. So the very first write to the new record would have been the one that allocated it, and the map extension immediately after would have been the first that covered that record, so the record would have only existed in any form in the ARC at that moment, not yet on disk.
(This is interesting, because complex interactions at object boundaries are a traditionally a rich source of bugs).
But the thing we must remember is that we have been running this code for nearly two years. We have written billions, maybe even trillions, of records in that time. If this was a simple case of a small handful of conditions occurring at the same time, we’d almost certainly have hit this already, and we know we haven’t. That almost certainly points to a locking bug; ZFS is working to keep the ARC and the page cache in sync, which means knowing their states in a given moment and choosing whether or not to copy data back and forth. It very much feels like there is a tiny, probably microsecond-long gap, where something is unlocked that shouldn’t be, and if you’re very very unlucky, a decision is made right in the gap based on faulty knowledge.
Making a test case
So Bron and I got to the point where we at least felt like we knew the moving parts that were likely involved, and we could start thinking about a test case. To reproduce a bug that only happens in a very specific set of conditions that you can control and one that you can’t, you just have to write a program that hits those conditions as often as possible, and then run it over and over in a short space of time, trying to hit the elusive condition that explodes the whole thing.
So, Bron wrote skipwhack, which attempts to write to a file in the shape described above over and over, and then check back to see that it actually wrote stuff out properly.
Now, at time of writing, I haven’t had chance to exercise this - all of my test systems have some deficiency (disks too slow, CPU too slow, not enough memory) that make it unable to generate enough load to trigger the bug. Of course, it could be that its also not an accurate enough reflection of the situation (maybe we missed a condition) to trigger it, but I think its far too soon to call that. I’ll be trying to find a suitable machine this week.
Maybe none of this matters
Around the time Bron emailed me a PR appeared that I think will probably fix this anyway: https://github.com/openzfs/zfs/pull/14498. It’s claiming to fix the previous mmap bug I mentioned upthread. I don’t really follow it well, but it seems like ZFS was expecting a page fault to either succeed or fail, but there’s a rare “in between” state that it wasn’t handling properly. This fix seems to make it handle that case.
I suspect this is the whole story, but I’m not totally sure, I’m still going to be trying to get skipwhack to run and blow up, because then we can try this patch and see if it stops doing that. If it does, great, if not, I can write this up in more detail as there may be further things. We’ll see!
Postscript
We never saw another one of these corruptions, and we’re now on a newer ZFS which includes that fix, so we believe we’re all good for this. But in all these years, the only twoskip corruption we have seen has been an operating system level filesystem corruption!
Yes, we did blow away and recreate the filesystem which had given us the error as well. Our hardware really is super reliable these days, so we really don’t see this kind of issue very often.
Rob N is now working for Klara Systems, on the OpenZFS filesystem. He writes more great material about it over on his blog. We’re all very happy to be using a filesystem that he’s working on improving!
Replacing twoskip
Having said all this, the issues that we identified even back in 2016 still exist. Twoskip is a little high on random IO for writes. That’s OK for us. But we have single file databases with over a gigabyte; and the “checkpoint” command, where a file gets rewritten from scratch, can take many minutes. That’s painful, because it’s a stop the world lock.
So I’m working on a new format, twom
- pronounced “tomb”. It’s basically what I described back in 2016:
- Ancestor pointers so you can perform an MVCC read without blocking writes
- a checkpoint operation which uses MVCC plus log replay, without ever needing long locks
- a faster hash algorithm (xxHash rather than CRC32)
- direct mmap reads and writes rather than mixing mmap with file IO
- blank slop space on the end of files to avoid needing so many syscalls to extend the file
- more direct file handle and mmap manipulation rather than using our library wrappers, so we can be smarter about duplicating key records.
That last one this is a bigger deal than you might imagine. We spend a lot of CPU in twoskip just duplicating data that could be only be lost in rare edge cases. By being more lazy about that, we can save CPU in the common case.
All this needs to be done without losing any of the existing high reliability that twoskip gives us. I’m very much hoping to have twom stable by early next year and start testing it in production!
Starvation
The thing that triggered me coming back to this problem was an archive user having a couple of million emails deleted. One week later, we run the cyr_expire
tool to clean up the delete email, and this tool currently takes a lock and then does all the deletes. A million updates to a conversations database takes a long time, even just for deletes! There’s indexes to re-calculate and update, files to unlink.
It took a couple of hours, and while it did so, it blocked deliveries for that one user. Archiveusers for big customers get a lot of mail, this started using up all the processes and slowing email delivery for about 8% of our customers. Ouch. We paused delivery to that one user and got everyone else back working, but that’s annoying having to manually intervene, particularly outside working hours.
So we’re also working out way through the Cyrus code looking for things like this, and batching them up. In future, cyr_expire
is going to do a bunch of messages, release all its locks, then come back and start again with another batch. That way, other tasks like delivering a new email get a chance to interleve.
Fastmail really cares about the detail
Hopefully you can see from this, we’re right down in the weeds looking at exactly what’s going on, and building components with the kind of reliability that allows us to keep the service up, and keep email flowing, even as we run into nasty edge cases with our systems.
See you again in the next one.