December 13, 2024

Dec 13: It’s knot DNS. There’s no way it’s DNS. It is DNS!

Open technologies

Technical

Rob Mueller

Founder & CTO

This is the thirteenth post in the Fastmail Advent 2024 series. The previous post was Dec 12: Following the Sun. The next post is Dec 14: On-call systems.

Ten years ago on December 13, 2014, I talked about how Fastmail had moved its DNS system from TinyDNS to PowerDNS.

In 2023, we made another big move, switching all our DNS serving from PowerDNS to Knot DNS. This turned out to be a fairly large change and we ended up going down a couple of different paths before landing on a solid implementation.

Where were we up to again?

As a reminder where we left off in 2014, we’d moved DNS serving for our ns[12].messagingengine.com DNS servers from static TinyDNS files to using PowerDNS with its pipe backend to generate content dynamically on each DNS request via internal logic in code. Modulo some caching, this removed the latency from when you made an update to your domain’s DNS in our UI to when those changes became visible at our DNS servers.

If you own your own domain and host it at Fastmail, it’s easy to customise the DNS for it. Just go to Settings -> Domains and click Edit next to the domain.

Screenshot of domain settings

Then Customise DNS.

Screenshot of Customise DNS link

By default we generate a number of records to make using your domain with email easy. We recommend leaving these as is, but you have full control and it’s easy to add or change additional records.

Screenshot of custom DNS page

# dig +short a.b.c.uberengineer.com TXT
"txt for a.b.c"

While this solved the original latency problem we had, it introduced a few others.

The pipe backend was computationally considerably more expensive. Under normal DNS load this wasn’t a problem and the system was scaled appropriately to handle it just fine. However if we got hit with excessive load mostly due to some form of DDoS attack, the system could easily come under strain and start to fail. When this happens people can’t access our website reliably, or our IMAP/POP/SMTP servers, and worst of all other sites might have problems working out which servers to deliver email for Fastmail customers to.

To protect our servers from DDoS attacks, we had put them behind Cloudflare’s DNS firewall product, which is a global distributed DNS cache.

The Cloudflare DNS firewall product works great at edge caching if there’s a flood of DNS queries to a particular domain (or small set of) domain names and solved most of the DDoS flooding issues we saw. However we experienced cases where we were flooded with DNS queries of the form $randomdomain.fastmail.com (called a pseudo random prefix attack or random subdomain attack). If every DNS query is to a random different sub-domain, then Cloudflare has to pass all those queries straight through to us as it has nothing cached. In theory Cloudflare say they can mitigate this. Unfortunately we felt their mitigation didn’t work particularly well and still resulted in a significant overload of incoming DNS queries.

Again, this flood of queries could cause an overload of the PowerDNS pipe backend processes which caused visible DNS downtime. We couldn’t find any sensible tuning that would allow PowerDNS and our pipe backend to correctly operate under one of these sustained floods.

Mitigating random prefix attacks

At the time we needed a quick solution to this problem. So we ended up splitting our DNS in two. We noted that basically all the random prefix attacks were against our system domains like fastmail.com and not user domains. Since DNS for our system domains doesn’t change much at all we basically backtracked and put our system domains onto separate nameservers that ran TinyDNS using a mostly static database of DNS records. Although this felt hacky, it worked. TinyDNS was able to absorb the higher query load in these attack situations quite well.

This however complicated our DNS setup even more. We also knew if a user domain experienced one of these attacks, it wasn’t using the TinyDNS system. Obviously it was possible an attack could take out not just the one user domain, but all user domains using our DNS if it overloaded the PowerDNS server. We wanted a simpler, better, and more permanent solution.

Replacing all DNS with Knot DNS server

So the main observation about DNS is that in general it doesn’t actually change that often. So what we wanted was a solution where each of the 100,000’s of domains in our system represents a DNS zone that can be individually built into a static database, and each zone can be added/updated/deleted separately without having to reload/rebuild all zones.

We looked around at a few servers and went with Knot DNS.

It looks like a “standard” DNS server with zone files, replication, etc, so it’s easier for new staff to understand
A single zone can be easily added/updated/deleted at runtime into its live database
It’s a known high performance server out of the box

In theory then what we want isn’t actually that hard:

Whenever a domain is added/updated/deleted, we create/replace/remove a zone file for that domain
We tell the DNS server to add/update/drop the corresponding zone

The devil turned out to be in many small details and mis-adventures.

Lets not unknot the knot

First things first. Although Knot as a DNS server works great, we’ve found its name to be a bit annoying. It’s a play on the fact that the most common DNS server on the internet is bind so… knot. Unfortunately, you’ll find yourself at some point saying something Bernard Woolley-esque like “it was not obvious that it did not work because knot was not bound to the knot ips”. Which is easier to understand when you read it compared to when you say it. We keep trying to keep discussions sane by explicitly saying “ka-not” whenever we refer to the server/software.

Detecting zone changes

There were two main ways we could do this:

Catch everywhere in application code we insert/update/delete records for the Domains, CustomDNS or DKIMRecords tables
Use DB triggers to do the same thing

We ended up going with (2) because it felt like the right choice for rock solid data reliability.

We update the DB for domains related changes in a number of places, from code, to scripts, to cron jobs. Each of those might need DNS zones to be updated and if we miss something unexpected now or in the future, it might create subtle bugs such as “domain DNS didn’t get updated” or worse, “domain SOA serial number didn’t get bumped and so the Knot replica server didn’t get the updated domain data, so we’re serving different DNS from two different nameservers”.

These are the sorts of problems that could cause really hard to debug customer issues, that then randomly disappear when the domain gets touched in some other way and everything gets updated correctly again, making them extremely hard to track down and debug.

DNS is already hard enough for many customers to understand. We wanted to be 100% sure that our DNS servers are rock solid and the data they are generating is completely consistent with what users see in our UI.

Now actually getting the triggers working turned out to have a number of issues and unexpected edge cases, but also ended up with some nice results.

We moved the logic that bumps an SOA serial number into the trigger. Originally this was done on the “active primary” server, and we had to make sure this happened before the non-active failover primary rebuilt the zone. By doing this in the trigger, we ensure that the serial can never be out of date after a change.
We ended up with a nice design to track changed domains.

We have a KnotDomainsChanged table that looks like:

+--------------+--------------+------+-----+---------+-------+
| Field        | Type         | Null | Key | Default | Extra |
+--------------+--------------+------+-----+---------+-------+
| Server       | varchar(255) | NO   | PRI | NULL    |       |
| DomainId     | int          | NO   | PRI | NULL    |       |
| Domain       | varchar(255) | YES  |     | NULL    |       |
| NeedsRebuild | tinyint      | YES  |     | 1       |       |
| ErrorCount   | int          | YES  |     | 0       |       |
+--------------+--------------+------+-----+---------+-------+

There is another table KnotServers that has a list of all currently running Knot servers. Whenever a domain is added/updated/deleted, a trigger executes this query:

    INSERT INTO KnotDomainsChanged (Server, DomainId, Domain)
    SELECT Server, BumpDomainId, BumpDomain
    FROM KnotServers
    ON DUPLICATE KEY UPDATE
      NeedsRebuild = 1,
      ErrorCount = 0;

With this, the KnotDomainsChanged table effectively maintains a set of changed domains for each primary Knot server to pick up and build. By using (Server, DomainId) as the primary key, we ensure this table can’t grow without bound even if a primary Knot server is down for a while.

The NeedsRebuild flag allows us to correctly rebuild a zone without a race condition. The process for keeping zones up-to-date is to effectively run the following in an infinite loop.

Fetch and iterate over all KnotDomainsChanged records for this Server
- Set NeedsRebuild = 0 for this (Server, DomainId) record
- If the domain exists in the Domains table, add/rebuild the zone into Knot
- If the domain does not exist in the Domains table, purge the zone from Knot
- Delete the (Server, DomainId) record from KnotDomainsChanged iff NeedsRebuild = 0

So if the domain changes while a rebuild is in progress, the trigger will set NeedsRebuild = 1, which means it won’t be deleted from the table after the zone build finishes, which means it’ll be picked up again the next time the sync runs again in a few seconds. This avoids the race if the domain is changed while it is in the process of being rebuilt.

Using standard DNS AXFR/IXFR for replication

Our initial plan was to use a more “standard” DNS setup. We would have an internal hidden primary server and a number of secondary servers that pull from the primary via AXFR/IXFR. Only the secondary servers would handle DNS queries from the world.

This turned out to have a number of annoying edge cases:

It required two completely separate Knot server setups with completely separate and quite different configurations. Although this is a more “standard” DNS management approach, it reduced some of benefit of moving to a single DNS server.
Using AXFR for updates caused unexpected problems. AXFR connections aren’t reused, so every domain transferred required a separate TCP connection. When a lot of domains needed to be updated at once, we ran out of TCP socket tuples because of sockets in TIMEWAIT state. This caused AXFRs to the secondaries to start failing. We fixed this by enabling the kernel tcp_tw_reuse tunable, but it felt… hacky.
At Fastmail, whenever we have a singleton service (e.g. Knot primary), we want to make sure that we can take the machine it’s running on down safely. To allow that we have the service run on at least two separate servers and use a failover IP to bind to the current up/active server.

Unfortunately this combined with the way Knot does catalog zones completely broke AXFR/IXFR replication. What is a catalog zone? It’s a standard way to allow primary and secondary DNS servers to keep the complete list of zones actually managed in sync.

   The content of a DNS zone is synchronized among its primary and
   secondary nameservers using AXFR and IXFR.  However, the list of
   zones served by the primary (called a catalog in [RFC1035]) is not
   automatically synchronized with the secondaries.  To add or remove a
   zone, the administrator of a DNS nameserver farm not only has to add
   or remove the zone from the primary, they must also add/remove
   configuration for the zone from all secondaries.  This can be both
   inconvenient and error-prone; in addition, the steps required are
   dependent on the nameserver implementation.

It’s a classic example of taking a system that already has a way of storing data (DNS records) and replicating that data (AXFR/IXFR), and reusing those mechanisms to sync something else. In this case, a specially configured catalog zone that itself contains a list of all other member zones managed by the server in a standard defined format.

The basic format of the catalog zone is that each member zone managed by the server exists in the catalog zone as the RDATA of a PTR record. Then because each PTR record needs to be a unique domain name, a unique identifier is used as a sub-domain of the catalog zone itself. e.g. unique-N.catalog.zone. PTR member-domain.org.

The problem here is that the unique identifiers are not generated in a consistent way if you have multiple different primary servers!

k1: 8815a670d8fa9032.zones.catalog.dns.internal. 0        PTR     uberengineer.com.
k2: 09b6b01fb75e81e6.zones.catalog.dns.internal. 0        PTR     uberengineer.com.

During testing k1 was the entry on our first server (e.g. active primary), the k2 on the second server (e.g. backup primary).

So when we did a failover from the current active primary server to promote the backup primary to the active primary, the catalog zone on the new active primary is completely out of sync with the catalog zone on all the downstream secondary servers. This caused a massive amount of resyncing and general Knot confusion. There didn’t seem to be an easy solution to this problem without ultimately having some singleton source of truth server, which is what we wanted to avoid.

Switching to a single server type

After this, we decided to dump the whole Knot primary/secondary system and AXFR/IXFR replication, and instead have just a single type of Knot server. This ended up having a number of advantages.

Only one type of Knot server and one type of Knot server configuration. Less configurations to understand.
No catalog zone. Less concepts to learn about.
No AXFR/IXFR. AXFR/IXFR is harder to reason about and less visible to operators.
No worry about zone serial numbers going backwards if you add -> update -> delete -> re-add a domain which can cause AXFR/IXFR replication weirdness.

Doing this definitely felt like a better solution. Additionally it already all “just worked” because we had already built the trigger system that could keep an arbitrary number of primary servers (originally for failover) up-to-date. Now it was just keeping all our Knot instances up to date.

Testing the new system

Since DNS is so critical and we have 100,000’s of domains, we wanted to test as carefully as possible that the new Knot system would generate the same results as the existing PowerDNS system.

The Net::Pcap, Net::Frame and Net::DNS modules in perl made this straight forward. We were able to write a script that captured packets with libpcap, unpacked them into DNS queries and responses, and then replayed them against the new Knot servers to compare the results. This allowed us to see with real world query data that we would get back the same responses.

my $resolver = Net::DNS::Resolver->new(nameservers => [ '...existing DNS ip...' ]);
... libpcap setup ...
my $link_class;
if ($linktype == Net::Pcap::DLT_EN10MB) {
  $link_class = 'Net::Frame::Layer::ETH';
} elsif ($linktype == Net::Pcap::DLT_LINUX_SLL) {
  $link_class = 'Net::Frame::Layer::SLL';
} else {
  die "unknown link layer: $linktype\n";
}
... run libpcap loop ...
sub process_packet ($user_data, $header, $packet) {
  my $p_link = $link_class->new(raw => $packet);
  $p_link->unpack;
  my $p_ip4 = Net::Frame::Layer::IPv4->new(raw => $p_link->payload);
  $p_ip4->unpack;
  my $p_udp = Net::Frame::Layer::UDP->new(raw => $p_ip4->payload);
  $p_udp->unpack;
  my $p_dns = Net::DNS::Packet->decode( \$p_udp->payload );

  if (my @a = $p_dns->answer) {
    my @q = $p_dns->question;
    my $q = $q[0];

    my $dns_q = Net::DNS::Packet->new();
    $dns_q->push(question => $q);
    my $kres = $resolver->send($dns_q);
    my @ka = $kres->answer;

... compare @a (pdns answer) vs @ka (knot answer) modulo some known differences ...

Mostly it showed that everything was working as expected, though there were a few interesting edge cases that ended up needing to be dealt with.

Non-terminal nodes and wildcards

The biggest subtle difference we discovered was around non-terminal nodes and wildcards.

This is subtly documented in the tinydns axfr-get program:

axfr-get does not precisely simulate BIND’s handling of *.dom. Under BIND, records for *.dom do not apply to y.dom or anything.y.dom if there is a normal record for x.y.dom. With axfr-get and tinydns, the records apply to y.dom and anything.y.dom except x.y.dom.

Knot DNS follows the traditional BIND and RFC 1034 intepretation of wildcard records. Our PowerDNS backend was built to follow the TinyDNS model because that’s what we were migrating from at the time. This can cause subtle differences in DNS results if you have any wildcard domains.

An example. If you have a domain example.com setup at Fastmail with our standard DNS configuration, then we add default A records for *.example.com. If you add a DNS record for the subdomain *.foo that creates a record for *.foo.example.com. However that implied existence of foo.example.com means that the *.example.com record no longer exists for foo.example.com. At least, that’s true in the Knot/Bind DNS implementation, but not true in our PowerDNS backend implementation, so this will get you subtly different results.

In quite a few cases this won’t be a problem. We see queries like _domainkey.example.com A, which under PowerDNS return an IP, but won’t under knot, but that’s fine. Nothing should really be using that IP anyway, it’s probably just some gateway device somewhere doing DNS querying of passively seen domains in email headers or the like. But in some cases it might be.

We initially thought we could fix this automatically. For everyone that has a foo.bar.example.com subdomain without a bar.example.com subdomain, if there’s any wildcard *.example.com records, we make copies of them at bar.example.com. This should make everything “just work”.

The problem is, there’s actually a deeper problem that affects basically every domain and every single subdomain in a subtle way. As noted, by default we publish *.example.com records. However these effectively work for all sub-sub domains as well. For example I have a domain uberengineer.com and the standard *.uberengineer.com wildcard means that all sub-domains, sub-sub-domains, etc resolve:

# dig +short that.uberengineer.com
103.168.172.37
103.168.172.52
# dig +short this.that.uberengineer.com
103.168.172.37
103.168.172.52

Now I have a TXT record at test.uberengineer.com, so that hides the A records

# dig +short test.uberengineer.com
#

But in our PowerDNS implementation, that doesn’t hide sub-domains of test.uberengineer.com from the original wildcard.

# dig +short foo.test.uberengineer.com
103.168.172.37
103.168.172.52

But with Knot, it does:

# dig +short foo.test.uberengineer.com @knottest.internal
#

So basically to just fix users automatically, for every single subdomain X they have configured, if there’s a wildcard at a lower level, we’d have to explicitly create a copy of all the lower level wildcard records at *.X as well. This was going to be just way too much magic, especially for a rare edge case that it’s possible no one was even relying on anyway!

In the end, we analysed the DNS queries we were seeing and also all the email deliveries we saw. We can looked at the email logs on our MX servers for any RCPT TO address, and then checked if the domain that was delivered to is hosted by us, and compared if it would resolve under Knot as well. Combining the data convinced us that no one was going to be actively affected by this change.

Conclusion

This has now been running for over a year in production and has been working extremely well. We were able to remove two existing systems and replace them with a single consistent Knot DNS based system for all our Fastmail domains and 100,000’s of user domains. The triggers that track zones that need rebuilding work reliably and consistently. The new system performs enormously better than the existing PowerDNS pipe backend based system. We can easily scale it to add additional servers if needed.

This all fits with a mantra we’ve been working with recently, “fewer better ways”. We’ve been running an email service for over 25 years and it’s easy to accumulate a plethora of different services and systems with varying levels of polish and performance. Revisiting what you’re doing and running to try and consolidate to a smaller number of systems working in better ways can reduce long term debt. This makes systems more reliable, easier to manage, and also allows new staff to understand them more quickly as well.