October 27, 2011

TCP keepalive, iOS 5 and NAT routers

Historical

Rob Mueller

Founder & CTO

This post contains some very technical information. For users just interested in the summary:

If over the next week you experience an increase in frozen, non-responding or broken IMAP connections, please contact our support team (use the “Support” link at the top of the http://www.fastmail.fm homepage) with details. Please make sure you include your operating system, email software, how you connect to the internet, and what modem/router/network connection you use in your report.

The long story: The IMAP protocol is designed as a long lived connection protocol. That is, your email client connects from your computer to the server, and stays connected for as long as possible.

In many cases, the connection remains open, but completely idle for extended periods of time while your email client is running but you are doing other things.

In general while a connection is idle, no data at all is sent between the server and the client, but they both know the connection still exists, so as soon as data is available on one side, it can send it to the other just fine.

There is a problem in some cases though. If you have a home modem and wireless network, then you are usually using a system called NAT that allows multiple devices on your wireless network to connect to the internet through one connection. For NAT to work, your modem/router must keep a mapping for every connection from any device inside your network to any server on the internet.

The problem is some modems/routers have very poor NAT implementations that “forget” the NAT mapping for any connection that’s been idle for 5 minutes or more (some appear to be 10 minutes or more). What this means is that if an IMAP connection remains idle with no communication for 5 minutes, then the connection is broken.

In itself this wouldn’t be so bad, but the way the connection is broken is that rather than telling the client “this connection has been closed”, packets from the client or server just disappear which causes some nasty user visible behaviour.

The effect is that if you leave your email client idle for 5 minutes and the NAT connection is lost, if you then try and do something with the client (e.g. read or move an email), the client tries to send the appropriate command to the server. But the TCP packets that contain the command never arrive at the server, but neither are RST packets sent back that would tell the client that there’s any problem with the connection, the packets just disappear. So the local computer tries to send again after a timeout period, and again a few more times, until usually about 30 seconds later, it finally gives up and marks the connection as dead, and finally sends that information back up to the email client, which shows some “connection was dropped by the server” type message.

From a user perspective, it’s a really annoying failure mode that looks like a problem with our server, even though it’s really because of a poor implementation of NAT in their modem.

However this is a workaround for this. At the TCP connection level, there’s a feature called keepalive that allows the operating system to send regular “is this connection still open?” type packets back and forth between the server and the client. By default keepalive isn’t turned on for connections, but it is possible to turn it on via a socket option. nginx, our frontend IMAP proxy, allows you to turn this on via a so_keepalive configuration option.

However even after you’ve enabled this option, the default time between keepalive “ping” packets is 2 hours. Fortunately again, there’s a Linux kernel tuneable net.ipv4.tcp_keepalive_time that lets you control this value.

By lowering this value to 4 minutes, it causes TCP keepalive packets to be sent over open but idle IMAP connections from the server to the client every 4 minutes. The packets themselves don’t contain any data, but what they do do is cause any existing NAT connection to be marked as “alive” on the users modem/router. So poor routers with NAT connections that would normally timeout after 5 minutes of inactivity are kept alive, so the user doesn’t see the nasty broken connection problem described above, and neither is there a visible downside to the user either.

So this is how things have been for the last 4-5 years, which has worked great.

Unfortunately, there’s a new and recent problem that has now appeared.

iOS 5 now uses long lived persistent IMAP connections (apparently previous versions only used short lived connections). The problem is that our ping packets every 4 minutes mean that the device (iPhone/iPad/iPod) is “woken up” every 4 minutes as well. This means the device never goes into a deeper sleep mode, which causes significantly more battery drain when you setup a connection to the FastMail IMAP server on iOS 5 devices.

Given the rapid increase in use of mobile devices like iPhones, and the big difference in battery life it can apparently cause, this is a significant issue.

So we’ve decided to re-visit the need for enabling so_keepalive in the first place. Given the original reason was due to poor NAT routers with short NAT table timeouts, that was definitely an observed problem a few years back, but we’re not sure how much of a problem it is now. It’s possible that the vast majority of modems/routers available in the last few years have much better NAT implementations. Unfortunately there’s no way to easily test this, short of actually disabling keepalive, and waiting for users to report issues.

So we’ve done that now on mail.messagingengine.com, and we’ll see over the next week what sort of reports we get. Depending on the number, there’s a few options we have:

If there’s lots of problem reports, we’d re-enable keepalive by
default, but setup an alternate server name like
mail-mobile.messagingengine.com that has keepalive disabled, and
tell mobile users to use that server name instead. The problem with
this is many devices now have auto configuration systems enabled, so
users don’t even have to enter a server name, so we’d have to work
out how to get that auto configuration to use a different server
name
If there’s not many problem reports, we’d leave keepalive off by
default, but setup an alternative server name like
mail-keepalive.messagingengine.com that has keepalive enabled, and
for users that report connection “freezing” problems, we’d tell them
to switch to using that server name instead
Ideally, we’d detect what sort of client was connecting, and turn
on/off keepalive as needed. This might be possible using software
like p0f, but integrating that
with nginx would require a bit of work, and still leaves you with
the problem of an iPhone user that is usually in their office/home
all day and uses a wireless network with a poor NAT router, would
they prefer the longer battery life, or better connectivity
experience.

I’ll update this post in a week or two when we have some more data.