May 28, 2013

Push events, NAT TCP connection timeouts, and device sleep

Historical

Rob Mueller

Founder & CTO

This is a technical post. Regular FastMail users subscribed to receive email updates from the FastMail blog can just ignore this post.

When we released the new user interface last year, one of the improvements included was push updates when new emails arrived.

In theory, push events are conceptually quite easy to do. We open a connection from your web browser to the server (see this blog post for details), then when a new email arrives, we send a message down the open connection to let the browser know. It then fetches the details about the new email(s) and refreshes the display.

Unfortunately, in the real world, it’s not quite that easy. The biggest problem is that when a mailbox is mostly idle (no new mail arriving), the connection from the browser to the server will be idle. While this shouldn’t be a problem, it turns out it often is.

As we have noted before, some of our users are behind NAT gateways/stateful firewalls that have short state timeouts. If you leave a TCP connection idle for too long (variable from 2 to 30 minutes depending on the device), these start dropping any new packets on the connection.

In the case of a push connection from the server to the client, this is particularly bad. When a new email arrives, the server will try and send data to the client, and then be told the connection is dead at that point. That’s fine for the server, it can then clean up the connection. However, the client will never see any data from the server, and neither will the client ever know that the gateway/firewall has broken the connection. The client will think it is still connected to the server and has no way of knowing that the connection has actually been broken. This is purely a consequence of the way the TCP protocol works. The only way for the client to be able to tell the connection is broken is to send some data down the connection, and there are only 2 ways that can happen.

If the client has enabled TCP keepalive on the socket. Currently
only Chrome on Windows does
this.
If the client sent some data down the connection to the server.
Unfortunately the eventsource
specification doesn’t provide
any way to do this; it basically assumes the underlying TCP
connection is always reliable and only the server can send to the
client.

One way to try and work around this issue is for the server to send regular “ping” events to the client, sufficiently often that the gateway/firewall knows the connection is still alive. This is relatively straightforward to do, but causes other problems.

If the ping events come too fast, it can cause some clients to never go into sleep mode. For instance, we used to send ping events every 60 seconds. It was noted that on an iPad if you left the FastMail webpage open in Safari and put the iPad down, the iPad itself would never actually go to sleep. The screen would stay on, draining the battery very quickly.

Because of that, we decided to go the other direction and disabled the ping events, but that ends up back at the other end of the scale where sometimes push just seems to randomly stop working.

As there’s no perfect solution to this problem, we’re now changing again to a new trade off.

The server will send regular “ping” events to the client at 5 minute
intervals. This should be enough for most gateways/firewalls to keep
the connection open, but long enough apart to allow devices to go to
sleep.
If the client doesn’t see a ping event after 6 minutes, it assumes
the connection has died, drops the existing connection and creates a
new one. This should at least allow push events to work to some
extent on connections with gateways/firewalls with low timeouts.

This change has now been rolled out everywhere. Based on initial testing, we think that this time we’ve got the balance between theory and reality right.