The Websocket Deploy Bug
I've encountered the same bug thrice, while working at three different companies, where users disconnect from long-running websocket sessions upon each new deploy. Thus, I assume you’re quite likely to run into it if you build an app with websockets. Here are some quick notes on it.
When deploying a new change to your app, virtually every modern platform uses some form of blue-green deploy, ie:
- There’s a load balancer that sends requests to one or more containers (”blue” servers)
- A new deploy gets kicked off - triggering a build, and launching one or more containers running the new code to start (”green” servers)
- All traffic continues going to the blue servers, until the green servers pass health-checks.
- Then, the load balancer points to the green servers instead, no longer sending requests to the old blue servers. The green servers now become the new blue servers.
- The old blue servers stay alive in a “draining” or “deregistering” state, allowed to remain online during a grace period to complete existing work, but not receive new requests.
For standard requests, the maximum limit on the draining phase, which usually defaults from 60-300s is fine, but apps that rely on long-lived websockets will experience spontaneous disconnects here.
For example:
- You have an AI Voice companion, that users talk to with audio & video in their browser, via websockets. Users frequently talk to the agent for up to 30 minutes.
- You have an LLM research agent that runs many steps of research, taking up to 20 minutes, sending back results to the user in realtime. You might even have a much longer persistent websocket that encompasses many research tasks.
This bug’s symptoms can be confusing: users report random disconnects or crashes, some time after you push an update. It can be hard to detect - since your server side logs probably don’t report it as an error, and you might ascribe it to random network flakiness.
So, how do you fix this?
Option 1: Just reconnect
Depending on your task, simply reconnecting often works. When you do so, you’ll be connected to the new server. You need to be robust to random network failures anyways, so this code-path likely exists already.
This isn’t great. Your users are probably going to see hiccups or error messages, unless your application lends itself to this sort of reconnection really seamlessly.
One other note is this requires you to be reasonably stateless - you’ll lose any complex state stored by the server itself, so hopefully you weren’t storing it there, or have a way for the new server to rehydrate appropriate state.
Option 2: Lengthen the draining phase
This is quite an easy fix, and usually just needs changing a config to increase the duration for which a server is kept alive after a new version has superseded it. This doesn’t truly resolve the issue, unless websocket lifetimes are bounded.
However, this usually does ameliorate it somewhat. For example, if you can raise your limit to 1h, then only sessions that have gone on for over an hour will experience disruption. This, combined with the “just reconnect” approach for a reasonably stateless app can mostly suppress the bug.
For example, on AWS’s Elastic Load Balancer, the default is 5 minutes, but can be raised up to an hour (link). Render, a smaller container platform, is similar, defaulting to 30 seconds, raisable to 5 minutes (link).
Option 3: Reconnect at a checkpoint
Certain apps have websockets that span both periods where it’s safe to reconnect and disruptive to reconnect.
For example: Your server is streaming a chat response to the client, and a reconnection would disrupt it. However, after the client receives the message, and before they type their next message, they ought to reconnect.
Since the server is usually informed that it’s in a draining phase by a
SIGTERM
signal, you can listen to the SIGTERM
and send a websocket message to the client that it should reconnect when next idle. Combine with a longer draining phase, and all clients should be able to gracefully move over to an updated instance.