We built a live polling app. It worked great for 50 users. Then a streamer with 50,000 viewers used it. Our servers melted in 3 seconds. This is the post-mortem.
The Problem with Stateful Connections
Unlike REST API requests which are stateless (request -> response -> close), a WebSocket connection stays open. It consumes a file descriptor on the server. It uses RAM to keep the heartbeat alive.
A single Node.js server can handle maybe 2k-5k concurrently active connections before the Event Loop lag becomes noticeable.
Challenge 1: The Load Balancer Trap
We use Nginx. By default, Nginx round-robins requests. But WebSockets maximize session duration. Once a user connects to Server A, they are stuck to Server A for hours.
The Issue: If Server A crashes and restarts, all 5,000 users try to reconnect instantly. Nginx sends them all to Server B (because A is dead). Server B now crashes under the load. This is the classic “Thundering Herd.”
Solution: Architecture Overhaul
We couldn’t rely on sticky sessions. We needed a pub/sub layer.
Layer 1: The Edge (terminators)
We deployed 10 small Node.js instances whose ONLY job is to hold the WebSocket connection. They utilize almost no CPU. They just forward messages.
Layer 2: Redis Pub/Sub
When a user votes, the Edge Server publishes a message to Redis: PUBLISH poll:123 “vote_A”.
Layer 3: The Workers
Backend workers subscribe to Redis. They process the vote, write to the database, and then publish the updated count back to Redis.
The “Fan-Out” Problem
Here is where it gets tricky. If 10k users are connected to the same poll, and the count updates, we need to send 10k messages instantly.
Redis Pub/Sub is fast, but JSON serialization in Node.js is slow. Stringifying the same message 10,000 times blocked our event loop.
Optimization: We use Buffer.from(JSON.stringify(msg)) once, and send the binary buffer to all sockets. Pre-serializing the message saved us 80ms of latency per broadcast.
Ephemeral vs. Persistent State
Do we need to save every single vote immediately to Postgres? No. We moved to an “Eventual Consistency” model.
Votes are incremented in Redis (Atomic INCR). A background cron job syncs the final tally to Postgres every 10 seconds. If the server crashes, we might lose 10 seconds of data, but the app stays alive. User experience > 100% data durability for a polling app.
Conclusion
Vertical scaling (bigger servers) fails for WebSockets. You must design for horizontal scaling from Day 1 using a Pub/Sub architecture.