39.7 Scaling WebSocket Services: Sticky Sessions and Redis Pub/Sub

Right, so you’ve built a single-server WebSocket handler. It’s beautiful. It works. You feel like a genius. Then you try to deploy a second instance behind a load balancer, and suddenly User A on server 1 is sending messages into the void, trying to reach User B who is happily connected to server 2. Your brilliant real-time app has become a masterclass in disappointment. Welcome to the distributed systems party; it’s messy, and everyone’s here.

The core problem is state. Your WebSocket connections are stateful, living as open TCP connections on the memory of a specific server. A standard load balancer is stateless and dumb (in the kindest way); it just routes requests based on its algorithm (round-robin, least connections, etc.). It has no idea that a HTTP upgrade request is for a long-lived connection that needs to stick to a specific backend.

The Sticky Session: A Necessary Evil

The quickest fix is to make your load balancer stateful. This is called “sticky sessions” or “session affinity.” You configure your load balancer (NGINX, HAProxy, your cloud provider’s ALB) to send all requests from a given user session to the same backend server.

In NGINX, you might use an ip_hash directive or, better, a cookie-based approach:

upstream backend {
    server server1.example.com;
    server server2.example.com;
    sticky cookie srv_id expires=1h domain=.example.com path=/;
}

This makes the load balancer set a cookie telling it which server to use. It’s simple and it works. But let’s be honest, it’s a bit of a kludge. You’re papering over the fundamental statelessness of HTTP. What happens when that specific server dies? The user’s connection drops, their sticky cookie is now a lie, and they get routed to a new server that has no memory of them. Poof. There goes the user experience. It’s better than nothing, but it’s not truly resilient.

To build something that doesn’t crumble when a server dies, you need to externalize the communication between your application servers. You need a shared message bus. This is where Redis and its Publish/Subscribe feature become your new best friend.

The architecture is elegantly simple:

Each application server instance connects to the same central Redis instance.
When a message comes in from a WebSocket connection on Server A, instead of trying to send it directly to a connection on Server B (which it can’t see), it publishes the message to a Redis channel.
Every server (including Server A) is subscribed to that same Redis channel.
When a message is published, Redis pushes it to all subscribed servers.
Each server then checks if the intended recipient is connected to them. If yes, they send the message down the local WebSocket. If not, they silently ignore it.

It’s like a town crier yelling a message for “Bob” in the town square. Every house (server) hears it, but only the house where Bob actually lives will respond.

Here’s how you wire this up in Go. First, let’s model a central Hub to manage local connections and the Redis connection.

package main

import (
    "context"
    "encoding/json"
    "fmt"
    "log"
    "sync"

    "github.com/go-redis/redis/v8"
    "github.com/gorilla/websocket"
)

// Message defines the structure of our pub/sub messages
type Message struct {
    TargetUserID string `json:"targetUserId"`
    Content      string `json:"content"`
}

// Hub holds the state of our application
type Hub struct {
    // Local connections on this server
    connections map[string]*websocket.Conn
    connLock    sync.RWMutex

    // Redis client for pub/sub
    redisClient *redis.Client
}

func NewHub(redisAddr string) *Hub {
    rdb := redis.NewClient(&redis.Options{Addr: redisAddr})
    h := &Hub{
        connections: make(map[string]*websocket.Conn),
        redisClient: rdb,
    }
    // Start listening for messages from Redis in a goroutine
    go h.subscribeToMessages()
    return h
}

// Handle a new WebSocket connection
func (h *Hub) HandleConnection(conn *websocket.Conn, userID string) {
    // 1. Add the connection to our local map
    h.connLock.Lock()
    h.connections[userID] = conn
    h.connLock.Unlock()

    // 2. Handle incoming messages from this WS connection
    defer func() {
        h.connLock.Lock()
        delete(h.connections, userID)
        h.connLock.Unlock()
        conn.Close()
    }()

    for {
        _, p, err := conn.ReadMessage()
        if err != nil {
            break
        }
        // 3. Incoming message! Let's publish it to Redis.
        var msg Message
        if err := json.Unmarshal(p, &msg); err != nil {
            log.Printf("Invalid message format: %v", err)
            continue
        }
        // The sender is implicitly this user, so we might set that here too.
        payload, _ := json.Marshal(msg)
        h.redisClient.Publish(context.Background(), "websocket_messages", payload)
    }
}

// subscribeToMessages listens to Redis and dispatches messages locally
func (h *Hub) subscribeToMessages() {
    ctx := context.Background()
    sub := h.redisClient.Subscribe(ctx, "websocket_messages")
    defer sub.Close()

    for {
        // This is a blocking call
        msg, err := sub.ReceiveMessage(ctx)
        if err != nil {
            log.Printf("Redis sub error: %v", err)
            return
        }

        var message Message
        if err := json.Unmarshal([]byte(msg.Payload), &message); err != nil {
            log.Printf("Failed to unmarshal pub/sub message: %v", err)
            continue
        }

        // Check if the target user is connected to THIS server
        h.connLock.RLock()
        targetConn, exists := h.connections[message.TargetUserID]
        h.connLock.RUnlock()

        if exists {
            // Send the message down the local WebSocket connection
            if err := targetConn.WriteJSON(message); err != nil {
                log.Printf("Failed to send to local connection: %v", err)
            }
        }
        // If the user isn't here, we just ignore the message. No problem.
    }
}

The Devil’s in the Details: Pitfalls and Best Practices

This code is the blueprint, but production code is harder. Here’s what you need to watch for:

Serialization: We used JSON for clarity, but for high-throughput systems, consider a binary format like Protocol Buffers. Every byte counts when you’re broadcasting to N servers.
Channel Design: One giant channel ("websocket_messages") is simple but inefficient. Every server gets every message. For massive scale, consider sharding channels by user ID or room ID (e.g., "user:userID123"). This requires more sophisticated subscription management but drastically reduces useless traffic.
Redis Resilience: Your Redis instance is now a critical Single Point of Failure. Use Redis Sentinel or a managed cloud offering to handle failover. If Redis goes down, your entire messaging bus goes down. No pressure.
Connection Management: The subscribeToMessages goroutine will die if the Redis connection drops. You need a retry mechanism with exponential backoff to reconnect and resubscribe. The above code does not have this, which is why it’s a example and not a production deployment.
Memory Leaks: The defer in HandleConnection ensures users are removed from the map when they disconnect. Forgetting this is a classic mistake that leads to memory leaks and attempts to write to closed sockets.

The combination of sticky sessions for connection affinity and Redis Pub/Sub for message distribution is the pragmatic, industry-standard way to scale WebSockets. It accepts the inherent chaos of distributed systems and builds a robust, understandable structure on top of it. Now go build something that doesn’t break when you need it most.

The Sticky Session: A Necessary Evil

The Right Way: Sharing State with Redis Pub/Sub

The Devil’s in the Details: Pitfalls and Best Practices