Node.js WebSocket Scaling Past 20k Connections

Currently viewing the human version

Switch to AI version

Why WebSocket Apps Die Around 20k Connections

Your chat app works fine with 100 users. At 20k concurrent connections, everything breaks. Here's why.

Node.js WebSocket Limits

Node.js handles WebSockets through its event loop. No threading required, which is great until it's not.

Where it breaks:

File descriptor limit (Linux default is 1024, set ulimit -n 65536)
Memory usage around 100KB per connection
Event loop blocks on slow database queries
Garbage collection pauses get longer

Most Node.js apps die between 15k-25k connections. I've seen this exact pattern on AWS, GCP, and bare metal servers.

The death spiral always starts the same way: event loop lag spikes to 500ms+, garbage collection pauses hit 2-3 seconds, then the OS starts killing processes with OOM. Your monitoring shows everything is fine until suddenly it's not.

Connection Problems at Scale

What happens around 20k connections:

Random connection timeouts
Messages start getting dropped
10-30 second server freezes during garbage collection
Memory usage climbs to 2GB+ and keeps going
Response times go from 50ms to 5+ seconds

The event loop can't keep up with processing all the connections and garbage collection kills performance.

Socket.IO vs Native WebSockets

Socket.IO is easier:

Handles reconnections automatically
Falls back to long polling when WebSockets are blocked
Built-in room management
Event-based messaging

But Socket.IO has problems:

Dies around 15k connections (vs 25k+ for native)
Uses 2-3x more memory per connection
Requires sticky sessions
Extra protocol overhead on every message

Native WebSockets:

Can handle 25k+ connections if done right
Lower memory and CPU usage
No sticky sessions needed
Direct browser support

But you have to build everything:

Room management
Reconnection logic
Heartbeat/keepalive
All the debugging

Multi-Server Architecture

When one Node.js process can't handle the load, run multiple:

Load Balancer → Node.js Instance 1 (15k connections)
             → Node.js Instance 2 (15k connections) → Redis → Database
             → Node.js Instance 3 (15k connections)

You need:

HAProxy or nginx for load balancing
Multiple Node.js processes (10-15k connections each)
Redis for sharing messages between servers
Health checks to detect failed nodes
Monitoring because things will break

The State Management Problem

This is where most scaling attempts fail. When User A on Server 1 sends a message, users on Server 2 need to see it.

I watched a chat app completely fall apart during a product demo because they stored room state in Node.js memory. Users kept sending messages to empty rooms and getting confused why nobody responded.

Problems:

Messages lost between servers (no shared state)
Users see different room member counts (each server tracks separately)
Connection state vanishes when servers restart (everything stored in RAM)
Race conditions when users join/leave rooms simultaneously

Solutions:

Store room membership in Redis sets
Use Redis pub/sub to broadcast messages
Heartbeat every 30 seconds to detect dead connections
Clean up state when users disconnect
Connection pooling for Redis

What Actually Works in Production

After debugging this shit at multiple companies, here's what actually keeps WebSocket servers alive:

Start with native WebSockets if you can handle building reconnection logic yourself. Plan for horizontal scaling from day one - don't wait until you hit 20k connections and everything breaks.

Use Redis for all shared state. Every other approach I've tried has race conditions or fails during server restarts. Redis pub/sub is the only thing that works reliably across multiple Node processes.

Monitor everything obsessively - connection counts, memory usage per server, Redis performance. The problems show up in these metrics before users complain.

Test with realistic load. Those 100 concurrent test users won't trigger garbage collection pauses or file descriptor limits. Load test with 25k connections or you're wasting your time.

Memory Leaks to Watch For

Common leak sources:

Not cleaning up connection metadata when users disconnect
Event listeners that never get removed
Storing connection references in global maps that never expire

Use WeakMap for connection metadata and always clean up in disconnect handlers.

Database Integration

WebSocket apps hit the database differently than HTTP APIs:

Lots of small real-time queries instead of batch operations
Connection pooling becomes critical (200+ connections is normal)
Database becomes the bottleneck before WebSocket connections do

Consider Redis for high-frequency data and PostgreSQL for persistent data.

Performance Monitoring

Track these metrics:

Active connection count per server
Memory usage per connection
Event loop lag
Database connection pool usage
Redis pub/sub latency

Set up alerts for connection count > 15k per server and event loop lag > 10ms.

When to Use Managed Services

Building WebSocket infrastructure is a pain. Consider managed services like Ably, Pusher, or AWS API Gateway WebSockets if you want to focus on your app instead of infrastructure.

The break-even point is around 10k concurrent connections - below that, DIY is cheaper. Above that, managed services start making financial sense.

Real Production Experience

Most WebSocket scaling problems happen in production with real users, real network conditions, and real load patterns.

Load testing with Artillery or similar tools helps, but you won't catch everything until you have actual users generating unpredictable traffic patterns.

Plan for 3x your expected peak load and design your system to fail gracefully when it gets overwhelmed.

WebSocket Libraries Comparison

Library	Connection Limit	Difficulty	Worth It?
Socket.IO	~15k	Easy	Great until you hit real scale, then it's garbage
Native WebSocket	~25k	Medium	My go-to when Socket.IO inevitably fails
ws library	~25k	Medium	Rock solid never let me down
uWebSockets.js	50k+	Hard	Fastest but fuck those C++ build issues

Production WebSocket Code That Works

Most WebSocket tutorials show happy path code with 10 users. Here's what you actually need for 20k connections.

This code comes from debugging WebSocket apps that handle millions of messages per day. Every line exists because something broke in production without it.

Connection Cleanup

Every WebSocket connection needs cleanup when it dies. Miss this and your server leaks memory.

I debugged a WebSocket server that was restarting every 6 hours. Turns out mobile clients were dropping connections without sending close frames, so the cleanup code never ran. Memory usage grew from 200MB to 4GB over 6 hours until the OS killed it.

class ConnectionManager {
  constructor() {
    this.connections = new Map();
    this.startHeartbeat();
  }

  addConnection(userId, ws) {
    ws.userId = userId;
    ws.isAlive = true;
    this.connections.set(userId, ws);

    ws.on('close', () => {
      this.connections.delete(userId);
      this.cleanupUserData(userId);
    });

    ws.on('pong', () => {
      ws.isAlive = true;
    });
  }

  startHeartbeat() {
    setInterval(() => {
      this.connections.forEach((ws) => {
        if (!ws.isAlive) {
          ws.terminate();
          this.connections.delete(ws.userId);
          return;
        }

        ws.isAlive = false;
        ws.ping();
      });
    }, 30000);
  }

  cleanupUserData(userId) {
    // Clean up any user-specific data
    // This prevents memory leaks
  }
}

Room Management with Redis

When users can join/leave rooms, you need Redis to track membership across servers.

I learned this the hard way when our WebSocket servers lost sync during a Redis restart. Half the users thought they were in empty rooms while the other half saw ghost users who had already left. Always store room state externally.

const Redis = require('redis');
const redis = Redis.createClient();

class RoomManager {
  async joinRoom(userId, roomId) {
    await redis.sadd(`room:${roomId}`, userId);
    await redis.publish('room_join', JSON.stringify({ userId, roomId }));
  }

  async leaveRoom(userId, roomId) {
    await redis.srem(`room:${roomId}`, userId);
    await redis.publish('room_leave', JSON.stringify({ userId, roomId }));
  }

  async broadcastToRoom(roomId, message) {
    const members = await redis.smembers(`room:${roomId}`);
    await redis.publish('room_message', JSON.stringify({
      roomId,
      message,
      members
    }));
  }
}

Message Broadcasting Between Servers

Use Redis pub/sub to share messages between your Node.js instances:

const subscriber = Redis.createClient();
const publisher = Redis.createClient();

subscriber.subscribe('room_message');
subscriber.on('message', (channel, data) => {
  const { roomId, message, members } = JSON.parse(data);

  members.forEach(userId => {
    const ws = connectionManager.connections.get(userId);
    if (ws && ws.readyState === WebSocket.OPEN) {
      ws.send(JSON.stringify(message));
    }
  });
});

// Send message to all servers
function broadcastMessage(roomId, message) {
  publisher.publish('room_message', JSON.stringify({
    roomId,
    message,
    timestamp: Date.now()
  }));
}

Error Handling

WebSocket errors are different from HTTP errors. Handle them properly.

WebSocket errors don't give you HTTP status codes. You get generic Error: connection reset by peer messages and have to figure out what went wrong. Mobile networks dropping connections look identical to users closing their browser.

ws.on('error', (error) => {
  console.error('WebSocket error:', error.message);
  connectionManager.connections.delete(ws.userId);
});

ws.on('message', (data) => {
  try {
    const message = JSON.parse(data);
    handleMessage(ws, message);
  } catch (error) {
    console.error('Invalid message:', error.message);
    ws.send(JSON.stringify({ error: 'Invalid message format' }));
  }
});

Database Queries

WebSocket apps query the database constantly. Use connection pooling.

A chat app with 10k concurrent users will hit your database 50-100 times per second just for authentication and room membership checks. Without connection pooling, you'll exhaust your Postgres connection limit (default 100) in minutes.

const { Pool } = require('pg');
const pool = new Pool({
  host: 'localhost',
  database: 'chat',
  max: 200, // Much higher than HTTP apps
  idleTimeoutMillis: 30000,
});

async function getUserData(userId) {
  const client = await pool.connect();
  try {
    const result = await client.query('SELECT * FROM users WHERE id = $1', [userId]);
    return result.rows[0];
  } finally {
    client.release();
  }
}

Rate Limiting

Prevent message spam:

const rateLimits = new Map(); // userId -> { count, resetTime }

function checkRateLimit(userId) {
  const now = Date.now();
  const userLimit = rateLimits.get(userId);

  if (!userLimit || now > userLimit.resetTime) {
    rateLimits.set(userId, { count: 1, resetTime: now + 60000 }); // 1 minute
    return true;
  }

  if (userLimit.count >= 100) { // 100 messages per minute
    return false;
  }

  userLimit.count++;
  return true;
}

ws.on('message', (data) => {
  if (!checkRateLimit(ws.userId)) {
    ws.send(JSON.stringify({ error: 'Rate limit exceeded' }));
    return;
  }

  // Process message
});

Health Checks

Your load balancer needs to know if your WebSocket server is healthy.

I've seen load balancers keep sending traffic to WebSocket servers that were completely dead - 20k connections but not processing any messages. The health check endpoint was still responding because HTTP was fine, but WebSocket processing had locked up.

const express = require('express');
const app = express();

app.get('/health', (req, res) => {
  const activeConnections = connectionManager.connections.size;
  const memoryUsage = process.memoryUsage();

  if (activeConnections > 20000 || memoryUsage.heapUsed > 2000000000) {
    return res.status(503).json({
      status: 'unhealthy',
      connections: activeConnections,
      memory: memoryUsage.heapUsed
    });
  }

  res.json({
    status: 'healthy',
    connections: activeConnections,
    uptime: process.uptime()
  });
});

app.listen(3001); // Health check port

Monitoring Metrics

Track what matters in production:

const metrics = {
  activeConnections: 0,
  messagesPerSecond: 0,
  errorCount: 0,
  redisLatency: 0
};

setInterval(() => {
  metrics.activeConnections = connectionManager.connections.size;
  console.log('Metrics:', metrics);

  // Send to your monitoring service
  // sendToDatadog(metrics);
}, 10000);

Performance Tips

Memory optimization:

Use WeakMap for connection metadata that can be garbage collected
Clean up event listeners in disconnect handlers
Limit message history per room (delete old messages)

CPU optimization:

Batch Redis operations when possible
Use connection pooling for all external services
Parse JSON once and reuse the object

Network optimization:

Compress WebSocket messages if they're large
Use binary format for high-frequency data
Implement message queuing if Redis gets overwhelmed

This code won't handle every edge case, but it's a solid foundation that can scale to 20k+ connections per server. Test it with real load before going to production.

WebSocket FAQ

My server dies at 20k connections. Is that normal?

Yes. A decent server (4 cores, 16GB RAM) typically handles:

15k connections for chat apps
25k for mostly idle connections
8k if processing heavy messages

Beyond that, you hit file descriptor limits and the event loop gets overwhelmed. Scale horizontally instead of trying to squeeze more connections per server.

Why do connections randomly drop?

Corporate firewalls are the worst. They kill idle WebSocket connections because they think anything quiet for 60 seconds must be dead. Mobile carriers do the same shit.

Your users will blame your app for "being buggy" when it's actually their network infrastructure being aggressive about killing connections.

Fix with heartbeat pings:

setInterval(() => {
  wss.clients.forEach(ws => {
    if (!ws.isAlive) {
      ws.terminate();
      return;
    }
    ws.isAlive = false;
    ws.ping();
  });
}, 30000);

ws.on('pong', () => {
  ws.isAlive = true;
});

Socket.IO vs native WebSockets?

Socket.IO for prototyping, native WebSockets for production scale.

Socket.IO handles reconnections and rooms automatically but uses more memory and dies around 15k connections. Native WebSockets handle 25k+ but you build everything yourself.

How do I share state between servers?

Use Redis. Store room membership in Redis sets, use pub/sub for message broadcasting:

// Join room
await redis.sadd(`room:${roomId}`, userId);

// Broadcast message
await redis.publish('room_message', JSON.stringify({ roomId, message }));

Why is my memory usage so high?

Each WebSocket connection uses ~100KB of memory. At 20k connections, that's 2GB just for connection overhead.

Memory leaks happen when you don't clean up connection metadata on disconnect. Always clean up in the close event handler.

How do I handle connection spikes?

Set connection limits and reject new connections when you hit them:

wss.on('connection', (ws, req) => {
  if (wss.clients.size > MAX_CONNECTIONS) {
    ws.close(1013, 'Server overloaded');
    return;
  }
  // Handle connection
});

Better to reject new connections than crash the server.

Database performance is terrible. Why?

WebSocket apps absolutely hammer the database. Unlike HTTP APIs that do a few big queries per request, chat apps do hundreds of tiny queries: "Is user X online? What room is user Y in? Who's typing in room Z?"

Your database connection pool gets exhausted fast:

Use connection pooling (200+ connections is normal)
Cache frequently accessed data in Redis
Batch database updates when possible

How do I debug connection issues?

Add logging to connection events:

ws.on('open', () => {
  console.log('Connection opened:', req.socket.remoteAddress);
});

ws.on('close', (code, reason) => {
  console.log('Connection closed:', code, reason.toString());
});

ws.on('error', (error) => {
  console.error('WebSocket error:', error);
});

Track connection counts, message rates, and error rates. Set up alerts for unusual patterns.

Should I use managed WebSocket services?

If you have 10k+ concurrent connections and don't want to manage infrastructure, yes.

Services like Ably, Pusher, or AWS API Gateway handle scaling, monitoring, and reliability. They're more expensive per connection but cheaper than hiring a team to build and maintain the infrastructure.

How do I test WebSocket performance?

Use Artillery or similar tools to simulate concurrent connections:

config:
  target: 'ws://localhost:3000'
  phases:
    - duration: 300
      arrivalRate: 100
scenarios:
  - name: 'WebSocket connections'
    engine: ws

Test with realistic message patterns, not just idle connections. Most problems show up under message load, not just connection load.

Testing 10k idle connections tells you nothing. You need connections that actually send messages, join/leave rooms, and simulate real user behavior. Otherwise you'll hit production and wonder why everything's on fire.

Quick Navigation

Node.js WebSocket Limits

Connection Problems at Scale

Socket.IO vs Native WebSockets

Multi-Server Architecture

The State Management Problem

What Actually Works in Production

Memory Leaks to Watch For

Database Integration

Performance Monitoring

When to Use Managed Services

Real Production Experience

Connection Cleanup

Room Management with Redis

Message Broadcasting Between Servers

Error Handling

Database Queries

Rate Limiting

Health Checks

Monitoring Metrics

Performance Tips

Memory optimization:

CPU optimization:

Network optimization:

My server dies at 20k connections. Is that normal?

Why do connections randomly drop?

Socket.IO vs native WebSockets?

How do I share state between servers?

Why is my memory usage so high?

How do I handle connection spikes?

Database performance is terrible. Why?

How do I debug connection issues?

Should I use managed WebSocket services?

How do I test WebSocket performance?

Related Tools & Recommendations

Sift - Fraud Detection That Actually Works

jQuery - The Library That Won't Die

GPT-5 Is So Bad That Users Are Begging for the Old Version Back

GitHub Codespaces Enterprise Deployment - Complete Cost & Management Guide

Install Python 3.12 on Windows 11 - Complete Setup Guide

Migrate JavaScript to TypeScript Without Losing Your Mind

DuckDB - When Pandas Dies and Spark is Overkill

SaaSReviews - Software Reviews Without the Fake Crap

Fresh - Zero JavaScript by Default Web Framework

Anthropic Raises $13B at $183B Valuation: AI Bubble Peak or Actual Revenue?

Google Pixel 10 Phones Launch with Triple Cameras and Tensor G5

Dutch Axelera AI Seeks €150M+ as Europe Bets on Chip Sovereignty

Samsung Wins 'Oscars of Innovation' for Revolutionary Cooling Tech

Nvidia's $45B Earnings Test: Beat Impossible Expectations or Watch Tech Crash

Microsoft's August Update Breaks NDI Streaming Worldwide

Apple's ImageIO Framework is Fucked Again: CVE-2025-43300

Trump Plans "Many More" Government Stakes After Intel Deal

Thunder Client Migration Guide - Escape the Paywall

Fix Prettier Format-on-Save and Common Failures

Get Alpaca Market Data Without the Connection Constantly Dying on You