Currently viewing the human version
Switch to AI version

Why WebSocket Apps Die Around 20k Connections

Your chat app works fine with 100 users. At 20k concurrent connections, everything breaks. Here's why.

Node.js WebSocket Limits

Node.js handles WebSockets through its event loop. No threading required, which is great until it's not.

Where it breaks:

  • File descriptor limit (Linux default is 1024, set ulimit -n 65536)
  • Memory usage around 100KB per connection
  • Event loop blocks on slow database queries
  • Garbage collection pauses get longer

Most Node.js apps die between 15k-25k connections. I've seen this exact pattern on AWS, GCP, and bare metal servers.

The death spiral always starts the same way: event loop lag spikes to 500ms+, garbage collection pauses hit 2-3 seconds, then the OS starts killing processes with OOM. Your monitoring shows everything is fine until suddenly it's not.

Connection Problems at Scale

What happens around 20k connections:

  • Random connection timeouts
  • Messages start getting dropped
  • 10-30 second server freezes during garbage collection
  • Memory usage climbs to 2GB+ and keeps going
  • Response times go from 50ms to 5+ seconds

The event loop can't keep up with processing all the connections and garbage collection kills performance.

Socket.IO vs Native WebSockets

Socket.IO is easier:

  • Handles reconnections automatically
  • Falls back to long polling when WebSockets are blocked
  • Built-in room management
  • Event-based messaging

But Socket.IO has problems:

  • Dies around 15k connections (vs 25k+ for native)
  • Uses 2-3x more memory per connection
  • Requires sticky sessions
  • Extra protocol overhead on every message

Native WebSockets:

  • Can handle 25k+ connections if done right
  • Lower memory and CPU usage
  • No sticky sessions needed
  • Direct browser support

But you have to build everything:

  • Room management
  • Reconnection logic
  • Heartbeat/keepalive
  • All the debugging

Multi-Server Architecture

When one Node.js process can't handle the load, run multiple:

Load Balancer → Node.js Instance 1 (15k connections)
             → Node.js Instance 2 (15k connections) → Redis → Database
             → Node.js Instance 3 (15k connections)

You need:

  • HAProxy or nginx for load balancing
  • Multiple Node.js processes (10-15k connections each)
  • Redis for sharing messages between servers
  • Health checks to detect failed nodes
  • Monitoring because things will break

The State Management Problem

This is where most scaling attempts fail. When User A on Server 1 sends a message, users on Server 2 need to see it.

I watched a chat app completely fall apart during a product demo because they stored room state in Node.js memory. Users kept sending messages to empty rooms and getting confused why nobody responded.

Problems:

  • Messages lost between servers (no shared state)
  • Users see different room member counts (each server tracks separately)
  • Connection state vanishes when servers restart (everything stored in RAM)
  • Race conditions when users join/leave rooms simultaneously

Solutions:

  • Store room membership in Redis sets
  • Use Redis pub/sub to broadcast messages
  • Heartbeat every 30 seconds to detect dead connections
  • Clean up state when users disconnect
  • Connection pooling for Redis

What Actually Works in Production

After debugging this shit at multiple companies, here's what actually keeps WebSocket servers alive:

Start with native WebSockets if you can handle building reconnection logic yourself. Plan for horizontal scaling from day one - don't wait until you hit 20k connections and everything breaks.

Use Redis for all shared state. Every other approach I've tried has race conditions or fails during server restarts. Redis pub/sub is the only thing that works reliably across multiple Node processes.

Monitor everything obsessively - connection counts, memory usage per server, Redis performance. The problems show up in these metrics before users complain.

Test with realistic load. Those 100 concurrent test users won't trigger garbage collection pauses or file descriptor limits. Load test with 25k connections or you're wasting your time.

Memory Leaks to Watch For

Common leak sources:

  • Not cleaning up connection metadata when users disconnect
  • Event listeners that never get removed
  • Storing connection references in global maps that never expire

Use WeakMap for connection metadata and always clean up in disconnect handlers.

Database Integration

WebSocket apps hit the database differently than HTTP APIs:

  • Lots of small real-time queries instead of batch operations
  • Connection pooling becomes critical (200+ connections is normal)
  • Database becomes the bottleneck before WebSocket connections do

Consider Redis for high-frequency data and PostgreSQL for persistent data.

Performance Monitoring

Track these metrics:

  • Active connection count per server
  • Memory usage per connection
  • Event loop lag
  • Database connection pool usage
  • Redis pub/sub latency

Set up alerts for connection count > 15k per server and event loop lag > 10ms.

When to Use Managed Services

Building WebSocket infrastructure is a pain. Consider managed services like Ably, Pusher, or AWS API Gateway WebSockets if you want to focus on your app instead of infrastructure.

The break-even point is around 10k concurrent connections - below that, DIY is cheaper. Above that, managed services start making financial sense.

Real Production Experience

Most WebSocket scaling problems happen in production with real users, real network conditions, and real load patterns.

Load testing with Artillery or similar tools helps, but you won't catch everything until you have actual users generating unpredictable traffic patterns.

Plan for 3x your expected peak load and design your system to fail gracefully when it gets overwhelmed.

WebSocket Libraries Comparison

Library

Connection Limit

Difficulty

Worth It?

Socket.IO

~15k

Easy

Great until you hit real scale, then it's garbage

Native WebSocket

~25k

Medium

My go-to when Socket.IO inevitably fails

ws library

~25k

Medium

Rock solid

  • never let me down

uWebSockets.js

50k+

Hard

Fastest but fuck those C++ build issues

Production WebSocket Code That Works

Most WebSocket tutorials show happy path code with 10 users. Here's what you actually need for 20k connections.

This code comes from debugging WebSocket apps that handle millions of messages per day. Every line exists because something broke in production without it.

Connection Cleanup

Every WebSocket connection needs cleanup when it dies. Miss this and your server leaks memory.

I debugged a WebSocket server that was restarting every 6 hours. Turns out mobile clients were dropping connections without sending close frames, so the cleanup code never ran. Memory usage grew from 200MB to 4GB over 6 hours until the OS killed it.

class ConnectionManager {
  constructor() {
    this.connections = new Map();
    this.startHeartbeat();
  }

  addConnection(userId, ws) {
    ws.userId = userId;
    ws.isAlive = true;
    this.connections.set(userId, ws);

    ws.on('close', () => {
      this.connections.delete(userId);
      this.cleanupUserData(userId);
    });

    ws.on('pong', () => {
      ws.isAlive = true;
    });
  }

  startHeartbeat() {
    setInterval(() => {
      this.connections.forEach((ws) => {
        if (!ws.isAlive) {
          ws.terminate();
          this.connections.delete(ws.userId);
          return;
        }

        ws.isAlive = false;
        ws.ping();
      });
    }, 30000);
  }

  cleanupUserData(userId) {
    // Clean up any user-specific data
    // This prevents memory leaks
  }
}

Room Management with Redis

When users can join/leave rooms, you need Redis to track membership across servers.

I learned this the hard way when our WebSocket servers lost sync during a Redis restart. Half the users thought they were in empty rooms while the other half saw ghost users who had already left. Always store room state externally.

const Redis = require('redis');
const redis = Redis.createClient();

class RoomManager {
  async joinRoom(userId, roomId) {
    await redis.sadd(`room:${roomId}`, userId);
    await redis.publish('room_join', JSON.stringify({ userId, roomId }));
  }

  async leaveRoom(userId, roomId) {
    await redis.srem(`room:${roomId}`, userId);
    await redis.publish('room_leave', JSON.stringify({ userId, roomId }));
  }

  async broadcastToRoom(roomId, message) {
    const members = await redis.smembers(`room:${roomId}`);
    await redis.publish('room_message', JSON.stringify({
      roomId,
      message,
      members
    }));
  }
}

Message Broadcasting Between Servers

Use Redis pub/sub to share messages between your Node.js instances:

const subscriber = Redis.createClient();
const publisher = Redis.createClient();

subscriber.subscribe('room_message');
subscriber.on('message', (channel, data) => {
  const { roomId, message, members } = JSON.parse(data);

  members.forEach(userId => {
    const ws = connectionManager.connections.get(userId);
    if (ws && ws.readyState === WebSocket.OPEN) {
      ws.send(JSON.stringify(message));
    }
  });
});

// Send message to all servers
function broadcastMessage(roomId, message) {
  publisher.publish('room_message', JSON.stringify({
    roomId,
    message,
    timestamp: Date.now()
  }));
}

Error Handling

WebSocket errors are different from HTTP errors. Handle them properly.

WebSocket errors don't give you HTTP status codes. You get generic Error: connection reset by peer messages and have to figure out what went wrong. Mobile networks dropping connections look identical to users closing their browser.

ws.on('error', (error) => {
  console.error('WebSocket error:', error.message);
  connectionManager.connections.delete(ws.userId);
});

ws.on('message', (data) => {
  try {
    const message = JSON.parse(data);
    handleMessage(ws, message);
  } catch (error) {
    console.error('Invalid message:', error.message);
    ws.send(JSON.stringify({ error: 'Invalid message format' }));
  }
});

Database Queries

WebSocket apps query the database constantly. Use connection pooling.

A chat app with 10k concurrent users will hit your database 50-100 times per second just for authentication and room membership checks. Without connection pooling, you'll exhaust your Postgres connection limit (default 100) in minutes.

const { Pool } = require('pg');
const pool = new Pool({
  host: 'localhost',
  database: 'chat',
  max: 200, // Much higher than HTTP apps
  idleTimeoutMillis: 30000,
});

async function getUserData(userId) {
  const client = await pool.connect();
  try {
    const result = await client.query('SELECT * FROM users WHERE id = $1', [userId]);
    return result.rows[0];
  } finally {
    client.release();
  }
}

Rate Limiting

Prevent message spam:

const rateLimits = new Map(); // userId -> { count, resetTime }

function checkRateLimit(userId) {
  const now = Date.now();
  const userLimit = rateLimits.get(userId);

  if (!userLimit || now > userLimit.resetTime) {
    rateLimits.set(userId, { count: 1, resetTime: now + 60000 }); // 1 minute
    return true;
  }

  if (userLimit.count >= 100) { // 100 messages per minute
    return false;
  }

  userLimit.count++;
  return true;
}

ws.on('message', (data) => {
  if (!checkRateLimit(ws.userId)) {
    ws.send(JSON.stringify({ error: 'Rate limit exceeded' }));
    return;
  }

  // Process message
});

Health Checks

Your load balancer needs to know if your WebSocket server is healthy.

I've seen load balancers keep sending traffic to WebSocket servers that were completely dead - 20k connections but not processing any messages. The health check endpoint was still responding because HTTP was fine, but WebSocket processing had locked up.

const express = require('express');
const app = express();

app.get('/health', (req, res) => {
  const activeConnections = connectionManager.connections.size;
  const memoryUsage = process.memoryUsage();

  if (activeConnections > 20000 || memoryUsage.heapUsed > 2000000000) {
    return res.status(503).json({
      status: 'unhealthy',
      connections: activeConnections,
      memory: memoryUsage.heapUsed
    });
  }

  res.json({
    status: 'healthy',
    connections: activeConnections,
    uptime: process.uptime()
  });
});

app.listen(3001); // Health check port

Monitoring Metrics

Track what matters in production:

const metrics = {
  activeConnections: 0,
  messagesPerSecond: 0,
  errorCount: 0,
  redisLatency: 0
};

setInterval(() => {
  metrics.activeConnections = connectionManager.connections.size;
  console.log('Metrics:', metrics);

  // Send to your monitoring service
  // sendToDatadog(metrics);
}, 10000);

Performance Tips

Memory optimization:

  • Use WeakMap for connection metadata that can be garbage collected
  • Clean up event listeners in disconnect handlers
  • Limit message history per room (delete old messages)

CPU optimization:

  • Batch Redis operations when possible
  • Use connection pooling for all external services
  • Parse JSON once and reuse the object

Network optimization:

  • Compress WebSocket messages if they're large
  • Use binary format for high-frequency data
  • Implement message queuing if Redis gets overwhelmed

This code won't handle every edge case, but it's a solid foundation that can scale to 20k+ connections per server. Test it with real load before going to production.

WebSocket FAQ

Q

My server dies at 20k connections. Is that normal?

A

Yes. A decent server (4 cores, 16GB RAM) typically handles:

  • 15k connections for chat apps
  • 25k for mostly idle connections
  • 8k if processing heavy messages

Beyond that, you hit file descriptor limits and the event loop gets overwhelmed. Scale horizontally instead of trying to squeeze more connections per server.

Q

Why do connections randomly drop?

A

Corporate firewalls are the worst. They kill idle WebSocket connections because they think anything quiet for 60 seconds must be dead. Mobile carriers do the same shit.

Your users will blame your app for "being buggy" when it's actually their network infrastructure being aggressive about killing connections.

Fix with heartbeat pings:

setInterval(() => {
  wss.clients.forEach(ws => {
    if (!ws.isAlive) {
      ws.terminate();
      return;
    }
    ws.isAlive = false;
    ws.ping();
  });
}, 30000);

ws.on('pong', () => {
  ws.isAlive = true;
});
Q

Socket.IO vs native WebSockets?

A

Socket.IO for prototyping, native WebSockets for production scale.

Socket.IO handles reconnections and rooms automatically but uses more memory and dies around 15k connections. Native WebSockets handle 25k+ but you build everything yourself.

Q

How do I share state between servers?

A

Use Redis. Store room membership in Redis sets, use pub/sub for message broadcasting:

// Join room
await redis.sadd(`room:${roomId}`, userId);

// Broadcast message
await redis.publish('room_message', JSON.stringify({ roomId, message }));
Q

Why is my memory usage so high?

A

Each WebSocket connection uses ~100KB of memory. At 20k connections, that's 2GB just for connection overhead.

Memory leaks happen when you don't clean up connection metadata on disconnect. Always clean up in the close event handler.

Q

How do I handle connection spikes?

A

Set connection limits and reject new connections when you hit them:

wss.on('connection', (ws, req) => {
  if (wss.clients.size > MAX_CONNECTIONS) {
    ws.close(1013, 'Server overloaded');
    return;
  }
  // Handle connection
});

Better to reject new connections than crash the server.

Q

Database performance is terrible. Why?

A

WebSocket apps absolutely hammer the database. Unlike HTTP APIs that do a few big queries per request, chat apps do hundreds of tiny queries: "Is user X online? What room is user Y in? Who's typing in room Z?"

Your database connection pool gets exhausted fast:

  • Use connection pooling (200+ connections is normal)
  • Cache frequently accessed data in Redis
  • Batch database updates when possible
Q

How do I debug connection issues?

A

Add logging to connection events:

ws.on('open', () => {
  console.log('Connection opened:', req.socket.remoteAddress);
});

ws.on('close', (code, reason) => {
  console.log('Connection closed:', code, reason.toString());
});

ws.on('error', (error) => {
  console.error('WebSocket error:', error);
});

Track connection counts, message rates, and error rates. Set up alerts for unusual patterns.

Q

Should I use managed WebSocket services?

A

If you have 10k+ concurrent connections and don't want to manage infrastructure, yes.

Services like Ably, Pusher, or AWS API Gateway handle scaling, monitoring, and reliability. They're more expensive per connection but cheaper than hiring a team to build and maintain the infrastructure.

Q

How do I test WebSocket performance?

A

Use Artillery or similar tools to simulate concurrent connections:

config:
  target: 'ws://localhost:3000'
  phases:
    - duration: 300
      arrivalRate: 100
scenarios:
  - name: 'WebSocket connections'
    engine: ws

Test with realistic message patterns, not just idle connections. Most problems show up under message load, not just connection load.

Testing 10k idle connections tells you nothing. You need connections that actually send messages, join/leave rooms, and simulate real user behavior. Otherwise you'll hit production and wonder why everything's on fire.

Useful WebSocket Resources

Related Tools & Recommendations

tool
Popular choice

Sift - Fraud Detection That Actually Works

The fraud detection service that won't flag your biggest customer while letting bot accounts slip through

Sift
/tool/sift/overview
60%
tool
Popular choice

jQuery - The Library That Won't Die

Explore jQuery's enduring legacy, its impact on web development, and the key changes in jQuery 4.0. Understand its relevance for new projects in 2025.

jQuery
/tool/jquery/overview
57%
news
Popular choice

GPT-5 Is So Bad That Users Are Begging for the Old Version Back

OpenAI forced everyone to use an objectively worse model. The backlash was so brutal they had to bring back GPT-4o within days.

GitHub Copilot
/news/2025-08-22/gpt5-user-backlash
55%
tool
Popular choice

GitHub Codespaces Enterprise Deployment - Complete Cost & Management Guide

Master GitHub Codespaces enterprise deployment. Learn strategies to optimize costs, manage usage, and prevent budget overruns for your engineering organization

GitHub Codespaces
/tool/github-codespaces/enterprise-deployment-cost-optimization
42%
howto
Popular choice

Install Python 3.12 on Windows 11 - Complete Setup Guide

Python 3.13 is out, but 3.12 still works fine if you're stuck with it

Python 3.12
/howto/install-python-3-12-windows-11/complete-installation-guide
40%
howto
Popular choice

Migrate JavaScript to TypeScript Without Losing Your Mind

A battle-tested guide for teams migrating production JavaScript codebases to TypeScript

JavaScript
/howto/migrate-javascript-project-typescript/complete-migration-guide
40%
tool
Popular choice

DuckDB - When Pandas Dies and Spark is Overkill

SQLite for analytics - runs on your laptop, no servers, no bullshit

DuckDB
/tool/duckdb/overview
40%
tool
Popular choice

SaaSReviews - Software Reviews Without the Fake Crap

Finally, a review platform that gives a damn about quality

SaaSReviews
/tool/saasreviews/overview
40%
tool
Popular choice

Fresh - Zero JavaScript by Default Web Framework

Discover Fresh, the zero JavaScript by default web framework for Deno. Get started with installation, understand its architecture, and see how it compares to Ne

Fresh
/tool/fresh/overview
40%
news
Popular choice

Anthropic Raises $13B at $183B Valuation: AI Bubble Peak or Actual Revenue?

Another AI funding round that makes no sense - $183 billion for a chatbot company that burns through investor money faster than AWS bills in a misconfigured k8s

/news/2025-09-02/anthropic-funding-surge
40%
news
Popular choice

Google Pixel 10 Phones Launch with Triple Cameras and Tensor G5

Google unveils 10th-generation Pixel lineup including Pro XL model and foldable, hitting retail stores August 28 - August 23, 2025

General Technology News
/news/2025-08-23/google-pixel-10-launch
40%
news
Popular choice

Dutch Axelera AI Seeks €150M+ as Europe Bets on Chip Sovereignty

Axelera AI - Edge AI Processing Solutions

GitHub Copilot
/news/2025-08-23/axelera-ai-funding
40%
news
Popular choice

Samsung Wins 'Oscars of Innovation' for Revolutionary Cooling Tech

South Korean tech giant and Johns Hopkins develop Peltier cooling that's 75% more efficient than current technology

Technology News Aggregation
/news/2025-08-25/samsung-peltier-cooling-award
40%
news
Popular choice

Nvidia's $45B Earnings Test: Beat Impossible Expectations or Watch Tech Crash

Wall Street set the bar so high that missing by $500M will crater the entire Nasdaq

GitHub Copilot
/news/2025-08-22/nvidia-earnings-ai-chip-tensions
40%
news
Popular choice

Microsoft's August Update Breaks NDI Streaming Worldwide

KB5063878 causes severe lag and stuttering in live video production systems

Technology News Aggregation
/news/2025-08-25/windows-11-kb5063878-streaming-disaster
40%
news
Popular choice

Apple's ImageIO Framework is Fucked Again: CVE-2025-43300

Another zero-day in image parsing that someone's already using to pwn iPhones - patch your shit now

GitHub Copilot
/news/2025-08-22/apple-zero-day-cve-2025-43300
40%
news
Popular choice

Trump Plans "Many More" Government Stakes After Intel Deal

Administration eyes sovereign wealth fund as president says he'll make corporate deals "all day long"

Technology News Aggregation
/news/2025-08-25/trump-intel-sovereign-wealth-fund
40%
tool
Popular choice

Thunder Client Migration Guide - Escape the Paywall

Complete step-by-step guide to migrating from Thunder Client's paywalled collections to better alternatives

Thunder Client
/tool/thunder-client/migration-guide
40%
tool
Popular choice

Fix Prettier Format-on-Save and Common Failures

Solve common Prettier issues: fix format-on-save, debug monorepo configuration, resolve CI/CD formatting disasters, and troubleshoot VS Code errors for consiste

Prettier
/tool/prettier/troubleshooting-failures
40%
integration
Popular choice

Get Alpaca Market Data Without the Connection Constantly Dying on You

WebSocket Streaming That Actually Works: Stop Polling APIs Like It's 2005

Alpaca Trading API
/integration/alpaca-trading-api-python/realtime-streaming-integration
40%

Recommendations combine user behavior, content similarity, research intelligence, and SEO optimization