Your chat app works fine with 100 users. At 20k concurrent connections, everything breaks. Here's why.
Node.js WebSocket Limits
Node.js handles WebSockets through its event loop. No threading required, which is great until it's not.
Where it breaks:
- File descriptor limit (Linux default is 1024, set
ulimit -n 65536
) - Memory usage around 100KB per connection
- Event loop blocks on slow database queries
- Garbage collection pauses get longer
Most Node.js apps die between 15k-25k connections. I've seen this exact pattern on AWS, GCP, and bare metal servers.
The death spiral always starts the same way: event loop lag spikes to 500ms+, garbage collection pauses hit 2-3 seconds, then the OS starts killing processes with OOM. Your monitoring shows everything is fine until suddenly it's not.
Connection Problems at Scale
What happens around 20k connections:
- Random connection timeouts
- Messages start getting dropped
- 10-30 second server freezes during garbage collection
- Memory usage climbs to 2GB+ and keeps going
- Response times go from 50ms to 5+ seconds
The event loop can't keep up with processing all the connections and garbage collection kills performance.
Socket.IO vs Native WebSockets
Socket.IO is easier:
- Handles reconnections automatically
- Falls back to long polling when WebSockets are blocked
- Built-in room management
- Event-based messaging
But Socket.IO has problems:
- Dies around 15k connections (vs 25k+ for native)
- Uses 2-3x more memory per connection
- Requires sticky sessions
- Extra protocol overhead on every message
Native WebSockets:
- Can handle 25k+ connections if done right
- Lower memory and CPU usage
- No sticky sessions needed
- Direct browser support
But you have to build everything:
- Room management
- Reconnection logic
- Heartbeat/keepalive
- All the debugging
Multi-Server Architecture
When one Node.js process can't handle the load, run multiple:
Load Balancer → Node.js Instance 1 (15k connections)
→ Node.js Instance 2 (15k connections) → Redis → Database
→ Node.js Instance 3 (15k connections)
You need:
- HAProxy or nginx for load balancing
- Multiple Node.js processes (10-15k connections each)
- Redis for sharing messages between servers
- Health checks to detect failed nodes
- Monitoring because things will break
The State Management Problem
This is where most scaling attempts fail. When User A on Server 1 sends a message, users on Server 2 need to see it.
I watched a chat app completely fall apart during a product demo because they stored room state in Node.js memory. Users kept sending messages to empty rooms and getting confused why nobody responded.
Problems:
- Messages lost between servers (no shared state)
- Users see different room member counts (each server tracks separately)
- Connection state vanishes when servers restart (everything stored in RAM)
- Race conditions when users join/leave rooms simultaneously
Solutions:
- Store room membership in Redis sets
- Use Redis pub/sub to broadcast messages
- Heartbeat every 30 seconds to detect dead connections
- Clean up state when users disconnect
- Connection pooling for Redis
What Actually Works in Production
After debugging this shit at multiple companies, here's what actually keeps WebSocket servers alive:
Start with native WebSockets if you can handle building reconnection logic yourself. Plan for horizontal scaling from day one - don't wait until you hit 20k connections and everything breaks.
Use Redis for all shared state. Every other approach I've tried has race conditions or fails during server restarts. Redis pub/sub is the only thing that works reliably across multiple Node processes.
Monitor everything obsessively - connection counts, memory usage per server, Redis performance. The problems show up in these metrics before users complain.
Test with realistic load. Those 100 concurrent test users won't trigger garbage collection pauses or file descriptor limits. Load test with 25k connections or you're wasting your time.
Memory Leaks to Watch For
Common leak sources:
- Not cleaning up connection metadata when users disconnect
- Event listeners that never get removed
- Storing connection references in global maps that never expire
Use WeakMap
for connection metadata and always clean up in disconnect handlers.
Database Integration
WebSocket apps hit the database differently than HTTP APIs:
- Lots of small real-time queries instead of batch operations
- Connection pooling becomes critical (200+ connections is normal)
- Database becomes the bottleneck before WebSocket connections do
Consider Redis for high-frequency data and PostgreSQL for persistent data.
Performance Monitoring
Track these metrics:
- Active connection count per server
- Memory usage per connection
- Event loop lag
- Database connection pool usage
- Redis pub/sub latency
Set up alerts for connection count > 15k per server and event loop lag > 10ms.
When to Use Managed Services
Building WebSocket infrastructure is a pain. Consider managed services like Ably, Pusher, or AWS API Gateway WebSockets if you want to focus on your app instead of infrastructure.
The break-even point is around 10k concurrent connections - below that, DIY is cheaper. Above that, managed services start making financial sense.
Real Production Experience
Most WebSocket scaling problems happen in production with real users, real network conditions, and real load patterns.
Load testing with Artillery or similar tools helps, but you won't catch everything until you have actual users generating unpredictable traffic patterns.
Plan for 3x your expected peak load and design your system to fail gracefully when it gets overwhelmed.