Why does my Redis cluster keep showing "CLUSTERDOWN" errors?

`CLUSTERDOWN` means your cluster doesn't have enough nodes available to serve all hash slots. This happens when:- Nodes are network partitioned and can't communicate via cluster bus port (usually client port + 10000)- `cluster-require-full-coverage yes` is set and some slots are unassigned after node failures- Slot migration failed mid-process and left orphaned slots **Fix:** Check `CLUSTER NODES` output, ensure all 16384 slots are assigned, and verify cluster bus connectivity between nodes.

My Redis node keeps getting OOM killed with exit code 137 - what's eating the memory?

Redis memory issues in clusters are usually:- **Slot migration memory spikes**: Moving large keys temporarily doubles memory usage- **Replication buffers growing unbounded**: Check `client-output-buffer-limit replica`- **Memory fragmentation after failed migrations**: Use `MEMORY DOCTOR` to check fragmentation ratio- **Large keys triggering evictions**: Monitor with `redis-cli --bigkeys` **Quick diagnosis:** `INFO memory` shows fragmentation ratio. If it's >1.5, you need defragmentation.

Why are my Redis cluster slot migrations taking forever?

Slot migrations crawl when:- **Large keys block the single-threaded migration**: Keys >100MB can lock Redis for seconds- **Network saturation during RESTORE operations**: Migration saturates cluster bus bandwidth- **Memory pressure causes timeouts**: Target node can't allocate memory for migrated keys- **Concurrent client operations**: Heavy write load interferes with migration **Solution:** Migrate during low-traffic windows and use `CLUSTER SETSLOT ... STABLE` to pause problematic migrations.

How do I fix "MOVED" and "ASK" redirection storms?

Redirection storms happen during:- **Slot migrations in progress**: Clients get bounced between nodes- **Cluster topology changes**: Node additions/removals confuse client routing- **Stale cluster slot mappings**: Clients cache outdated routing information **Prevention:** Use cluster-aware clients (redis-py-cluster, lettuce) that handle redirects gracefully and refresh topology automatically.

My Redis cluster nodes keep timing out - is this normal?

Cluster timeouts indicate:- **Underpowered cluster bus network**: Separate cluster traffic from client traffic- **Clock drift between nodes**: Use NTP to sync time across all Redis nodes- **High memory pressure causing slow operations**: Monitor memory usage and fragmentation- **Inappropriate timeout values**: Default 15s `cluster-node-timeout` may be too aggressive **Tuning:** Increase `cluster-node-timeout` to 30-60s for unstable networks, but balance against failover speed.

What's the difference between "FAIL" and "PFAIL" node states?

- **PFAIL (Possible Failure)**: One node thinks another node is unreachable- **FAIL**: Majority of nodes agree a node is down and trigger failover A node stuck in PFAIL usually means network issues or GC pauses. Monitor with `CLUSTER NODES` - too many PFAIL states indicate cluster instability.

Why does Redis clustering break when I restart just one node?

Single node restarts trigger:- **Replica promotion**: If you restart a master, replicas promote and slot ownership changes- **Cluster configuration loss**: Restarted node may lose cluster state if persistence is broken- **Client connection redistribution**: Applications suddenly can't reach their preferred node **Best practice:** Use rolling restarts and ensure `cluster-config-file` is writable for state persistence.

How do I handle Redis cluster "split brain" scenarios?

Split brain happens during network partitions where multiple nodes think they're masters for the same slots. **Detection:** Monitor `CLUSTER INFO` - look for `cluster_state:fail` or inconsistent `cluster_slots_assigned` counts across nodes. **Recovery:** 1. Identify the partition with the most recent data 2. Use `CLUSTER RESET HARD` on minority partitions 3. Re-add nodes to the authoritative cluster 4. Validate data consistency before resuming operations

What causes endless Redis replication loops in clusters?

Replication loops occur when:- **Slow network links**: Replica can't keep up with master's write rate- **Memory pressure on replica**: Slow memory allocation causes replication lag- **Large RDB snapshots**: Full resync triggered repeatedly due to replication backlog overflow **Monitor:** `INFO replication` shows `master_repl_offset` vs `slave_repl_offset` gap. Increase `repl-backlog-size` if gap keeps growing.

My Redis cluster shows nodes as "disconnected" but ping works - what gives?

Redis cluster uses a separate cluster bus port (typically client port + 10000) for internal communication. Standard ping tests client port, not cluster bus. **Debug:** ```bash # Check cluster bus connectivity specifically telnet redis-node1 17000 # If client port is 7000 # Verify in redis.conf grep cluster-port /etc/redis/redis.conf ```

How do I safely remove a failed node from Redis cluster?

**Step-by-step removal:** 1. First migrate all slots away: `redis-cli --cluster reshard --cluster-from --cluster-to --cluster-slots ` 2. Remove empty node: `redis-cli --cluster del-node ` 3. Update client configurations to remove the failed node endpoint **Never** just kill a node - always migrate slots first or you'll lose data.

Currently viewing the AI version

Switch to human version

Redis Cluster Production Issues - AI-Optimized Technical Reference

Critical Failure Scenarios

Split-Brain Scenarios

What it is: Multiple Redis nodes simultaneously act as masters for same hash slots during network partitions
Severity: CRITICAL - causes immediate data divergence and application failures
Recovery time: 30+ minutes manual intervention required
Data loss: HIGH - must choose which partition's data to keep

Real-world triggers:

Switch failures splitting nodes across failure domains
AWS/Azure availability zone networking failures (documented incidents)
Kubernetes CNI plugin crashes during pod restarts
Service mesh proxy (Istio/Linkerd) timeout cascades

Detection patterns:

CLUSTER INFO shows cluster_state:fail
CLUSTER NODES shows multiple masters for same slots
Inconsistent cluster_slots_assigned counts across nodes

Recovery process:

Stop accepting writes immediately (read-only mode)
Identify authoritative partition (most recent LASTSAVE)
CLUSTER RESET HARD on minority partitions
Manual reshard using redis-cli --cluster reshard
Full validation before resuming service

Memory Fragmentation Cascade Failures

What happens: Slot migration doubles memory usage, triggering OOM kills during peak traffic
Frequency: Common during resharding operations under load
Impact: Entire cluster unavailable for 15-60 minutes

Critical thresholds:

Memory fragmentation ratio >1.5 = immediate intervention required
Replication offset lag >50MB = replica will never catch up
Used RSS memory >80% = migrations will fail

CLUSTERDOWN Error States

Cause: Not enough available nodes to serve all 16384 hash slots
Common scenarios:

Node failures during slot migration
Network partitions with cluster-require-full-coverage=yes
Failed migrations leaving orphaned slots

Diagnostic commands:

CLUSTER NODES  # Check slot assignments
CLUSTER SLOTS  # Verify slot coverage
CLUSTER CHECK-SLOTS  # Find orphaned slots

Production Configuration That Actually Works

Essential redis.conf Settings

# Prevent premature failovers on unstable networks
cluster-node-timeout 30000          # Default 15s too aggressive
cluster-replica-validity-factor 0   # Disable validity checks during partitions

# Real-world replication buffer sizes
client-output-buffer-limit replica 256mb 64mb 60  # Default 1MB is pathetic
repl-backlog-size 128mb             # Default 1MB causes constant full resyncs
repl-backlog-ttl 3600

# Memory management
maxmemory-policy allkeys-lru
cluster-require-full-coverage yes   # Fail safe during partitions

Network Architecture Requirements

Dedicated cluster network: Separate cluster bus (port+10000) from client traffic
Multiple network paths: Redundant connections between nodes
Firewall rules: Both client port AND cluster bus port must be accessible

Memory Management Failure Patterns

Slot Migration Memory Doubling

Critical insight: During migration, both source and target nodes hold copies of same keys
Planning rule: Never migrate >25% of available memory at once
Failure scenario: Target node OOMs halfway through migration, leaving orphaned slots

Replication Buffer Growth

Default disaster: 1MB buffer overflows within minutes under real load
E-commerce reality: Black Friday traffic requires 500MB+ replication buffers
Monitoring: master_repl_offset vs slave_repl_offset gap indicates buffer overflow

Cluster-Specific Memory Leaks

Gossip protocol metadata: Accumulates failed node state, never garbage collected
Failed migration artifacts: Orphaned keys invisible to normal operations
Threading in Redis 8: More parallel allocation accelerates fragmentation

Performance Thresholds and Breaking Points

Metric	Warning Level	Critical Level	Failure Consequence
Memory fragmentation ratio	>1.3	>1.5	Node OOM kills during operations
Replication lag	>10MB	>50MB	Full resync required, data loss risk
Cluster message rate	>500/sec	>1000/sec	Network saturation, gossip failures
Slot migration duration	>5 minutes	>15 minutes	Client timeout storms, readonly state
Node response time	>1 second	>5 seconds	Cluster declares node failed

Diagnostic Command Sequences

Split-Brain Detection

# Run on ALL nodes to compare cluster state
for node in redis-node-{1..6}; do
  echo "=== $node ==="
  redis-cli -h $node CLUSTER NODES | grep master
  redis-cli -h $node CLUSTER INFO | grep cluster_state
done

Memory Fragmentation Analysis

# Cluster-wide memory health check
for node in $(redis-cli CLUSTER NODES | awk '{print $2}' | cut -d: -f1); do
  echo "=== Node $node ==="
  redis-cli -h $node INFO memory | grep -E "(used_memory_human|mem_fragmentation_ratio)"
  redis-cli -h $node MEMORY STATS | grep fragmentation
done

Slot Migration Monitoring

# Check for stuck migrations
CLUSTER NODES | grep -E "(importing|migrating)"
# Find orphaned keys after failed migrations
CLUSTER GETKEYSINSLOT <slot> 100

Client Configuration Patterns

Redirection Handling

Problem: MOVED/ASK redirection storms during topology changes
Solution: Use cluster-aware clients that cache slot mappings and handle redirects automatically

Python: redis-py-cluster
Java: Lettuce with async cluster support
Node.js: node-redis v4 with cluster mode
.NET: StackExchange.Redis with cluster endpoints

Circuit Breaker Integration

Pattern: Detect cluster instability and fail to read-only mode
Implementation: Monitor CLUSTERDOWN responses and implement exponential backoff

When Redis Cluster Is Wrong Choice

Data Consistency Requirements

Redis limitation: Eventually consistent, no strong consistency guarantees
Better alternatives:

PostgreSQL with streaming replication for transactional data
MongoDB replica sets with proper read concerns
Cassandra with tunable consistency levels

Operational Complexity Thresholds

Team size: <3 experienced Redis engineers = consider managed solutions
Failure tolerance: Zero data loss requirements = Redis clustering inappropriate
Network reliability: Frequent partitions = choose consensus-based systems (etcd, Consul)

Monitoring and Alerting Setup

Critical Metrics for Production

# Memory pressure indicators
used_memory_rss > 80% of allocated
mem_fragmentation_ratio > 1.5
used_memory_lua > 5MB

# Cluster health indicators  
cluster_state != "ok"
cluster_slots_fail > 0
cluster_known_nodes != expected_count

# Replication health
master_repl_offset - slave_repl_offset > 50MB
connected_slaves != expected_replica_count

Alert Escalation Matrix

P1 (immediate): CLUSTERDOWN errors, split-brain detection, >80% memory usage
P2 (within 1 hour): High fragmentation, replication lag >10MB, migration timeouts
P3 (next business day): Cluster message rate elevation, minor slot imbalances

Emergency Response Procedures

Cluster Completely Down

Identify node with most recent data (highest LASTSAVE timestamp)
Start that node in standalone mode temporarily
Point applications to standalone instance for read operations
Rebuild cluster from backup or remaining healthy nodes
Validate data consistency before full restoration

Partial Cluster Failure

Check which slots are affected: CLUSTER SLOTS
If <50% slots affected, continue with remaining nodes
Manually migrate affected slots to healthy nodes
Add replacement nodes and rebalance

Data Recovery Priority

Session data: Usually acceptable to lose, restart user sessions
Cache data: Acceptable to lose, will rebuild from primary data store
Primary data: MUST be recovered, may require point-in-time restore from backups

Resource Requirements and Scaling Limits

Minimum Production Setup

3 master nodes minimum for split-brain protection
1 replica per master for basic high availability
16GB RAM per node minimum for realistic workloads
10Gbps network between nodes for slot migration performance

Scaling Considerations

Memory overhead: 15-25% additional RAM for cluster metadata and replication buffers
Network bandwidth: Slot migrations can saturate 1Gbps links
CPU overhead: Gossip protocol scales O(n²) with node count
Operational complexity: Exponential growth beyond 12 nodes

Hard Limits

Maximum realistic cluster size: 10-15 nodes before operational complexity overwhelms benefits
Single key size limit: 512MB theoretical, 10MB practical for migration performance
Slot migration time: >1 hour indicates serious infrastructure problems
Network partition tolerance: >30 seconds triggers cluster reconfiguration

This reference prioritizes actionable information for production Redis cluster deployment, troubleshooting, and recovery scenarios based on real-world operational experience.

Useful Links for Further Investigation

Essential Redis Clustering Resources for Production Issues

Link	Description
Redis Cluster Specification	The authoritative source for understanding hash slots, gossip protocol, and failover mechanics. Essential reading before deploying clusters.
Redis Cluster Tutorial	Step-by-step cluster setup guide. Good for understanding basics, but doesn't cover production failure scenarios.
Redis Memory Optimization Guide	Official memory management documentation. Covers eviction policies and fragmentation monitoring.
redis-cli Cluster Commands Reference	Complete reference for cluster management commands. CLUSTER NODES, CLUSTER INFO, and CLUSTER SLOTS are essential for debugging.
Redis Troubleshooting Guide - Site24x7	Comprehensive troubleshooting guide covering cluster instability, network partitions, and performance bottlenecks with specific error codes.
RedisInsight	Official GUI tool for Redis monitoring. Visualizes cluster topology, memory usage, and slow queries. Essential for production monitoring.
Prometheus Redis Exporter	Open-source monitoring solution. Provides detailed metrics for cluster health, memory usage, and replication lag.
Grafana Redis Dashboard	Pre-built dashboard for visualizing Redis cluster metrics. Shows cluster state, slot distribution, and node health.
Datadog Redis Integration	Enterprise monitoring with alerting on cluster failures, memory pressure, and replication issues.
Redis Anti-Patterns Guide	Official guide covering common production mistakes. Includes sections on hot keys, memory management, and cluster configuration issues.
AWS ElastiCache for Redis Best Practices	Production deployment patterns for managed Redis. Covers network configuration, security, and failover scenarios.
Azure Cache for Redis Documentation	Microsoft's managed Redis documentation with clustering configuration and troubleshooting guides.
Redis Google Group	Active community forum where Redis core developers answer production issues. Search for specific error messages and failure patterns.
Redis Stack Overflow	Community Q&A for specific clustering problems. Good source for real-world troubleshooting solutions.
Redis GitHub Issues	Official issue tracker. Search for clustering bugs and known issues in your Redis version.
redis-py-cluster	Python client with full cluster support. Handles MOVED/ASK redirects and slot mapping automatically.
Lettuce (Java)	High-performance Java Redis client with async cluster support. Includes connection pooling and automatic failover.
node-redis v4	Node.js client with built-in cluster support. Handles topology changes and provides connection health monitoring.
StackExchange.Redis (.NET)	.NET client with cluster support and configuration string management. Good for Azure-based deployments.
redis-benchmark	Official benchmarking tool. Use --cluster flag to test cluster performance and identify bottlenecks.
memtier_benchmark	Advanced benchmarking tool with cluster support. Better for realistic load testing than redis-benchmark.
Redis Data Recovery Procedures	Official guide for recovering from data corruption, failed migrations, and cluster state issues.
Cluster Recovery Scripts	Collection of scripts for cluster recovery scenarios. Includes slot migration utilities and health checks.
Redis Security Guide	Comprehensive security configuration including ACLs, TLS, and network isolation for clusters.
Redis Network Troubleshooting	Network configuration guide covering cluster bus ports, firewall rules, and DNS requirements.
Redis vs Alternatives Comparison	When Redis clustering isn't the right solution. Covers alternatives like Memcached, Hazelcast, and database caching.
Migration from Redis OSS to Redis Enterprise	Guide for migrating to Redis Enterprise when open-source clustering becomes too complex to manage.

Redis Cluster Production Issues - AI-Optimized Technical Reference

Critical Failure Scenarios

Split-Brain Scenarios

Memory Fragmentation Cascade Failures

CLUSTERDOWN Error States

Production Configuration That Actually Works

Essential redis.conf Settings

Network Architecture Requirements

Memory Management Failure Patterns

Slot Migration Memory Doubling

Replication Buffer Growth

Cluster-Specific Memory Leaks

Performance Thresholds and Breaking Points

Diagnostic Command Sequences

Split-Brain Detection

Memory Fragmentation Analysis

Slot Migration Monitoring

Client Configuration Patterns

Redirection Handling

Circuit Breaker Integration

When Redis Cluster Is Wrong Choice

Data Consistency Requirements

Operational Complexity Thresholds

Monitoring and Alerting Setup

Critical Metrics for Production

Alert Escalation Matrix

Emergency Response Procedures

Cluster Completely Down

Partial Cluster Failure

Data Recovery Priority

Resource Requirements and Scaling Limits

Minimum Production Setup

Scaling Considerations

Hard Limits

Useful Links for Further Investigation

Essential Redis Clustering Resources for Production Issues

Related Tools & Recommendations

GitOps Integration Hell: Docker + Kubernetes + ArgoCD + Prometheus

Redis vs Memcached vs Hazelcast: Production Caching Decision Guide

Memcached - Stop Your Database From Dying

Docker Alternatives That Won't Break Your Budget

I Tested 5 Container Security Scanners in CI/CD - Here's What Actually Works

RAG on Kubernetes: Why You Probably Don't Need It (But If You Do, Here's How)

Kafka + MongoDB + Kubernetes + Prometheus Integration - When Event Streams Break

GitHub Actions Marketplace - Where CI/CD Actually Gets Easier

GitHub Actions Alternatives That Don't Suck

GitHub Actions + Docker + ECS: Stop SSH-ing Into Servers Like It's 2015

Deploy Django with Docker Compose - Complete Production Guide

Stop Waiting 3 Seconds for Your Django Pages to Load

Django - The Web Framework for Perfectionists with Deadlines

Anthropic Raises $13B at $183B Valuation: AI Bubble Peak or Actual Revenue?

Docker Desktop Hit by Critical Container Escape Vulnerability

Yarn Package Manager - npm's Faster Cousin

PostgreSQL Alternatives: Escape Your Production Nightmare

Kafka Will Fuck Your Budget - Here's the Real Cost

Apache Kafka - The Distributed Log That LinkedIn Built (And You Probably Don't Need)

AWS RDS Blue/Green Deployments - Zero-Downtime Database Updates