What's the difference between replication and backups?

Replication gives you live copies that can take over when your primary dies. Backups are dead files that take forever to restore. When your database crashes at 2am, replication switches over in seconds. Backup restoration means you're on the phone at 2:15am explaining to the CEO why the site is fucked for the next 2 hours while you restore from yesterday's dump.

How much will synchronous replication slow down my database?

A fucking lot. Prepare for 40-60% performance loss, not the bullshit "10-15%" in vendor slide decks. Every transaction waits for network round-trips to replicas. Synchronous replication across regions will kill your app. Physics doesn't give a shit about your SLA.

What network latency can I get away with?

Under 5ms if you want synchronous replication to be usable. Over 10ms and your database will crawl. Cross-country replication (50ms+) means async only. Physics wins every argument, no matter what your PM promises the client.

Can I write to replica databases?

Don't. Just don't. Multi-master replication sounds great until your data gets corrupted by write conflicts. I spent 3 fucking days debugging phantom user deletions in a multi-master setup. Turns out both masters were deleting the same account at microsecond intervals. Data integrity went completely to shit, and the logs were useless.

How fast is database failover really?

Aurora's "sub-60 second" failover actually takes 30-60 seconds on a good day. MySQL with manual failover is 2-5 minutes if you're ready. Automatic failover tools like MHA or Orchestrator help, but they'll shit the bed when you actually need them. Murphy's Law in action. Always have manual failover procedures ready and tested.

How much storage do I need for replicas?

Double your storage costs minimum. Each replica needs a full copy of your data, plus transaction logs. Cross-region replicas will fuck your budget - you're paying premium storage prices in multiple AWS regions. Budget for pain.

What happens when the network splits?

Split-brain: multiple databases think they're the primary. Your application writes to both, data diverges, and you spend weekends manually reconciling conflicts. Most replication systems have split-brain protection that shuts down secondaries during network partitions. Better than corrupted data, but your failover capabilities disappear.

Is Change Data Capture worth the complexity?

Maybe. CDC tools like Debezium capture every change, but setup is nightmare fuel. Kafka, Schema Registry, 50 configuration parameters that interact in mysterious ways. When CDC works, it's great. When it breaks, you'll be debugging until 4am wondering why events stopped flowing 6 hours ago with no error messages. The Kafka logs will just say "consumer group rebalancing" which tells you nothing useful. Half the time it's because some jackass deployed a new consumer with a different `session.timeout.ms` setting and broke the entire cluster.

Can I replicate between different databases?

Technically yes, practically hell. AWS DMS can replicate MySQL to PostgreSQL, but data types don't map perfectly, and performance is terrible. Schema changes break cross-platform replication in creative ways. Add a column to MySQL and watch PostgreSQL throw "unknown data type" errors for days. Stick with same-database replication unless you hate yourself.

How much does replication cost?

More than you think. Double your infrastructure costs for basic master-slave. Cross-region replication can easily cost $2000+/month for a medium database. Aurora Global Database costs $0.20/million replicated write operations. Sounds cheap until you do the math on a busy application.

Are cloud database replication services worth it?

Yes, if you can afford them. Aurora, Cosmos DB, and Cloud Spanner hide the complexity but cost 2-3x more than self-managed. When managed services break, you're stuck waiting for support. With self-managed, at least you can restart things and pretend to fix them. I've spent hours on AWS support calls explaining that "turn it off and on again" doesn't work for managed databases.

What should I monitor?

**Replication lag** is critical. Alert when lag exceeds 30 seconds. Anything over 5 minutes means something is very wrong and you're about to get angry phone calls. **Disk space** on replicas - transaction logs will fill up your disk and kill everything. **Network throughput** - saturated links cause lag spikes. **Error rates** - MySQL replication randomly stops working and won't tell you why. Set up monitoring before you need it. [Percona Monitoring](https://www.percona.com/software/database-tools/percona-monitoring-and-management) or DataDog work well. The built-in dashboards will save you hours of creating custom alerts that actually matter.

Currently viewing the AI version

Switch to human version

Database Replication: AI-Optimized Technical Reference

Overview

Database replication copies data to multiple servers in real-time to prevent single points of failure. When the primary database fails, traffic switches to replicas. Essential for production systems but introduces significant operational complexity.

Critical Failure Scenarios

Why Databases Will Fail

Hard drives die - Physical hardware failure inevitable
Servers catch fire - Environmental disasters occur
Network cables unplugged - Human error (janitors, maintenance)
AWS outages - Cloud providers fail regularly
Black Friday incident example: MySQL master failure during peak traffic = 2 hours downtime, $50k lost sales

Performance Impact Reality Check

Synchronous replication: 10-30% performance reduction typical, up to 40-60% in practice
Asynchronous replication: 3-8% performance reduction
Semi-synchronous: 5-15% performance reduction
Cross-region sync: Response times 50ms → 300ms due to network latency

Replication Types and Trade-offs

Master-Slave (Primary-Replica)

Configuration:

One server handles all writes
Changes copied to read-only replicas via binary log
Most reliable starting point

Critical Issues:

MySQL replication randomly stops with unhelpful errors: "Error reading packet from server: Connection reset by peer (2013)"
Half the failures: replica ran out of disk space
Other half: MySQL 8.0.28 networking issues

Multi-Master

Warning: Avoid Unless Forced

Allows writes to multiple databases simultaneously
High failure risk: Write conflicts create data inconsistency
Real example: User accounts randomly disappeared due to simultaneous deletions on different masters
Error logs show "duplicate key" with no useful context
Complexity outweighs benefits for most use cases

Synchronous vs Asynchronous

Type	Latency	Data Loss	Performance Impact	Real-World Use
Synchronous	High (5-15ms)	Zero	10-30% reduction (40-60% actual)	Financial systems only
Asynchronous	Low (seconds)	Minimal-moderate	3-8% reduction	Most production systems
Semi-Synchronous	Medium (1-5ms)	Very low	5-15% reduction	Recommended sweet spot

Network Latency Constraints

Physics Limitations

Speed of light limit: ~1ms per 100 miles
New York to London: 3,500 miles = 35ms minimum latency
Cross-region synchronous replication: Effectively impossible for performance
Sub-10ms required for usable synchronous replication
Over 10ms: Database performance severely degraded

Practical Limits

Under 5ms: Synchronous replication usable
5-10ms: Performance degraded but functional
Over 10ms: Asynchronous only
50ms+ (cross-country): Major performance issues

Database-Specific Implementation Reality

MySQL

Configuration Requirements:

sync_binlog=1 (mandatory to prevent data loss)
replica_parallel_workers=4 (not 16+ due to lock contention)
Semi-synchronous replication recommended over async/sync
MySQL 8.0: Improved parallel replication but requires extensive tuning

Common Failures:

Binary log position corruption
Replication randomly stops with cryptic errors
Network timeouts break replication state
Performance: More threads ≠ better performance

PostgreSQL

Complexity Warning:

Streaming replication solid but complex setup
WAL files corrupt frequently
200-line log entries difficult to debug
Logical replication: Replicates data changes but NOT schema changes
Deploy new column → replica breaks

Critical Settings:

max_wal_senders=3
wal_keep_size=1GB
shared_buffers=25% of RAM
300+ configuration parameters in postgresql.conf

AWS Aurora

Marketing vs Reality:

Advertised: Sub-second failover
Actual: 30-60 seconds typical, up to 90 seconds during peak traffic
Cross-region replicas: $1000+/month minimum
Aurora Serverless: 15-30 second cold start kills performance benefits
When Aurora breaks, customer stuck waiting for AWS support

Oracle Data Guard

Enterprise Cost Reality:

Costs more than small country GDP
$500k/year licenses don't include basic support
3+ hours hold time for support calls
Works well on enterprise hardware, fails on AWS due to latency

Change Data Capture (CDC)

Debezium Implementation

Setup Complexity:

Requires Kafka, Kafka Connect, Schema Registry
50+ interacting configuration parameters
Processing overhead: 10x more events than expected
MySQL binlog position tracking: Randomly corrupts
Version 1.9.7 bug: Loses GTID positions after exactly 16,777,216 transactions

Operational Issues:

Replication stops with no error messages
Debugging at 2am with useless logs: "consumer group rebalancing"
Network issues corrupt CDC state
Kafka cluster collapse under load

Performance Tuning Reality

Thread Configuration

MySQL parallel replication: 4-8 threads maximum
32 threads slower than single-threaded due to coordination overhead
More threads create lock contention, not performance gains

Compression and Batching

LZ4 compression: Saves bandwidth, uses CPU - may worsen performance on CPU-limited instances
Batch sizes: 100-500 transactions optimal
Larger batches increase memory usage and lag
Smaller batches waste network round-trips

Hardware Requirements

SSDs mandatory: Spinning disks cannot keep up with transaction logs
RAM: 70-80% for database buffer pools
Network: 1 Gbps minimum, 10 Gbps for high performance
Never use WiFi for replication

Cost Analysis

Infrastructure Costs

Basic master-slave: Double infrastructure costs minimum
Cross-region replication: $1000-2000+/month for medium databases
Aurora Global Database: $0.20/million write operations (adds up to $2000+/month for busy apps)
Cloud egress fees: $0.09/GB for cross-region data transfer

Hidden Costs

Human time: Debugging replication failures
Operational complexity: 24/7 monitoring requirements
Support costs: Enterprise database licensing and support

Critical Monitoring Requirements

Essential Metrics

Replication lag > 30 seconds: Critical alert
Replication lag > 5 minutes: System failure imminent
Disk space on replicas: Transaction logs fill disk, kill database
Network throughput: Saturated links cause lag spikes
Error rates: MySQL replication stops randomly

Monitoring Tools

Percona Monitoring: Free, effective for MySQL/PostgreSQL
DataDog: Paid, better alerting, fewer false positives
pt-table-checksum: Verify replica data consistency
MySQL Orchestrator: Automated MySQL failover
pg_auto_failover: PostgreSQL automatic failover

Common Failure Patterns

Top 5 Failure Modes

Disk space exhaustion: Transaction logs grow unbounded
Network hiccups: 5-second connectivity blip corrupts replication state
Schema changes: ALTER TABLE on master breaks replica mysteriously
Time drift: Clock synchronization issues cause timestamp conflicts
Memory leaks: Replication processes slowly consume all RAM

Failure Examples with Solutions

MySQL replication stops: "Got fatal error 1236" → Set up automated restart scripts
PostgreSQL WAL corruption: Monitor disk space and network stability
Aurora failover delays: 30-60 seconds actual vs sub-second marketing
CDC position corruption: Manual position reset required, data recovery needed

Security Configuration

Essential Security Measures

TLS encryption: Mandatory for replication traffic, negligible performance impact
Firewall rules: Limit replication traffic to specific IPs only
Separate replication users: Minimal privileges, never use root
Avoid VPNs: Unless compliance-required, direct encrypted connections better

Disaster Recovery Procedures

Testing Requirements

Monthly failover testing: Automated tools fail when needed most
Manual procedures documented: Plain English instructions for 3am outages
Chaos engineering: Randomly break staging to verify procedures
RTO/RPO targets: Most apps tolerate 5 minutes downtime, 1 minute data loss

Documentation Requirements

Step-by-step failover procedures: Tested and updated monthly
Emergency contact information: 24/7 availability
Rollback procedures: When failover goes wrong
Communication templates: Customer notifications, status updates

Cloud vs Self-Managed Trade-offs

Managed Services (Aurora, Cosmos DB, Cloud Spanner)

Pros:

Hide operational complexity
Automated failover and maintenance
Enterprise support (when it works)

Cons:

2-3x cost premium
Limited control during failures
Vendor lock-in
Support wait times during outages

Self-Managed

Pros:

Full control over configuration
Can debug and restart during failures
Lower infrastructure costs
No vendor lock-in

Cons:

24/7 operational responsibility
Expertise requirements
Manual failover procedures
Complex monitoring setup

When to Avoid Certain Approaches

Multi-Master Replication

Conflict resolution extremely complex
Data corruption risk high
Time spent debugging > time building features
Use only when forced by requirements

Cross-Database Replication

MySQL to PostgreSQL: Data type mapping failures
AWS DMS: Terrible performance in production
Schema changes break replication
Performance degradation severe

Real-Time Analytics on Replicas

Kills replication performance
Use dedicated analytics databases instead
Analytical queries interfere with replication lag

Recommended Starting Configuration

Simple Master-Slave Setup

Start with one read replica in same region
Use semi-synchronous replication
Monitor replication lag and disk space
Automated restart scripts for MySQL
Monthly manual failover testing

Hardware Minimums

SSD storage: Non-negotiable
RAM: 32GB minimum for production
Network: 1 Gbps dedicated connection
CPU: Database-optimized instances

Essential Monitoring

Replication lag alerts (30 second threshold)
Disk space monitoring (80% threshold)
Network throughput monitoring
Error log analysis and alerting

Scaling Considerations

When to Add Replicas

Read traffic exceeds primary capacity
Geographic distribution requirements
Disaster recovery requirements
Analytical workload separation

Performance Limits

Single master write bottleneck
Network bandwidth saturation
Replica lag increases with load
Management complexity grows exponentially

This technical reference provides the operational intelligence needed to successfully implement and maintain database replication while avoiding common pitfalls that cause production failures.