MySQL replication copies data from one server (the "source" - Oracle changed the name from "master" because reasons) to other servers (called "replicas" - formerly "slaves"). The idea is simple: when your primary database writes data, it gets copied to backup databases in real time. When your primary server dies, you promote a replica to become the new primary. Simple in theory, nightmare in practice.
I learned about replication the hard way on a Tuesday at 3:17am when our e-commerce site went down. Primary MySQL server had a hardware failure, and our "hot standby" turned out to be 4 hours behind due to replication lag. Spent the next 6 hours explaining to management why we lost $30k in orders while scrambling to restore from backups. That was my introduction to the wonderful world of MySQL replication.
The Binary Log: Your Friend and Enemy
Binary Log Formats Visual Comparison:
Statement-Based: INSERT INTO users VALUES (1, 'John', NOW())
Row-Based: Row: id=1, name='John', created='2025-09-06 15:42:33'
Mixed: Switches based on statement safety
MySQL replication works through the binary log (binlog) - a sequential record of every data change on your server. When you UPDATE a customer record, that change gets written to the binlog. Replica servers read these binlog entries and replay the same changes locally, keeping data in sync.
Here's the catch: the binlog has three formats, and choosing wrong will fuck you:
- Statement-based: Logs the actual SQL statements. Sounds logical until you realize
NOW()
orRAND()
give different results on replicas than the source - Row-based: Logs actual data changes. Works reliably but creates massive binlog files
- Mixed: Supposedly combines the best of both. In reality, it picks the worst option at the worst time
Use row-based (binlog_format=ROW
) unless you enjoy debugging why your replica has different data than your source. Trust me on this one - I've been burned by statement-based replication more times than I care to count.
The modern MySQL 8.4 implementation has parallel replication that actually works (most of the time), compared to earlier versions where parallel replication was more of a marketing term than a working feature.
GTID: When MySQL Got Its Shit Together
GTID Transaction Flow Process:
Step 1: Source generates unique GTID: 3E11FA47-71CA-11E1-9E33-C80AA9429562:1
Step 2: Transaction written to binlog with GTID identifier
Step 3: Replica receives binlog event and checks executed GTID set
Step 4: If GTID not executed, replica applies transaction and marks as executed
Step 5: If GTID already executed, replica skips to prevent duplicate application
GTID Replication Process:
Source Server Replica Server
↓ ↓
Generate GTID: uuid:sequence Check GTID executed set
Write to binlog Skip if already applied
Replicate to replicas → Apply and mark as executed
GTID Transaction Flow:
Source: 3E11FA47-71CA-11E1-9E33-C80AA9429562:1
Replica: Checks if transaction 1 is already applied
Result: Apply transaction and mark as executed
Global Transaction Identifiers (GTIDs) are MySQL's attempt at fixing the clusterfuck that is position-based replication. Instead of tracking mysterious binlog positions like mysql-bin.000042, position 1337
, each transaction gets a unique ID like 3E11FA47-71CA-11E1-9E33-C80AA9429562:23
.
Before GTID, promoting a replica to primary meant:
- Figure out where the replica stopped replicating
- Calculate binlog positions for other replicas to catch up
- Pray you got the math right
- Usually get it wrong and corrupt data
With GTID, promoting a replica is simple:
STOP REPLICA;
RESET REPLICA ALL;
Then point other replicas to the new source. MySQL figures out what transactions they're missing and catches up automatically. It's fucking magic when it works.
GTID has one major gotcha: you can't easily go back to position-based replication once you enable it. It's a one-way door, so test thoroughly first. MySQL 8.4's GTID improvements include better error handling and transaction tagging, which helps when you need to figure out which application broke replication at 2am.
Replication Flavors That'll Make Your Life Hell
Single Source with Multiple Replicas is what everyone starts with. One server takes writes, multiple replicas handle reads. Works great until your source dies and you realize your replicas are 20 seconds behind, meaning you just lost 20 seconds of customer orders. That's when you learn about semi-synchronous replication.
Multi-Source Replication lets one replica receive data from multiple sources. Useful for data warehousing where you want to aggregate data from different databases. The fun starts when the same customer exists in both sources with different email addresses. MySQL doesn't resolve conflicts - that's your problem now.
Group Replication is Oracle's fancy distributed system that promises to solve all your problems. In reality, it works beautifully in controlled environments and falls apart spectacularly when network partitions happen or you have more than 3 nodes. It uses the Paxos consensus algorithm, which is great until split-brain scenarios leave you with two "primary" nodes both accepting writes.
The dirty secret is that Facebook, GitHub, and YouTube don't use Group Replication for their critical systems. They use traditional async replication with custom tooling like Vitess or ProxySQL because it's predictable. When you're serving millions of users, predictable beats fancy every time.
Tools that actually help in production include MySQL Shell for administration, pt-heartbeat for lag monitoring, and orchestrator for failover automation. The MySQL replication troubleshooting guide is your bible when things break at 3am.
Understanding these replication fundamentals is crucial, but the real education happens when you try to implement them in production. That's where theory meets the harsh realities of network latency, hardware failures, and applications that weren't designed for distributed systems.