Change Data Capture - Stream Database Changes So Your Data Isn't 6 Hours Behind

Why CDC Exists (And Why You'll Eventually Need It)

I've implemented CDC at three companies. Here's what actually happens and why you'll end up doing it too.

The Problem Everyone Hits

Your data team starts with nightly ETL jobs. Works great until:

Business wants "real-time" dashboards (they mean 5-minute refresh, you know it means hours of debugging)
Someone changes a database column and your entire pipeline dies at 3AM
Users complain data is "stale" when it's only 6 hours behind
You need to sync data between 5 different systems and each sync takes longer

How CDC Actually Works

Your database already logs every change to its transaction log - PostgreSQL calls it WAL, MySQL calls it binlog. CDC just taps into that stream and says "hey, this row changed, here's what happened." No queries hammering your production tables, no full table scans at 3am.

SCD Type 1 vs Type 2

CDC Architecture Overview

Three ways to actually implement this stuff:

Log-based CDC - The good shit. Read transaction logs directly. Works great if your database isn't ancient.

Trigger-based CDC - Database triggers fire on every change. Sure, it works everywhere, but watching your production queries slow to a crawl isn't fun.

Query-based CDC - Just poll for changes using timestamps. Simple as hell, but you'll miss deletes and it's not really real-time.

MySQL Binlog Configuration

When CDC Actually Helps

CDC shines when:

You have high-change tables that need to sync quickly
Downstream systems can't wait for batch runs
You need to replicate deletes (triggers and polling struggle with this)
Source system can't handle heavy query load from ETL

When it doesn't help:

Low-change tables (less than 1000 changes/day)
Complex transformations (do those downstream)
Compliance requires batch processing
Legacy databases with shitty log access

The Real Implementation Pain Points

WAL Retention Hell: PostgreSQL WAL files will fill your disk if CDC falls behind. Set max_slot_wal_keep_size or you'll run out of space. I watched Ubuntu systems shit the bed when /var/lib/postgresql/data hits 95% - server just stops responding and you're SSH'ing in at 2am to clean up WAL files. This Stack Overflow post shows the exact problem that made me lose a weekend.

MySQL Binlog Position Tracking: Lose track of the binlog position and you're either missing data or reprocessing everything. Debezium MySQL connector docs explains position tracking but good luck finding the relevant section.

Schema Evolution: Adding a column is fine. Renaming or dropping columns will break your CDC pipeline in exciting ways. Debezium pain points blog covers what actually breaks in production.

Network Partitions: When your CDC process can't reach Kafka for 30 minutes, fun things happen to your lag metrics. Kafka Connect troubleshooting has the monitoring queries you'll need.

CDC Methods - What Actually Works vs What Sounds Good

Method	Latency	Source Impact	Change Types	Complexity	Best For
Log-Based CDC	Milliseconds	PostgreSQL WAL overhead is manageable, MySQL binlog can spike CPU	All (Insert/Update/Delete)	High	Production systems, real-time analytics
Trigger-Based CDC	Near real-time	Will kill your performance on busy tables	All (Insert/Update/Delete)	Medium	Small-scale, audit requirements
Query-Based CDC	Minutes to Hours	Depends on query frequency, won't scale	Insert/Update only	Low	Batch processing, simple use cases

What Really Happens in Production

Those comparison tables look clean, but here's what actually happens when you implement CDC in production.

What "Minimal Performance Impact" Actually Means

The marketing says log-based CDC adds "1-3% overhead." In my experience:

PostgreSQL: WAL overhead is real but manageable if you size disk correctly
MySQL: Binlog I/O can spike during high-write periods
SQL Server: CDC can impact tempdb if you don't tune properly

You need monitoring. PostgreSQL CDC monitoring queries help track WAL generation rate and slot lag.

The Schema Change Minefield

Debezium Server Architecture

Kafka-Based CDC Implementation

Redpanda-Based CDC Implementation

"Automatic schema evolution" is mostly bullshit. Here's what actually works:

Adding columns: Usually fine, new columns appear as null
Renaming columns: Breaks everything. Plan downtime.
Changing data types: VARCHAR(50) to VARCHAR(100) works, INT to VARCHAR does not
Dropping columns: Some tools handle this, others shit the bed

SCD Type 2 Handling

We learned to test schema changes on dev environments first. Revolutionary concept, I know.

Debugging CDC When It Breaks (And It Will Break)

Debezium Memory Issues: Debezium 1.9.x has memory leaks with large transactions. I learned this the hard way when our connector died during a 2M row batch update at 4am on a Saturday - took 3 hours to figure out it was a known issue. Upgrade to 2.x or restart connectors weekly. Don't be me.

Kafka Connect Restarts: Kafka Connect dies randomly. Set connect.log.level=DEBUG and prepare for log diving.

WAL/Binlog Position Loss: This is your worst nightmare. Lost position means reprocessing everything or missing data. Took down prod for 2 hours when someone deleted our Kafka offsets topic. Debezium stores offsets in Kafka - monitor those topics like your life depends on it.

Lag Monitoring: Set up alerts on replication lag. When lag hits 10+ minutes, someone's phone should ring.

Tool Selection Reality

Debezium: Free but you'll spend 6 months learning it. Official documentation is scattered across 47 different pages. Debezium Slack community is where you'll actually get help.

Airbyte: Easier setup, costs money, connectors restart mysteriously. Your ops team will hate the random failures but love the UI.

AWS DMS: Slow as hell but it works. DMS CDC setup guide has everything you need. Your ops team already knows how to fix it when it breaks.

Fivetran: Expensive but actually works reliably. You pay for the pain you don't experience.

Cost Reality Check

"Open-source Debezium is free!" Sure, if you ignore:

Kafka cluster hosting ($2-5k/month depending on throughput)
Engineer time (20% of someone's job maintaining it, more during incidents)
Monitoring and alerting setup (because you WILL get paged at 3am)
The therapy costs from debugging Kafka Connect failures

Budget $50-100k/year for a production CDC setup including people, infrastructure, and your sanity.

When You Should Just Use Batch ETL

Don't let CDC become a golden hammer:

Tables with fewer than 10k changes/day - batch ETL is simpler
Heavy transformations - do them downstream, not in CDC
Compliance requirements that mandate batch processing
When your team doesn't have streaming expertise yet

Start simple, add complexity when you actually need it.

Questions From 3AM When Everything's Broken

Why does my CDC lag keep increasing?

First things to check:

WAL/binlog retention
- PostgreSQL: SELECT pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) FROM pg_replication_slots;2. Kafka Connect memory
- Default 1GB heap isn't enough for large transactions
Network issues
- Packet loss between database and CDC process
Large transactions
- Debezium struggles with 1M+ row transactionsQuick fix:

Restart Kafka Connect and monitor (yeah, "turn it off and on again" still works in 2025). Long-term: Add more memory and investigate why someone's running 5M row transactions at midnight.

How do I debug Debezium when it just stops working?

Enable debug logging: log4j.logger.io.debezium=DEBUG in your Connect config.

Check these logs:

Debezium connector logs for MySQL/PostgreSQL errors
Kafka Connect logs for serialization failures like "Value too large"
Database logs for replication slot issuesCommon fixes:
Restart the connector (it's not embarrassing, it's Monday)
Increase max.request.size for large row changes
Check if someone dropped the replication slot
If you see FATAL: terminating connection due to conflict with recovery, someone fucked with WAL settings (check wal_level and max_wal_senders
took me 2 hours to figure this out the first time)

What happens when my source database crashes during replication?

PostgreSQL: Replication slots survive crashes but WAL files might get cleaned up. Check max_slot_wal_keep_size.MySQL: Binlog position gets lost. Debezium stores this in Kafka topics, so as long as Kafka survived, you're fine.Worst case: You'll need to do a fresh snapshot. Plan for 2-8 hours downtime depending on table size. Yes, you'll be explaining to everyone why the "real-time" system needs 8 hours of downtime. Have coffee ready.

Why do I have duplicate events in my target system?

CDC systems deliver "at-least-once" semantics.

Duplicates happen during:

Network failures between CDC and Kafka
Kafka Connect task rebalancing
Manual connector restartsFix: Implement idempotent downstream processing using primary keys or add deduplication logic.

How do I handle schema changes without downtime?

The safe way: 1.

Add new columns as nullable first 2. Update application to use new schema 3. Run migration to populate data 4. Make column non-nullable if neededThe dangerous way: Change schemas directly and pray your CDC tool handles it.MySQL note: ALTER TABLE locks the table.

Use pt-online-schema-change for large tables.

Why is my CDC setup consuming so much disk space?

CDC Monitoring Dashboard PostgreSQL WAL accumulation: If CDC falls behind, WAL files pile up. Set max_slot_wal_keep_size to prevent disk space issues.Kafka topic retention: Change events are stored in Kafka topics. Set appropriate retention policies or use compaction.Monitoring disk: Set up alerts when WAL usage > 10GB or Kafka usage > 100GB per topic.

Shit That Actually Works

25%

Recommendations combine user behavior, content similarity, research intelligence, and SEO optimization

Quick Navigation

The Problem Everyone Hits

How CDC Actually Works

When CDC Actually Helps

The Real Implementation Pain Points

What "Minimal Performance Impact" Actually Means

The Schema Change Minefield

Debugging CDC When It Breaks (And It Will Break)

Tool Selection Reality

Cost Reality Check

When You Should Just Use Batch ETL

Why does my CDC lag keep increasing?

How do I debug Debezium when it just stops working?

What happens when my source database crashes during replication?

Why do I have duplicate events in my target system?

How do I handle schema changes without downtime?

Why is my CDC setup consuming so much disk space?

Related Tools & Recommendations

PostgreSQL vs MySQL vs MongoDB vs Cassandra - Which Database Will Ruin Your Weekend Less?

Change Data Capture (CDC) Integration Patterns for Production

CDC Tool Selection Guide: Pick the Right Change Data Capture

CDC Enterprise Implementation Guide: Real-World Challenges & Solutions

Change Data Capture (CDC) Performance Optimization Guide

Change Data Capture (CDC) Troubleshooting Guide: Fix Common Issues

CDC Database Platform Guide: PostgreSQL, MySQL, MongoDB Setup

MySQL to PostgreSQL Production Migration: Complete Step-by-Step Guide

I Survived Our MongoDB to PostgreSQL Migration - Here's How You Can Too

Fix Your Slow-Ass Laravel + MySQL Setup

Fix MySQL Error 1045 Access Denied - Real Solutions That Actually Work

Apache Kafka - The Distributed Log That LinkedIn Built (And You Probably Don't Need)

CDC Security & Compliance Guide: Protect Your Data Pipelines

AWS Database Migration Service - When You Need to Move Your Database Without Getting Fired

Oracle GoldenGate - Database Replication That Actually Works

Fivetran: Expensive Data Plumbing That Actually Works

Databricks vs Snowflake vs BigQuery Pricing: Which Platform Will Bankrupt You Slowest

MongoDB Atlas Enterprise Deployment Guide

Your MongoDB Atlas Bill Just Doubled Overnight. Again.

How to Actually Get GitHub Copilot Working in JetBrains IDEs