Why CDC Exists (And Why You'll Eventually Need It)

I've implemented CDC at three companies. Here's what actually happens and why you'll end up doing it too.

The Problem Everyone Hits

Your data team starts with nightly ETL jobs. Works great until:

  • Business wants "real-time" dashboards (they mean 5-minute refresh, you know it means hours of debugging)
  • Someone changes a database column and your entire pipeline dies at 3AM
  • Users complain data is "stale" when it's only 6 hours behind
  • You need to sync data between 5 different systems and each sync takes longer

How CDC Actually Works

Your database already logs every change to its transaction log - PostgreSQL calls it WAL, MySQL calls it binlog. CDC just taps into that stream and says "hey, this row changed, here's what happened." No queries hammering your production tables, no full table scans at 3am.

SCD Type 1 vs Type 2

CDC Architecture Overview

Three ways to actually implement this stuff:

Log-based CDC - The good shit. Read transaction logs directly. Works great if your database isn't ancient.

Trigger-based CDC - Database triggers fire on every change. Sure, it works everywhere, but watching your production queries slow to a crawl isn't fun.

Query-based CDC - Just poll for changes using timestamps. Simple as hell, but you'll miss deletes and it's not really real-time.

MySQL Binlog Configuration

When CDC Actually Helps

CDC shines when:

  • You have high-change tables that need to sync quickly
  • Downstream systems can't wait for batch runs
  • You need to replicate deletes (triggers and polling struggle with this)
  • Source system can't handle heavy query load from ETL

When it doesn't help:

  • Low-change tables (less than 1000 changes/day)
  • Complex transformations (do those downstream)
  • Compliance requires batch processing
  • Legacy databases with shitty log access

The Real Implementation Pain Points

WAL Retention Hell: PostgreSQL WAL files will fill your disk if CDC falls behind. Set max_slot_wal_keep_size or you'll run out of space. I watched Ubuntu systems shit the bed when /var/lib/postgresql/data hits 95% - server just stops responding and you're SSH'ing in at 2am to clean up WAL files. This Stack Overflow post shows the exact problem that made me lose a weekend.

MySQL Binlog Position Tracking: Lose track of the binlog position and you're either missing data or reprocessing everything. Debezium MySQL connector docs explains position tracking but good luck finding the relevant section.

Schema Evolution: Adding a column is fine. Renaming or dropping columns will break your CDC pipeline in exciting ways. Debezium pain points blog covers what actually breaks in production.

Network Partitions: When your CDC process can't reach Kafka for 30 minutes, fun things happen to your lag metrics. Kafka Connect troubleshooting has the monitoring queries you'll need.

CDC Methods - What Actually Works vs What Sounds Good

Method

Latency

Source Impact

Change Types

Complexity

Best For

Log-Based CDC

Milliseconds

PostgreSQL WAL overhead is manageable, MySQL binlog can spike CPU

All (Insert/Update/Delete)

High

Production systems, real-time analytics

Trigger-Based CDC

Near real-time

Will kill your performance on busy tables

All (Insert/Update/Delete)

Medium

Small-scale, audit requirements

Query-Based CDC

Minutes to Hours

Depends on query frequency, won't scale

Insert/Update only

Low

Batch processing, simple use cases

What Really Happens in Production

Those comparison tables look clean, but here's what actually happens when you implement CDC in production.

What "Minimal Performance Impact" Actually Means

The marketing says log-based CDC adds "1-3% overhead." In my experience:

  • PostgreSQL: WAL overhead is real but manageable if you size disk correctly
  • MySQL: Binlog I/O can spike during high-write periods
  • SQL Server: CDC can impact tempdb if you don't tune properly

You need monitoring. PostgreSQL CDC monitoring queries help track WAL generation rate and slot lag.

The Schema Change Minefield

Debezium Server Architecture

Kafka-Based CDC Implementation

Redpanda-Based CDC Implementation

"Automatic schema evolution" is mostly bullshit. Here's what actually works:

Adding columns: Usually fine, new columns appear as null
Renaming columns: Breaks everything. Plan downtime.
Changing data types: VARCHAR(50) to VARCHAR(100) works, INT to VARCHAR does not
Dropping columns: Some tools handle this, others shit the bed

SCD Type 2 Handling

We learned to test schema changes on dev environments first. Revolutionary concept, I know.

Debugging CDC When It Breaks (And It Will Break)

Debezium Memory Issues: Debezium 1.9.x has memory leaks with large transactions. I learned this the hard way when our connector died during a 2M row batch update at 4am on a Saturday - took 3 hours to figure out it was a known issue. Upgrade to 2.x or restart connectors weekly. Don't be me.

Kafka Connect Restarts: Kafka Connect dies randomly. Set connect.log.level=DEBUG and prepare for log diving.

WAL/Binlog Position Loss: This is your worst nightmare. Lost position means reprocessing everything or missing data. Took down prod for 2 hours when someone deleted our Kafka offsets topic. Debezium stores offsets in Kafka - monitor those topics like your life depends on it.

Lag Monitoring: Set up alerts on replication lag. When lag hits 10+ minutes, someone's phone should ring.

Tool Selection Reality

Debezium: Free but you'll spend 6 months learning it. Official documentation is scattered across 47 different pages. Debezium Slack community is where you'll actually get help.

Airbyte: Easier setup, costs money, connectors restart mysteriously. Your ops team will hate the random failures but love the UI.

AWS DMS: Slow as hell but it works. DMS CDC setup guide has everything you need. Your ops team already knows how to fix it when it breaks.

Fivetran: Expensive but actually works reliably. You pay for the pain you don't experience.

Cost Reality Check

"Open-source Debezium is free!" Sure, if you ignore:

  • Kafka cluster hosting ($2-5k/month depending on throughput)
  • Engineer time (20% of someone's job maintaining it, more during incidents)
  • Monitoring and alerting setup (because you WILL get paged at 3am)
  • The therapy costs from debugging Kafka Connect failures

Budget $50-100k/year for a production CDC setup including people, infrastructure, and your sanity.

When You Should Just Use Batch ETL

Don't let CDC become a golden hammer:

  • Tables with fewer than 10k changes/day - batch ETL is simpler
  • Heavy transformations - do them downstream, not in CDC
  • Compliance requirements that mandate batch processing
  • When your team doesn't have streaming expertise yet

Start simple, add complexity when you actually need it.

Questions From 3AM When Everything's Broken

Q

Why does my CDC lag keep increasing?

A

First things to check:

  1. WAL/binlog retention
    • PostgreSQL: SELECT pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) FROM pg_replication_slots;2. Kafka Connect memory
    • Default 1GB heap isn't enough for large transactions
  2. Network issues
    • Packet loss between database and CDC process
  3. Large transactions
    • Debezium struggles with 1M+ row transactionsQuick fix:

Restart Kafka Connect and monitor (yeah, "turn it off and on again" still works in 2025). Long-term: Add more memory and investigate why someone's running 5M row transactions at midnight.

Q

How do I debug Debezium when it just stops working?

A

Enable debug logging: log4j.logger.io.debezium=DEBUG in your Connect config.

Check these logs:

  • Debezium connector logs for MySQL/PostgreSQL errors

  • Kafka Connect logs for serialization failures like "Value too large"

  • Database logs for replication slot issuesCommon fixes:

  • Restart the connector (it's not embarrassing, it's Monday)

  • Increase max.request.size for large row changes

  • Check if someone dropped the replication slot

  • If you see FATAL: terminating connection due to conflict with recovery, someone fucked with WAL settings (check wal_level and max_wal_senders

  • took me 2 hours to figure this out the first time)

Q

What happens when my source database crashes during replication?

A

PostgreSQL: Replication slots survive crashes but WAL files might get cleaned up. Check max_slot_wal_keep_size.MySQL: Binlog position gets lost. Debezium stores this in Kafka topics, so as long as Kafka survived, you're fine.Worst case: You'll need to do a fresh snapshot. Plan for 2-8 hours downtime depending on table size. Yes, you'll be explaining to everyone why the "real-time" system needs 8 hours of downtime. Have coffee ready.

Q

Why do I have duplicate events in my target system?

A

CDC systems deliver "at-least-once" semantics.

Duplicates happen during:

  • Network failures between CDC and Kafka
  • Kafka Connect task rebalancing
  • Manual connector restartsFix: Implement idempotent downstream processing using primary keys or add deduplication logic.
Q

How do I handle schema changes without downtime?

A

The safe way: 1.

Add new columns as nullable first 2. Update application to use new schema 3. Run migration to populate data 4. Make column non-nullable if neededThe dangerous way: Change schemas directly and pray your CDC tool handles it.MySQL note: ALTER TABLE locks the table.

Use pt-online-schema-change for large tables.

Q

Why is my CDC setup consuming so much disk space?

A

CDC Monitoring DashboardPostgreSQL WAL accumulation: If CDC falls behind, WAL files pile up. Set max_slot_wal_keep_size to prevent disk space issues.Kafka topic retention: Change events are stored in Kafka topics. Set appropriate retention policies or use compaction.Monitoring disk: Set up alerts when WAL usage > 10GB or Kafka usage > 100GB per topic.

Related Tools & Recommendations

compare
Recommended

PostgreSQL vs MySQL vs MongoDB vs Cassandra - Which Database Will Ruin Your Weekend Less?

Skip the bullshit. Here's what breaks in production.

PostgreSQL
/compare/postgresql/mysql/mongodb/cassandra/comprehensive-database-comparison
100%
tool
Similar content

Change Data Capture (CDC) Integration Patterns for Production

Set up CDC at three companies. Got paged at 2am during Black Friday when our setup died. Here's what keeps working.

Change Data Capture (CDC)
/tool/change-data-capture/integration-deployment-patterns
80%
tool
Similar content

CDC Tool Selection Guide: Pick the Right Change Data Capture

I've debugged enough CDC disasters to know what actually matters. Here's what works and what doesn't.

Change Data Capture (CDC)
/tool/change-data-capture/tool-selection-guide
74%
tool
Similar content

CDC Enterprise Implementation Guide: Real-World Challenges & Solutions

I've implemented CDC at 3 companies. Here's what actually works vs what the vendors promise.

Change Data Capture (CDC)
/tool/change-data-capture/enterprise-implementation-guide
70%
tool
Similar content

Change Data Capture (CDC) Performance Optimization Guide

Demo worked perfectly. Then some asshole ran a 50M row import at 2 AM Tuesday and took down everything.

Change Data Capture (CDC)
/tool/change-data-capture/performance-optimization-guide
63%
tool
Similar content

Change Data Capture (CDC) Troubleshooting Guide: Fix Common Issues

I've debugged CDC disasters at three different companies. Here's what actually breaks and how to fix it.

Change Data Capture (CDC)
/tool/change-data-capture/troubleshooting-guide
63%
tool
Similar content

CDC Database Platform Guide: PostgreSQL, MySQL, MongoDB Setup

Stop wasting weeks debugging database-specific CDC setups that the vendor docs completely fuck up

Change Data Capture (CDC)
/tool/change-data-capture/database-platform-implementations
53%
howto
Recommended

MySQL to PostgreSQL Production Migration: Complete Step-by-Step Guide

Migrate MySQL to PostgreSQL without destroying your career (probably)

MySQL
/howto/migrate-mysql-to-postgresql-production/mysql-to-postgresql-production-migration
44%
howto
Recommended

I Survived Our MongoDB to PostgreSQL Migration - Here's How You Can Too

Four Months of Pain, 47k Lost Sessions, and What Actually Works

MongoDB
/howto/migrate-mongodb-to-postgresql/complete-migration-guide
44%
integration
Recommended

Fix Your Slow-Ass Laravel + MySQL Setup

Stop letting database performance kill your Laravel app - here's how to actually fix it

MySQL
/integration/mysql-laravel/overview
44%
troubleshoot
Recommended

Fix MySQL Error 1045 Access Denied - Real Solutions That Actually Work

Stop fucking around with generic fixes - these authentication solutions are tested on thousands of production systems

MySQL
/troubleshoot/mysql-error-1045-access-denied/authentication-error-solutions
44%
tool
Recommended

Apache Kafka - The Distributed Log That LinkedIn Built (And You Probably Don't Need)

integrates with Apache Kafka

Apache Kafka
/tool/apache-kafka/overview
43%
tool
Similar content

CDC Security & Compliance Guide: Protect Your Data Pipelines

I've seen CDC implementations fail audits, leak PII, and violate GDPR. Here's how to secure your change data capture without breaking everything.

Change Data Capture (CDC)
/tool/change-data-capture/security-compliance-guide
38%
tool
Recommended

AWS Database Migration Service - When You Need to Move Your Database Without Getting Fired

competes with AWS Database Migration Service

AWS Database Migration Service
/tool/aws-database-migration-service/overview
29%
tool
Recommended

Oracle GoldenGate - Database Replication That Actually Works

Database replication for enterprises who can afford Oracle's pricing

Oracle GoldenGate
/tool/oracle-goldengate/overview
29%
tool
Recommended

Fivetran: Expensive Data Plumbing That Actually Works

Data integration for teams who'd rather pay than debug pipelines at 3am

Fivetran
/tool/fivetran/overview
26%
pricing
Recommended

Databricks vs Snowflake vs BigQuery Pricing: Which Platform Will Bankrupt You Slowest

We burned through about $47k in cloud bills figuring this out so you don't have to

Databricks
/pricing/databricks-snowflake-bigquery-comparison/comprehensive-pricing-breakdown
26%
tool
Recommended

MongoDB Atlas Enterprise Deployment Guide

integrates with MongoDB Atlas

MongoDB Atlas
/tool/mongodb-atlas/enterprise-deployment
26%
alternatives
Recommended

Your MongoDB Atlas Bill Just Doubled Overnight. Again.

integrates with MongoDB Atlas

MongoDB Atlas
/alternatives/mongodb-atlas/migration-focused-alternatives
26%
howto
Popular choice

How to Actually Get GitHub Copilot Working in JetBrains IDEs

Stop fighting with code completion and let AI do the heavy lifting in IntelliJ, PyCharm, WebStorm, or whatever JetBrains IDE you're using

GitHub Copilot
/howto/setup-github-copilot-jetbrains-ide/complete-setup-guide
25%

Recommendations combine user behavior, content similarity, research intelligence, and SEO optimization