Why does my CDC lag keep increasing?

First things to check:1. **WAL/binlog retention** - PostgreSQL: `SELECT pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) FROM pg_replication_slots;`2. **Kafka Connect memory** - Default 1GB heap isn't enough for large transactions3. **Network issues** - Packet loss between database and CDC process4. **Large transactions** - Debezium struggles with 1M+ row transactionsQuick fix: Restart Kafka Connect and monitor (yeah, "turn it off and on again" still works in 2025). Long-term: Add more memory and investigate why someone's running 5M row transactions at midnight.

How do I debug Debezium when it just stops working?

Enable debug logging: `log4j.logger.io.debezium=DEBUG` in your Connect config.Check these logs:- Debezium connector logs for MySQL/PostgreSQL errors- Kafka Connect logs for serialization failures like "Value too large"- Database logs for replication slot issuesCommon fixes:- Restart the connector (it's not embarrassing, it's Monday)- Increase `max.request.size` for large row changes- Check if someone dropped the replication slot- If you see `FATAL: terminating connection due to conflict with recovery`, someone fucked with WAL settings (check `wal_level` and `max_wal_senders` - took me 2 hours to figure this out the first time)

What happens when my source database crashes during replication?

**PostgreSQL**: Replication slots survive crashes but WAL files might get cleaned up. Check `max_slot_wal_keep_size`.**MySQL**: Binlog position gets lost. Debezium stores this in Kafka topics, so as long as Kafka survived, you're fine.**Worst case**: You'll need to do a fresh snapshot. Plan for 2-8 hours downtime depending on table size. Yes, you'll be explaining to everyone why the "real-time" system needs 8 hours of downtime. Have coffee ready.

Why do I have duplicate events in my target system?

CDC systems deliver "at-least-once" semantics. Duplicates happen during:- Network failures between CDC and Kafka- Kafka Connect task rebalancing- Manual connector restartsFix: Implement idempotent downstream processing using primary keys or add deduplication logic.

How do I handle schema changes without downtime?

**The safe way**:1. Add new columns as nullable first2. Update application to use new schema3. Run migration to populate data4. Make column non-nullable if needed**The dangerous way**: Change schemas directly and pray your CDC tool handles it.**MySQL note**: `ALTER TABLE` locks the table. Use [pt-online-schema-change](https://docs.percona.com/percona-toolkit/pt-online-schema-change.html) for large tables.

Why is my CDC setup consuming so much disk space?

![CDC Monitoring Dashboard](https://www.yugabyte.com/wp-content/uploads/2023/01/Figure-2-Sent-lag-600x193.png)**PostgreSQL WAL accumulation**: If CDC falls behind, WAL files pile up. Set `max_slot_wal_keep_size` to prevent disk space issues.**Kafka topic retention**: Change events are stored in Kafka topics. Set appropriate retention policies or use compaction.**Monitoring disk**: Set up alerts when WAL usage > 10GB or Kafka usage > 100GB per topic.

Currently viewing the AI version

Switch to human version

Change Data Capture (CDC) - AI-Optimized Technical Reference

Technology Overview

Change Data Capture (CDC) streams database changes to other systems in real-time by tapping into database transaction logs. Eliminates 6-hour data lag from batch ETL processes and prevents pipeline failures from schema changes.

Implementation Methods

Log-Based CDC (Recommended for Production)

Latency: Milliseconds
Source Impact: 1-3% overhead on database
Change Types: All (Insert/Update/Delete)
Complexity: High
Best For: Production systems, real-time analytics

Critical Configuration:

PostgreSQL: Set max_slot_wal_keep_size to prevent disk space issues
MySQL: Monitor binlog I/O during high-write periods
SQL Server: Tune tempdb to prevent CDC impact

Trigger-Based CDC

Latency: Near real-time
Source Impact: Severe performance degradation on busy tables
Change Types: All (Insert/Update/Delete)
Complexity: Medium
Best For: Small-scale, audit requirements only

Query-Based CDC

Latency: Minutes to hours
Source Impact: Depends on query frequency
Change Types: Insert/Update only (misses deletes)
Complexity: Low
Best For: Batch processing, simple use cases

Production Implementation Requirements

Resource Requirements

Infrastructure Cost: $2-5k/month for Kafka cluster
Engineering Time: 20% of one engineer's time for maintenance
Total Budget: $50-100k/year including people, infrastructure, monitoring

Critical Failure Modes

WAL Retention Hell (PostgreSQL)

Problem: WAL files fill disk when CDC falls behind
Impact: Server stops responding at 95% disk usage
Solution: Set max_slot_wal_keep_size, monitor with SELECT pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) FROM pg_replication_slots;

MySQL Binlog Position Loss

Problem: Lose track of binlog position
Impact: Missing data or full reprocessing required
Solution: Monitor Kafka offset topics, backup position tracking

Schema Evolution Breaks

Safe: Adding nullable columns
Dangerous: Renaming columns, changing data types (VARCHAR to INT)
Deadly: Dropping columns
Solution: Test all schema changes in dev environment first

Memory and Performance Issues

Debezium Memory Leaks

Problem: Debezium 1.9.x has memory leaks with large transactions
Impact: Connector dies during batch updates (2M+ rows)
Solution: Upgrade to 2.x or restart connectors weekly

Kafka Connect Failures

Problem: Random connector deaths
Solution: Set connect.log.level=DEBUG, monitor and restart automatically

Monitoring Requirements

Essential Alerts

Replication lag > 10 minutes
WAL usage > 10GB (PostgreSQL)
Kafka topic size > 100GB per topic
Disk space > 95%

Debug Commands

-- PostgreSQL WAL monitoring
SELECT pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) FROM pg_replication_slots;

-- Check replication slot status
SELECT slot_name, database, active, restart_lsn FROM pg_replication_slots;

Tool Selection Matrix

Tool	Cost	Reliability	Setup Complexity	Operational Overhead
Debezium	Free	Medium	High	High (6 months learning curve)
Airbyte	Medium	Medium	Low	Medium (random failures)
AWS DMS	High	High	Medium	Low (slow but reliable)
Fivetran	Very High	Very High	Very Low	Very Low

Tool-Specific Issues

Debezium

Learning Curve: 6 months to production readiness
Documentation: Scattered across 47 pages
Support: Slack community more useful than docs
Memory: Default 1GB heap insufficient for large transactions

Airbyte

Pros: Easy UI, faster setup
Cons: Mysterious connector restarts, costs money
Operations: Ops teams love UI, hate random failures

When NOT to Use CDC

Use Batch ETL Instead When:

Tables with <10k changes/day
Heavy transformations required
Compliance mandates batch processing
Team lacks streaming expertise
<1000 changes/day total volume

Cost-Benefit Threshold

CDC becomes cost-effective when:

Data freshness requirements <1 hour
Multiple downstream systems need sync
Source system can't handle ETL query load
DELETE operations must be captured

Common Production Scenarios

Network Partition Recovery

Duration: CDC can't reach Kafka for 30+ minutes
Impact: Lag metrics spike, potential data loss
Recovery: Automatic catchup if WAL/binlog retained

Database Crash Recovery

PostgreSQL: Replication slots survive, WAL files may be cleaned
MySQL: Binlog position stored in Kafka topics
Worst Case: 2-8 hours downtime for fresh snapshot

Duplicate Event Handling

Cause: At-least-once delivery semantics
Triggers: Network failures, connector restarts, rebalancing
Solution: Implement idempotent downstream processing

Critical Warnings

Schema Change Disasters

VARCHAR(50) to VARCHAR(100): Usually safe
INT to VARCHAR: Will break CDC pipeline
Column renames: Breaks everything, plan downtime
ALTER TABLE on MySQL: Locks table, use pt-online-schema-change

Hidden Operational Costs

24/7 monitoring required (3am pages guaranteed)
Kafka expertise mandatory for troubleshooting
Database administrator involvement for WAL/binlog tuning
DevOps overhead for connector lifecycle management

Performance Degradation Scenarios

Large transactions (1M+ rows) cause memory issues
High-frequency small transactions can overwhelm CDC
Schema with many columns increases serialization overhead
Network latency between database and Kafka affects throughput

Success Criteria

CDC implementation succeeds when:

Replication lag consistently <5 minutes
Schema changes deploy without CDC pipeline failures
Ops team can troubleshoot common issues without escalation
Cost per GB of data transferred <$0.10
Downstream systems receive 99.9% of change events

Useful Links for Further Investigation

Shit That Actually Works

Link	Description
Debezium docs	Scattered across 47 pages but has the real info. Their PostgreSQL connector page saved me 6 hours of WAL retention debugging.
This Kafka Connect troubleshooting guide	The only resource that helped when our connectors kept dying. Check the "Common Issues" section first.
Debezium Slack community	Where you'll actually get answers at 2am when your CDC pipeline is fucked. More useful than the documentation.
PostgreSQL replication slots monitoring	Essential for preventing WAL disk space disasters. Use the queries in the "Monitoring" section.
Estuary's Debezium pain points article	Someone finally wrote down all the shit that breaks in production. Wish I'd found this earlier.

Change Data Capture (CDC) - AI-Optimized Technical Reference

Technology Overview

Implementation Methods

Log-Based CDC (Recommended for Production)

Trigger-Based CDC

Query-Based CDC

Production Implementation Requirements

Resource Requirements

Critical Failure Modes

WAL Retention Hell (PostgreSQL)

MySQL Binlog Position Loss

Schema Evolution Breaks

Memory and Performance Issues

Debezium Memory Leaks

Kafka Connect Failures

Monitoring Requirements

Essential Alerts

Debug Commands

Tool Selection Matrix

Tool-Specific Issues

Debezium

Airbyte

When NOT to Use CDC

Use Batch ETL Instead When:

Cost-Benefit Threshold

Common Production Scenarios

Network Partition Recovery

Database Crash Recovery

Duplicate Event Handling

Critical Warnings

Schema Change Disasters

Hidden Operational Costs

Performance Degradation Scenarios

Success Criteria

Useful Links for Further Investigation

Shit That Actually Works

Related Tools & Recommendations

MongoDB vs PostgreSQL vs MySQL: Which One Won't Ruin Your Weekend

How to Migrate PostgreSQL 15 to 16 Without Destroying Your Weekend

Why I Finally Dumped Cassandra After 5 Years of 3AM Hell

MySQL Replication - How to Keep Your Database Alive When Shit Goes Wrong

MySQL Alternatives That Don't Suck - A Migration Reality Check

Kafka Will Fuck Your Budget - Here's the Real Cost

Apache Kafka - The Distributed Log That LinkedIn Built (And You Probably Don't Need)

Debezium - Database Change Capture Without the Pain

AWS Database Migration Service - When You Need to Move Your Database Without Getting Fired

Oracle GoldenGate - Database Replication That Actually Works

Fivetran: Expensive Data Plumbing That Actually Works

Airbyte - Stop Your Data Pipeline From Shitting The Bed

Striim - Enterprise CDC That Actually Doesn't Suck

Databricks vs Snowflake vs BigQuery Pricing: Which Platform Will Bankrupt You Slowest

Snowflake - Cloud Data Warehouse That Doesn't Suck

dbt + Snowflake + Apache Airflow: Production Orchestration That Actually Works

MongoDB Alternatives: Choose the Right Database for Your Specific Use Case

Kafka + MongoDB + Kubernetes + Prometheus Integration - When Event Streams Break

MongoDB Alternatives: The Migration Reality Check

Thunder Client Migration Guide - Escape the Paywall