Change Data Capture (CDC) Performance Optimization - AI Knowledge Base
Critical Configuration Settings
PostgreSQL WAL Management
Failure Mode: WAL files accumulate until disk is full, causing complete database failure at 95% disk usage
Required Settings:
max_slot_wal_keep_size=4GB
- CRITICAL: Prevents disk space disastersmax_replication_slots=10
- Default inadequate for production (limit hit with 3 connectors)wal_level=logical
- Required but often reset to 'replica' accidentallyshared_preload_libraries='wal2json'
- Better performance than pgoutputmax_connections=300+
- Default 100 causes "too many clients" errors
Monitoring Query (saves production systems):
SELECT slot_name, active,
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) as lag_size
FROM pg_replication_slots
ORDER BY pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) DESC;
MySQL Binlog Configuration
Failure Mode: Lost binlog position = missing data or complete reprocessing
Required Settings:
SET GLOBAL binlog_format = 'ROW';
SET GLOBAL binlog_row_image = 'FULL';
SET GLOBAL expire_logs_days = 7;
SET GLOBAL max_binlog_size = 1073741824;
Performance Reality vs Marketing Claims
Tool | Marketing | Single-AZ Reality | Multi-AZ Reality | First Failure Point |
---|---|---|---|---|
Debezium + PostgreSQL | "Sub-millisecond" | ~100ms | ~2 seconds | WAL disk space |
Confluent Cloud | "Real-time streaming" | ~300ms | ~1 minute | Budget constraints |
AWS DMS | "Low latency CDC" | ~5 seconds | ~30 seconds | Complex data types |
Airbyte CDC | "Near real-time" | ~30 seconds | ~5 minutes | Not actually streaming |
Fivetran | "Instant data sync" | ~3 minutes | ~10 minutes | Custom logic breaks |
Memory and Resource Requirements
Debezium Memory Configuration
Failure Mode: OutOfMemoryError during large transactions (2M+ rows)
Required Settings:
export KAFKA_HEAP_OPTS="-Xmx8g -Xms8g"
export KAFKA_JVM_PERFORMANCE_OPTS="-XX:+UseG1GC -XX:+HeapDumpOnOutOfMemoryError"
# Debezium connector tuning
max.queue.size=16000
max.batch.size=4096
TOAST Field Problem: PostgreSQL large JSON/TEXT fields load entirely into memory
Solution: Exclude large fields: column.exclude.list=user_profile.large_json_field
Resource Scaling Thresholds
- Tables <1000 changes/hour: Don't need optimization
- Tables >10K changes/day: Require dedicated connectors
- Bulk imports >1M rows: Will crash default configurations
- Single connector limit: ~10K events/hour regardless of downstream capacity
Network and Cross-AZ Performance Impact
AWS Multi-AZ Reality
- Cross-AZ latency: 2-3ms average, spikes to 50ms during peak hours
- Performance degradation:
- Single AZ: 200-300ms CDC latency
- Multi-AZ: 2-5 seconds average, 30+ seconds during AWS issues
- Cost: Data transfer $0.02-0.12 per GB between regions
Network Optimization Strategy
Deploy components in same AZ despite availability trade-offs - consistent low latency beats theoretical high availability for most CDC use cases.
Common Production Failures and Solutions
WAL Accumulation (Most Common 3AM Alert)
Cause: One slow table holds up all replication slots
Solution: Heartbeat table with regular updates
CREATE TABLE cdc_heartbeat (id BIGINT PRIMARY KEY, last_updated TIMESTAMP);
-- Schedule updates every minute via pg_cron
Kafka Connect Rebalancing
Cause: Network glitches trigger rebalancing, stopping all processing for 30+ seconds
Mitigation:
- Static consumer group membership
- Increase session timeouts:
session.timeout.ms=30000
- Monitor rebalancing frequency
Schema Evolution Breaking Changes
Impact: Adding NOT NULL column causes full table scan, 3+ hour CDC lag
Testing Required: Always test schema changes with actual CDC load
Safe Patterns: Add nullable columns first, populate later
Scaling Patterns
Single Connector Bottleneck
Problem: One thread per connector regardless of downstream capacity
Solution: Split high-volume tables into separate connectors
connector-users:
table.include.list: "users"
connector-orders:
table.include.list: "orders"
Deduplication Strategy
Problem: High-frequency tables generate 90% duplicate events
Solution: Buffer events in 30-60 second windows, keep latest per primary key
Result: 70-90% reduction in downstream writes
Critical Monitoring Setup
Essential Alerts
- WAL lag > 1GB: Immediate disk space risk
- Kafka consumer lag > 10K messages: Processing falling behind
- Disk usage > 80%: Before complete failure
- CDC latency > business SLA: Performance degradation
Monitoring Queries
-- PostgreSQL connection monitoring
SELECT count(*) FROM pg_stat_activity WHERE state = 'active';
-- Long-running queries during snapshots
SELECT pid, now() - query_start AS duration, query
FROM pg_stat_activity
WHERE duration > interval '5 minutes';
Cost Reality Check
Budget Planning (3-Year Total Cost)
- Infrastructure: $5-15K/month (Kafka, monitoring, storage)
- Engineering: 1-2 FTE for operations
- Hidden costs: Data transfer, backups, disaster recovery
- Vendor licenses: $50-200K/year for managed services
- Total realistic budget: $500K - $1.5M over 3 years
Cost Optimization
- Spot instances: 50-70% reduction for Kafka brokers
- Reserved instances: 30-50% reduction for PostgreSQL RDS
- Deduplication: Eliminates AWS RDS IOPS bottlenecks
When NOT to Use CDC
Inappropriate Use Cases
- Tables <10K changes/day: Batch ETL simpler and cheaper
- Analytics with 5+ minute tolerance: No need for real-time
- Compliance reporting: Usually requires batch processing anyway
Optimization Threshold
Don't optimize for imaginary scale: Most CDC problems solved by proper database configuration, not complex architectures.
Disaster Recovery Patterns
Recovery Time Objectives
- RTO <15 minutes: Multi-region active-active (expensive)
- RTO <2 hours: Standby infrastructure with manual failover
- RTO <24 hours: Full rebuild from snapshots (cheapest)
Recovery Validation
-- Row count validation after recovery
SELECT COUNT(*) FROM source_table WHERE updated_at > '2025-09-01';
-- Checksum validation for critical data
SELECT SUM(HASH(primary_key, updated_at)) FROM critical_table;
Advanced Patterns (High Complexity)
Snapshot Management
Problem: Single-threaded snapshots take 18+ hours on large tables
Solution: Split by primary key ranges, parallel processing
Alternative: Lock-free snapshots (RisingWave approach)
Multi-Sink Architecture
Pattern: One Kafka topic serving multiple consumers
Benefit: Independent scaling without cross-impact
Implementation: Separate consumer groups with backpressure isolation
Cross-Region CDC
Active-Passive: Primary real-time, secondary batch every 5-15 minutes
Cost consideration: Cross-region transfer costs significant at scale
Technology-Specific Gotchas
Debezium Connector Limits
- Memory leaks: Versions before 2.x with large transactions
- Single-threaded processing: Cannot exceed source DB single-core performance
- TOAST field loading: Entire large fields loaded into memory
Kafka Connect Issues
- Default 1GB heap: Insufficient for production CDC workloads
- Rebalancing delays: All processing stops during consumer rebalancing
- GC tuning required: G1GC recommended for CDC workloads
AWS-Specific Problems
- Cross-AZ latency spikes: Unpredictable 50ms spikes during peak hours
- RDS connection limits: Default too low for CDC + application load
- Data transfer costs: Hidden cost that scales with volume
Performance Ceiling Recognition
Hard Limits
- Single-core database performance: Cannot exceed source DB capabilities
- Network bandwidth: Cross-region limited by WAN capacity
- Storage IOPS: WAL writes bounded by disk performance
- Memory constraints: Large transactions require proportional RAM
Architectural Solutions
Sometimes split databases rather than optimize CDC tools when hitting fundamental limits.
Essential Operational Knowledge
3AM Debugging Checklist
- Check WAL disk usage first
- Verify replication slot advancement
- Monitor Kafka consumer group lag
- Check for connector rebalancing
- Validate network connectivity between components
Production Readiness Requirements
- Monitoring: WAL, consumer lag, latency, error rates
- Alerting: Disk space, performance degradation, failure detection
- Runbooks: Recovery procedures for common failure modes
- Capacity planning: 3x initial estimates for realistic budgeting
Team Requirements
- At least one engineer capable of Kafka debugging at 3AM
- Database administration expertise for WAL management
- Network troubleshooting skills for cross-AZ issues
- Budget authority for scaling infrastructure during incidents
Success Criteria Definition
Performance Targets (Realistic)
- Sub-second latency: Only achievable in single-AZ deployments
- Multi-AZ tolerance: Plan for 2-5 second average latency
- Bulk operation handling: System survives 50M row imports
- Schema evolution: Changes complete without multi-hour lag
Business Value Realization
- Data lag reduction: From hours to seconds enables real-time features
- ETL elimination: Engineers stop fighting batch schedules
- Feature enablement: Real-time fraud detection, live dashboards possible
- Operational improvement: Reduced manual data synchronization effort
Critical Decision Points
Build vs Buy Analysis
- Self-managed: High complexity, lower ongoing cost, full control
- Managed services: Lower complexity, higher cost, vendor lock-in
- Hybrid approach: Critical components self-managed, peripherals outsourced
Technology Selection Criteria
- Existing team expertise: PostgreSQL vs MySQL knowledge
- Infrastructure constraints: Available resources and budget
- Performance requirements: Sub-second vs minute-level tolerance
- Operational capacity: 24/7 support availability
Implementation Phases
- Proof of concept: Single table, development environment
- Production pilot: Critical table with full monitoring
- Gradual rollout: Add tables based on business priority
- Full deployment: Complete CDC coverage with optimization
Useful Links for Further Investigation
Performance Resources That Don't Suck
Link | Description |
---|---|
PostgreSQL WAL Monitoring Queries | These SQL queries have saved my ass multiple times when WAL was about to fill up the disk. Copy-paste these into your monitoring setup before you get paged at 3AM. |
Debezium Metrics and Monitoring | The official monitoring guide that's actually useful. Use the JMX metrics sections - ignore the rest of the Debezium docs unless you hate yourself. |
Kafka Connect Monitoring Best Practices | Confluent's guide to monitoring Kafka Connect performance. Covers the metrics that actually matter for CDC workloads. |
Prometheus JMX Exporter Configuration | Configure JMX metrics export from Kafka and Debezium. Default configs are basic - customize for CDC-specific metrics. |
PostgreSQL Logical Replication | Official PostgreSQL documentation on logical replication. Covers WAL settings and replication slot management. |
MySQL Binlog Performance Tuning | MySQL's guide to binlog optimization. Essential reading if you're using MySQL as CDC source. |
WAL-G: PostgreSQL WAL Management Tool | Tool for managing PostgreSQL WAL files and backups. Useful for automating WAL cleanup in high-volume CDC scenarios. |
Siemens CDC Implementation Case Study | Real-world case study from Siemens showing CDC architecture simplification and performance improvements. |
Shopify's CDC at Scale | How Shopify handles CDC across their sharded architecture. Good patterns for multi-database CDC coordination. |
Pinterest's Real-Time Analytics Architecture | Pinterest's approach to CDC + real-time analytics. Covers performance optimization and scaling patterns. |
PayPal's Kafka Performance Benchmarks | Actual production performance data from PayPal. Rare honest benchmarking instead of vendor marketing. |
Debezium GitHub Issues | Search here first when everything goes to shit. The maintainers actually respond and help debug, unlike most open source projects. |
CDC Performance Optimization Medium Article | Practical guide to preventing PostgreSQL WAL accumulation in CDC pipelines. Based on real production experience. |
Israeli Tech Radar CDC Lessons Learned | Three-part series on building production CDC. Part 3 focuses on performance optimization and monitoring. |
Stack Overflow CDC Tags | Community Q&A for CDC problems. Quality varies but sometimes has the exact error you're fighting. |
Confluent Cloud Performance Best Practices | Confluent's guide to optimizing their managed Kafka service for CDC workloads. |
AWS DMS Performance Optimization | AWS best practices for DMS performance. Limited but useful if you're stuck with DMS. |
RisingWave CDC Performance Guide | Documentation for RisingWave's unified CDC approach. Shows performance benefits of eliminating multi-component architectures. |
Kafka Users Mailing List | Old-school but helpful. Real engineers solving real Kafka problems, including CDC use cases. |
Debezium Zulip Chat | Active community forum for Debezium. More responsive than GitHub issues for quick questions. |
Confluent Community Forum | Community discussions about Confluent Platform and Kafka Connect performance optimization. |
Data Engineering Slack Communities | Various Slack communities where data engineers share CDC war stories and solutions. |
Apache Kafka Benchmarking Tools | Tools for load testing Kafka and Kafka Connect configurations. Essential for capacity planning. |
PostgreSQL Performance Testing | pgbench for testing database performance under CDC load. Useful for baseline measurements. |
TiDB CDC Performance Benchmarks | One of the few places with actual CDC performance measurements and methodology. |
AWS Calculator for DMS | Calculate actual AWS DMS costs including data transfer and instance hours. |
Confluent Cloud Cost Estimator | Estimate Confluent Cloud costs. Remember to factor in data transfer and API calls. |
Open Source CDC Cost Analysis | Analysis of hidden costs in self-managed CDC infrastructure. |
Related Tools & Recommendations
MongoDB vs PostgreSQL vs MySQL: Which One Won't Ruin Your Weekend
integrates with postgresql
Why I Finally Dumped Cassandra After 5 Years of 3AM Hell
integrates with MongoDB
I Survived Our MongoDB to PostgreSQL Migration - Here's How You Can Too
Four Months of Pain, 47k Lost Sessions, and What Actually Works
MySQL Replication - How to Keep Your Database Alive When Shit Goes Wrong
integrates with MySQL Replication
MySQL Alternatives That Don't Suck - A Migration Reality Check
Oracle's 2025 Licensing Squeeze and MySQL's Scaling Walls Are Forcing Your Hand
Kafka Will Fuck Your Budget - Here's the Real Cost
Don't let "free and open source" fool you. Kafka costs more than your mortgage.
Apache Kafka - The Distributed Log That LinkedIn Built (And You Probably Don't Need)
integrates with Apache Kafka
Debezium - Database Change Capture Without the Pain
Watches your database and streams changes to Kafka. Works great until it doesn't.
AWS Database Migration Service - When You Need to Move Your Database Without Getting Fired
competes with AWS Database Migration Service
Oracle GoldenGate - Database Replication That Actually Works
Database replication for enterprises who can afford Oracle's pricing
Fivetran: Expensive Data Plumbing That Actually Works
Data integration for teams who'd rather pay than debug pipelines at 3am
Airbyte - Stop Your Data Pipeline From Shitting The Bed
Tired of debugging Fivetran at 3am? Airbyte actually fucking works
Striim - Enterprise CDC That Actually Doesn't Suck
Real-time Change Data Capture for engineers who've been burned by flaky ETL pipelines before
Databricks vs Snowflake vs BigQuery Pricing: Which Platform Will Bankrupt You Slowest
We burned through about $47k in cloud bills figuring this out so you don't have to
Snowflake - Cloud Data Warehouse That Doesn't Suck
Finally, a database that scales without the usual database admin bullshit
dbt + Snowflake + Apache Airflow: Production Orchestration That Actually Works
How to stop burning money on failed pipelines and actually get your data stack working together
MongoDB Alternatives: Choose the Right Database for Your Specific Use Case
Stop paying MongoDB tax. Choose a database that actually works for your use case.
Kafka + MongoDB + Kubernetes + Prometheus Integration - When Event Streams Break
When your event-driven services die and you're staring at green dashboards while everything burns, you need real observability - not the vendor promises that go
MongoDB Alternatives: The Migration Reality Check
Stop bleeding money on Atlas and discover databases that actually work in production
Fix Redis "ERR max number of clients reached" - Solutions That Actually Work
When Redis starts rejecting connections, you need fixes that work in minutes, not hours
Recommendations combine user behavior, content similarity, research intelligence, and SEO optimization