My Debezium connector keeps crashing with OutOfMemoryError

Large transactions will murder your connector. Default 1GB heap is pathetic for production. Bulk imports with millions of rows = instant death. **Memory tuning checklist**: ```bash # Increase heap size export KAFKA_HEAP_OPTS=\"-Xmx8g -Xms8g\" # Enable heap dumps for debugging export KAFKA_JVM_PERFORMANCE_OPTS=\"-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp\" # Debezium connector settings max.queue.size=16000 max.batch.size=2048 ``` **TOAST field workaround**: Exclude large JSON/TEXT fields if not needed: ```json { \"column.exclude.list\": \"user_profile.large_json_field,logs.full_stacktrace\" } ```

Why is my single-table CDC limited to 10K events/hour when Kafka can handle millions?

**Single connector bottleneck**. Debezium uses one thread per connector, not per table. That single thread becomes your performance ceiling regardless of downstream capacity. **Scaling strategies**: 1. **Table sharding**: Create separate connectors for high-volume tables 2. **Partition tuning**: Increase Kafka topic partitions for parallel downstream processing 3. **Consumer parallelism**: Deploy multiple consumer instances with proper partition assignment **Don't expect linear scaling** - monitor CPU usage on the Debezium connector host. When you hit 100% on a single core, you need more connectors.

How do I prevent network issues from breaking my entire CDC pipeline?

**Cross-AZ latency kills consistency**. When network latency spikes between Debezium, Kafka, and downstream consumers, the entire pipeline backs up. **Network resilience tactics**: - Deploy components in same AZ for consistent latency - Configure proper timeouts: `database.connectionTimeoutInMs=30000` - Set up backpressure handling in downstream consumers - Monitor inter-component network latency with alerting **Circuit breaker pattern**: If downstream systems fail, buffer in Kafka rather than blocking upstream CDC.

Why does adding a single column break my CDC pipeline for 3 hours?

**Schema evolution isn't free**. Adding NOT NULL columns triggers full table scans. Renaming columns requires connector restart. Type changes can corrupt offsets. **Schema change testing**: ```bash # Test schema changes with CDC running 1. Apply DDL in staging environment 2. Monitor CDC lag during and after change 3. Verify downstream applications handle new schema 4. Check Schema Registry compatibility 5. Test connector restart/recovery process ``` **Safe schema patterns**: - Add columns as nullable first, populate later - Use database views for column renames - Schedule breaking changes during maintenance windows - Always test schema changes with actual CDC load

My AWS RDS hit the IOPS limit. How do I reduce database writes from CDC?

**Deduplication saves 70-90% writes**. High-frequency tables generate tons of duplicate events that downstream systems don't need. **Deduplication implementation**: - Buffer events in 30-60 second windows - Keep only latest event per primary key - Use Kafka compaction for automatic deduplication - Implement idempotent downstream processing **Alternative**: Use [read replicas](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_ReadRepl.html) for CDC source to isolate replication workload from production writes.

How do I debug CDC when latency randomly spikes to 30+ seconds?

**Kafka Connect rebalancing** is usually the culprit. When consumers join/leave or network glitches occur, all processing stops during rebalancing. **Debugging steps**: ```bash # Check connector status curl -s localhost:8083/connectors/debezium-connector/status # Monitor consumer group lag kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group connect-debezium-connector # Check for rebalancing in logs grep \"Rebalance\" connect.log ``` **Rebalancing mitigation**: - Use static consumer group membership - Increase session timeouts: `session.timeout.ms=30000` - Deploy dedicated Kafka Connect clusters for CDC - Monitor task assignments and restarts

Should I use multiple small Kafka topics or one big topic for CDC?

![Debezium Logo](https://avatars.githubusercontent.com/u/17318566?s=200&v=4) One topic per table for operational sanity. Mixing tables in topics makes debugging, schema evolution, and consumer scaling much harder. **Topic configuration for CDC**: ```bash # Create topic with proper settings kafka-topics.sh --create --topic db.public.users \ --partitions 12 \ --replication-factor 3 \ --config cleanup.policy=compact \ --config min.insync.replicas=2 ``` **Partitioning strategy**: Use primary key for partition key to maintain per-entity ordering while enabling parallelism.

How much should I budget for CDC infrastructure costs?

**Rule of thumb: 3x your initial estimate**. Infrastructure, engineering time, and operational overhead add up fast. **Realistic budget breakdown**: - **Infrastructure**: $5-15K/month (Kafka cluster, monitoring, storage) - **Engineering**: 1-2 full-time engineers for operations and maintenance - **Hidden costs**: Data transfer, backup storage, disaster recovery testing - **Vendor licenses**: $50-200K/year for managed services **Total 3-year cost**: Somewhere between $500K and... I dunno, maybe $1.5M? It's expensive as hell. Budget accordingly and don't believe anyone who says open source CDC is \"free\" - that's just the download cost. Alright, if you've made it this far and your CDC is sort of working but you want to get fancy with advanced patterns, here's some of the more complicated shit we've tried...

Currently viewing the AI version

Switch to human version

Change Data Capture (CDC) Performance Optimization - AI Knowledge Base

Critical Configuration Settings

PostgreSQL WAL Management

Failure Mode: WAL files accumulate until disk is full, causing complete database failure at 95% disk usage
Required Settings:

max_slot_wal_keep_size=4GB - CRITICAL: Prevents disk space disasters
max_replication_slots=10 - Default inadequate for production (limit hit with 3 connectors)
wal_level=logical - Required but often reset to 'replica' accidentally
shared_preload_libraries='wal2json' - Better performance than pgoutput
max_connections=300+ - Default 100 causes "too many clients" errors

Monitoring Query (saves production systems):

SELECT slot_name, active,
       pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) as lag_size
FROM pg_replication_slots 
ORDER BY pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) DESC;

MySQL Binlog Configuration

Failure Mode: Lost binlog position = missing data or complete reprocessing
Required Settings:

SET GLOBAL binlog_format = 'ROW';
SET GLOBAL binlog_row_image = 'FULL'; 
SET GLOBAL expire_logs_days = 7;
SET GLOBAL max_binlog_size = 1073741824;

Performance Reality vs Marketing Claims

Tool	Marketing	Single-AZ Reality	Multi-AZ Reality	First Failure Point
Debezium + PostgreSQL	"Sub-millisecond"	~100ms	~2 seconds	WAL disk space
Confluent Cloud	"Real-time streaming"	~300ms	~1 minute	Budget constraints
AWS DMS	"Low latency CDC"	~5 seconds	~30 seconds	Complex data types
Airbyte CDC	"Near real-time"	~30 seconds	~5 minutes	Not actually streaming
Fivetran	"Instant data sync"	~3 minutes	~10 minutes	Custom logic breaks

Memory and Resource Requirements

Debezium Memory Configuration

Failure Mode: OutOfMemoryError during large transactions (2M+ rows)
Required Settings:

export KAFKA_HEAP_OPTS="-Xmx8g -Xms8g"
export KAFKA_JVM_PERFORMANCE_OPTS="-XX:+UseG1GC -XX:+HeapDumpOnOutOfMemoryError"

# Debezium connector tuning
max.queue.size=16000
max.batch.size=4096

TOAST Field Problem: PostgreSQL large JSON/TEXT fields load entirely into memory
Solution: Exclude large fields: column.exclude.list=user_profile.large_json_field

Resource Scaling Thresholds

Tables <1000 changes/hour: Don't need optimization
Tables >10K changes/day: Require dedicated connectors
Bulk imports >1M rows: Will crash default configurations
Single connector limit: ~10K events/hour regardless of downstream capacity

Network and Cross-AZ Performance Impact

AWS Multi-AZ Reality

Cross-AZ latency: 2-3ms average, spikes to 50ms during peak hours
Performance degradation:
- Single AZ: 200-300ms CDC latency
- Multi-AZ: 2-5 seconds average, 30+ seconds during AWS issues
Cost: Data transfer $0.02-0.12 per GB between regions

Network Optimization Strategy

Deploy components in same AZ despite availability trade-offs - consistent low latency beats theoretical high availability for most CDC use cases.

Common Production Failures and Solutions

WAL Accumulation (Most Common 3AM Alert)

Cause: One slow table holds up all replication slots
Solution: Heartbeat table with regular updates

CREATE TABLE cdc_heartbeat (id BIGINT PRIMARY KEY, last_updated TIMESTAMP);
-- Schedule updates every minute via pg_cron

Kafka Connect Rebalancing

Cause: Network glitches trigger rebalancing, stopping all processing for 30+ seconds
Mitigation:

Static consumer group membership
Increase session timeouts: session.timeout.ms=30000
Monitor rebalancing frequency

Schema Evolution Breaking Changes

Impact: Adding NOT NULL column causes full table scan, 3+ hour CDC lag
Testing Required: Always test schema changes with actual CDC load
Safe Patterns: Add nullable columns first, populate later

Scaling Patterns

Single Connector Bottleneck

Problem: One thread per connector regardless of downstream capacity
Solution: Split high-volume tables into separate connectors

connector-users:
  table.include.list: "users"
connector-orders:  
  table.include.list: "orders"

Deduplication Strategy

Problem: High-frequency tables generate 90% duplicate events
Solution: Buffer events in 30-60 second windows, keep latest per primary key
Result: 70-90% reduction in downstream writes

Critical Monitoring Setup

Essential Alerts

WAL lag > 1GB: Immediate disk space risk
Kafka consumer lag > 10K messages: Processing falling behind
Disk usage > 80%: Before complete failure
CDC latency > business SLA: Performance degradation

Monitoring Queries

-- PostgreSQL connection monitoring
SELECT count(*) FROM pg_stat_activity WHERE state = 'active';

-- Long-running queries during snapshots
SELECT pid, now() - query_start AS duration, query 
FROM pg_stat_activity 
WHERE duration > interval '5 minutes';

Cost Reality Check

Budget Planning (3-Year Total Cost)

Infrastructure: $5-15K/month (Kafka, monitoring, storage)
Engineering: 1-2 FTE for operations
Hidden costs: Data transfer, backups, disaster recovery
Vendor licenses: $50-200K/year for managed services
Total realistic budget: $500K - $1.5M over 3 years

Cost Optimization

Spot instances: 50-70% reduction for Kafka brokers
Reserved instances: 30-50% reduction for PostgreSQL RDS
Deduplication: Eliminates AWS RDS IOPS bottlenecks

When NOT to Use CDC

Inappropriate Use Cases

Tables <10K changes/day: Batch ETL simpler and cheaper
Analytics with 5+ minute tolerance: No need for real-time
Compliance reporting: Usually requires batch processing anyway

Optimization Threshold

Don't optimize for imaginary scale: Most CDC problems solved by proper database configuration, not complex architectures.

Disaster Recovery Patterns

Recovery Time Objectives

RTO <15 minutes: Multi-region active-active (expensive)
RTO <2 hours: Standby infrastructure with manual failover
RTO <24 hours: Full rebuild from snapshots (cheapest)

Recovery Validation

-- Row count validation after recovery
SELECT COUNT(*) FROM source_table WHERE updated_at > '2025-09-01';
-- Checksum validation for critical data
SELECT SUM(HASH(primary_key, updated_at)) FROM critical_table;

Advanced Patterns (High Complexity)

Snapshot Management

Problem: Single-threaded snapshots take 18+ hours on large tables
Solution: Split by primary key ranges, parallel processing
Alternative: Lock-free snapshots (RisingWave approach)

Multi-Sink Architecture

Pattern: One Kafka topic serving multiple consumers
Benefit: Independent scaling without cross-impact
Implementation: Separate consumer groups with backpressure isolation

Cross-Region CDC

Active-Passive: Primary real-time, secondary batch every 5-15 minutes
Cost consideration: Cross-region transfer costs significant at scale

Technology-Specific Gotchas

Debezium Connector Limits

Memory leaks: Versions before 2.x with large transactions
Single-threaded processing: Cannot exceed source DB single-core performance
TOAST field loading: Entire large fields loaded into memory

Kafka Connect Issues

Default 1GB heap: Insufficient for production CDC workloads
Rebalancing delays: All processing stops during consumer rebalancing
GC tuning required: G1GC recommended for CDC workloads

AWS-Specific Problems

Cross-AZ latency spikes: Unpredictable 50ms spikes during peak hours
RDS connection limits: Default too low for CDC + application load
Data transfer costs: Hidden cost that scales with volume

Performance Ceiling Recognition

Hard Limits

Single-core database performance: Cannot exceed source DB capabilities
Network bandwidth: Cross-region limited by WAN capacity
Storage IOPS: WAL writes bounded by disk performance
Memory constraints: Large transactions require proportional RAM

Architectural Solutions

Sometimes split databases rather than optimize CDC tools when hitting fundamental limits.

Essential Operational Knowledge

3AM Debugging Checklist

Check WAL disk usage first
Verify replication slot advancement
Monitor Kafka consumer group lag
Check for connector rebalancing
Validate network connectivity between components

Production Readiness Requirements

Monitoring: WAL, consumer lag, latency, error rates
Alerting: Disk space, performance degradation, failure detection
Runbooks: Recovery procedures for common failure modes
Capacity planning: 3x initial estimates for realistic budgeting

Team Requirements

At least one engineer capable of Kafka debugging at 3AM
Database administration expertise for WAL management
Network troubleshooting skills for cross-AZ issues
Budget authority for scaling infrastructure during incidents

Success Criteria Definition

Performance Targets (Realistic)

Sub-second latency: Only achievable in single-AZ deployments
Multi-AZ tolerance: Plan for 2-5 second average latency
Bulk operation handling: System survives 50M row imports
Schema evolution: Changes complete without multi-hour lag

Business Value Realization

Data lag reduction: From hours to seconds enables real-time features
ETL elimination: Engineers stop fighting batch schedules
Feature enablement: Real-time fraud detection, live dashboards possible
Operational improvement: Reduced manual data synchronization effort

Critical Decision Points

Build vs Buy Analysis

Self-managed: High complexity, lower ongoing cost, full control
Managed services: Lower complexity, higher cost, vendor lock-in
Hybrid approach: Critical components self-managed, peripherals outsourced

Technology Selection Criteria

Existing team expertise: PostgreSQL vs MySQL knowledge
Infrastructure constraints: Available resources and budget
Performance requirements: Sub-second vs minute-level tolerance
Operational capacity: 24/7 support availability

Implementation Phases

Proof of concept: Single table, development environment
Production pilot: Critical table with full monitoring
Gradual rollout: Add tables based on business priority
Full deployment: Complete CDC coverage with optimization

Useful Links for Further Investigation

Performance Resources That Don't Suck

Link	Description
PostgreSQL WAL Monitoring Queries	These SQL queries have saved my ass multiple times when WAL was about to fill up the disk. Copy-paste these into your monitoring setup before you get paged at 3AM.
Debezium Metrics and Monitoring	The official monitoring guide that's actually useful. Use the JMX metrics sections - ignore the rest of the Debezium docs unless you hate yourself.
Kafka Connect Monitoring Best Practices	Confluent's guide to monitoring Kafka Connect performance. Covers the metrics that actually matter for CDC workloads.
Prometheus JMX Exporter Configuration	Configure JMX metrics export from Kafka and Debezium. Default configs are basic - customize for CDC-specific metrics.
PostgreSQL Logical Replication	Official PostgreSQL documentation on logical replication. Covers WAL settings and replication slot management.
MySQL Binlog Performance Tuning	MySQL's guide to binlog optimization. Essential reading if you're using MySQL as CDC source.
WAL-G: PostgreSQL WAL Management Tool	Tool for managing PostgreSQL WAL files and backups. Useful for automating WAL cleanup in high-volume CDC scenarios.
Siemens CDC Implementation Case Study	Real-world case study from Siemens showing CDC architecture simplification and performance improvements.
Shopify's CDC at Scale	How Shopify handles CDC across their sharded architecture. Good patterns for multi-database CDC coordination.
Pinterest's Real-Time Analytics Architecture	Pinterest's approach to CDC + real-time analytics. Covers performance optimization and scaling patterns.
PayPal's Kafka Performance Benchmarks	Actual production performance data from PayPal. Rare honest benchmarking instead of vendor marketing.
Debezium GitHub Issues	Search here first when everything goes to shit. The maintainers actually respond and help debug, unlike most open source projects.
CDC Performance Optimization Medium Article	Practical guide to preventing PostgreSQL WAL accumulation in CDC pipelines. Based on real production experience.
Israeli Tech Radar CDC Lessons Learned	Three-part series on building production CDC. Part 3 focuses on performance optimization and monitoring.
Stack Overflow CDC Tags	Community Q&A for CDC problems. Quality varies but sometimes has the exact error you're fighting.
Confluent Cloud Performance Best Practices	Confluent's guide to optimizing their managed Kafka service for CDC workloads.
AWS DMS Performance Optimization	AWS best practices for DMS performance. Limited but useful if you're stuck with DMS.
RisingWave CDC Performance Guide	Documentation for RisingWave's unified CDC approach. Shows performance benefits of eliminating multi-component architectures.
Kafka Users Mailing List	Old-school but helpful. Real engineers solving real Kafka problems, including CDC use cases.
Debezium Zulip Chat	Active community forum for Debezium. More responsive than GitHub issues for quick questions.
Confluent Community Forum	Community discussions about Confluent Platform and Kafka Connect performance optimization.
Data Engineering Slack Communities	Various Slack communities where data engineers share CDC war stories and solutions.
Apache Kafka Benchmarking Tools	Tools for load testing Kafka and Kafka Connect configurations. Essential for capacity planning.
PostgreSQL Performance Testing	pgbench for testing database performance under CDC load. Useful for baseline measurements.
TiDB CDC Performance Benchmarks	One of the few places with actual CDC performance measurements and methodology.
AWS Calculator for DMS	Calculate actual AWS DMS costs including data transfer and instance hours.
Confluent Cloud Cost Estimator	Estimate Confluent Cloud costs. Remember to factor in data transfer and API calls.
Open Source CDC Cost Analysis	Analysis of hidden costs in self-managed CDC infrastructure.