Debezium - Database Change Capture Without the Pain

What is Debezium and Why You'll Need It (Eventually)

Debezium captures database changes by reading transaction logs, which sounds simple until you actually try to set it up. I've been running this shit for 2 years now, and here's what you need to know before you dive in.

The Setup Reality Check

Debezium runs on Kafka Connect, which means you need Kafka first. If you don't already have a Kafka cluster, plan for 3 weeks of setup, not 3 hours. The documentation makes it look easy - it's not.

Kafka Connect Mode: This is what everyone uses in production. Your connector runs in a distributed cluster, survives single node failures, and scales horizontally. Setting it up properly took me 5 days because the memory settings are garbage by default.

Debezium Server: Standalone mode that doesn't require Kafka. Sounds great, right? Wrong. You lose fault tolerance and horizontal scaling. I tried this first - lasted exactly 2 weeks before I gave up and went back to Kafka Connect.

Embedded Engine: Java library you embed in your app. Don't do this unless you enjoy debugging memory leaks at 3am. The embedded engine will eat your heap and you'll have no idea why.

Database Support (The Real Story)

Version 3.2.2.Final (September 2025) supports these databases, but "supports" is doing some heavy lifting:

PostgreSQL: Works great once you enable logical replication. Just make sure your wal_level is set to logical or you'll waste 4 hours debugging why nothing works.

PostgreSQL Logical Replication Architecture

MySQL: The binlog setup is straightforward, but row-based replication is required. Mixed or statement-based replication will silently fuck you over.

Oracle: Prepare for pain. LogMiner works but requires supplemental logging enabled. XStream is faster but costs extra licensing. Either way, you'll need a DBA who doesn't hate you.

MongoDB: Uses replica set oplog, which means you need a replica set. Single node MongoDB won't work, learned that the hard way.

SQL Server: Transaction log capture works, but the CDC feature must be enabled at both database and table level. Miss one table and you'll be wondering why data isn't flowing.

Debezium CDC Architecture

Change Data Capture Flow Diagram

Debezium Server Architecture

What Actually Works in Production

Latency: Usually sub-second, but can spike to minutes when your connector decides to shit the bed. Monitor your lag metrics or you'll be blind.

Database Impact: Minimal until it's not. Oracle LogMiner can peg a CPU core, and PostgreSQL replication slots will fill your disk if the connector stops consuming. I learned this during a 6-hour outage.

Ordering: Per-partition ordering works, but if you're sharding data across multiple partitions, global ordering goes out the window. Design your partition keys carefully or accept eventual consistency.

Failure Recovery: Debezium stores offsets in Kafka, so recovery works well. But if you lose your offset data, you're starting over with a full snapshot. We've been there - 48 hours to catch up on a 500GB table.

Log-Based Change Data Capture

This whole setup lets you stop polling databases and writing triggers, which is worth the complexity. Just don't expect it to work perfectly on day one.

CDC Tools: The Good, Bad, and Expensive

Feature	Debezium	AWS DMS	Oracle GoldenGate	Airbyte	Striim
Free to Use	✅	❌	❌	✅	❌
Actually Works	Most of the time	When AWS wants it to	Yes, if you pay enough	Hit or miss	Usually
Setup Complexity	High (Kafka required)	Medium (AWS magic)	Nightmare	Low (nice UI)	Medium
Database Support	8 that matter	15+ with caveats	Oracle + extras	10+ varying quality	100+ (quantity ≠ quality)
Performance	Good when tuned	AWS throttles you	Excellent	Depends	Enterprise-grade
When It Breaks	Stack Overflow	AWS Support ticket	Call Oracle ($$)	GitHub issues	Enterprise support
Real-World Cost	Infrastructure only (~$500/month for our setup)	$$$+ ($2-5k/month)	$$$$$+ (Oracle tax = mortgage)	Freemium trap ($200-2k+/month)	Enterprise pricing (call for quote = expensive)
Learning Curve	Steep	Gentle slope	Mountain	Easy start	Manageable
Documentation	Decent	AWS-grade	Oracle-grade	Pretty good	Enterprise-grade

Production Use Cases (And How They Break)

We've been running Debezium for 2 years across multiple services. Here's what actually works and what fails spectacularly at 3am.

Microservices Data Sync (The Original Sin)

Our order service writes to PostgreSQL, and we need inventory updates in real-time. Sounds simple, right? The outbox pattern works until it doesn't.

What works: Normal flow handles 10k orders/day fine. Event-driven architecture keeps services decoupled and happy.

What breaks: Schema changes. We added a column to the orders table and forgot to update the outbox schema. Spent 6 hours debugging why events stopped flowing. The connector didn't crash, it just silently ignored the new structure.

Real-Time Analytics (Mostly Real-Time)

We stream database changes to ClickHouse for real-time dashboards. Works great for normal operations.

The failure: During a marketing campaign, order volume spiked 50x. Debezium couldn't keep up, lag increased to 20 minutes, and our "real-time" dashboards were showing stale data. Kafka partitioning saved us - increased partitions from 3 to 12 and scaled horizontally.

Lesson learned: Load test your CDC pipeline. Monitor consumer lag religiously or you'll be flying blind.

Search Index Sync (When Elasticsearch Fights Back)

We use Debezium to keep Elasticsearch indexes in sync with PostgreSQL. The setup is straightforward until Elasticsearch decides to shit the bed.

Production incident: Elasticsearch cluster went down for maintenance. Debezium kept streaming to Kafka, but when ES came back up, we had 4 hours of events to replay. The bulk indexing overwhelmed ES and created a feedback loop - ES couldn't keep up, so more events backed up, making it worse.

Fix: Implement backpressure and circuit breakers. When the downstream system is struggling, slow down the connector instead of making it worse.

Cache Invalidation (The Hard Way)

Redis cache invalidation using Debezium events works well for simple cases. Complex cache dependencies will kill you.

The problem: User profiles are cached, but they reference multiple tables (users, preferences, subscriptions). When any related data changes, we need to invalidate the user cache. Sounds simple, but tracking all the relationships is a nightmare.

Current solution: We gave up on surgical cache invalidation and just TTL everything at 5 minutes. Not elegant, but it works. Sometimes the simple solution is the right solution.

Debezium Production Architecture

Production monitoring is critical for Debezium deployments. Monitor connector lag, memory usage, and failure rates to prevent data loss and performance issues.

Configuration Hell (The Real Challenge)

Kafka Cluster: We run 5 brokers because 3 wasn't enough headroom. Replication factor of 3 with min.insync.replicas=2. Learned this during a broker failure that took down our entire pipeline.

Memory Settings: Default JVM settings are a joke. We run connectors with 8GB heap because 2GB caused frequent GC pauses and connector restarts.

Database Config: PostgreSQL wal_level=logical and max_replication_slots=10. MySQL binlog_format=ROW and binlog_row_image=FULL. Oracle supplemental logging enabled - forgetting this cost us 4 hours of debugging.

Monitoring Stack: Prometheus + Grafana for metrics, JMX metrics exported via Kafka Connect. Critical alerts on connector lag > 60 seconds and connector failures.

Schema Evolution (The Silent Killer)

The incident: Added a NOT NULL column to a table without a default value. PostgreSQL was fine (we added the column correctly), but Debezium connector crashed during the next snapshot. Took down the entire pipeline for 3 hours while we figured out that schema changes need to be coordinated with connector restarts.

Current process: Schema changes require downtime windows and connector restarts. Not ideal, but better than random failures.

What Actually Works

Simple table changes (inserts, updates, deletes)
Single-database connectors
Well-partitioned Kafka topics
Monitoring everything with proper alerting

What Doesn't Work

Cross-database transactions
Complex schema evolution
Expecting sub-second latency under load
Running without proper monitoring

The Bottom Line

After 2 years of running Debezium in production, here's the truth: it's the best CDC solution available, but that's not saying much. The CDC space is full of half-baked solutions and enterprise bullshit.

Debezium works, but you'll earn every bit of that reliability through debugging sessions, memory tuning, and monitoring setup. The payoff is worth it - no more polling databases or writing triggers. Just clean, event-driven architecture that scales.

If you're considering Debezium, budget time for learning Kafka first. If you already have Kafka, Debezium is a no-brainer. If you don't want to deal with Kafka, maybe reconsider whether you actually need CDC.

The tool has saved us hundreds of hours of manual data syncing and prevented countless bugs from stale data. Just don't expect it to work perfectly on day one, and you'll be fine.

Frequently Asked Questions (The Real Answers)

Why does my Debezium connector keep restarting randomly?

Memory leaks, that's why. The default heap size of 2GB is a joke for any real workload. We run 8GB minimum, and even that's tight with complex schemas.Also check your Kafka Connect worker logs for OutOfMemoryError or GC thrashing. If you see java.lang.OutOfMemoryError: Java heap space, you need more memory. Period.

My connector is "running" but no events are flowing. What's wrong?

First thing to check: are you actually making changes to the database? I've spent hours debugging a "broken" connector that was working fine - I just wasn't changing any data.If data is changing, check the connector status via REST API. Look for FAILED tasks or check if the connector is paused. Common causes:

Database permissions (connector can't read transaction logs)
Network connectivity issues
Schema registry is down
Your database doesn't have the right logging enabled

How do I fix "Failed to flush offsets to storage" errors?

This error means Kafka Connect can't write to the __connect-offsets topic. Usually it's because:

Kafka cluster is down or unreachable
Not enough brokers available (check min.insync.replicas)
Kafka Connect worker misconfigured

Increase offset.flush.timeout.ms to 60000 (60 seconds) if you're on a slow network. Default 5 seconds is too aggressive for most deployments.

Why is my PostgreSQL connector failing with "replication slot does not exist"?

PostgreSQL dropped your replication slot, probably because:

Connector was down too long and slot was auto-dropped
Database restart without preserving slots
Someone manually dropped it (check with your DBA)

Create a new replication slot manually:

SELECT pg_create_logical_replication_slot('debezium_slot', 'pgoutput');

Or let Debezium recreate it by restarting the connector with slot.drop.on.stop=false.

MySQL connector says "binlog position no longer available" - now what?

MySQL Binlog Replication

Your MySQL binlogs rotated and the old position is gone. This happens when:

Connector was down longer than binlog retention period
MySQL binlog expiration is too short
Someone purged binlogs manually

You're fucked. You need to do a new snapshot, which means downtime and potential data loss for the gap period.

Set binlog_expire_logs_seconds to at least 7 days to avoid this.

Oracle connector keeps crashing with LogMiner errors. Help?

Oracle LogMiner is a pain in the ass. Common issues:

Supplemental logging not enabled properly
Redo logs getting archived faster than LogMiner can read them
Memory issues (LogMiner is a memory hog)

Enable supplemental logging for all tables:

ALTER DATABASE ADD SUPPLEMENTAL LOG DATA (ALL) COLUMNS;

And pray to the Oracle gods that it works.

How do I handle schema changes without breaking everything?

You don't. Schema evolution is Debezium's weakest point. Our process:

Make the change backwards-compatible if possible
Deploy to a test environment first
Plan for connector downtime
Make the schema change
Restart the connector
Test everything thoroughly

There's no magic bullet. Schema registry helps but doesn't solve the fundamental problem.

Why are my events delayed by several minutes?

Usually it's downstream bottlenecks:

Consumer can't keep up with producer
Kafka cluster under load
Network issues between components
JMX metrics show high lag

Check consumer lag first: kafka-consumer-groups.sh --describe --group your-group. If lag is growing, your consumer is the bottleneck, not Debezium.

Can I run Debezium without Kafka?

Yes, with Debezium Server, but you lose:

Fault tolerance
Horizontal scaling
Built-in offset management
The entire Kafka ecosystem

I tried it for 2 weeks. Went back to Kafka Connect and never looked back.

How do I monitor this thing properly?

JMX metrics are your friend. Key metrics to monitor:

Connector lag (most important)
Connector status (running/failed)
Database connection health
Memory usage

We use Prometheus + Grafana with alerts on:

Connector lag > 60 seconds
Any connector failures
Memory usage > 80%

What happens if I accidentally delete the offset topic?

You start over with a full snapshot. All your offset data is gone, so Debezium doesn't know where it left off.

Back up your __connect-offsets topic regularly if you care about recovery time. Otherwise, plan for a long weekend of full snapshots.

Quick Navigation

The Setup Reality Check

Database Support (The Real Story)

What Actually Works in Production

Microservices Data Sync (The Original Sin)

Real-Time Analytics (Mostly Real-Time)

Search Index Sync (When Elasticsearch Fights Back)

Cache Invalidation (The Hard Way)

Configuration Hell (The Real Challenge)

Schema Evolution (The Silent Killer)

What Actually Works

What Doesn't Work

The Bottom Line

Why does my Debezium connector keep restarting randomly?

My connector is "running" but no events are flowing. What's wrong?

How do I fix "Failed to flush offsets to storage" errors?

Why is my PostgreSQL connector failing with "replication slot does not exist"?

MySQL connector says "binlog position no longer available" - now what?

Oracle connector keeps crashing with LogMiner errors. Help?

How do I handle schema changes without breaking everything?

Why are my events delayed by several minutes?

Can I run Debezium without Kafka?

How do I monitor this thing properly?

What happens if I accidentally delete the offset topic?

Related Tools & Recommendations

Kafka, MongoDB, K8s, Prometheus: Event-Driven Observability

Kafka Spark Elasticsearch: Build & Optimize Real-time Pipelines

Striim: Real-time Enterprise CDC & Data Pipelines for Engineers

Change Data Capture (CDC) Explained: Production & Debugging

Apache Kafka Overview: What It Is & Why It's Hard to Operate

Change Data Capture (CDC) Troubleshooting Guide: Fix Common Issues

CDC Tool Selection Guide: Pick the Right Change Data Capture

Change Data Capture (CDC) Performance Optimization Guide

Change Data Capture (CDC) Integration Patterns for Production

CDC Enterprise Implementation Guide: Real-World Challenges & Solutions

Change Data Capture (CDC) Skills, Career & Team Building

CDC Database Platform Guide: PostgreSQL, MySQL, MongoDB Setup

Apache Pulsar Review: Production Reality, Pros & Cons vs Kafka

Kafka, Redis & RabbitMQ: Event Streaming Architecture Guide

CDC Security & Compliance Guide: Protect Your Data Pipelines

Apache Kafka Costs: Unpacking Real-World Budget & Benefits

Oracle GoldenGate - Database Replication That Actually Works

PostgreSQL Performance Optimization - Stop Your Database From Shitting Itself Under Load

FastAPI + SQLAlchemy + Alembic + PostgreSQL: The Real Integration Guide

PostgreSQL vs MySQL vs MongoDB vs Redis vs Cassandra - Enterprise Scaling Reality Check