What is Debezium and Why You'll Need It (Eventually)

Debezium captures database changes by reading transaction logs, which sounds simple until you actually try to set it up. I've been running this shit for 2 years now, and here's what you need to know before you dive in.

The Setup Reality Check

Debezium runs on Kafka Connect, which means you need Kafka first. If you don't already have a Kafka cluster, plan for 3 weeks of setup, not 3 hours. The documentation makes it look easy - it's not.

Kafka Connect Mode: This is what everyone uses in production. Your connector runs in a distributed cluster, survives single node failures, and scales horizontally. Setting it up properly took me 5 days because the memory settings are garbage by default.

Debezium Server: Standalone mode that doesn't require Kafka. Sounds great, right? Wrong. You lose fault tolerance and horizontal scaling. I tried this first - lasted exactly 2 weeks before I gave up and went back to Kafka Connect.

Embedded Engine: Java library you embed in your app. Don't do this unless you enjoy debugging memory leaks at 3am. The embedded engine will eat your heap and you'll have no idea why.

Database Support (The Real Story)

Version 3.2.2.Final (September 2025) supports these databases, but "supports" is doing some heavy lifting:

PostgreSQL: Works great once you enable logical replication. Just make sure your wal_level is set to logical or you'll waste 4 hours debugging why nothing works.

PostgreSQL Logical Replication Architecture

MySQL: The binlog setup is straightforward, but row-based replication is required. Mixed or statement-based replication will silently fuck you over.

Oracle: Prepare for pain. LogMiner works but requires supplemental logging enabled. XStream is faster but costs extra licensing. Either way, you'll need a DBA who doesn't hate you.

MongoDB: Uses replica set oplog, which means you need a replica set. Single node MongoDB won't work, learned that the hard way.

SQL Server: Transaction log capture works, but the CDC feature must be enabled at both database and table level. Miss one table and you'll be wondering why data isn't flowing.

Debezium CDC Architecture

Change Data Capture Flow Diagram

Debezium Server Architecture

What Actually Works in Production

Latency: Usually sub-second, but can spike to minutes when your connector decides to shit the bed. Monitor your lag metrics or you'll be blind.

Database Impact: Minimal until it's not. Oracle LogMiner can peg a CPU core, and PostgreSQL replication slots will fill your disk if the connector stops consuming. I learned this during a 6-hour outage.

Ordering: Per-partition ordering works, but if you're sharding data across multiple partitions, global ordering goes out the window. Design your partition keys carefully or accept eventual consistency.

Failure Recovery: Debezium stores offsets in Kafka, so recovery works well. But if you lose your offset data, you're starting over with a full snapshot. We've been there - 48 hours to catch up on a 500GB table.

Log-Based Change Data Capture

This whole setup lets you stop polling databases and writing triggers, which is worth the complexity. Just don't expect it to work perfectly on day one.

CDC Tools: The Good, Bad, and Expensive

Feature

Debezium

AWS DMS

Oracle GoldenGate

Airbyte

Striim

Free to Use

Actually Works

Most of the time

When AWS wants it to

Yes, if you pay enough

Hit or miss

Usually

Setup Complexity

High (Kafka required)

Medium (AWS magic)

Nightmare

Low (nice UI)

Medium

Database Support

8 that matter

15+ with caveats

Oracle + extras

10+ varying quality

100+ (quantity ≠ quality)

Performance

Good when tuned

AWS throttles you

Excellent

Depends

Enterprise-grade

When It Breaks

Stack Overflow

AWS Support ticket

Call Oracle ($$)

GitHub issues

Enterprise support

Real-World Cost

Infrastructure only (~$500/month for our setup)

$$$+ ($2-5k/month)

$$$$$+ (Oracle tax = mortgage)

Freemium trap ($200-2k+/month)

Enterprise pricing (call for quote = expensive)

Learning Curve

Steep

Gentle slope

Mountain

Easy start

Manageable

Documentation

Decent

AWS-grade

Oracle-grade

Pretty good

Enterprise-grade

Production Use Cases (And How They Break)

We've been running Debezium for 2 years across multiple services. Here's what actually works and what fails spectacularly at 3am.

Microservices Data Sync (The Original Sin)

Our order service writes to PostgreSQL, and we need inventory updates in real-time. Sounds simple, right? The outbox pattern works until it doesn't.

What works: Normal flow handles 10k orders/day fine. Event-driven architecture keeps services decoupled and happy.

What breaks: Schema changes. We added a column to the orders table and forgot to update the outbox schema. Spent 6 hours debugging why events stopped flowing. The connector didn't crash, it just silently ignored the new structure.

Real-Time Analytics (Mostly Real-Time)

We stream database changes to ClickHouse for real-time dashboards. Works great for normal operations.

The failure: During a marketing campaign, order volume spiked 50x. Debezium couldn't keep up, lag increased to 20 minutes, and our "real-time" dashboards were showing stale data. Kafka partitioning saved us - increased partitions from 3 to 12 and scaled horizontally.

Lesson learned: Load test your CDC pipeline. Monitor consumer lag religiously or you'll be flying blind.

Search Index Sync (When Elasticsearch Fights Back)

We use Debezium to keep Elasticsearch indexes in sync with PostgreSQL. The setup is straightforward until Elasticsearch decides to shit the bed.

Production incident: Elasticsearch cluster went down for maintenance. Debezium kept streaming to Kafka, but when ES came back up, we had 4 hours of events to replay. The bulk indexing overwhelmed ES and created a feedback loop - ES couldn't keep up, so more events backed up, making it worse.

Fix: Implement backpressure and circuit breakers. When the downstream system is struggling, slow down the connector instead of making it worse.

Cache Invalidation (The Hard Way)

Redis cache invalidation using Debezium events works well for simple cases. Complex cache dependencies will kill you.

The problem: User profiles are cached, but they reference multiple tables (users, preferences, subscriptions). When any related data changes, we need to invalidate the user cache. Sounds simple, but tracking all the relationships is a nightmare.

Current solution: We gave up on surgical cache invalidation and just TTL everything at 5 minutes. Not elegant, but it works. Sometimes the simple solution is the right solution.

Debezium Production Architecture

Production monitoring is critical for Debezium deployments. Monitor connector lag, memory usage, and failure rates to prevent data loss and performance issues.

Configuration Hell (The Real Challenge)

Kafka Cluster: We run 5 brokers because 3 wasn't enough headroom. Replication factor of 3 with min.insync.replicas=2. Learned this during a broker failure that took down our entire pipeline.

Memory Settings: Default JVM settings are a joke. We run connectors with 8GB heap because 2GB caused frequent GC pauses and connector restarts.

Database Config: PostgreSQL wal_level=logical and max_replication_slots=10. MySQL binlog_format=ROW and binlog_row_image=FULL. Oracle supplemental logging enabled - forgetting this cost us 4 hours of debugging.

Monitoring Stack: Prometheus + Grafana for metrics, JMX metrics exported via Kafka Connect. Critical alerts on connector lag > 60 seconds and connector failures.

Schema Evolution (The Silent Killer)

The incident: Added a NOT NULL column to a table without a default value. PostgreSQL was fine (we added the column correctly), but Debezium connector crashed during the next snapshot. Took down the entire pipeline for 3 hours while we figured out that schema changes need to be coordinated with connector restarts.

Current process: Schema changes require downtime windows and connector restarts. Not ideal, but better than random failures.

What Actually Works

  • Simple table changes (inserts, updates, deletes)
  • Single-database connectors
  • Well-partitioned Kafka topics
  • Monitoring everything with proper alerting

What Doesn't Work

  • Cross-database transactions
  • Complex schema evolution
  • Expecting sub-second latency under load
  • Running without proper monitoring

The Bottom Line

After 2 years of running Debezium in production, here's the truth: it's the best CDC solution available, but that's not saying much. The CDC space is full of half-baked solutions and enterprise bullshit.

Debezium works, but you'll earn every bit of that reliability through debugging sessions, memory tuning, and monitoring setup. The payoff is worth it - no more polling databases or writing triggers. Just clean, event-driven architecture that scales.

If you're considering Debezium, budget time for learning Kafka first. If you already have Kafka, Debezium is a no-brainer. If you don't want to deal with Kafka, maybe reconsider whether you actually need CDC.

The tool has saved us hundreds of hours of manual data syncing and prevented countless bugs from stale data. Just don't expect it to work perfectly on day one, and you'll be fine.

Frequently Asked Questions (The Real Answers)

Q

Why does my Debezium connector keep restarting randomly?

A

Memory leaks, that's why. The default heap size of 2GB is a joke for any real workload. We run 8GB minimum, and even that's tight with complex schemas.Also check your Kafka Connect worker logs for OutOfMemoryError or GC thrashing. If you see java.lang.OutOfMemoryError: Java heap space, you need more memory. Period.

Q

My connector is "running" but no events are flowing. What's wrong?

A

First thing to check: are you actually making changes to the database? I've spent hours debugging a "broken" connector that was working fine - I just wasn't changing any data.If data is changing, check the connector status via REST API. Look for FAILED tasks or check if the connector is paused. Common causes:

  • Database permissions (connector can't read transaction logs)
  • Network connectivity issues
  • Schema registry is down
  • Your database doesn't have the right logging enabled
Q

How do I fix "Failed to flush offsets to storage" errors?

A

This error means Kafka Connect can't write to the __connect-offsets topic. Usually it's because:

  • Kafka cluster is down or unreachable
  • Not enough brokers available (check min.insync.replicas)
  • Kafka Connect worker misconfigured

Increase offset.flush.timeout.ms to 60000 (60 seconds) if you're on a slow network. Default 5 seconds is too aggressive for most deployments.

Q

Why is my PostgreSQL connector failing with "replication slot does not exist"?

A

PostgreSQL dropped your replication slot, probably because:

  • Connector was down too long and slot was auto-dropped
  • Database restart without preserving slots
  • Someone manually dropped it (check with your DBA)

Create a new replication slot manually:

SELECT pg_create_logical_replication_slot('debezium_slot', 'pgoutput');

Or let Debezium recreate it by restarting the connector with slot.drop.on.stop=false.

Q

MySQL connector says "binlog position no longer available" - now what?

A

MySQL Binlog Replication

Your MySQL binlogs rotated and the old position is gone. This happens when:

  • Connector was down longer than binlog retention period
  • MySQL binlog expiration is too short
  • Someone purged binlogs manually

You're fucked. You need to do a new snapshot, which means downtime and potential data loss for the gap period.

Set binlog_expire_logs_seconds to at least 7 days to avoid this.

Q

Oracle connector keeps crashing with LogMiner errors. Help?

A

Oracle LogMiner is a pain in the ass. Common issues:

  • Supplemental logging not enabled properly
  • Redo logs getting archived faster than LogMiner can read them
  • Memory issues (LogMiner is a memory hog)

Enable supplemental logging for all tables:

ALTER DATABASE ADD SUPPLEMENTAL LOG DATA (ALL) COLUMNS;

And pray to the Oracle gods that it works.

Q

How do I handle schema changes without breaking everything?

A

You don't. Schema evolution is Debezium's weakest point. Our process:

  1. Make the change backwards-compatible if possible
  2. Deploy to a test environment first
  3. Plan for connector downtime
  4. Make the schema change
  5. Restart the connector
  6. Test everything thoroughly

There's no magic bullet. Schema registry helps but doesn't solve the fundamental problem.

Q

Why are my events delayed by several minutes?

A

Usually it's downstream bottlenecks:

  • Consumer can't keep up with producer
  • Kafka cluster under load
  • Network issues between components
  • JMX metrics show high lag

Check consumer lag first: kafka-consumer-groups.sh --describe --group your-group. If lag is growing, your consumer is the bottleneck, not Debezium.

Q

Can I run Debezium without Kafka?

A

Yes, with Debezium Server, but you lose:

  • Fault tolerance
  • Horizontal scaling
  • Built-in offset management
  • The entire Kafka ecosystem

I tried it for 2 weeks. Went back to Kafka Connect and never looked back.

Q

How do I monitor this thing properly?

A

JMX metrics are your friend. Key metrics to monitor:

  • Connector lag (most important)
  • Connector status (running/failed)
  • Database connection health
  • Memory usage

We use Prometheus + Grafana with alerts on:

  • Connector lag > 60 seconds
  • Any connector failures
  • Memory usage > 80%
Q

What happens if I accidentally delete the offset topic?

A

You start over with a full snapshot. All your offset data is gone, so Debezium doesn't know where it left off.

Back up your __connect-offsets topic regularly if you care about recovery time. Otherwise, plan for a long weekend of full snapshots.

Essential Debezium Resources

Related Tools & Recommendations

integration
Similar content

Kafka, MongoDB, K8s, Prometheus: Event-Driven Observability

When your event-driven services die and you're staring at green dashboards while everything burns, you need real observability - not the vendor promises that go

Apache Kafka
/integration/kafka-mongodb-kubernetes-prometheus-event-driven/complete-observability-architecture
100%
integration
Similar content

Kafka Spark Elasticsearch: Build & Optimize Real-time Pipelines

The Data Pipeline That'll Consume Your Soul (But Actually Works)

Apache Kafka
/integration/kafka-spark-elasticsearch/real-time-data-pipeline
97%
tool
Similar content

Striim: Real-time Enterprise CDC & Data Pipelines for Engineers

Real-time Change Data Capture for engineers who've been burned by flaky ETL pipelines before

Striim
/tool/striim/overview
91%
tool
Similar content

Change Data Capture (CDC) Explained: Production & Debugging

Discover Change Data Capture (CDC): why it's essential, real-world production insights, performance considerations, and debugging tips for tools like Debezium.

Change Data Capture (CDC)
/tool/change-data-capture/overview
87%
tool
Similar content

Apache Kafka Overview: What It Is & Why It's Hard to Operate

Dive into Apache Kafka: understand its core, real-world production challenges, and advanced features. Discover why Kafka is complex to operate and how Kafka 4.0

Apache Kafka
/tool/apache-kafka/overview
85%
tool
Similar content

Change Data Capture (CDC) Troubleshooting Guide: Fix Common Issues

I've debugged CDC disasters at three different companies. Here's what actually breaks and how to fix it.

Change Data Capture (CDC)
/tool/change-data-capture/troubleshooting-guide
78%
tool
Similar content

CDC Tool Selection Guide: Pick the Right Change Data Capture

I've debugged enough CDC disasters to know what actually matters. Here's what works and what doesn't.

Change Data Capture (CDC)
/tool/change-data-capture/tool-selection-guide
76%
tool
Similar content

Change Data Capture (CDC) Performance Optimization Guide

Demo worked perfectly. Then some asshole ran a 50M row import at 2 AM Tuesday and took down everything.

Change Data Capture (CDC)
/tool/change-data-capture/performance-optimization-guide
68%
tool
Similar content

Change Data Capture (CDC) Integration Patterns for Production

Set up CDC at three companies. Got paged at 2am during Black Friday when our setup died. Here's what keeps working.

Change Data Capture (CDC)
/tool/change-data-capture/integration-deployment-patterns
68%
tool
Similar content

CDC Enterprise Implementation Guide: Real-World Challenges & Solutions

I've implemented CDC at 3 companies. Here's what actually works vs what the vendors promise.

Change Data Capture (CDC)
/tool/change-data-capture/enterprise-implementation-guide
62%
tool
Similar content

Change Data Capture (CDC) Skills, Career & Team Building

The missing piece in your CDC implementation isn't technical - it's finding people who can actually build and maintain these systems in production without losin

Debezium
/tool/change-data-capture/cdc-skills-career-development
60%
tool
Similar content

CDC Database Platform Guide: PostgreSQL, MySQL, MongoDB Setup

Stop wasting weeks debugging database-specific CDC setups that the vendor docs completely fuck up

Change Data Capture (CDC)
/tool/change-data-capture/database-platform-implementations
48%
review
Similar content

Apache Pulsar Review: Production Reality, Pros & Cons vs Kafka

Yahoo built this because Kafka couldn't handle their scale. Here's what 3 years of production deployments taught us.

Apache Pulsar
/review/apache-pulsar/comprehensive-review
46%
integration
Similar content

Kafka, Redis & RabbitMQ: Event Streaming Architecture Guide

Kafka + Redis + RabbitMQ Event Streaming Architecture

Apache Kafka
/integration/kafka-redis-rabbitmq/architecture-overview
43%
tool
Similar content

CDC Security & Compliance Guide: Protect Your Data Pipelines

I've seen CDC implementations fail audits, leak PII, and violate GDPR. Here's how to secure your change data capture without breaking everything.

Change Data Capture (CDC)
/tool/change-data-capture/security-compliance-guide
43%
review
Similar content

Apache Kafka Costs: Unpacking Real-World Budget & Benefits

Don't let "free and open source" fool you. Kafka costs more than your mortgage.

Apache Kafka
/review/apache-kafka/cost-benefit-review
42%
tool
Recommended

Oracle GoldenGate - Database Replication That Actually Works

Database replication for enterprises who can afford Oracle's pricing

Oracle GoldenGate
/tool/oracle-goldengate/overview
35%
tool
Recommended

PostgreSQL Performance Optimization - Stop Your Database From Shitting Itself Under Load

integrates with PostgreSQL

PostgreSQL
/tool/postgresql/performance-optimization
35%
integration
Recommended

FastAPI + SQLAlchemy + Alembic + PostgreSQL: The Real Integration Guide

integrates with FastAPI

FastAPI
/integration/fastapi-sqlalchemy-alembic-postgresql/complete-integration-stack
35%
compare
Recommended

PostgreSQL vs MySQL vs MongoDB vs Redis vs Cassandra - Enterprise Scaling Reality Check

When Your Database Needs to Handle Enterprise Load Without Breaking Your Team's Sanity

PostgreSQL
/compare/postgresql/mysql/mongodb/redis/cassandra/enterprise-scaling-reality-check
35%

Recommendations combine user behavior, content similarity, research intelligence, and SEO optimization