Why does my Debezium connector keep restarting randomly?

Memory leaks, that's why. The default heap size of 2GB is a joke for any real workload. We run 8GB minimum, and even that's tight with complex schemas.Also check your [Kafka Connect worker logs](https://kafka.apache.org/documentation/#connect_running) for `OutOfMemoryError` or GC thrashing. If you see `java.lang.OutOfMemoryError: Java heap space`, you need more memory. Period.

My connector is "running" but no events are flowing. What's wrong?

First thing to check: are you actually making changes to the database? I've spent hours debugging a "broken" connector that was working fine - I just wasn't changing any data.If data is changing, check the [connector status](https://kafka.apache.org/documentation/#connect_rest) via REST API. Look for `FAILED` tasks or check if the connector is paused. Common causes: - Database permissions (connector can't read transaction logs) - Network connectivity issues - [Schema registry](https://docs.confluent.io/platform/current/schema-registry/index.html) is down - Your database doesn't have the right logging enabled

How do I fix "Failed to flush offsets to storage" errors?

This error means Kafka Connect can't write to the `__connect-offsets` topic. Usually it's because: - Kafka cluster is down or unreachable - Not enough brokers available (check `min.insync.replicas`) - [Kafka Connect worker](https://kafka.apache.org/documentation/#connect_configuring) misconfigured Increase `offset.flush.timeout.ms` to 60000 (60 seconds) if you're on a slow network. Default 5 seconds is too aggressive for most deployments.

Why is my PostgreSQL connector failing with "replication slot does not exist"?

PostgreSQL dropped your replication slot, probably because: - Connector was down too long and slot was auto-dropped - Database restart without preserving slots - Someone manually dropped it (check with your DBA) Create a new replication slot manually: ```sql SELECT pg_create_logical_replication_slot('debezium_slot', 'pgoutput'); ``` Or let Debezium recreate it by restarting the connector with `slot.drop.on.stop=false`.

MySQL connector says "binlog position no longer available" - now what?

![MySQL Binlog Replication](https://www.dolthub.com/blog/static/584c79341e3c8a7eb299e9fe67c653e5/d26aa/binlog-use-cases.png) Your MySQL binlogs rotated and the old position is gone. This happens when: - Connector was down longer than binlog retention period - MySQL [binlog expiration](https://dev.mysql.com/doc/refman/8.0/en/replication-options-binary-log.html#option_mysqld_binlog-expire-logs-seconds) is too short - Someone purged binlogs manually You're fucked. You need to do a new snapshot, which means downtime and potential data loss for the gap period. Set `binlog_expire_logs_seconds` to at least 7 days to avoid this.

Oracle connector keeps crashing with LogMiner errors. Help?

Oracle LogMiner is a pain in the ass. Common issues: - [Supplemental logging](https://docs.oracle.com/en/database/oracle/oracle-database/19/sutil/oracle-logminer-utility.html#GUID-3417B738-374C-4EE3-B15C-3A66E01AE2B5) not enabled properly - Redo logs getting archived faster than LogMiner can read them - Memory issues (LogMiner is a memory hog) Enable supplemental logging for all tables: ```sql ALTER DATABASE ADD SUPPLEMENTAL LOG DATA (ALL) COLUMNS; ``` And pray to the Oracle gods that it works.

How do I handle schema changes without breaking everything?

You don't. Schema evolution is Debezium's weakest point. Our process: 1. Make the change backwards-compatible if possible 2. Deploy to a test environment first 3. Plan for connector downtime 4. Make the schema change 5. Restart the connector 6. Test everything thoroughly There's no magic bullet. [Schema registry](https://docs.confluent.io/platform/current/schema-registry/index.html) helps but doesn't solve the fundamental problem.

Why are my events delayed by several minutes?

Usually it's downstream bottlenecks: - Consumer can't keep up with producer - Kafka cluster under load - Network issues between components - [JMX metrics](https://debezium.io/documentation/reference/stable/operations/monitoring.html) show high lag Check consumer lag first: `kafka-consumer-groups.sh --describe --group your-group`. If lag is growing, your consumer is the bottleneck, not Debezium.

Can I run Debezium without Kafka?

Yes, with [Debezium Server](https://debezium.io/documentation/reference/stable/operations/debezium-server.html), but you lose: - Fault tolerance - Horizontal scaling - Built-in offset management - The entire Kafka ecosystem I tried it for 2 weeks. Went back to Kafka Connect and never looked back.

How do I monitor this thing properly?

[JMX metrics](https://debezium.io/documentation/reference/stable/operations/monitoring.html) are your friend. Key metrics to monitor: - Connector lag (most important) - Connector status (running/failed) - Database connection health - Memory usage We use [Prometheus](https://prometheus.io/docs/introduction/overview/) + [Grafana](https://grafana.com/docs/) with alerts on: - Connector lag > 60 seconds - Any connector failures - Memory usage > 80%

What happens if I accidentally delete the offset topic?

You start over with a full snapshot. All your offset data is gone, so Debezium doesn't know where it left off. Back up your `__connect-offsets` topic regularly if you care about recovery time. Otherwise, plan for a long weekend of full snapshots.

Currently viewing the AI version

Switch to human version

Debezium CDC: Production Implementation Guide

Critical Configuration Requirements

Database Prerequisites

PostgreSQL: wal_level=logical, max_replication_slots=10 minimum
MySQL: binlog_format=ROW, binlog_row_image=FULL, binlog_expire_logs_seconds=604800 (7 days)
Oracle: Supplemental logging enabled: ALTER DATABASE ADD SUPPLEMENTAL LOG DATA (ALL) COLUMNS;
SQL Server: CDC enabled at database and table level
MongoDB: Replica set required (single node will fail)

Infrastructure Requirements

Kafka Cluster: 5 brokers minimum (3 insufficient for production headroom)
Replication Factor: 3 with min.insync.replicas=2
Memory: 8GB heap minimum per connector (2GB default causes GC failures)
Setup Time: 3 weeks for Kafka + Debezium (not 3 hours as documentation suggests)

Production Deployment Modes

Mode	Use Case	Fault Tolerance	Setup Complexity	Production Viability
Kafka Connect	Production standard	High (survives node failures)	5 days	✅ Recommended
Debezium Server	Standalone without Kafka	None	2 weeks trial period	❌ Abandoned after 2 weeks
Embedded Engine	Java library	None	High	❌ Memory leaks at 3am

Performance Thresholds

Latency Expectations

Normal Operations: Sub-second latency
Under Load: Can spike to minutes during connector failures
Recovery Time: 48 hours for 500GB table full snapshot
Critical Threshold: Monitor lag > 60 seconds (alert immediately)

Breaking Points

UI Failure: 1000+ spans make debugging distributed transactions impossible
Volume Spike: 50x increase requires partition scaling (3→12 partitions)
Memory Leak: Default 2GB heap causes frequent restarts

Critical Failure Scenarios

Schema Evolution (Silent Killer)

Failure Case: NOT NULL column without default crashes connector during snapshot
Impact: 3-hour pipeline downtime
Prevention: Coordinate schema changes with connector restarts
Process: Backwards-compatible changes only, test environment first, plan downtime

Offset Loss

Trigger: Lost __connect-offsets topic or corruption
Impact: Full snapshot restart (48+ hours for large tables)
Recovery: No automated recovery, manual full resync required
Prevention: Regular offset topic backups

Replication Slot Failures (PostgreSQL)

Cause: Connector down longer than slot retention
Symptom: "replication slot does not exist" error
Disk Impact: Slots fill disk if connector stops consuming
Recovery: Manual slot recreation, potential data gap

Binlog Position Loss (MySQL)

Trigger: Binlog rotation during connector downtime
Impact: Complete data resync required
Prevention: binlog_expire_logs_seconds=604800 minimum
Recovery Time: Full snapshot duration (hours to days)

Tool Comparison Matrix

Solution	Real Cost/Month	Setup Complexity	Reliability	Documentation Quality
Debezium	$500 (infrastructure)	High (Kafka required)	Good when tuned	Decent
AWS DMS	$2-5k	Medium	AWS-dependent	AWS-grade
Oracle GoldenGate	$$$$$+ (mortgage-level)	Nightmare	Excellent	Oracle-grade
Airbyte	$200-2k+ (freemium trap)	Low	Hit or miss	Pretty good
Striim	Enterprise pricing	Medium	Usually works	Enterprise-grade

Production Use Case Failures

Microservices Data Sync

Failure: Schema changes break outbox pattern
Root Cause: Forgot to update outbox schema after column addition
Duration: 6 hours debugging (connector didn't crash, silently ignored)
Fix: Schema change coordination process required

Real-Time Analytics

Failure: 20-minute lag during marketing campaign
Cause: 50x volume spike overwhelmed 3-partition setup
Solution: Horizontal scaling to 12 partitions
Lesson: Load test CDC pipeline, monitor consumer lag religiously

Search Index Sync

Failure: Elasticsearch overwhelmed during 4-hour event replay
Cause: Bulk indexing feedback loop after maintenance window
Fix: Implement backpressure and circuit breakers
Impact: Downstream system degradation affects entire pipeline

Cache Invalidation

Complexity: Tracking relationships across multiple tables
Solution: Abandoned surgical invalidation for 5-minute TTL
Lesson: Simple solutions often beat complex ones

Monitoring Requirements

Critical Metrics

Connector lag (alert > 60 seconds)
Connector status (running/failed)
Memory usage (alert > 80%)
Database connection health

Infrastructure Stack

Monitoring: Prometheus + Grafana
Metrics Export: JMX via Kafka Connect
Alerting: Connector failures and performance degradation

Common Production Issues

Random Connector Restarts

Root Cause: Memory leaks with default 2GB heap
Solution: 8GB minimum heap size
Symptom: OutOfMemoryError: Java heap space in logs

Events Not Flowing

Check: Actual database changes occurring
Verify: Connector status via REST API
Common Causes: Database permissions, network issues, schema registry down

Offset Flush Failures

Error: "Failed to flush offsets to storage"
Cause: Kafka cluster unreachable or insufficient brokers
Fix: Increase offset.flush.timeout.ms to 60000ms

Oracle LogMiner Crashes

Requirements: Supplemental logging enabled
Issues: Memory consumption, redo log archival speed
Solution: Memory tuning and proper logging configuration

Resource Investment Reality

Time Requirements

Kafka Setup: 3 weeks (not hours)
Connector Configuration: 5 days for production-ready setup
Schema Change Process: Manual coordination required
Recovery Operations: 48+ hours for large table snapshots

Expertise Requirements

Kafka Administration: Essential for troubleshooting
Database Administration: Required for log configuration
JVM Tuning: Necessary for memory optimization
Monitoring Setup: Critical for operational visibility

Decision Criteria

Choose Debezium When

Already have Kafka infrastructure
Need open-source CDC solution
Can invest in Kafka expertise
Require horizontal scaling

Avoid Debezium When

No Kafka experience
Need immediate production deployment
Cannot tolerate 3-week setup timeline
Require 24/7 enterprise support

Bottom Line Assessment

After 2 years production experience: Best available CDC solution, but "best" reflects poor state of CDC market. Saved hundreds of hours of manual syncing, prevented countless stale data bugs. Expect earning reliability through debugging, memory tuning, and monitoring setup. Budget 3 weeks setup time, not 3 hours.

Useful Links for Further Investigation

Essential Debezium Resources

Link	Description
Debezium Official Documentation	Documentation that actually explains the hard parts, unlike most open source projects. Start here for connector configs that don't silently fail.
Debezium Tutorial	Step-by-step tutorial that glosses over the hard parts. Good for getting started, but you'll be back here when it breaks.
Architecture Overview	Architecture guide that explains why you need Kafka before you can even think about Debezium. Read this before you commit.
Connector Configuration Reference	Config guides that don't skip the gotchas. These actually mention the settings that will save you from debugging hell.
Debezium GitHub Repository	Source code and issue tracker. Read the closed issues before assuming your problem is unique - someone else hit it first.
Debezium Examples Repository	Working examples that actually run. Copy these configs and modify - don't start from scratch like an idiot.
Community Chat (Zulip)	Where to ask when Stack Overflow fails you. Actually helpful people who've made the same mistakes you're about to make.
Debezium Blog	Release notes and war stories from production. Read the "lessons learned" posts to avoid repeating other people's pain.
Change Data Capture with Apache Kafka	Comprehensive guide to CDC patterns and their implementation with Kafka and Debezium.
Debezium Conference Talks - YouTube	Practical CDC presentations and demos. Skip the marketing fluff, watch the technical deep dives.
QCon Presentation: Practical Change Data Streaming	Comprehensive presentation on real-world CDC use cases and implementation patterns.
Apache Kafka Documentation	Essential reference for understanding Kafka architecture, configuration, and operations.
Kafka Connect Documentation	Framework documentation for understanding how Debezium integrates with Kafka Connect.
Confluent Schema Registry	Schema management solution for handling evolving data schemas with Debezium.
Apicurio Registry	Open source schema registry alternative compatible with Debezium for schema evolution management.
Debezium Monitoring Guide	Production monitoring best practices with JMX metrics, alerting, and operational visibility.
Kubernetes Deployment Examples	Container orchestration patterns for deploying Debezium in Kubernetes environments.
Docker Images and Compose Files	Official Docker images for quick deployment and development environment setup.
Best CDC Tools Comparison (2025)	Comprehensive comparison of Debezium with alternative CDC solutions including Airbyte, AWS DMS, and Oracle GoldenGate.
Red Hat Build of Debezium	Enterprise-supported version of Debezium with additional features and commercial support options.

Debezium CDC: Production Implementation Guide

Critical Configuration Requirements

Database Prerequisites

Infrastructure Requirements

Production Deployment Modes

Performance Thresholds

Latency Expectations

Breaking Points

Critical Failure Scenarios

Schema Evolution (Silent Killer)

Offset Loss

Replication Slot Failures (PostgreSQL)

Binlog Position Loss (MySQL)

Tool Comparison Matrix

Production Use Case Failures

Microservices Data Sync

Real-Time Analytics

Search Index Sync

Cache Invalidation

Monitoring Requirements

Critical Metrics

Infrastructure Stack

Common Production Issues

Random Connector Restarts

Events Not Flowing

Offset Flush Failures

Oracle LogMiner Crashes

Resource Investment Reality

Time Requirements

Expertise Requirements

Decision Criteria

Choose Debezium When

Avoid Debezium When

Bottom Line Assessment

Useful Links for Further Investigation

Essential Debezium Resources

Related Tools & Recommendations

MongoDB vs PostgreSQL vs MySQL: Which One Won't Ruin Your Weekend

Databricks vs Snowflake vs BigQuery Pricing: Which Platform Will Bankrupt You Slowest

Kafka + MongoDB + Kubernetes + Prometheus Integration - When Event Streams Break

Striim - Enterprise CDC That Actually Doesn't Suck

Oracle GoldenGate - Database Replication That Actually Works

How to Migrate PostgreSQL 15 to 16 Without Destroying Your Weekend

Why I Finally Dumped Cassandra After 5 Years of 3AM Hell

MySQL Replication - How to Keep Your Database Alive When Shit Goes Wrong

MySQL Alternatives That Don't Suck - A Migration Reality Check

MongoDB Alternatives: Choose the Right Database for Your Specific Use Case

MongoDB Alternatives: The Migration Reality Check

SQL Server 2025 - Vector Search Finally Works (Sort Of)

Airbyte - Stop Your Data Pipeline From Shitting The Bed

SaaSReviews - Software Reviews Without the Fake Crap

Fresh - Zero JavaScript by Default Web Framework

Anthropic Raises $13B at $183B Valuation: AI Bubble Peak or Actual Revenue?

Snowflake - Cloud Data Warehouse That Doesn't Suck

dbt + Snowflake + Apache Airflow: Production Orchestration That Actually Works

ELK Stack for Microservices - Stop Losing Log Data

Your Elasticsearch Cluster Went Red and Production is Down