I've been on-call for CDC failures at three different companies over the past 6 years. These are the 5 disasters that will absolutely ruin your weekend - the real production failures that page you at 2am, not the theoretical edge cases in vendor documentation. If you're running CDC in production, you WILL hit these. It's not a matter of if, it's when.
PostgreSQL WAL Eats Your Entire Disk
This happened twice in my first year at a fintech startup - both times during major product launches when we couldn't afford downtime. First time was a Saturday morning during our Black Friday event. I woke up to 47 Slack notifications and angry messages from the CEO. WAL directory went from maybe 5GB to completely maxing out our 2TB drive in about 4 hours because some network hiccup caused the replication slot to get stuck, but PostgreSQL kept happily writing WAL files anyway.
Here's what actually happens: Debezium stops processing for some reason (network hiccup, memory issue, whatever), but PostgreSQL keeps writing WAL files. PostgreSQL can't clean them up because the replication slot is still there, claiming it needs those files. Your disk fills up, database stops accepting writes, and your phone starts buzzing at 3am.
The Fix:
First, stop the bleeding. Check which replication slot is hogging all the WAL space:
SELECT slot_name,
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) as lag_size,
active
FROM pg_replication_slots
ORDER BY pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) DESC;
If the lag_size is massive (like 40GB), just drop the slot. Yes, you'll lose some data. No, it's not worth your weekend:
SELECT pg_drop_replication_slot('debezium');
Then restart your Debezium connector. It'll create a new slot and take a fresh snapshot. Takes forever but at least your database is accepting writes again.
Prevention: Add this to postgresql.conf
so PostgreSQL will drop slots automatically instead of eating your disk:
max_slot_wal_keep_size = 5GB
I learned this the hard way after the second disk-full incident. Your future self will thank you.
Check out the PostgreSQL documentation on replication slots for more details on how WAL management works. Also see PostgreSQL WAL configuration and monitoring WAL usage. For troubleshooting WAL issues specifically, check out this PostgreSQL wiki guide and EDB's replication troubleshooting guide.
Kafka Connect Lies About Being Healthy
Kafka Connect's status endpoint is a liar. It'll show "RUNNING" while doing absolutely nothing. I've seen this happen when connectors run out of memory, hit network timeouts, or just randomly decide to stop working.
The default 1GB heap is a joke for any real workload. I've seen connectors restart every 20 minutes because they can't fit a single large transaction in memory. The error messages are useless too - either OutOfMemoryError
or just generic "connector failed" with no context.
The Fix:
First thing: increase the heap to something reasonable:
export KAFKA_HEAP_OPTS="-Xmx4G -Xms4G"
Then restart the connector. Don't waste time debugging - 60% of Kafka Connect issues are fixed by restarting:
curl -X POST <connect-host>:8083/connectors/your-connector/restart
If it keeps happening, you probably have large transactions killing the connector. Find them:
SELECT pid, query_start, query, state
FROM pg_stat_activity
WHERE state = 'active'
AND query_start < NOW() - INTERVAL '10 minutes';
Prevention: I have a cron job that restarts all connectors every Sunday at 2am. Yes, it's a total hack. No, I don't give a shit about "best practices" - this ugly solution has saved me from getting paged on more weekends than any fancy monitoring setup ever did. Confluent's support will tell you this is wrong, but Confluent's support has never had to debug a hung connector at 3am on Christmas morning.
The Kafka Connect REST API documentation has all the endpoints you need for scripting connector management. Also useful: Apache Kafka Connect documentation, Confluent's Connect troubleshooting guide, monitoring Connect clusters, and Connect performance tuning.
Schema Changes Will Ruin Your Week
Three months into my current job, some brilliant backend engineer decided to add a NOT NULL created_by_user_id
column to our 50-million-row users table. No migration. No heads up to the data team. Just merged it into master and deployed Friday at 4:30pm before leaving for a long weekend. Saturday morning I'm trying to enjoy coffee with my girlfriend when my phone starts going off with PagerDuty alerts. CDC pipeline is completely fucked with AvroTypeException
errors because the schema evolution just broke everything downstream.
This is my least favorite CDC failure because there's no quick fix. When schema evolution breaks, you usually have to drop everything and start over. The connector gets confused about what schema version it's using, and trying to fix it often makes things worse.
The Fix:
Stop the connector immediately before it corrupts more data:
curl -X DELETE http://<connect-host>:8083/connectors/your-connector
Drop the replication slot and start fresh:
SELECT pg_drop_replication_slot('debezium');
Recreate the connector. It'll take a new snapshot, which means downtime for a few hours while it reads your entire database again.
Prevention: Make schema changes require approval from whoever owns CDC. Set up staging environment that actually tests schema changes with CDC running. Most teams skip this and learn the hard way. See Debezium schema evolution docs and Confluent Schema Registry compatibility for proper schema management. Also check database migration best practices and MySQL schema change strategies.
MySQL Binlog Position Disappears Into Thin Air
MySQL is worse than PostgreSQL for CDC. The binlog position gets lost randomly, especially after database restarts or network issues. When this happens, you either miss data or the connector tries to reprocess everything from the beginning.
I once spent most of a Sunday debugging this "mysterious" CDC lag that kept climbing from 10 seconds to 2 hours, then dropping back to zero, then climbing again. Turns out our MySQL DBA had changed the binlog rotation from 1GB files to 100MB files for "better backup performance" without telling anyone. Debezium couldn't keep up with files rotating every 15 minutes and kept losing its position and restarting from the beginning of each file. Six hours of my weekend gone because of a configuration change nobody documented.
The Fix:
Check if your binlog position still exists:
SHOW MASTER STATUS;
SHOW BINARY LOGS;
If the file that Debezium thinks it should read is gone, you're screwed. You'll have to restart with a fresh snapshot:
curl -X DELETE http://<connect-host>:8083/connectors/mysql-connector
Prevention: Increase binlog retention to at least 7 days:
SET GLOBAL binlog_expire_logs_seconds = 604800;
Also, monitor binlog processing lag. If it gets behind by more than an hour, something's wrong. Read MySQL binlog management, replication monitoring, Debezium MySQL connector docs, and MySQL performance tuning for replication. For troubleshooting, check MySQL replication FAQ.
Network Issues Turn CDC Into a Nightmare
Your CDC pipeline works fine until there's a network hiccup between the database and Kafka. Then everything explodes. Lag goes from 100ms to 30 minutes. Connectors start rebalancing. Messages get duplicated or lost.
The worst part is that network issues are invisible to most monitoring. Everything shows "healthy" while your CDC lag climbs higher and higher.
The Fix:
First, actually check if there are network issues:
ping -c 100 your-kafka-broker | grep 'packet loss'
If you're seeing packet loss or high latency, that's your problem. Fix the network or move components closer together.
Increase timeouts to handle network flakiness:
{
"consumer.session.timeout.ms": "60000",
"consumer.max.poll.interval.ms": "600000"
}
Reality check: Cross-AZ deployments look good on paper but create operational headaches. If you can put CDC components in the same AZ, do it. For network troubleshooting, see Kafka networking guide, PostgreSQL connection troubleshooting, AWS VPC networking best practices, and monitoring network performance.
What You've Learned (The Hard Way)
By now you should understand why I started this guide by saying CDC will definitely break. These five disasters - WAL accumulation, lying Kafka Connect status, schema changes, MySQL binlog position loss, and network issues - represent about 90% of the actual production failures I've debugged over the years.
The most important lesson I can share: stop trying to prevent every possible failure. That's an impossible goal that will drive you insane. Instead, focus on making failures recoverable. Monitor the basics religiously (disk space, connector status, lag), build runbooks for these common disasters, and accept that some data loss is better than spending your entire weekend trying to recover corrupted state.
The next sections will show you exactly how to debug the edge cases when these basic fixes don't work, and how to build monitoring that actually catches problems before they page you.
Quick reference for 3am debugging:
- Check disk space first - WAL accumulation kills everything
- Restart connectors - fixes most random issues
- Drop and recreate replication slots when in doubt
- Increase timeouts for network issues
- Don't try to save corrupted state - start fresh
The documentation makes CDC sound reliable. It's not. Plan accordingly. For comprehensive CDC guidance, read Debezium operations guide, Martin Kleppmann's data consistency article, distributed systems failure modes, Netflix's CDC architecture, and Uber's real-time data infrastructure.