Kafka Connect - Stop Writing Custom ETL Scripts That Break Every Weekend

Currently viewing the human version

What Kafka Connect Actually Is (And Why It'll Drive You Nuts)

Kafka Connect is supposed to solve the nightmare of writing custom ETL code that breaks every time someone sneezes on a database. The promise is simple: drop in a pre-built connector, configure some JSON, and watch your data flow magically between systems.

Reality check: you'll spend your first week figuring out why your perfectly valid JSON config gets rejected with cryptic error messages like WorkerSinkTaskThreadException: Task failed with WorkerSinkTaskThreadException.

Kafka Connect Architecture

Kafka Connect Worker Distribution

The Three-Headed Monster (Core Components)

Connector Model: Connectors are supposed to "define the integration" but what they actually do is hide the complexity until something breaks. Source connectors pull data from your database (when they feel like it), while sink connectors push data to destinations (and fail silently when the schema doesn't match). Each connector comes with 47 configuration options, of which exactly 3 are documented properly.

Worker Model: The distributed worker model sounds great until you realize it needs its own Kafka topics for coordination. So to connect to Kafka, you need... more Kafka. Workers "automatically coordinate" except when they don't, leading to split-brain scenarios where everyone thinks they're the leader. I learned this the hard way during a holiday morning rebalancing storm that took down our entire pipeline.

Data Model: Everything flows through Kafka as structured data with schemas. Except when it doesn't. The Schema Registry integration works beautifully until you need to evolve a schema, at which point your connectors start throwing SerializationException errors and you're debugging schema compatibility rules during your lunch break.

What's New in Kafka 4.1.0 (And What Still Sucks)

The Kafka 4.1.0 release finally fixed some long-standing pain points:

Enhanced Metrics Registration: KIP-877 lets you register custom metrics, which is great because the default metrics tell you everything except what you actually need to know. Now you can finally track why your connector keeps failing without parsing through 50GB of logs.
Multiple Connector Versions: KIP-891 allows running different versions of the same connector simultaneously. This exists because upgrading connectors in production is basically playing Russian roulette - one wrong version bump and your entire data pipeline stops working. Now you can test the new version while keeping the old one running. Genius.

Kafka Connect distributed mode workers

The Reality of "Reliable" Data Integration

Connect promises to "address enterprise integration challenges" but what it really does is move your problems from custom code to configuration hell. Instead of debugging Java exceptions, you're now debugging JSON configs that look perfectly fine but somehow break the entire cluster.

The distributed architecture is supposed to eliminate manual coordination, but you'll spend hours manually restarting failed tasks and wondering why the leader election keeps flip-flopping every 30 seconds. The "automatic fault recovery" works great until a connector gets stuck in FAILED state and refuses to restart without manual intervention.

Pro tip: Always run connectors with debug logging enabled from day one. When things break (and they will), the error messages are about as helpful as a chocolate teapot. You'll need every log line you can get when you're trying to figure out why your JDBC connector suddenly stopped writing data but still shows as RUNNING.

Version-specific gotcha: Confluent Platform 7.2.0 has a nasty bug where JDBC sink connectors leak connections when the target database has case-sensitive table names. Upgrade to 7.2.1 or prepare to restart your database every few days when the connection pool gets exhausted.

Time estimate for your first production connector: 30 minutes if the demo gods smile upon you, probably 3-4 hours if you're normal, maybe 3 days (or longer, who knows) if you need custom serialization or have to deal with schema evolution bullshit.

Essential reading for when Connect inevitably breaks:

Troubleshooting Kafka Connect - debugging when things go wrong
Connect performance tuning - making it not suck at scale
SMT Custom transforms - when built-in transforms aren't enough
Connect configuration reference - all the knobs you can turn
Connector development guide - if you're brave enough to write your own
Advanced source configurations - deep dive into optimization
JDBC connector tuning - database-specific optimizations

Source vs Sink Connectors Comparison

Aspect	Source Connectors	Sink Connectors
Data Flow Direction	External System → Kafka Topics	Kafka Topics → External System
Primary Purpose	Ingest data from external sources	Export data to external destinations
Common Sources/Sinks	Databases, Log Files, Message Queues, APIs	Data Warehouses, Cloud Storage, Analytics Platforms
Offset Management	Track position in source system (DB transaction logs, file positions)	Track Kafka topic offsets for reliable delivery
Failure Recovery	Resume from last processed source position	Replay from last committed Kafka offset
Schema Evolution	Handle source schema changes, evolve topic schemas	Adapt to topic schema changes, update destination
Partitioning Strategy	Determine how to partition data across Kafka topics	Consume from topic partitions, write to destination
Popular Examples	Debezium CDC, JDBC Source	S3 Sink, Elasticsearch Sink
Latency Characteristics	Near real-time (milliseconds to seconds)	Configurable batching (seconds to minutes)
Data Transformation	Minimal focus on faithful data capture	Format conversion for destination requirements
Monitoring Focus	Source system health, ingestion lag	Delivery success rate, destination system health
Scaling Considerations	Limited by source system capabilities	Limited by destination system write capacity
Configuration Complexity	Source connection details, polling intervals	Destination credentials, formatting options
Use Case Examples	Change Data Capture, Log Aggregation, IoT Data Ingestion	Data Warehousing, Search Indexing, Real-time Analytics

Architecture Porn vs. Production Reality

Now that you understand what Connect promises versus what it delivers, let's dive into the architectural choices that seemed like good ideas to someone who never had to debug them during Sunday brunch.

Kafka Connect's architecture looks brilliant on paper - distributed workers, automatic coordination, fault tolerance. In practice, it's a complex beast that will teach you new ways to hate distributed systems.

Kafka Connect cluster architecture

The Worker Coordination Nightmare

The distributed worker model sounds amazing until you realize it's basically a mini distributed system running on top of your already complex Kafka cluster. Workers are "stateless" except for all the state they maintain in memory that gets lost when they restart.

Leader election - the source of many 3am pages. The leader worker is supposed to handle:

Distributing configs (except when network partitions cause split-brain scenarios)
Monitoring worker health (with a 30-second delay that makes failures feel eternal)
Managing task lifecycle (and getting stuck when tasks refuse to stop cleanly)
Coordinating rebalancing (which triggers more often than a smoke detector with a low battery)

Reality check: I've seen leader elections flip-flop every couple minutes because of minor network hiccups (like someone rebooting a switch during lunch), causing connector tasks to restart continuously. The "separation of concerns" becomes "separation anxiety" when you're debugging why the cluster thinks it has 3 leaders simultaneously, or sometimes no leader at all.

Pro tip: Set worker.sync.timeout.ms to something reasonable like 10000ms instead of the default 3000ms. Your sanity will thank you when workers stop dropping out during minor GC pauses.

Connector vs Task: The Hierarchy of Pain

The two-level hierarchy sounds elegant until you're debugging why half your tasks are FAILED and the other half are stuck in RUNNING but doing absolutely nothing.

Connector Level: The connector class is supposed to intelligently partition work. In reality, it's where you'll encounter gems like:

Database connectors that create one task per table, except when the table has a weird name that breaks the SQL generation
File connectors that crash when they encounter a directory symlink
Custom connectors that work perfectly in dev but explode when they hit production data volumes

Task Level: Where the actual work happens, and where everything goes to shit. Source tasks poll external systems every few seconds and sometimes just... stop polling. No error, no exception, just silence. Sink tasks consume from Kafka and write to destinations, except when the destination is unavailable for 0.3 seconds and the task decides to give up forever.

Kafka Connect task failure states

Kafka Connect Task Management

The "lightweight and stateless" myth: Tasks maintain connection pools, offset information, and internal state that gets lost when they restart. When a task fails and restarts, you're rolling the dice on whether it picks up where it left off or starts duplicating data.

Offset Management: The Source of Most 3AM Pages

Connect's "robust fault tolerance" through offset management is like a safety net made of tissue paper. It works great until you actually need it.

The framework stores metadata in three dedicated Kafka topics (yes, more Kafka dependencies):

connect-configs: Where connector configs live and occasionally get corrupted. I've seen configs disappear entirely during cluster restarts, leading to the delightful experience of manually reconfiguring 47 connectors from backup JSON files. Cluster-wide consistency is more like "eventual consistency, if you're lucky."

connect-offsets: The supposed source of truth for connector progress. Source connectors store their position in external systems here, while sink connectors track Kafka offsets. Sounds great until:

Offset corruption causes connectors to reprocess weeks of data
Schema changes break offset deserialization, forcing manual offset resets
The offset topic gets compacted aggressively and you lose tracking data
Exactly-once delivery works "when properly configured" (spoiler: it's never properly configured on the first try)

connect-status: Contains worker and task status that's about as reliable as a weather forecast. Tasks show as RUNNING while doing nothing, or show as FAILED when they're actually working fine. The "automatic failure recovery" usually means tasks get stuck in permanent FAILED state until you manually restart them.

Nuclear option: When offset corruption hits (and it will), you'll need to manually reset offsets using kafka-console-consumer.sh and pray you don't lose data or create duplicates. Keep those backup scripts handy.

Schema Evolution: Where Data Goes to Die

The data model "abstracts away serialization concerns" until schema evolution hits and everything explodes:

Connect Data Types: Supports primitives, arrays, maps, and nested structures. Works beautifully until you add a new field and discover your connector doesn't handle schema changes gracefully.
Schema Registry Integration: Confluent Schema Registry provides "automatic schema evolution" that automatically breaks your pipeline when schemas change. Forward compatibility, backward compatibility, full compatibility - pick one, because you can't have all three.
Converter Hell: JSON converters lose type information, Avro converters are strict about schemas, Protobuf converters work great until someone changes a field from optional to required. Custom serialization formats? Good luck debugging those on a Tuesday afternoon when you just want to go home.

Schema evolution compatibility matrix

Pro tip: Always test schema changes in a staging environment that actually mirrors production. That test environment with 10 records? It won't catch the schema compatibility issues that surface when you have millions of records with slightly different schemas from 6 months of gradual evolution.

REST API: When Declarative Configuration Meets Reality

The Connect REST API is great for demos and terrible for production operations. Key pain points:

Lifecycle management: POST /connectors works great until you hit resource limits and the connector gets stuck in FAILED state with no useful error message
Status monitoring: GET /connectors/{name}/status returns optimistic status that doesn't match reality. Task status lags by 30+ seconds, so your monitoring thinks everything is fine while data stops flowing
Configuration updates: PUT /connectors/{name}/config is supposed to update configs seamlessly but often requires a full restart to take effect
Error handling: API errors are about as descriptive as "something went wrong" - you'll be digging through worker logs to find the actual problem

Reality check: You'll end up writing wrapper scripts around the REST API because the raw endpoints don't handle edge cases like "what if the connector is stuck and won't respond to stop requests."

Platform gotcha: On RHEL/CentOS systems (especially 8.x), the Connect worker process sometimes hangs when using systemd with default service limits. Set LimitNOFILE=65536 in your systemd unit file or watch your connectors mysteriously fail after exactly 1024 tasks. Found this out the hard way on a Friday afternoon.

Deep dive resources for masochists:

Connect architecture deep dive - understand why it's so complex
SMT development guide - custom transforms that actually work
Performance optimization strategies - making Connect not suck
Kafka Connect troubleshooting - official debugging guide
Custom connector development - when existing connectors aren't broken enough
Worker configuration tuning - all the knobs that don't work as expected
Monitoring and metrics - watching things fail in real-time
Schema Registry integration - where schemas go to die
Security configuration - making Connect secure (good luck)

Real-World Use Cases (And Where Things Go Wrong)

So you've survived the architectural overview and still think Connect might work for you? Let's look at how real companies with real budgets and real deadlines actually use this thing in production.

Companies that actually use Kafka Connect in production will tell you a different story than the marketing materials. These are the battle-tested scenarios where Connect either saves your ass or ruins your weekend.

Kafka Connect data flow diagrams

Change Data Capture: The Database Stalker

CDC is where Connect supposedly shines - watching your database for changes and streaming them to Kafka. In practice, it's where you learn that "real-time" is more like "eventually-time."

Netflix runs Kafka Connect with Debezium connectors to capture database changes. Sounds smooth until you hit the reality:

Database locks from CDC queries slow down your OLTP workload
Connector falls behind during high write periods and never catches up
Schema changes break the connector and require manual intervention
"Near-zero latency" becomes "anywhere from 20 minutes to an hour behind during peak hours, maybe longer if something weird happens"

Netflix probably has a team of 20 engineers just for Kafka Connect. You probably don't.

The financial services nightmare: JPMorgan Chase processes millions of transactions through Connect pipelines. The "exactly-once delivery semantics" sounds great until you discover:

Connector restarts cause duplicate transactions that break downstream calculations
Offset corruption leads to missed transactions that regulators find in audits
Schema evolution breaks risk calculation systems during market volatility

Cloud Data Lakes: Where Money Goes to Die

"Modern data architecture" is code for "let's dump everything into S3 and hope the data scientists can make sense of it later." Spotify streams user activity to Google Cloud Storage, which works great until:

The S3 Sink Connector "automatically handles partitioning" by creating 50,000 tiny files that cost more to list than they're worth
Data arrives out of order because Kafka partitioning doesn't match your time-based partitioning scheme
JSON serialization bloats your storage costs by 3x compared to Parquet
"Efficient storage" becomes "paying Amazon something like $10k/month (maybe more) for mostly empty directories"

Tesla's telemetry pipeline: Tesla streams vehicle data through Connect. Millions of data points sounds impressive until you realize:

Network partitions cause cars to buffer telemetry data and flood the system when reconnected
Schema changes break the ingestion pipeline right when you need to deploy an OTA update
"Real-time predictive maintenance" becomes "we'll tell you your battery is dying after it's already dead"

Microservices: The Distributed Debugging Nightmare

Event-driven microservices sound amazing in architecture reviews. In production, they're how you turn a simple user profile update into a 6-service debugging session that lasts until 4am.

LinkedIn (where Kafka was born) uses Connect to sync profile changes across dozens of services. "Consistent updates" in theory, "eventual consistency with occasional data loss" in practice:

Profile updates trigger 47 downstream events that sometimes arrive out of order
Service outages cause event backlogs that replay old data over current state
Schema mismatches between services cause silent data corruption
"Real-time updates" become "updates that show up eventually, maybe"

E-commerce reality check: Walmart streams inventory changes between systems. Sounds great until Black Friday when:

High volume causes Connect workers to lag behind, showing items as in-stock when they're sold out
The JDBC Sink Connector handles "complex mapping" by silently dropping fields that don't fit the legacy schema
Database locks from sink operations slow down the POS system during peak traffic
"Accurate stock levels" become "mostly accurate, give or take a few hundred units (or more during Black Friday chaos)"

"Real-Time" Analytics (AKA Eventually-Time Analytics)

Stream processing with Connect sounds sophisticated until you realize "real-time" means "when the system feels like it."

The New York Times streams article events to Elasticsearch. "Low-latency pipeline" becomes high-anxiety debugging when:

Elasticsearch connector creates index mapping conflicts that break ingestion
Search indexing falls behind during traffic spikes from viral articles
"Minutes of publication" becomes "an hour later if you're lucky"
Schema changes break the analytics pipeline right when everyone's watching CNN

Gaming industry pain: Unity streams player behavior for "real-time game balancing." The "high-throughput event streams with ordering guarantees" works until:

Player events arrive out of order due to network issues, skewing analytics
Connector lag causes A/B tests to run on stale data
"Personalized content delivery" becomes "showing players ads for games they already own"

Analytics pipeline monitoring

Kafka Connect Pipeline Architecture

DevOps Monitoring: Watching the Watchers Fail

Connect for observability platforms is how you discover that monitoring your monitoring system requires another monitoring system.

Confluent's infrastructure uses Connect to aggregate logs from thousands of clusters. "Proactive issue detection" works great until:

High log volume causes connector memory leaks that crash the monitoring system
Log parsing failures cause silent data loss in your alerting pipeline
The monitoring system goes down right when you need it most

The Elasticsearch Sink Connector provides "efficient indexing" until:

Field mapping explosions crash Elasticsearch with too many unique fields
"Automatic schema detection" creates conflicts that block all log ingestion
You discover your logs are 6 hours behind when troubleshooting a production outage

Pro tip: Always have a backup monitoring system that doesn't depend on Kafka Connect. When your primary monitoring fails (and it will), you'll need something to tell you why.

Case studies and real-world pain:

Netflix's Connect architecture - how to succeed with unlimited engineering resources
Walmart's inventory challenges - real-time data at scale
Tesla's telemetry pipeline - IoT data challenges
New York Times publishing - media industry use cases
JPMorgan's financial pipelines - regulated industry challenges
Spotify's analytics platform - music streaming data architecture
Unity's gaming analytics - real-time game data
LinkedIn's microservices - where Kafka was born
Connect monitoring best practices - watching distributed systems fail
Sink connector optimization - minimizing latency when possible

Frequently Asked Questions (And Real Answers)

What's the difference between Kafka Connect and custom Kafka clients?

Connect promises "standardized framework with built-in features" but what you get is configuration hell and mysterious failures. Custom clients require more code but when they break, you can actually debug them. Connect offers "distributed coordination" that constantly rebalances for no apparent reason and a REST API that returns optimistic status while your data pipeline quietly dies. Reality: Custom clients take maybe 2-3 weeks to write properly but you understand them. Connect takes 30 minutes to configure and 3 months to understand why it randomly stops working for no fucking reason.

Can Kafka Connect handle schema evolution automatically?

"Automatic schema evolution" is marketing speak for "schema changes that break your pipeline in creative new ways." When integrated with Schema Registry, Connect supposedly adapts to schema changes automatically. What actually happens:

Source connectors capture schema changes and immediately crash with `SerializationException:

Unknown magic byte!`

Sink connectors handle "backward and forward compatibility" by silently dropping fields that don't match
"Seamless evolution" becomes "2am debugging session figuring out why half your data disappeared" Debug tip: Always test schema changes in staging with the exact same connector versions. Compatibility rules work differently in Connect 2.8.1 vs 3.4.0.

How does Kafka Connect achieve exactly-once delivery?

Spoiler: it doesn't, reliably.

Connect claims "exactly-once semantics through careful offset management" but what you get is "mostly-once with occasional duplicates and rare data loss." The theory: Source connectors store offsets after producing records, sink connectors commit offsets after writing to destinations. Kafka transactions provide end-to-end guarantees. The reality:

Connector restarts between producing records and committing offsets cause duplicates
Transactional features require enable.idempotence=true and isolation.level=read_committed which nobody configures correctly
Offset corruption leads to reprocessing weeks of data or skipping records entirely
"End-to-end exactly-once" works until your sink system is down for 30 seconds and the connector gives up Debug command:

Check if your sink connector is actually committing offsets: bash kafka-console-consumer --bootstrap-server localhost:9092 --topic connect-offsets --from-beginning Kafka Connect Data Flow

What happens when a Kafka Connect worker fails?

"Automatic redistribution" is optimistic.

When workers fail, you get to experience the joy of distributed systems coordination failing in real-time. What's supposed to happen: Heartbeat mechanisms detect failures, leader reassigns work to healthy workers, state preserved in Kafka topics. What actually happens:

Worker failures trigger rebalancing storms that take down the entire cluster
Leader election gets confused and you end up with 3 leaders or no leader
"Preserved state" gets corrupted and tasks restart from the beginning of time
"Seamless recovery" means 20 minutes of downtime while workers fight over who's in charge Debug tip: Check worker logs for `Worker

Coordinatormessages. If you see constant rebalancing, increaseworker.sync.timeout.msto 10000ms andworker.unsync.timeout.ms` to 6000ms.

How do I choose between standalone and distributed mode?

Simple rule:

Use standalone mode unless you enjoy debugging distributed system failures at 3am. Standalone mode stores config in local files and actually works. Distributed mode stores config in Kafka topics and introduces failure modes you never knew existed. Use standalone when:

You want to sleep through the night
You have 1-3 connectors that don't need HA
You value simplicity over "scalability" Use distributed when:
Your manager insists on "production-grade distributed architecture"
You need 10+ connectors and can afford a dedicated Connect ops team
You enjoy explaining to stakeholders why the data pipeline is down because of "leader election issues" Reality check: I've seen teams spend 6+ months trying to make distributed mode stable when standalone would have solved their problem in like 1 day, maybe 2 if they hit some weird edge case.

Why does my connector show RUNNING but no data flows?

Welcome to the most frustrating Kafka Connect bug.

Your connector claims it's RUNNING but hasn't moved data in hours. Common causes:

Source connector polling interval is too high (poll.interval.ms=5000 means 5-second delays)
Sink connector is blocked by destination system but doesn't report the error
Schema incompatibility causes silent failures in data conversion
Connector is waiting for more records to hit flush.size threshold
Task is stuck in an infinite retry loop with exponential backoff Debug steps: 1.

Check task-level status: GET /connectors/{name}/tasks/{id}/status 2.

Look for errors in worker logs: grep "WorkerSinkTask\|WorkerSourceTask" connect.log 3.

Check if offsets are advancing: Monitor connect-offsets topic 4.

Restart the specific task: POST /connectors/{name}/tasks/{id}/restart Nuclear option: Delete and recreate the connector. Yes, really.

Can Kafka Connect transform data during transit?

Connect is basically useless for anything complex.

You get some basic Single Message Transforms (SMTs) that work fine for trivial stuff like adding headers or renaming fields, but anything real requires proper stream processing. What SMTs can do:

Add timestamps, filter out fields, change data types, route to different topics What SMTs can't do: Complex joins, aggregations, windowing, or basically anything useful If you need real transformations, bite the bullet and use Kafka Streams or ksqlDB. Don't try to hack complex logic into SMTs

that way lies madness and debugging sessions that last until sunrise.

How does Kafka Connect handle backpressure from slow sinks?

"Backpressure handling" is a fancy way of saying "everything grinds to a halt when your destination is slow." Connect will reduce polling from Kafka topics and eventually pause consumption when buffers fill up. What happens in practice:

Sink connector falls behind because Elasticsearch is choking on your JSON blobs
Connect buffers pile up until you hit buffer.memory=33554432 (32MB default)
Framework pauses consumption and your real-time pipeline becomes eventually-time
You spend Friday night tuning flush.size, linger.ms, and batch.size trying to find the magic combination Pro tip: Don't rely on Connect's backpressure. Design your sink systems to actually handle the load, or use a proper stream processor that can drop data intelligently instead of just stopping everything.

What's the performance overhead of using Kafka Connect?

Connect adds roughly 15-20% overhead compared to a well-written custom client (maybe more depending on your connector), plus whatever latency your connector adds on top.

The framework does a lot of reflection and JSON parsing that custom code can avoid. Real performance factors:

Connector quality:

JDBC connector with bad SQL can kill your database

Serialization overhead: JSON converters are slow, Avro is better but Schema Registry adds latency
Worker coordination:

Distributed mode spends time on rebalancing that could be processing data

SMT processing: Each transform adds CPU overhead and potential bottlenecks Reality check: If you're pushing millions of records per second, write custom clients. If you're processing thousands per second and value operational simplicity over raw speed, Connect is probably fine.

How do I monitor Kafka Connect in production?

Connect monitoring is like watching a black box that occasionally lights up when things are already broken.

The JMX metrics tell you everything except what you actually need to know. Essential metrics that actually matter:

connector-failed-task-count:

How many tasks are dead (not just "degraded")

sink-record-lag: How far behind your sink connectors are
source-record-poll-rate:

If this drops to zero, your source is stuck

Task-level error counts: Connector-level metrics hide which specific task is failing Monitoring reality:

The REST API status is optimistic bullshit. A connector can show as RUNNING while doing absolutely nothing for hours. Set up actual data validation

count records going in vs records coming out, because Connect won't tell you when it's silently losing data. Tools that don't suck:
Prometheus JMX exporter for metrics
Custom health checks that verify data is actually flowing
Log aggregation because when Connect breaks, the answers are buried in worker logs

How do I debug "Task failed with WorkerSinkTaskThreadException"?

This error message is about as helpful as "something went wrong somewhere." It's Connect's way of saying "a task died but I won't tell you why." What it means:

A sink task crashed and Connect caught the exception but lost the actual error details. Debug steps: 1.

Check worker logs for the full stack trace before the exception 2. Look for schema compatibility errors: grep "SerializationException\|DeserializationException" *.log 3.

Check if destination system is rejecting writes: Database locks, permission errors, etc. 4.

Verify your converter configuration matches the data format 5. Check if you hit resource limits: Memory, disk space, connection pools Common root causes:

Schema Registry is down but the error gets swallowed
Destination database has connectivity issues but task doesn't report it properly
Memory leak in connector causes OOM but only shows generic exception Fix: Restart the task and watch logs closely during startup. The real error usually appears in the first few attempts.

Can I run multiple versions of the same connector?

KIP-891 in Kafka 4.1.0 lets you run multiple connector versions simultaneously. This exists because upgrading connectors in production is basically Russian roulette. Why you need this: Connector version 10.6.1 works great, version 10.6.2 has a memory leak that crashes your cluster. Now you can test the new version while keeping the old one running instead of rolling back at 2am when everything breaks. Reality check: This feature solves a problem that shouldn't exist. If connectors were properly tested and backward compatible, we wouldn't need to run multiple versions.

Why did my connector stop working after a JVM restart?

JVM restarts expose all the hidden state that "stateless" connectors actually maintain.

When the JVM comes back up, your connectors often start from completely wrong positions. Common issues after restart:

Connector picks up from beginning of source data instead of last position
Sink connector reprocesses all Kafka topics from the start
Connection pools are reset and connector can't reconnect to external systems
In-memory state about table schemas or API pagination is lost Debug steps: 1.

Check if offset data is corrupted: kafka-console-consumer --topic connect-offsets 2.

Verify external system connectivity: Can workers reach databases/APIs? 3.

Look for classloader issues: ClassNotFoundException in worker logs 4.

Check if connector plugin directory changed or became unreadable Prevention: Always test connector restarts in staging.

What works fine for weeks will break mysteriously after the first restart. War story: Had a Debezium My

SQL connector (version 1.9.2 I think) that worked perfectly for about 3 months, then died immediately after a planned server restart. Turns out the connector was relying on a specific MySQL binlog position that got reset during the MySQL service restart. Lost maybe 6 hours of CDC data and had to rebuild downstream aggregations from scratch. Took the whole damn weekend.

How do I fix connector tasks stuck in FAILED state?

Tasks get stuck in FAILED state and refuse to restart automatically.

The "automatic failure recovery" only works in marketing materials. Manual restart commands:

Restart single task: curl -X POST http://localhost:8083/connectors/my-connector/tasks/0/restart
Restart all tasks: curl -X POST http://localhost:8083/connectors/my-connector/restart?includeTasks=true
Nuclear option:

Delete and recreate the connector Why tasks get stuck:

Connector hit a non-retryable exception and gave up
Task consumed all retry attempts and entered permanent failure mode
Configuration error prevents task from starting but doesn't get reported properly
Resource exhaustion (memory, file handles) that persists after restart Monitoring tip: Set up alerts on task status. Don't rely on connector-level status
it lies.

Official Resources and Documentation

38%

Recommendations combine user behavior, content similarity, research intelligence, and SEO optimization

Quick Navigation

The Three-Headed Monster (Core Components)

What's New in Kafka 4.1.0 (And What Still Sucks)

The Reality of "Reliable" Data Integration

The Worker Coordination Nightmare

Connector vs Task: The Hierarchy of Pain

Offset Management: The Source of Most 3AM Pages

Schema Evolution: Where Data Goes to Die

REST API: When Declarative Configuration Meets Reality

Change Data Capture: The Database Stalker

Cloud Data Lakes: Where Money Goes to Die

Microservices: The Distributed Debugging Nightmare

"Real-Time" Analytics (AKA Eventually-Time Analytics)

DevOps Monitoring: Watching the Watchers Fail

What's the difference between Kafka Connect and custom Kafka clients?

Can Kafka Connect handle schema evolution automatically?

How does Kafka Connect achieve exactly-once delivery?

What happens when a Kafka Connect worker fails?

How do I choose between standalone and distributed mode?

Why does my connector show RUNNING but no data flows?

Can Kafka Connect transform data during transit?

How does Kafka Connect handle backpressure from slow sinks?

What's the performance overhead of using Kafka Connect?

How do I monitor Kafka Connect in production?

How do I debug "Task failed with WorkerSinkTaskThreadException"?

Can I run multiple versions of the same connector?

Why did my connector stop working after a JVM restart?

How do I fix connector tasks stuck in FAILED state?

Related Tools & Recommendations

Fivetran: Expensive Data Plumbing That Actually Works

Apache NiFi: Drag-and-drop data plumbing that actually works (most of the time)

PostgreSQL vs MySQL vs MongoDB vs Cassandra - Which Database Will Ruin Your Weekend Less?

Kafka + Spark + Elasticsearch: Don't Let This Pipeline Ruin Your Life

ELK Stack for Microservices - Stop Losing Log Data

Debezium - Database Change Capture Without the Pain

Airbyte - Stop Your Data Pipeline From Shitting The Bed

Your Elasticsearch Cluster Went Red and Production is Down

EFK Stack Integration - Stop Your Logs From Disappearing Into the Void

PostgreSQL vs MySQL vs MariaDB - Performance Analysis 2025

How I Migrated Our MySQL Database to PostgreSQL (And Didn't Quit My Job)

How to Actually Connect Cassandra and Kafka Without Losing Your Shit

MongoDB 스키마 설계 - 삽질 안 하는 법

MongoDB Alternatives: Choose the Right Database for Your Specific Use Case

Your Snowflake Bill is Out of Control - Here's Why

dbt + Snowflake + Apache Airflow: Production Orchestration That Actually Works

Snowflake - Cloud Data Warehouse That Doesn't Suck

CDC Integration Patterns That Work in Production

CDC Troubleshooting: When Your Pipeline Shits the Bed

CDC Performance: When Your Demo Crashes and Burns in Production