Why Event-Driven Systems Are Both Amazing and Terrible

Most tutorials show you how to send a "Hello World" message and call it event-driven architecture. That's like showing someone console.log() and claiming they know JavaScript.

I've spent 4 years dealing with event systems that process anywhere from 10K to 500K messages per second. Here's what actually happens when you build this stuff for real.

Martin Fowler's event-driven guide actually covers the real-world stuff unlike most academic bullshit.

The Reality Check: What Event-Driven Actually Means

Event-driven architecture (EDA) means your services communicate by publishing events about what happened and consuming events to react to changes. But here's what separates production systems from playground demos:

Stuff that actually matters when you're getting paged at 3am:

  • Messages don't just vanish when a container restarts (way harder than it sounds, learned this the hard way)
  • Consumer lag doesn't spiral out of control during traffic spikes (hit 6+ hours once, was not fun)
  • Dead letter queues exist and someone actually fucking monitors them
  • Schema changes don't break every downstream service (one field removal broke 8 services once)
  • You can replay events without the entire system shitting itself
  • Circuit breakers actually work instead of making everything worse (ours opened all at once once, still don't know why)

Why Traditional Request-Response Breaks at Scale

Synchronous REST APIs between microservices create a house of cards. When your payment service calls the inventory service, which calls the shipping service, which calls the notification service, you've created a distributed monolith with single points of failure everywhere.

The synchronous failure cascade:

Order API → Payment Service → Inventory Service → Shipping Service
     ↓              ↓              ↓              ↓
  200ms         timeout         503 error      service down

Result: A failed inventory check brings down your entire order flow. We learned this the hard way when a single Redis timeout cascaded through 8 services and killed our entire checkout process for 45 minutes during peak traffic.

Event Patterns That Actually Work (And The Ones That Don't)

Event-Driven Architecture Diagram

Event-Driven Microservices Communication

Here's what actually works when shit hits the fan. There are 3 patterns that work, maybe 4 if you count the weird hybrid approach we tried.

1. Event Notification Pattern

Services emit lightweight events about state changes. Other services decide what to do with those notifications.

{
  "eventType": "OrderPlaced",
  "orderId": "ord_12345",
  "customerId": "cust_67890", 
  "timestamp": "2025-09-09T14:30:00Z",
  "amount": 299.99
  // TODO: add correlation ID, this shit is hard to debug without it
}

When to use: Real-time notifications, analytics pipelines, audit logging.

2. Event-Carried State Transfer

Events contain complete state information, reducing the need for additional API calls.

{
  "eventType": "CustomerProfileUpdated",
  "customerId": "cust_67890",
  "profile": {
    "name": "Jane Smith",
    "email": "jane@example.com",
    "tier": "premium",
    "preferences": {...}
  },
  "version": 5,
  "timestamp": "2025-09-09T14:30:00Z"
}

When to use: When downstream services need complete context, reducing API chattiness.

3. Event Sourcing Pattern

Store events as the source of truth, derive current state by replaying events.

// Event store
const events = [
  { type: "AccountCreated", amount: 0, timestamp: "2025-01-01" },
  { type: "MoneyDeposited", amount: 1000, timestamp: "2025-01-02" },
  { type: "MoneyWithdrawn", amount: 200, timestamp: "2025-01-03" }
];

// Current balance = replay all events
const currentBalance = events.reduce((balance, event) => {
  switch(event.type) {
    case "MoneyDeposited": return balance + event.amount;
    case "MoneyWithdrawn": return balance - event.amount;
    default: return balance;
  }
}, 0); // Result: 800

When to use: Auditing requirements, complex business logic, need for temporal queries.

Message Broker Selection: The 2025 Landscape

Kafka Architecture

After 4 years of running different brokers in production, here's what I've learned:

Kafka: The 800-pound gorilla that everyone uses. Handles serious throughput but you'll need someone who actually understands JVM tuning (good luck finding them). Expect to spend weekends debugging consumer rebalancing bullshit. Kafka 3.6.0 broke our consumers with a COORDINATOR_NOT_AVAILABLE error that took 8 hours to fix. I fucking hate managing Kafka, but it works.

NATS: Way simpler to operate than Kafka, which is refreshing. Good performance, smaller memory footprint. The catch? Way smaller ecosystem and fewer tools when shit breaks at 2am.

Pulsar: Tried this for 2 weeks, went back to Kafka. It's like Kafka with better multi-tenancy but more complex to operate. The community is tiny compared to Kafka's.

Redis Logo

Redis Streams: Perfect for simple use cases and prototyping. Just don't expect durability guarantees - our Redis crashed once and we lost 3 hours of events.

How to Not Fuck This Up: Implementation Strategy

Don't be the team that implements event sourcing, CQRS, and saga patterns all at once and then spends 6 months debugging why nothing works. I've seen this pattern kill three different projects.

Start small or you'll hate your life. I began with user registration events because they're simple and won't take down payments if I fucked up. Get monitoring set up first - you'll need it when things break at 2am. Learn to hate consumer lag before you scale, trust me on this.

Then add a few more event types once you're not completely terrified. We added order events and payment events next. Watch your dead letter queues fill up with garbage you didn't expect. Implement schema versioning before you need it - learned this the hard way when one field change broke 15 services.

Finally, move to the advanced stuff only when you're comfortable with the basics. Event sourcing for the domain that actually needs it (billing for us). Saga patterns - prepare for debugging hell, I'm not kidding. CQRS if you really need read/write separation (spoiler: you probably don't).

Most teams skip the simple stuff and wonder why their event system is unreliable. Don't be those teams - I was one of those teams.

Next up: the actual implementation details and all the ways it breaks in production.

Message Broker Comparison: Production Reality Check (2025)

Feature

Apache Kafka

NATS JetStream

Apache Pulsar

Redis Streams

Throughput

Usually 200K-300K msg/sec in our setup, YMMV

Pretty fast, maybe 200K+ (we hit 300K once)

Haven't tested much, heard it's good

Depends on Redis mood, maybe 100K

Latency

Under 20ms when things aren't broken

2-5ms (actually impressive)

No idea, probably fine

Fast as hell when memory isn't full

Durability

Rock solid

Solid

Solid

Eh, good enough

Ops Complexity

Prepare for weekend debugging

Actually manageable

Heard it's a nightmare

Works until it doesn't

Multi-tenancy

Hack it with partitions

Actually built-in

Native (finally!)

Hack it with prefixes

Geo-replication

Mirror Maker (it's fine)

Built-in (nice)

Native (actually works)

Enterprise only ($$$)

Schema Evolution

Schema Registry (more moving parts)

Basic (but works)

Built-in (comprehensive)

You're on your own

Message Ordering

Per-partition (mostly works)

Per-stream (seems reliable)

Per-partition (can break)

Per-stream (if you configure it right)

Exactly-once

Marketing bullshit (see FAQ)

Also marketing bullshit

Also marketing bullshit

Nope, at-least-once only

Storage Model

Log-based (proven)

Log-based (simpler)

Tiered (complex but powerful)

Memory + disk (fast but fragile)

Query Capabilities

KSQL (powerful but heavy)

Stream processors (basic)

Pulsar Functions (decent)

Redis modules (limited)

Cloud Support

Confluent, MSK ($$)

NGS Cloud (reasonable)

StreamNative ($$$)

Redis Cloud (cheapest)

License

Apache 2.0

Apache 2.0

Apache 2.0

BSD

Best For

Data pipelines that matter

Microservices that need to work

Multi-tenant SaaS platforms

Fast prototypes and caching

Avoid When

You don't need the complexity

You need heavy analytics

You have a small team

You need guaranteed durability

Monthly Cost

Around $2K-15K+ (and that's just AWS)

$500-5K (not too bad)

$3K-20K+ (enterprise pricing)

$200-3K (unless you really scale)

Team Size Needed

2-5 platform engineers (who hate weekends)

1-2 engineers (manageable)

3-8 platform engineers (good luck hiring)

1-2 engineers (until it breaks spectacularly)

Implementation Patterns That Don't Break at 3AM

Microservices Event Flow

Theory is nice, but production systems require patterns that actually work when everything's on fire. Here's what I've learned after being paged at 2am way too many times.

The Microservices.io patterns catalog is actually useful, unlike most documentation.

Event Schema Design That Won't Destroy Your Weekend

Event-Driven E-commerce Architecture

Bad event schemas will make your life miserable. I've seen teams spend weeks fixing downstream consumers because someone removed a field. Here's how to not be those teams.

Schema Versioning Strategy

Use semantic versioning for events with backward compatibility - this is one of the few things that actually works:

{
  \"schemaVersion\": \"2.1.0\",
  \"eventId\": \"evt_789012\",
  \"eventType\": \"OrderPlaced\",
  \"timestamp\": \"2025-09-09T14:30:00Z\",
  \"data\": {
    \"orderId\": \"ord_12345\",
    \"customerId\": \"cust_67890\",
    \"items\": [
      {
        \"productId\": \"prod_111\",
        \"quantity\": 2,
        \"price\": 49.99
      }
    ],
    \"totalAmount\": 99.98,
    \"currency\": \"USD\",
    // New field in v2.1 - optional for compatibility
    \"paymentMethod\": \"credit_card\", 
    \"shippingAddress\": {
      \"street\": \"123 Main St\",
      \"city\": \"San Francisco\",
      \"state\": \"CA\",
      \"zipCode\": \"94105\"
    }
  }
}

Rules that'll save your ass (learned the hard way):

  • Never remove fields - just mark them deprecated and leave them there forever
  • New fields better be optional with defaults that won't break old consumers
  • Version your schemas or you'll be debugging schema mismatches at 2am
  • Always include eventId and timestamp - you'll need them when tracing why orders disappeared

Event Envelope Pattern

Wrap business events in a standard envelope for consistent handling:

interface EventEnvelope<T> {
  metadata: {
    eventId: string;
    eventType: string;
    schemaVersion: string;
    timestamp: string;
    correlationId: string;
    causationId?: string;
    source: string;
  };
  data: T;
}

This pattern enables cross-cutting concerns like tracing, auditing, and replay without touching business logic.

Event Publishing: When Everything Goes Wrong

Event Publishing Pattern

Publishing events reliably requires handling network failures, broker downtime, and database transaction consistency.

Outbox Pattern Implementation

The outbox pattern ensures events are published even if the message broker is temporarily unavailable:

// In the same database transaction - this took us 3 weeks to get right
class OrderService {
  async createOrder(orderData: CreateOrderRequest): Promise<void> {
    await this.db.transaction(async (tx) => {
      // 1. Save business data first
      const order = await tx.orders.create(orderData);
      
      // 2. Save event to outbox table (same transaction - critical!)
      await tx.outbox.create({
        eventId: generateEventId(), // TODO: handle ID collision edge case, this broke once in staging
        eventType: 'OrderPlaced',
        aggregateId: order.id,
        eventData: JSON.stringify({
          orderId: order.id,
          customerId: order.customerId,
          totalAmount: order.totalAmount
          // TODO: add more fields without breaking consumers (last time this took out 3 services)
        }),
        createdAt: new Date(),
        published: false // track publishing status
      });
    });
    
    // 3. Separate outbox publisher handles this async
    // TODO: what happens if this never gets published? still figuring this out
  }
}

// Separate outbox publisher process
class OutboxPublisher {
  async processEvents(): Promise<void> {
    const events = await this.db.outbox.findUnprocessed();
    
    for (const event of events) {
      try {
        await this.messagebroker.publish(event.eventType, event.eventData);
        await this.db.outbox.markProcessed(event.id);
      } catch (error) {
        // Retry logic handles temporary failures
        await this.handlePublishFailure(event, error);
      }
    }
  }
}

This pattern ensures exactly-once event publishing even during broker outages or database failures. Works great in theory - still working out some edge cases with our implementation. Like, what happens if the outbox publisher crashes halfway through? I think we handle it but honestly not 100% sure. More on outbox patterns and event sourcing.

Idempotency and Deduplication

Events may be delivered multiple times. Design consumers to handle duplicates gracefully:

class InventoryService {
  async handleOrderPlaced(event: OrderPlacedEvent): Promise<void> {
    // Use event ID for idempotency
    const alreadyProcessed = await this.processedEvents.exists(event.eventId);
    if (alreadyProcessed) {
      console.log(`Event ${event.eventId} already processed, skipping`);
      return;
    }
    
    try {
      // Business logic
      await this.reduceInventory(event.data.items);
      
      // Mark as processed only after successful completion
      await this.processedEvents.add(event.eventId, {
        processedAt: new Date(),
        eventType: event.eventType
      });
    } catch (error) {
      // Don't mark as processed on failure - allow retry
      throw error;
    }
  }
}

Circuit Breakers: When Your Shit Stops Working

Event-driven systems fail in complex ways. Implement intelligent failure handling to prevent cascading failures. Study circuit breaker patterns and resilience patterns.

Circuit Breaker Pattern

class CircuitBreaker {
  private state: 'CLOSED' | 'OPEN' | 'HALF_OPEN' = 'CLOSED';
  private failureCount = 0;
  private lastFailureTime?: Date;
  
  constructor(
    private threshold: number = 5,
    private timeout: number = 60000 // 1 minute
  ) {}
  
  async execute<T>(operation: () => Promise<T>): Promise<T> {
    if (this.state === 'OPEN') {
      if (this.shouldAttemptReset()) {
        this.state = 'HALF_OPEN';
      } else {
        throw new Error('Circuit breaker is OPEN');
      }
    }
    
    try {
      const result = await operation();
      this.onSuccess();
      return result;
    } catch (error) {
      this.onFailure();
      throw error;
    }
  }
  
  private onSuccess(): void {
    this.failureCount = 0;
    this.state = 'CLOSED';
  }
  
  private onFailure(): void {
    this.failureCount++;
    this.lastFailureTime = new Date();
    
    if (this.failureCount >= this.threshold) {
      this.state = 'OPEN';
    }
  }
}

// Usage in event consumer
class PaymentService {
  private circuitBreaker = new CircuitBreaker(5, 60000);
  
  async handleOrderPlaced(event: OrderPlacedEvent): Promise<void> {
    try {
      await this.circuitBreaker.execute(async () => {
        await this.externalPaymentProvider.processPayment(event.data);
      });
    } catch (error) {
      // Circuit breaker is open - use fallback or queue for later
      await this.queueForRetry(event);
    }
  }
}

Exponential Backoff Retry

class RetryHandler {
  async executeWithRetry<T>(
    operation: () => Promise<T>,
    maxRetries: number = 3,
    baseDelay: number = 1000
  ): Promise<T> {
    let lastError: Error;
    
    for (let attempt = 0; attempt <= maxRetries; attempt++) {
      try {
        return await operation();
      } catch (error) {
        lastError = error;
        
        if (attempt === maxRetries) {
          break; // No more retries
        }
        
        // Exponential backoff with jitter
        const delay = baseDelay * Math.pow(2, attempt) + Math.random() * 1000;
        await this.sleep(delay);
      }
    }
    
    throw lastError;
  }
  
  private sleep(ms: number): Promise<void> {
    return new Promise(resolve => setTimeout(resolve, ms));
  }
}

Saga Patterns: Distributed Transaction Hell

Complex business workflows spanning multiple services require saga patterns to maintain consistency.

Choreography vs Orchestration - I Still Get These Mixed Up Sometimes

Choreography (no central coordinator - services just react to events):

// Order saga - choreography style (tried this, it was a mess)
class OrderSaga {
  // 1. Order created
  async handleOrderPlaced(event: OrderPlacedEvent): Promise<void> {
    await this.publishEvent('ReserveInventory', {
      orderId: event.data.orderId,
      items: event.data.items
      // TODO: what if this event gets lost? happened once
    });
  }
  
  // 2. Inventory reserved successfully
  async handleInventoryReserved(event: InventoryReservedEvent): Promise<void> {
    await this.publishEvent('ProcessPayment', {
      orderId: event.data.orderId,
      amount: event.data.totalAmount
    });
  }
  
  // 3. Payment failed - compensate (this part always breaks)
  async handlePaymentFailed(event: PaymentFailedEvent): Promise<void> {
    await this.publishEvent('ReleaseInventory', {
      orderId: event.data.orderId,
      items: event.data.items
    });
    
    await this.publishEvent('CancelOrder', {
      orderId: event.data.orderId,
      reason: 'Payment failed'
      // TODO: add correlation ID for debugging, learned this the hard way
    });
  }
}

Orchestration (central coordinator - easier to debug when shit breaks):

class OrderOrchestrator {
  async processOrder(orderId: string): Promise<void> {
    const saga = await this.createSaga(orderId);
    
    try {
      // Step 1: Reserve inventory
      await this.executeStep(saga, 'ReserveInventory');
      
      // Step 2: Process payment  
      await this.executeStep(saga, 'ProcessPayment');
      
      // Step 3: Arrange shipping
      await this.executeStep(saga, 'ArrangeShipping');
      
      await this.completeSaga(saga);
    } catch (error) {
      // This compensation thing is tricky as hell
      await this.compensateSaga(saga, error);
    }
  }
  
  private async compensateSaga(saga: Saga, error: Error): Promise<void> {
    // Execute compensation in reverse order (theoretically)
    const completedSteps = saga.completedSteps.reverse();
    
    for (const step of completedSteps) {
      await this.executeCompensation(step);
      // TODO: what if compensation fails? still figuring this out
    }
  }
}

Monitoring: Knowing When You're Fucked

Production event-driven systems require comprehensive monitoring to debug issues across distributed services.

Essential Metrics

// Track key metrics for event-driven systems
class EventMetrics {
  // Event publishing metrics
  @Counter('events_published_total')
  eventsPublished: Counter;
  
  @Histogram('event_processing_duration_seconds')
  processingDuration: Histogram;
  
  @Counter('events_failed_total')
  eventsFailed: Counter;
  
  // Consumer lag metrics
  @Gauge('consumer_lag_messages')
  consumerLag: Gauge;
  
  // Circuit breaker metrics
  @Counter('circuit_breaker_state_changes_total')
  circuitBreakerStateChanges: Counter;
}

Monitor these critical metrics:

  • Event publishing rate and success rate
  • Consumer lag across all topics/streams
  • Dead letter queue depth
  • Circuit breaker state changes
  • End-to-end event processing latency

Distributed Tracing

Use correlation IDs to trace events across service boundaries:

class EventTracing {
  async publishEventWithTrace(eventType: string, eventData: any): Promise<void> {
    const span = tracer.startSpan('publish_event');
    const correlationId = generateCorrelationId();
    
    try {
      await this.messagebroker.publish(eventType, {
        ...eventData,
        metadata: {
          correlationId,
          traceId: span.traceId,
          spanId: span.spanId
        }
      });
      
      span.setStatus({ code: SpanStatusCode.OK });
    } catch (error) {
      span.setStatus({ code: SpanStatusCode.ERROR, message: error.message });
      throw error;
    } finally {
      span.end();
    }
  }
}

These implementation patterns provide the foundation for production-ready event-driven systems. The key is starting with solid basics and adding complexity gradually as your system scales. Next, we'll look at specific tools and their configuration for different production scenarios.

Kafka Tutorial - Spring Boot Microservices by Amigoscode

## Spring Boot Kafka Event-Driven Microservices Tutorial

This is one of the few Java tutorials that doesn't suck. 5 hours sounds brutal but it actually covers production patterns instead of Hello World bullshit. Java Guides shows you event sourcing, CQRS, and saga patterns that actually work.

What you'll learn:
- Setting up Kafka with Spring Boot for production
- Implementing event sourcing patterns with proper error handling
- Building CQRS architectures for read/write separation
- Saga pattern implementation for distributed transactions
- Testing strategies for event-driven systems
- Production deployment considerations

Key timestamps:
- 0:15 - Project setup and Kafka configuration
- 1:30 - Event sourcing implementation with Spring Data
- 2:45 - CQRS pattern with separate read/write models
- 3:50 - Saga pattern for order processing workflow
- 4:20 - Error handling and retry mechanisms

Watch: Spring Boot Kafka Event-Driven Microservices Tutorial

Why this tutorial is valuable: Actually covers the stuff that breaks in production - error handling, retries, circuit breakers. Most tutorials show you the happy path and bail when things get complicated. This one doesn't.

📺 YouTube

The Stuff That Always Breaks (And Why I Drink)

Q

What's the deal with event ordering?

A

Event ordering will drive you fucking insane.

I've tried every approach over 4 years

  • global ordering kills performance dead. Just don't. I partition by business key (customer ID, order ID, whatever) and let each partition handle its own timeline. Still working out some edge cases:```typescript// This is what works in prod, not the textbook versionconst partition = hashFunction(event.customerId) % totalPartitions;await producer.send({ topic: 'customer-events', partition: partition, key: event.customer

Id, value: JSON.stringify(event)});```Works great until your product manager decides they need global transaction ordering. That's when you start updating your resume.

Q

How do you handle consumer crashes?

A

Speaking of things that don't work as advertised... consumer crashes are inevitable.

Learned this during a midnight deploy in November 2023 when our payment consumer died with `org.apache.kafka.common.errors.Not

EnoughReplicasException` and we lost 3 hours of transactions. If you configured retention right, they'll replay. If not, start practicing your "it wasn't my fault" speech:

  • Consumer groups are mandatory or your instances will step on each other
  • Retention should be 7+ days (learned this when a weekend outage ate everything)
  • Consumer lag monitoring is not optional
  • it spikes during every deploy
  • Health checks save your ass when Kubernetes kills pods randomlyyaml# This config has saved me from getting paged at 3amgroup.id: payment-service-groupenable.auto.commit: false # Because I trust nothing to auto-commitauto.offset.reset: earliest # Better safe than sorrymax.poll.interval.ms: 300000 # 5 minutes before Kafka gives up on you
Q

What's your approach to testing event-driven systems?

A

Oh, and testing?

Testing is a complete nightmare because everything's async and you can't predict timing. Spent 2 weeks tracking down a race condition that only happened in prod on Tuesdays. I've tried every approach and they all suck in different ways:Unit testing with mocks is the only thing that doesn't make me want to quit.

Contract testing with Pact prevents the schema disasters I've seen kill three different releases. Integration testing with Test

Containers works but makes your CI pipeline take 45 minutes instead of 5. End-to-end testing is where dreams go to die

  • only test the stuff that makes money.```typescript// This test has caught more production bugs than I care to admitdescribe('OrderEventHandler', () => { it('should process order placed event correctly', async () => { const mockEvent = { eventType: 'Order

Placed', data: { orderId: '123', customerId: '456' } }; await orderHandler.handle(mockEvent); expect(inventoryService.reduceStock).toHaveBeenCalledWith('123'); expect(shippingService.scheduleShipment).toHaveBeenCalledWith('123'); });});```

Q

How do you manage schema evolution?

A

Schema evolution... this is where most teams die a slow, painful death.

I've seen one schema change break 15 downstream services because nobody followed the rules. I think this is why exactly-once fails, but I'm still not 100% sure:Never remove required fields

  • just mark them deprecated and leave them there forever.

New fields have to be optional with defaults that won't break old consumers. Semantic versioning is mandatory or you'll be debugging schema mismatches at 2am.```json{ "schemaVersion": "2.0.0", "eventType": "User

Registered", "data": { "userId": "123", "email": "user@example.com", "name": "John Doe", // This optional field took me 6 months to convince the team to add "preferences": { "newsletter": true, "notifications": false } }}```Your consumers better handle missing fields gracefully or they'll crash when you deploy. I learned this when a missing preferences field brought down our entire notification service.

Q

What's the difference between EDA and event sourcing?

A

Wait, EDA vs event sourcing

  • this confuses everyone including me sometimes. EDA is about service communication
  • your order service publishes "Order

Placed" and your inventory service reacts. Event sourcing is a storage pattern where you store the events as the source of truth instead of just the current state.You can absolutely do EDA without event sourcing (most teams do). Event sourcing is for when you need complete audit trails, temporal queries like "what was this account balance on March 15th", or complex business rules that benefit from replaying history. It's also way more complex than most teams need. We tried event sourcing for everything once

  • bad idea.
Q

How do you prevent duplicate processing?

A

Duplicate processing will fuck your data six ways from Sunday.

I've seen double charges, double inventory reductions, and double email notifications all because someone thought "at-least-once delivery" meant "exactly-once processing."You have to implement idempotency at the application level using event IDs. This saved my ass when Kafka decided to replay 50,000 payment events during a cluster rebalance (still not sure exactly why that happened):```typescriptclass PaymentService { async processPayment(event:

PaymentEvent): Promise { // Always check if we've already processed this specific event const existing = await this.processed

Events.findById(event.eventId); if (existing) { console.log('Event already processed, skipping'); return; } try { // Process payment const result = await this.paymentGateway.charge(event.data); // Store result with event ID

  • this is what prevents duplicates await this.processedEvents.create({ eventId: event.event

Id, result: result, processedAt: new Date() }); } catch (error) { // Critical: don't mark as processed on failure throw error; } }}```

Q

How do you debug distributed events?

A

Debugging distributed events is like trying to find a specific needle in a haystack made of needles. Correlation IDs are the only thing that kept me sane during a 6-hour incident in March where a customer's order got stuck between 4 different services:typescript// Add correlation ID to every goddamn eventconst correlationId = generateCorrelationId();await eventPublisher.publish('OrderPlaced', { correlationId, orderId: '123', customerId: '456'});// Log everything with the correlation ID so you can trace the flowawait logger.info('Processing order', { correlationId, service: 'payment-service', action: 'charge-card'});Jaeger saved my career when I needed to trace why orders were taking 47 seconds to process instead of the expected 2 seconds. Without distributed tracing, you're basically debugging blind.

Q

When should I use synchronous vs asynchronous calls?

A

Synchronous vs async is the eternal question that splits teams in half. Use sync calls for real-time user interactions like login and search

  • users expect immediate feedback. Use events for business workflows like order fulfillment where eventual consistency is fine.The anti-pattern I see constantly is teams that think events solve everything. They're wrong. Synchronous calls are simpler and work great for most scenarios. Don't event-source your user login system just because you read about event sourcing on Twitter.
Q

How do you handle poison messages?

A

Poison messages are events that fail processing over and over, clogging up your queues like a bad burrito. I've seen one malformed JSON event bring down an entire consumer group because it kept retrying forever.Dead letter queues with exponential backoff are your only salvation:typescriptclass EventProcessor { async processEvent(event: Event): Promise<void> { const maxRetries = 3; const currentRetry = event.metadata.retryCount || 0; try { await this.businessLogic.process(event); } catch (error) { if (currentRetry < maxRetries) { // Exponential backoff prevents thundering herd problems const delay = Math.pow(2, currentRetry) * 1000; await this.scheduleRetry(event, delay); } else { // Finally give up and dump it for manual investigation await this.deadLetterQueue.send(event, error); } } }}Monitor your DLQ depth religiously. Anything sitting there means your code is broken or your data is fucked.

Q

Is exactly-once delivery truly possible?

A

Exactly-once delivery is marketing bullshit.

I've been through this with three different message brokers

  • they all give you "at-least-once" and handwave the rest. Kafka's "exactly-once" semantics work until you have a network partition, then it's just fancy at-least-once with extra steps. No idea why vendors keep promising this when it's physically impossible.Make your consumers idempotent instead:Handle duplicates gracefully, use unique event IDs for deduplication, and only attempt transactional outbox if you enjoy pain.

Real exactly-once delivery requires distributed transactions which are slow as hell and break in ways that'll make you question your career choices.```typescript// Transactional outbox

  • works but you'll hate maintaining itawait database.transaction(async (tx) => { // Business logic await tx.orders.create(orderData); // Event in same transaction
  • this ensures atomicity await tx.outbox.create({ eventId: generate

UniqueId(), eventType: 'Order

Created', eventData: orderData });});```Just handle duplicates and move on with your life.

Q

What should I monitor in event-driven systems?

A

You'll find out everything's broken from angry customers calling support unless you monitor the right shit:

Consumer lag spikes during every deploy and will ruin your weekend if you don't watch it. Dead letter queue depth should be zero

  • anything sitting there means something's fucked. Event processing errors mean your code is broken (usually). End-to-end latency matters because users notice when checkout takes 30 seconds instead of 3. Circuit breaker trips tell you a downstream service is having a bad time.```typescript// These metrics have saved my ass more times than I can count@Counter('events_published_total')eventsPublished: Counter;@Histogram('event_processing_duration_seconds')processingDuration:

Histogram;@Gauge('consumer_lag_messages')consumerLag: Gauge;```Alert on consumer lag over 1000 messages (adjust based on your volume), any messages in the DLQ, circuit breaker opens, and error rates over 5%. These thresholds took me 2 years to tune properly and I'm still adjusting them.

Q

How do event-driven systems affect performance and latency?

A

Event-driven systems will make some things slower and other things faster. Each async hop adds 10-100ms of latency, but you can handle way more concurrent users. One service failing won't kill your entire application.Your checkout might take 500ms instead of 200ms, but you can process 10x more orders without the whole system collapsing. I've seen synchronous systems die under load while the event-driven version just kept chugging along.Measure end-to-end latency for critical flows and cache aggressively where users expect immediate responses. The trade-offs are usually worth it once you hit real scale.

Resources That Actually Helped Me (Skip the Rest)