ELK Stack for Microservices - Stop Losing Log Data

Why ELK Stack Will Save Your Sanity (And Your Career)

Microservices Architecture

Look, I've been there. You've got fifteen microservices running in production, and when something breaks, you're frantically SSHing into different servers trying to piece together what happened from scattered log files. That's caveman shit.

The Data Flow: Raw application logs → Beats/Logstash (collection & parsing) → Elasticsearch (indexing & storage) → Kibana (visualization & search). Each component has a specific job, and when one breaks, the whole chain fails.

Here's what these components do (and how they break)

Elasticsearch: It's a distributed search engine, not a database, no matter what your architect says. Yeah, it can store data, but it'll corrupt itself if you look at it wrong. I've seen clusters go red because someone sneezed too hard. When it works, you can search through millions of log entries in milliseconds. When it doesn't, you'll be up at 3am figuring out why your heap exploded.

Logstash: This thing eats CPU like it's going out of style. It takes your logs and transforms them into something useful, but the configuration syntax is YAML hell. One wrong indent and nothing works. I spent 3 hours debugging a pipeline once because of a fucking space. The Ruby DSL syntax will make you question your life choices.

Kibana: Beautiful dashboards that randomly forget your work. I've lost count of how many times I've built the perfect dashboard only to have Kibana shit the bed during a deployment and lose everything. Pro tip: export your dashboards religiously. Trust me on this one - you'll thank me later when your perfect monitoring setup disappears.

How to Actually Deploy This Thing

Logstash Pipeline Architecture

There are basically three ways to do this, and two of them will make you hate your life:

Direct Integration: Your app talks directly to Elasticsearch. Sounds simple, right? Wrong. When Elasticsearch goes down (and it will), your app starts throwing exceptions and your logs disappear into the void. Use this only if you hate yourself or you're doing a quick prototype.

Buffered Pipeline: Logs go through Logstash or Kafka first. This actually works most of the time, but now you have more moving parts to break. When traffic spikes, Logstash will fall over and take your monitoring with it. Good luck tuning that Java heap.

Sidecar Pattern: In Kubernetes, you run Filebeat next to your app container. This is the least shitty option because when your app crashes, the logs still get collected. Plus, you get pod metadata for free, which is actually useful when debugging. Just remember that each sidecar eats about 200MB of RAM.

What Actually Breaks in Production

Cluster Health States: Green (all good), Yellow (missing replicas but functional), Red (missing primary shards, data loss imminent). When you see Red, your weekend is fucked.

Memory Issues: Elasticsearch will OOM kill itself if you give it too much heap memory. The magic number is 50% of system RAM, but good luck figuring out the optimal heap size before your cluster melts down. I learned this the hard way when our production cluster died during Black Friday.

Disk Space: This is always the problem. Your indices grow faster than you expect, and suddenly Elasticsearch goes read-only because it's out of disk space. Set up ILM policies or suffer. I've been called at 2am because someone forgot to configure log rotation.

Version Hell: Don't even think about upgrading versions without testing everything. Version 8.1.3 has a memory leak in the ingest pipeline processor, skip it. I've seen entire clusters become unusable because someone upgraded Elasticsearch and the index mappings broke. Test your upgrades on a copy of prod data, not just the happy path demo data.

Network Partitions: When your cluster splits, you'll get split brain syndrome and lose data. Configure your master nodes properly or watch everything burn. Use an odd number of master nodes and set your minimum_master_nodes to (total_masters / 2) + 1.

How to Actually Deploy ELK Stack Without Losing Your Mind

Kubernetes Cluster Architecture

Skip the Official Helm Charts (They're Bloated)

Yeah, I know everyone says to use the official Elastic Helm charts, but they're bloated as fuck and will eat your cluster resources. The resource requirements they recommend are insane. Here's what actually works:

Filebeat Architecture

## This config actually works in production
replicas: 3
minimumMasterNodes: 2
resources:
  requests:
    memory: \"8Gi\"  # Start small, you can always scale up
  limits:
    memory: \"16Gi\"  # Not 32Gi like the docs say, that's insane
heap:
  max: \"8g\"  # NEVER exceed 50% of container memory or it'll OOM

Storage costs will eat your budget alive: SSDs are expensive as shit. I watched one company blow $15K/month on premium NVMe SSDs because they believed Elastic's marketing about "optimized storage." Reality check: you need fast storage for hot data (last 7-14 days) but you can use cheaper drives for older logs that get queried maybe once a month. Start with 1TB NVMe for hot indices, move everything older to cheap SATA drives. One client saved $8K/month just by implementing proper ILM.

Beats Input Architecture

Application Integration That Won't Break

JSON Logging: Yeah, you need structured logs, but don't overthink it. Use standard field names that won't break everything. Here's what works:

// This logback config has saved my ass multiple times
logging:
  level:
    com.yourcompany: DEBUG
  pattern:
    console: \"%d{yyyy-MM-dd HH:mm:ss} [%thread] %-5level [%X{traceId:-}] %logger{36} - %msg%n\"

Correlation IDs: Use Spring Cloud Sleuth if you're in the Spring ecosystem, but configure it properly or it'll slow down your app. Don't trace every single request unless you want your performance to tank:

spring:
  sleuth:
    sampler:
      probability: 0.1  # Don't trace everything, you'll regret it
    zipkin:
      enabled: false  # Unless you actually have Zipkin running

The Kafka Buffer That Actually Works

Don't connect your apps directly to Elasticsearch - that's amateur hour. Use Kafka as a buffer. When Elasticsearch shits the bed (and it will), your logs will still be queued up safely:

## Filebeat config that won't lose data
output.kafka:
  hosts: [\"kafka1:9092\", \"kafka2:9092\", \"kafka3:9092\"]
  topic: 'logs-%{[service.name]}'
  required_acks: 1
  compression: gzip  # Save bandwidth, Kafka can handle it

Logstash Pipeline: Keep it simple or spend your weekend debugging YAML. Here's a basic config that won't explode in your face:

input {
  kafka {
    bootstrap_servers => \"kafka:9092\"
    topics => [\"logs\"]
    codec => \"json\"
  }
}

filter {
  # Don't get fancy here, basic parsing is fine
  if [level] == \"ERROR\" {
    mutate { add_tag => [\"error\"] }
  }
}

output {
  elasticsearch {
    hosts => [\"elasticsearch:9200\"]
    index => \"logs-%{+YYYY.MM.dd}\"
  }
}

What Actually Fails (And How to Fix It)

Elasticsearch Goes Red: This happens constantly. Usually it's disk space, but here's the debug process I use every fucking time:

## First thing to check - always disk space
curl -X GET \"localhost:9200/_cat/allocation?v&h=shards,disk.indices,disk.used,disk.avail,disk.percent,host\"

## If that's fine, check cluster health
curl -X GET \"localhost:9200/_cluster/health?pretty\"

## Find the actual problem
curl -X GET \"localhost:9200/_cat/shards?v&h=index,shard,prirep,state,unassigned.reason\"

Memory Issues: When Elasticsearch OOMs, it's usually heap misconfiguration. The exact error is java.lang.OutOfMemoryError: Java heap space followed by your cluster going red. I learned this the hard way during a 2am outage when someone gave ES 64GB of heap on a 128GB machine:

## This works, the documentation lies
ES_JAVA_OPTS: \"-Xms8g -Xmx8g\"  # Same min/max, always

Logstash Stops Processing: This happens when your pipeline config is shit. Check the logs first, then the pipeline stats:

## Logstash logs are actually useful, unlike most things
docker logs logstash | grep ERROR

## Pipeline stats tell you what's broken
curl -XGET 'localhost:9600/_node/stats/pipelines?pretty'

Production Monitoring You'll Actually Use

Monitoring Setup: Your dashboards should show cluster health, disk usage, indexing rate, search latency, and JVM heap usage. If you can't see these metrics, you're flying blind.

Set up these alerts or you'll be blind when things break. Here's what actually matters:

Elasticsearch cluster status != green for > 2 minutes
Disk usage > 85% (not 90%, by then it's too late)
Logstash pipeline throughput drops by 50%
Index creation failures (this means your mapping is fucked)

Save yourself the headache: Use Cerebro for Elasticsearch cluster monitoring. It's better than the built-in monitoring and free. ElastAlert2 is decent for alerting if you don't want to deal with vendor lock-in.

ELK Stack Integration Patterns Comparison

Integration Pattern	Will This Ruin Your Weekend?	How Much You'll Hate Your AWS Bill	What Breaks First
Direct Application Integration	Maybe for prototypes	Cheap until it breaks	Your app when ES goes down
Filebeat + Logstash Pipeline	Yeah, most production use this	Medium need dedicated boxes	Logstash config or heap
Kafka + ELK Integration	Yes but expensive as hell	High need Kafka expertise	Kafka rebalances or ES heap
Sidecar Container Pattern	Works in K8s	Low-medium 200MB per pod	Pod memory limits
Agent-Based Collection	If you like pain	Low but maintenance hell	Agent version conflicts

Common ELK Stack Problems (And How to Fix Them)

Why does Elasticsearch keep going red?

Usually disk space, always fucking disk space. Elasticsearch has hard-coded disk watermarks: low watermark at 85%, high watermark at 90%, and flood stage at 95%. When you hit the flood stage (95%), it goes read-only and stops accepting writes with this exact error: cluster_block_exception: blocked by: [FORBIDDEN/12/index read-only / allow delete (api)]. Here's what I do every damn time:

curl localhost:9200/_cat/allocation?v
curl localhost:9200/_cluster/settings?include_defaults&flat_settings&pretty | grep watermark

If disk usage hits 95% flood stage, Elasticsearch goes read-only and your cluster turns to shit. The exact error is Error: disk usage exceeded flood-stage watermark, index has read-only-allow-delete block. Set up ILM policies before this bites you in the ass:

PUT _ilm/policy/logs-policy
{
  "policy": {
    "phases": {
      "hot": { "actions": { "rollover": { "max_size": "50gb", "max_age": "7d" }}},
      "delete": { "min_age": "30d", "actions": { "delete": {} }}
    }
  }
}

My Logstash pipeline stopped processing logs. What the fuck?

Check the pipeline stats first - this API endpoint shows you exactly where the bottleneck is:

curl localhost:9600/_node/stats/pipelines?pretty

If the input is 0 (something like "events": {"in": 0}), your source is broken. If output is 0, Elasticsearch is probably down or rejecting documents. If both are running but events/second tanked, you probably fucked up the filter config. Look for some plugin that has zero duration or workers - that's your problem. Also check if your pipeline threads are even running:

## This breaks everything
filter {
  if [field] == "value"  # Missing closing brace - YAML hell
    mutate { add_tag => "broken" }
  }
}

How much memory should I give Elasticsearch?

Usually around 50% of system RAM, but never more than 30.5GB. Above 32GB, you lose compressed ordinary object pointers (compressed oops) optimization and your actual usable heap shrinks because 64-bit pointers eat twice as much memory. There's another threshold around 26GB where you lose zero-based compressed oops - performance drops significantly if the JVM can't get your heap in the first 32GB of address space.

## This works - around 50% on a 32GB machine
ES_JAVA_OPTS: "-Xms16g -Xmx16g"  

## This doesn't - you'll get insane GC pauses
ES_JAVA_OPTS: "-Xms32g -Xmx32g"  # On a 64GB machine, don't do this

## Check your GC time - should be under 10% or you're fucked
curl localhost:9200/_nodes/stats/jvm?pretty

The sweet spot is maybe 26-30GB heap on machines with 64GB RAM, but test this shit yourself. Use jstat -gc [PID] to monitor GC time - if you're spending more than 10% of CPU time in GC, you're doing it wrong.

Why are my searches so slow?

Your mapping is probably fucked. Check if you're using text fields for exact matching:

GET /logs-*/_mapping

Fix it with proper field types:

{
  "properties": {
    "service_name": { "type": "keyword" },  # For exact matches
    "message": { "type": "text" },          # For full-text search
    "timestamp": { "type": "date" }         # For time range queries
  }
}

Filebeat vs Logstash - which one should I use?

Use Filebeat to collect logs, Logstash to transform them. Don't try to do everything with one tool:

Filebeat: Lightweight, runs everywhere, doesn't break
Logstash: Heavy, complex, breaks when you look at it wrong

Most people try to do everything with Logstash and wonder why their CPU is pegged at 100%.

My Kibana dashboards keep disappearing. Is this normal?

Yeah, Kibana randomly forgets your work. Export your dashboards:

## Export everything
curl -X GET "localhost:5601/api/saved_objects/_export" \
  -H 'Content-Type: application/json' \
  -H 'kbn-xsrf: true' \
  -d '{"type":"dashboard"}'

Store the exported JSON in git or you'll hate your life when it happens again.

How do I know when my cluster is about to die?

Set up alerts for these, or you'll be debugging at 3am like I was last Tuesday:

Disk usage > 85% (not 90%, by then it's too late and you're fucked)
Cluster status != green for more than a few minutes
JVM heap usage > 75% (or whatever threshold works for you)
Too many pending tasks piling up in cluster state
Oh, and check if someone deployed without testing again - that's usually the problem

Use ElastAlert or just Prometheus + Grafana if you don't want vendor lock-in.

Security? Do I really need it?

Unless you want your logs exposed to the internet, yes. Enable TLS and basic auth at minimum:

xpack.security.enabled: true
xpack.security.transport.ssl.enabled: true
xpack.security.http.ssl.enabled: true

Or use a reverse proxy with authentication. Either way, don't run Elasticsearch wide open - I've seen clusters get cryptolocked.

Can I use ELK Stack with other monitoring tools?

Yeah, but don't go crazy. Use Metricbeat to scrape Prometheus metrics if you need to. Just remember: more complexity = more things that break at 2am when you're trying to sleep.

What is ELK? | Centralized Log Management | Elasticsearch Logstash Kibana | DevOps | Tech Primers by Tech Primers

## ELK Stack Tutorial That Doesn't Suck

I've watched a lot of ELK tutorials and most are garbage. This one's actually useful - the guy has clearly run this in production and shows you the real gotchas, not just the happy path demo bullshit.

What you'll actually learn:
- How to set up Elasticsearch without it eating all your RAM
- Logstash configs that don't break when you look at them wrong
- Kibana dashboards that don't randomly forget your work
- The dumb stuff that'll bite you in production

Watch: Complete ELK Stack Tutorial - Elasticsearch, Logstash & Kibana

Why I recommend this:
Dude actually shows you how to debug shit when it breaks. Most tutorials stop at "congratulations, it works!" This one keeps going and shows you what happens when your heap runs out at 2am. Worth the 2 hours if you're tired of learning things the hard way.

📺 YouTube

Resources That'll Actually Help You

46%

Recommendations combine user behavior, content similarity, research intelligence, and SEO optimization

Quick Navigation

Here's what these components do (and how they break)

How to Actually Deploy This Thing

What Actually Breaks in Production

Skip the Official Helm Charts (They're Bloated)

Application Integration That Won't Break

The Kafka Buffer That Actually Works

What Actually Fails (And How to Fix It)

Production Monitoring You'll Actually Use

Why does Elasticsearch keep going red?

My Logstash pipeline stopped processing logs. What the fuck?

How much memory should I give Elasticsearch?

Why are my searches so slow?

Filebeat vs Logstash - which one should I use?

My Kibana dashboards keep disappearing. Is this normal?

How do I know when my cluster is about to die?

Security? Do I really need it?

Can I use ELK Stack with other monitoring tools?

Related Tools & Recommendations

Maven is Slow, Gradle Crashes, Mill Confuses Everyone

Redis Overview: In-Memory Database, Caching & Getting Started

Jenkins Docker Kubernetes CI/CD: Deploy Without Breaking Production

Apache NiFi: Visual Data Flow for ETL & API Integrations

Kibana - Because Raw Elasticsearch JSON Makes Your Eyes Bleed

Setting Up Prometheus Monitoring That Won't Make You Hate Your Job

Docker Won't Start on Windows 11? Here's How to Fix That Garbage

Stop Docker from Killing Your Containers at Random (Exit Code 137 Is Not Your Friend)

Docker Desktop's Stupidly Simple Container Escape Just Owned Everyone

Node.js ESM Migration - Stop Writing 2018 Code Like It's Still Cool

How to Actually Connect Cassandra and Kafka Without Losing Your Shit

Apache Kafka - The Distributed Log That LinkedIn Built (And You Probably Don't Need)

Node.js Microservices: Avoid Pitfalls & Build Robust Systems

GraphQL Overview: Why It Exists, Features & Tools Explained

Redis Caching in Django: Boost Performance & Solve Problems

Neon Production Troubleshooting Guide: Fix Database Errors

Cassandra Vector Search - Build RAG Apps Without the Vector Database Bullshit

Grafana - The Monitoring Dashboard That Doesn't Suck

Debug Kubernetes Issues: The 3AM Production Survival Guide

Kong Gateway: Cloud-Native API Gateway Overview & Features