Why does Elasticsearch keep going red?

Usually disk space, always fucking disk space. Elasticsearch has hard-coded disk watermarks: low watermark at 85%, high watermark at 90%, and flood stage at 95%. When you hit the flood stage (95%), it goes read-only and stops accepting writes with this exact error: `cluster_block_exception: blocked by: [FORBIDDEN/12/index read-only / allow delete (api)]`. Here's what I do every damn time: ```bash curl localhost:9200/_cat/allocation?v curl localhost:9200/_cluster/settings?include_defaults&flat_settings&pretty | grep watermark ``` If disk usage hits 95% flood stage, Elasticsearch goes read-only and your cluster turns to shit. The exact error is `Error: disk usage exceeded flood-stage watermark, index has read-only-allow-delete block`. Set up ILM policies before this bites you in the ass: ```json PUT _ilm/policy/logs-policy { "policy": { "phases": { "hot": { "actions": { "rollover": { "max_size": "50gb", "max_age": "7d" }}}, "delete": { "min_age": "30d", "actions": { "delete": {} }} } } } ```

My Logstash pipeline stopped processing logs. What the fuck?

Check the pipeline stats first - this API endpoint shows you exactly where the bottleneck is: ```bash curl localhost:9600/_node/stats/pipelines?pretty ``` If the input is 0 (something like `"events": {"in": 0}`), your source is broken. If output is 0, Elasticsearch is probably down or rejecting documents. If both are running but events/second tanked, you probably fucked up the filter config. Look for some plugin that has zero duration or workers - that's your problem. Also check if your pipeline threads are even running: ```ruby # This breaks everything filter { if [field] == "value" # Missing closing brace - YAML hell mutate { add_tag => "broken" } } } ```

How much memory should I give Elasticsearch?

Usually around 50% of system RAM, but never more than 30.5GB. Above 32GB, you lose compressed ordinary object pointers (compressed oops) optimization and your actual usable heap shrinks because 64-bit pointers eat twice as much memory. There's another threshold around 26GB where you lose zero-based compressed oops - performance drops significantly if the JVM can't get your heap in the first 32GB of address space. ```yaml # This works - around 50% on a 32GB machine ES_JAVA_OPTS: "-Xms16g -Xmx16g" # This doesn't - you'll get insane GC pauses ES_JAVA_OPTS: "-Xms32g -Xmx32g" # On a 64GB machine, don't do this # Check your GC time - should be under 10% or you're fucked curl localhost:9200/_nodes/stats/jvm?pretty ``` The sweet spot is maybe 26-30GB heap on machines with 64GB RAM, but test this shit yourself. Use `jstat -gc [PID]` to monitor GC time - if you're spending more than 10% of CPU time in GC, you're doing it wrong.

Why are my searches so slow?

Your mapping is probably fucked. Check if you're using `text` fields for exact matching: ```bash GET /logs-*/_mapping ``` Fix it with proper field types: ```json { "properties": { "service_name": { "type": "keyword" }, # For exact matches "message": { "type": "text" }, # For full-text search "timestamp": { "type": "date" } # For time range queries } } ```

Filebeat vs Logstash - which one should I use?

Use Filebeat to collect logs, Logstash to transform them. Don't try to do everything with one tool: - **Filebeat**: Lightweight, runs everywhere, doesn't break - **Logstash**: Heavy, complex, breaks when you look at it wrong Most people try to do everything with Logstash and wonder why their CPU is pegged at 100%.

How do I know when my cluster is about to die?

Set up alerts for these, or you'll be debugging at 3am like I was last Tuesday: - Disk usage > 85% (not 90%, by then it's too late and you're fucked) - Cluster status != green for more than a few minutes - JVM heap usage > 75% (or whatever threshold works for you) - Too many pending tasks piling up in cluster state - Oh, and check if someone deployed without testing again - that's usually the problem Use ElastAlert or just Prometheus + Grafana if you don't want vendor lock-in.

Can I use ELK Stack with other monitoring tools?

Yeah, but don't go crazy. Use Metricbeat to scrape Prometheus metrics if you need to. Just remember: more complexity = more things that break at 2am when you're trying to sleep.

Currently viewing the AI version

Switch to human version

ELK Stack for Microservices Logging - AI-Optimized Knowledge Base

Technology Overview

ELK Stack Components: Elasticsearch (distributed search engine), Logstash (log processing pipeline), Kibana (visualization interface), Filebeat/Beats (log collection agents)

Data Flow Architecture: Raw application logs → Beats/Logstash (collection & parsing) → Elasticsearch (indexing & storage) → Kibana (visualization & search)

Critical Failure Point: When any component in the chain fails, the entire logging pipeline breaks

Configuration That Actually Works

Elasticsearch Production Settings

Memory Configuration:

Heap size: 50% of system RAM, never exceed 30.5GB
Sweet spot: 26-30GB heap on 64GB machines
Above 32GB: Loss of compressed ordinary object pointers (compressed oops) optimization
Performance threshold: Above 26GB loses zero-based compressed oops

ES_JAVA_OPTS: "-Xms8g -Xmx8g"  # Same min/max values required
# NOT: "-Xms32g -Xmx32g" on 64GB machine - causes excessive GC pauses

Resource Requirements (Production-Tested):

replicas: 3
minimumMasterNodes: 2
resources:
  requests:
    memory: "8Gi"  # Start small, scale up
  limits:
    memory: "16Gi"  # Not 32Gi as documentation suggests
heap:
  max: "8g"  # Never exceed 50% of container memory

Cluster Health States:

Green: All primary and replica shards allocated
Yellow: Missing replica shards but functional
Red: Missing primary shards, data loss imminent

Network Partition Prevention:

Use odd number of master nodes
Set minimum_master_nodes to (total_masters / 2) + 1
Prevents split-brain syndrome and data loss

Storage Cost Optimization

Storage Strategy:

Hot data (7-14 days): NVMe SSD - $15K/month for high-performance access
Warm data (older logs): SATA drives - saves $8K/month compared to all-NVMe setup
Start with 1TB NVMe for hot indices

Disk Space Management:

Low watermark: 85% disk usage
High watermark: 90% disk usage
Flood stage: 95% - triggers read-only mode with error: cluster_block_exception: blocked by: [FORBIDDEN/12/index read-only / allow delete (api)]

Application Integration Patterns

Pattern	Production Reliability	Cost Impact	Primary Failure Mode
Direct Integration	High failure rate	Low cost	App crashes when ES down
Filebeat + Logstash	Most common production	Medium cost	Logstash config/heap issues
Kafka + ELK	Enterprise grade	High cost	Kafka rebalances/ES heap
Sidecar Container	K8s recommended	200MB RAM per pod	Pod memory limits
Agent-Based	Maintenance intensive	Low cost	Agent version conflicts

Critical Failure Scenarios and Solutions

Elasticsearch Cluster Red Status

Root Cause Priority:

Disk space exhaustion (95%+ usage)
Memory issues (OOM heap space)
Unassigned shards
Network partitions

Diagnostic Commands:

# Always check disk space first
curl -X GET "localhost:9200/_cat/allocation?v&h=shards,disk.indices,disk.used,disk.avail,disk.percent,host"

# Cluster health details
curl -X GET "localhost:9200/_cluster/health?pretty"

# Find specific shard problems
curl -X GET "localhost:9200/_cat/shards?v&h=index,shard,prirep,state,unassigned.reason"

Memory-Related Failures

OutOfMemoryError Symptoms:

Error: java.lang.OutOfMemoryError: Java heap space
Cluster immediately goes red
Excessive GC pauses (>10% CPU time)

Prevention:

Monitor GC time with jstat -gc [PID]
Keep GC time under 10% of CPU time
Use identical min/max heap settings

Logstash Pipeline Failures

Common Failure Indicators:

Pipeline throughput drops to 0 events/second
Input events: 0 (source connection broken)
Output events: 0 (Elasticsearch down/rejecting documents)

Diagnostic API:

curl localhost:9600/_node/stats/pipelines?pretty

Configuration Pitfalls:

# BREAKS EVERYTHING - missing closing brace
filter {
  if [field] == "value"  # Missing closing brace
    mutate { add_tag => "broken" }
  }
}

# WORKS - proper YAML syntax
filter {
  if [field] == "value" {
    mutate { add_tag => ["working"] }
  }
}

Search Performance Issues

Mapping Problems:

Using text fields for exact matching (causes slow queries)
Missing keyword fields for aggregations
Incorrect field types for time-based queries

Optimal Field Mapping:

{
  "properties": {
    "service_name": { "type": "keyword" },    # Exact matches
    "message": { "type": "text" },           # Full-text search
    "timestamp": { "type": "date" }          # Time range queries
  }
}

Resource Requirements and Costs

Real-World Resource Consumption

Elasticsearch Cluster (Production):

3-node cluster: 24GB RAM per node minimum
Storage: 1TB NVMe for hot data + cheaper SATA for cold
CPU: 8-16 cores per node (I/O bound workload)
Network: 10Gbps for inter-node communication

Logstash Resource Usage:

CPU intensive: 4-8 cores minimum
Memory: 8-16GB heap + system memory
JVM tuning required for production loads

Filebeat Overhead:

Sidecar pattern: 200MB RAM per pod
Minimal CPU impact
Disk I/O dependent on log volume

Cost Optimization Strategies

Storage Costs (Based on real deployments):

All-NVMe approach: $15K/month (enterprise workload)
Hot/warm strategy: $7K/month (65% cost reduction)
ILM implementation: Additional 30-50% savings

Index Lifecycle Management (ILM):

PUT _ilm/policy/logs-policy
{
  "policy": {
    "phases": {
      "hot": { "actions": { "rollover": { "max_size": "50gb", "max_age": "7d" }}},
      "delete": { "min_age": "30d", "actions": { "delete": {} }}
    }
  }
}

Version-Specific Critical Issues

Elasticsearch 8.1.3

Critical Bug: Memory leak in ingest pipeline processor
Impact: Gradual heap exhaustion leading to OOM
Solution: Skip version, upgrade to 8.2+

Upgrade Risks

Index mapping incompatibilities between major versions
Plugin compatibility breaks
Performance regression in specific configurations
Mitigation: Test upgrades on production data copy, not demo data

Production Monitoring Requirements

Essential Alerts

Critical Thresholds:

Elasticsearch cluster status != green for >2 minutes
Disk usage >85% (not 90% - too late at 90%)
Logstash pipeline throughput drops 50%
JVM heap usage >75%
Index creation failures (indicates mapping problems)

Monitoring Tools

Recommended Stack:

Cerebro: Better than built-in Elasticsearch monitoring
ElastAlert2: Reliable alerting without vendor lock-in
Prometheus + Grafana: Comprehensive metrics collection

Kibana Dashboard Backup:

# Export dashboards before they disappear
curl -X GET "localhost:5601/api/saved_objects/_export" \
  -H 'Content-Type: application/json' \
  -H 'kbn-xsrf: true' \
  -d '{"type":"dashboard"}'

Security Implementation

Minimal Security Configuration

TLS and Authentication:

xpack.security.enabled: true
xpack.security.transport.ssl.enabled: true
xpack.security.http.ssl.enabled: true

Alternative: NGINX reverse proxy with basic auth

Often easier to implement than X-Pack security
Lower complexity, fewer failure points
Sufficient for most production environments

Security Failures

Cryptolocker attacks on exposed Elasticsearch clusters
Data exfiltration through unsecured Kibana instances
Resource abuse from public access to cluster

Integration Architecture Patterns

Kafka Buffer Implementation

Why Kafka Buffer:

Prevents log loss during Elasticsearch downtime
Handles traffic spikes without pipeline failures
Provides data durability and replay capability

Configuration:

# Filebeat to Kafka
output.kafka:
  hosts: ["kafka1:9092", "kafka2:9092", "kafka3:9092"]
  topic: 'logs-%{[service.name]}'
  required_acks: 1
  compression: gzip

Application Logging Best Practices

JSON Logging Configuration:

logging:
  pattern:
    console: "%d{yyyy-MM-dd HH:mm:ss} [%thread] %-5level [%X{traceId:-}] %logger{36} - %msg%n"

Correlation ID Implementation:

spring:
  sleuth:
    sampler:
      probability: 0.1  # Don't trace everything - performance impact
    zipkin:
      enabled: false   # Only if Zipkin actually running

Troubleshooting Decision Trees

Cluster Red Status

Check disk space (_cat/allocation)
If disk OK → Check unassigned shards (_cat/shards)
If shards OK → Check heap usage and GC time
If memory OK → Check network connectivity between nodes

Performance Degradation

Check field mappings for proper types
Verify query patterns (avoid wildcard prefixes)
Check index size and shard distribution
Monitor resource utilization (CPU, memory, I/O)

Data Loss Prevention

Implement ILM policies before disk fills
Monitor cluster health continuously
Backup critical indices regularly
Test restore procedures

Real-World Deployment Lessons

Common Misconceptions

"Elasticsearch is a database": It's a search engine; treat it as such
"More heap = better performance": Above 32GB actually decreases performance
"Official Helm charts are production-ready": They're resource-intensive and often over-engineered

Time Investment Requirements

Initial setup: 2-3 weeks for production-grade deployment
Performance tuning: 1-2 weeks of iterative optimization
Monitoring setup: 1 week for comprehensive alerting
Team training: 1 month for operational proficiency

Expertise Prerequisites

Strong understanding of JVM tuning and garbage collection
Kubernetes knowledge for containerized deployments
Linux system administration for performance optimization
Network troubleshooting for cluster communication issues

This knowledge base provides actionable intelligence for successful ELK Stack implementation while highlighting critical failure modes and their prevention strategies.

Useful Links for Further Investigation

Resources That'll Actually Help You

Link	Description
Elasticsearch Reference 8.15	The official docs. Dense but accurate. Start with the "Get Started" section, ignore the marketing fluff.
Logstash Configuration Reference	Real configs that work. Copy-paste these instead of trying to write from scratch.
Cerebro - Elasticsearch Management	Better cluster management than Kibana's built-in tools. Install this first.
ElastAlert Rules	Alerting that actually works. Don't use X-Pack alerting unless you hate yourself.
Netflix Engineering Blog - Logging Posts	How they handle millions of events per second. Spoiler: lots of Kafka and custom tooling.
Uber's Logging Platform	Their architecture evolution from ELK to custom solutions. Good read on scaling pain points.
Shopify Engineering Blog	Real-world deployment challenges and solutions. They made every mistake so you don't have to.
Airbnb's Logging Infrastructure	How they scaled from startup to enterprise logging. Includes actual cost numbers.
Elasticsearch Common Issues	Community forum where people post actual errors. Search here when shit breaks.
ELK Stack GitHub Issues	Where bugs get reported and (sometimes) fixed. Useful for version-specific problems.
Stack Overflow ELK Questions	Filter by votes, ignore answers from 2015. Configuration has changed a lot.
Elastic Official Docs - Troubleshooting	Official troubleshooting guide. Actually useful for common problems.
Docker ELK Stack	The most starred Docker setup on GitHub. Use this for local development.
Elastic Helm Charts	Official Kubernetes deployment for the Elastic Stack. It is heavy but provides a production-ready setup for complex environments.
Filebeat Kubernetes Examples	Working Kubernetes configurations for Filebeat, providing practical examples without unnecessary marketing fluff.
Logstash Patterns	A collection of pre-built grok patterns for Logstash, designed to save you time by eliminating the need to write regex from scratch.
Elasticsearch Tuning Guide	Provides actually useful performance tips for Elasticsearch. It is recommended to follow these guidelines before troubleshooting slow performance.
JVM Settings for Elasticsearch	Essential guidance on heap sizing and JVM options for Elasticsearch, ensuring cluster stability and optimal performance without crashes.
Elasticsearch Examples Repository	A repository containing working Index Lifecycle Management (ILM) policies, which are crucial for saving disk space and maintaining cluster sanity.
ELK Stack Security Hardening	Provides a minimal security setup for the ELK Stack that is proven to work effectively. This crucial guide should not be skipped.
TLS Configuration Examples	Practical and working TLS configuration examples for Elasticsearch, addressing shortcomings found in official documentation examples.
NGINX Reverse Proxy for ELK	A simple NGINX reverse proxy setup for authenticating access to the ELK Stack, often proving easier to implement than X-Pack security.
Elastic Discord Community	Access real-time help and support from experienced community members who have faced similar challenges, often more responsive than traditional forums.
Elastic Stack Community Forum	A community forum for honest discussions about problems and solutions related to the Elastic Stack, offering insights beyond corporate messaging.
Elasticsearch Users Mailing List	An old-school but still active mailing list where senior engineers and experienced users discuss Elasticsearch topics and provide support.

ELK Stack for Microservices Logging - AI-Optimized Knowledge Base

Technology Overview

Configuration That Actually Works

Elasticsearch Production Settings

Storage Cost Optimization

Application Integration Patterns

Critical Failure Scenarios and Solutions

Elasticsearch Cluster Red Status

Memory-Related Failures

Logstash Pipeline Failures

Search Performance Issues

Resource Requirements and Costs

Real-World Resource Consumption

Cost Optimization Strategies

Version-Specific Critical Issues

Elasticsearch 8.1.3

Upgrade Risks

Production Monitoring Requirements

Essential Alerts

Monitoring Tools

Security Implementation

Minimal Security Configuration

Security Failures

Integration Architecture Patterns

Kafka Buffer Implementation

Application Logging Best Practices

Troubleshooting Decision Trees

Cluster Red Status

Performance Degradation

Data Loss Prevention

Real-World Deployment Lessons

Common Misconceptions

Time Investment Requirements

Expertise Prerequisites

Useful Links for Further Investigation

Resources That'll Actually Help You

Related Tools & Recommendations

Kafka + MongoDB + Kubernetes + Prometheus Integration - When Event Streams Break

EFK Stack Integration - Stop Your Logs From Disappearing Into the Void

Splunk - Expensive But It Works

Prometheus + Grafana + Jaeger: Stop Debugging Microservices Like It's 2015

GitOps Integration Hell: Docker + Kubernetes + ArgoCD + Prometheus

Kibana - Because Raw Elasticsearch JSON Makes Your Eyes Bleed

Should You Use TypeScript? Here's What It Actually Costs

RAG on Kubernetes: Why You Probably Don't Need It (But If You Do, Here's How)

Connecting ClickHouse to Kafka Without Losing Your Sanity

Fix Your Broken Kafka Consumers

Grafana - The Monitoring Dashboard That Doesn't Suck

Set Up Microservices Monitoring That Actually Works

Your Elasticsearch Cluster Went Red and Production is Down

Kafka + Spark + Elasticsearch: Don't Let This Pipeline Ruin Your Life

Elastic APM - Track down why your shit's broken before users start screaming

Python vs JavaScript vs Go vs Rust - Production Reality Check

JavaScript Gets Built-In Iterator Operators in ECMAScript 2025

Fluentd - Ruby-Based Log Aggregator That Actually Works

Fluentd Production Troubleshooting - When Shit Hits the Fan

Pinecone Production Reality: What I Learned After $3200 in Surprise Bills