Fluentd Production Troubleshooting - When Shit Hits the Fan

When Fluentd Goes to Hell (And How to Fix It)

I've been firefighting Fluentd production issues for years. Here's what you need to know when everything breaks and you need to fix it fast. Skip the theory - this is battlefield medicine for your logging infrastructure.

The Nuclear Option (Try This First)

When Fluentd is fucked and you need it working now (see the official troubleshooting guide for more context):

## Stop everything
sudo systemctl stop fluentd
## Kill any stuck processes
sudo pkill -9 fluentd
## Clear buffer files (ONLY if you can lose recent logs)
sudo rm -rf /var/log/fluentd/buffer/*
## Start with verbose logging
sudo systemctl start fluentd
sudo journalctl -u fluentd -f

Time to fix: 2 minutes if you're lucky, 2 hours if the config is broken.

Memory is Growing Like Cancer

This is the #1 production killer. Fluentd starts at 100MB and grows to 2GB+ until your pods get OOMKilled. Here's why (explained in detail in the buffer configuration docs):

Root cause: Buffer overflow + memory leaks in plugins + Ruby garbage collection not keeping up.

Real example from production:

2025-09-10 03:42:18 +0000 [error]: temporarily failed to flush the buffer.
next_retry=2025-09-10 03:42:28 +0000 retry_times=5

That innocent error? It means your output destination is down and Fluentd is buffering everything in memory. After 10 minutes, you're out of RAM.

The fix that actually works:

<system>
  log_level debug
  workers 2  # Multi-worker saves your ass
</system>

<buffer>
  @type file  # Get it out of memory NOW
  path /var/log/fluentd/buffer/
  chunk_limit_size 4MB
  total_limit_size 512MB  # Hard limit - will drop logs before OOM
  overflow_action drop_oldest_chunk  # Better than crashing
  flush_thread_count 2
  flush_interval 5s
  retry_max_times 3  # Don't retry forever
  retry_wait 10s
</buffer>

Pro tip: Set RUBY_GC_HEAP_OLDOBJECT_LIMIT_FACTOR=1.2 in your environment. Ruby's default GC settings are optimized for throughput, not memory usage. This forces more frequent garbage collection. See Ruby GC documentation for deeper understanding.

CPU Pegged at 100% (The Ruby GIL Problem)

Symptom: Fluentd CPU hits 100%, logs stop flowing, everything backs up.

Why this happens: Ruby's Global Interpreter Lock means Fluentd is basically single-threaded for CPU work. Heavy regex parsing or JSON transformation will peg one core and block everything else.

The PayU solution that worked:
Their team reduced Fluentd instances by 48% using multi-workers:

<system>
  workers 3  # Use 2-3 workers, not more
</system>

## Only worker 0 handles file tailing (single source)
<worker 0>
  <source>
    @type tail
    path /var/log/app/*.log
    pos_file /var/log/fluentd/app.log.pos
    format json
    tag app.logs
  </source>
</worker>

## All workers handle output (parallel processing)
<match **>
  @type http
  endpoint \"https://ingress.coralogix.com/logs/v1/singles\"
  <buffer>
    flush_thread_count 4  # Parallel flushes
    chunk_limit_size 4MB
  </buffer>
</match>

Before: 30 single-worker pods consuming 0.8 CPU each
After: 15 multi-worker pods consuming 1 CPU each
Result: Same throughput, half the resource usage

Kubernetes DaemonSet Hell

The problem: Kubernetes 1.10+ changed log rotation, triggering memory leaks in file buffer mode. Fluentd memory grows forever until pods get killed.

Symptoms you'll see:

Memory usage climbing non-stop
kubectl get pods shows constant restarts
Error: signal: killed in pod logs

The working fix:

## In your DaemonSet
spec:
  template:
    spec:
      containers:
      - name: fluentd
        image: fluent/fluentd:v1.19.0
        resources:
          limits:
            memory: 1Gi  # Hard limit - pods will restart before OOM
          requests:
            memory: 400Mi
            cpu: 100m
        env:
        - name: RUBY_GC_HEAP_OLDOBJECT_LIMIT_FACTOR
          value: \"1.2\"  # Aggressive GC for containers
        - name: WORKERS
          value: \"2\"

Memory buffer config that doesn't leak:

<buffer>
  @type memory  # Counter-intuitive but file buffers leak in K8s
  chunk_limit_size 2MB
  total_limit_size 64MB
  flush_thread_count 2
  overflow_action drop_oldest_chunk
</buffer>

Real Error Messages (And What They Mean)

Error: buffer queue limit overflow
Translation: Your output destination is slow/down and buffers are full
Fix: Increase total_limit_size or fix the destination

Error: Fluent::ConfigError: Plugin 'tail' does not support multi workers
Translation: You tried to use tail input with multi-workers
Fix: Wrap it in a worker directive:

<worker 0>
  <source>
    @type tail
    # your tail config
  </source>
</worker>

Error: parsing failed
Translation: Your config syntax is fucked but Fluentd won't tell you where
Fix: Use fluentd --dry-run -c your-config.conf to validate

Error: Permission denied - /var/log/fluentd/buffer
Translation: Process can't write to buffer directory
Fix: chown -R fluentd:fluentd /var/log/fluentd or run as root (not recommended)

The 3AM Debug Workflow

When Fluentd breaks at 3AM and you need it fixed NOW:

Check if it's actually running: ps aux | grep fluentd
Look at recent logs: tail -100 /var/log/fluentd/fluentd.log
Check memory usage: free -h && df -h
Validate config: fluentd --dry-run -c /etc/fluentd/fluent.conf
Nuclear restart: Stop, kill, clear buffers, start
Monitor for 5 minutes: watch 'ps aux | grep fluentd && free -h'

Pro tip: Keep a working minimal config file handy. When shit's broken, switch to minimal config first, then add complexity back piece by piece.

Performance Numbers That Matter

From actual production deployments:

Single worker: ~3-4K events/sec before choking
Multi-worker (2-3): ~8-10K events/sec
Memory growth: 200-300MB with file buffers, 100-500MB with memory buffers
CPU usage: 0.5-1.0 cores per worker under normal load

When to scale out: If you're hitting >5K events/sec consistently, add more Fluentd instances rather than more workers per instance. Check the high availability configuration guide for scaling strategies and the performance tuning documentation for optimization techniques.

Production Troubleshooting FAQ (The Questions You Actually Have)

Why does Fluentd randomly stop processing logs?

Most likely: Buffer overflow. Your output destination (Elasticsearch, S3, etc.) is slow or down, so Fluentd queues everything in memory until it runs out.Check this: grep "buffer queue" /var/log/fluentd/fluentd.logFix: Add overflow_action drop_oldest_chunk and total_limit_size to your buffer config. Better to drop old logs than stop processing entirely.

Memory usage keeps growing until the pod gets killed - what's happening?

Root cause: File buffer memory leak in Kubernetes + Ruby GC not keeping up with allocation rate.The leak pattern: Memory grows steadily from 100MB to 1GB+ over hours/days, then sudden restart.Working solution: Switch to memory buffers with hard limits:ruby<buffer> @type memory total_limit_size 128MB # Hard stop overflow_action drop_oldest_chunk</buffer>Set RUBY_GC_HEAP_OLDOBJECT_LIMIT_FACTOR=1.2 to force more frequent garbage collection.

How do I know if I need multi-workers?

Signs you need multi-workers:

CPU consistently over 80% on single core
fluentd process consuming 1+ full CPU cores
Logs backing up even with fast output destinations
Processing >3-4K events/secDon't use multi-workers if:
Memory usage is your main problem (workers multiply memory usage)
You're using plugins that don't support it (tail needs special config)
CPU usage is low
you'll just waste resources

What's the optimal number of workers?

Production-tested answer: 2-3 workers max.Why not more? Ruby's GIL limits the benefit. PayU found no improvement beyond 3 workers, but 2-3 workers gave 48% resource reduction.Memory trade-off: Each worker consumes its own memory. 3 workers = ~3x memory usage.

Kubernetes pods keep restarting with "signal: killed" - how to fix?

Translation: Your pods are getting OOMKilled by the kernel.

Quick diagnosis:bashkubectl describe pod your-fluentd-pod | grep -A 5 "Last State"# Look for "OOMKilled" in the exit reasonPermanent fix: 1.

Set realistic memory limits: memory: 1Gi (not 512Mi)2.

Use memory buffers with hard limits 3. Enable aggressive garbage collection 4. Monitor memory growth: kubectl top pods

Error "Plugin 'tail' does not support multi workers" - what now?

Problem: Some plugins don't work with multiple workers because they need exclusive access to resources.Solution: Use the <worker> directive:ruby<system> workers 3</system><worker 0> <source> @type tail path /var/log/app/*.log # tail config here </source></worker># All workers can handle the output<match **> @type elasticsearch # output config here</match>

How do I debug high CPU usage?

Step 1: Check if it's regex processing hell:bash# Look for complex regex patternsgrep -i "regexp\|format" /etc/fluentd/fluent.confStep 2: Use perf to profile the Ruby process:bashsudo perf record -g -p $(pgrep fluentd)# Let it run for 30 secondssudo perf reportQuick fixes:

Simplify regex patterns in your config
Move heavy processing to filters, not parsers
Use gzip_command for S3 output to offload compression

My config validates but Fluentd still won't start - why?

Hidden gotcha: Fluentd's --dry-run only checks syntax, not runtime requirements.

Common runtime failures:

Permissions on log files/directories
Network connectivity to output destinations
Missing plugins (install with fluent-gem install plugin-name)
Port conflicts with other servicesDebug startup:

Run Fluentd in foreground with verbose logging:bashfluentd -c /etc/fluentd/fluent.conf -vv# Watch for the actual error before it daemonizes

Buffer files growing huge - should I delete them?

Danger zone: Never delete buffer files while Fluentd is running.

You'll corrupt the buffer state and lose logs.Safe cleanup process: 1.

Stop Fluentd: sudo systemctl stop fluentd2.

Check buffer size: du -sh /var/log/fluentd/buffer/3.

If you can lose logs: rm -rf /var/log/fluentd/buffer/*4.

Start Fluentd: sudo systemctl start fluentdBetter solution:

Configure total_limit_size to prevent runaway growth:ruby<buffer> total_limit_size 1GB overflow_action drop_oldest_chunk</buffer>

Performance is shit - what's the nuclear performance optimization?

**The PayU playbook that cut resource usage by 48%:**1. Multi-workers: workers 2 in system config 2. Parallel flushing: flush_thread_count 43. Memory GC tuning: RUBY_GC_HEAP_OLDOBJECT_LIMIT_FACTOR=1.24. External compression: store_as gzip_command for S3 output 5. Buffer optimization: File buffers for persistence, memory buffers for speedResult: Same throughput with half the CPU/memory usage.

How do I monitor Fluentd in production to prevent issues?

Critical metrics to watch:

Memory usage trending over time
Buffer queue length
Retry counts (indicates downstream issues)
Processing latencyQuick monitoring setup:ruby<source> @type monitor_agent bind 0.0.0.0 port 24220</source>Check health: curl http://localhost:24220/api/plugins.jsonSet alerts on:
Memory usage >80% of limit
Buffer queue >75% full
Retry rate >10% over 5 minutes

Kubernetes Production Setup (How to Not Fuck It Up)

Envoy Logo

Running Fluentd in Kubernetes is where most people screw up. I've seen teams burn weeks debugging memory leaks, pod restarts, and buffer issues that could have been avoided with the right setup from day one. The official Kubernetes deployment guide covers the basics, but here's what actually works in production.

DaemonSet Configuration That Actually Works

Here's the DaemonSet config I use in production. It's based on fixing every major issue I've encountered:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: fluentd
  namespace: kube-system
spec:
  selector:
    matchLabels:
      name: fluentd
  template:
    metadata:
      labels:
        name: fluentd
    spec:
      tolerations:
      - key: node-role.kubernetes.io/master
        effect: NoSchedule
      containers:
      - name: fluentd
        image: fluent/fluentd:v1.19.0-debian-1.0
        resources:
          limits:
            memory: 1Gi      # Don't go lower - you'll get OOMKilled
            cpu: 1000m       # Full core for multi-workers
          requests:
            memory: 400Mi    # Realistic starting point
            cpu: 200m        # Conservative request
        env:
        - name: RUBY_GC_HEAP_OLDOBJECT_LIMIT_FACTOR
          value: "1.2"      # Force frequent GC in containers
        - name: WORKERS
          value: "2"        # Multi-worker for performance
        volumeMounts:
        - name: varlog
          mountPath: /var/log
        - name: varlibdockercontainers
          mountPath: /var/lib/docker/containers
          readOnly: true
        - name: config-volume
          mountPath: /fluentd/etc
      volumes:
      - name: varlog
        hostPath:
          path: /var/log
      - name: varlibdockercontainers
        hostPath:
          path: /var/lib/docker/containers
      - name: config-volume
        configMap:
          name: fluentd-config

Key differences from the default Kubernetes examples:

Memory limit 1GB: Most examples use 512Mi, which causes OOMKilled under load
CPU limit 1000m: Full core needed for multi-workers to be effective
RUBY_GC_HEAP_OLDOBJECT_LIMIT_FACTOR: Containers need aggressive GC
WORKERS=2: Multi-worker mode for better CPU utilization

Multi-Worker Configuration for Kubernetes

The trick with multi-workers in Kubernetes is handling log collection properly. You can't have multiple workers reading the same files (see the tail input documentation for why this causes issues):

<system>
  log_level info
  workers "#{ENV['WORKERS'] || 1}"
</system>

## Only worker 0 handles log file collection
<worker 0>
  <source>
    @type tail
    @id in_tail_container_logs
    path /var/log/containers/*.log
    pos_file /var/log/fluentd-containers.log.pos
    tag kubernetes.*
    read_from_head true
    <parse>
      @type json
      time_format %Y-%m-%dT%H:%M:%S.%NZ
    </parse>
  </source>
</worker>

## All workers handle processing and output
<filter kubernetes.**>
  @type kubernetes_metadata
  @id filter_kube_metadata
  kubernetes_url "#{ENV['KUBERNETES_SERVICE_HOST']}:#{ENV['KUBERNETES_SERVICE_PORT_HTTPS']}"
  verify_ssl false
  ca_file /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
  bearer_token_file /var/run/secrets/kubernetes.io/serviceaccount/token
</filter>

<match **>
  @type elasticsearch
  @id out_es
  @log_level info
  include_tag_key true
  host "#{ENV['FLUENT_ELASTICSEARCH_HOST']}"
  port "#{ENV['FLUENT_ELASTICSEARCH_PORT']}"
  logstash_format true
  <buffer>
    @type memory
    chunk_limit_size 4MB
    total_limit_size 128MB
    flush_thread_count 2
    flush_interval 5s
    overflow_action drop_oldest_chunk
    retry_max_times 3
  </buffer>
</match>

The Kubernetes Log Rotation Problem

The issue: Kubernetes 1.10+ changed how log rotation works, causing memory leaks with file buffers. This is documented in the container log manager changes. Symptoms:

Memory usage climbs steadily over days
Eventually hits OOMKilled
Pod restarts, memory usage resets, cycle repeats

Root cause: Fluentd file buffers don't handle the new rotation mechanism properly, creating leaked file descriptors and buffer chunks.

Working solution: Use memory buffers with hard limits instead of file buffers:

<buffer>
  @type memory          # NOT file - file buffers leak in K8s
  total_limit_size 64MB # Hard limit prevents OOM
  chunk_limit_size 2MB
  overflow_action drop_oldest_chunk  # Drop logs vs crash
  flush_thread_count 2
  flush_interval 5s
</buffer>

Memory vs. file buffers trade-off:

Memory buffers: Fast, no leaks, but logs lost on restart
File buffers: Persistent, but leak memory in Kubernetes

For most use cases, losing a few seconds of logs on restart is better than constant pod restarts from memory leaks.

Performance Optimization (The PayU Method)

PayU's engineering team documented their 48% resource reduction using these performance optimizations. Their approach is backed by the multi-worker documentation and real production metrics:

Before optimization:

30 Fluentd replicas
0.8 CPU and 768Mi memory per pod
Single worker per pod
High CPU usage causing processing delays

After optimization:

15 Fluentd replicas (50% reduction)
1 CPU and 768Mi memory per pod
2-3 workers per pod
Same throughput with better resource utilization

The key changes:

Replace single-threaded plugins with multi-worker compatible ones:

## Before: Coralogix plugin (no multi-worker support)
<match **>
  @type coralogix
  # config
</match>

## After: HTTP plugin with Coralogix endpoint
<match **>
  @type http
  @id applications_json_http_to_coralogix
  endpoint "https://ingress.coralogix.com/logs/v1/singles"
  headers_from_placeholders {"Authorization":"Bearer ${$.privateKey}"}
  retryable_response_codes 503
  error_response_as_unrecoverable false
  <buffer $.privateKey>
    @type memory
    chunk_limit_size 4MB
    compress gzip
    flush_thread_count 4    # Parallel processing
    flush_interval 1s
    overflow_action throw_exception
    retry_max_times 10
    retry_type periodic
    retry_wait 8
  </buffer>
</match>

Use ConfigMap for configuration management:

apiVersion: v1
kind: ConfigMap
metadata:
  name: fluentd-config
data:
  fluent.conf: |
    <system>
      log_level "#{ENV["LOG_LEVEL"] || 'info'}"
      workers "#{ENV["WORKERS"] || 1}"
    </system>
    # rest of config here

This allows config changes without rebuilding Docker images.

Monitoring and Alerting Setup

Essential metrics to monitor (see the monitoring documentation for complete setup):

<source>
  @type monitor_agent
  bind 0.0.0.0
  port 24220
</source>

<source>
  @type prometheus
  bind 0.0.0.0
  port 24231
  metrics_path /metrics
</source>

Critical alerts to set up:

Memory usage trending up: Alert if memory usage increases >10% over 1 hour
Buffer queue growing: Alert if buffer utilization >75%
High retry rate: Alert if retry count >10% of total events
Pod restart frequency: Alert if pod restarts >3 times in 1 hour

Monitoring dashboard metrics (configure using the Prometheus monitoring guide):

fluentd_output_status_retry_count - Output destination issues
fluentd_output_status_buffer_queue_length - Buffer utilization
Container memory usage from Kubelet metrics
Pod restart count from Kubernetes events

Common Kubernetes Production Failures

Problem: Fluentd can't read log files
Error: Permission denied - /var/log/containers/app.log
Fix: Add proper securityContext:

securityContext:
  runAsUser: 0  # Run as root to read system logs
  # Or use specific user with log file access

Problem: Fluentd overwhelms Kubernetes API
Error: 429 Too Many Requests from Kubernetes metadata plugin
Fix: Configure rate limiting:

<filter kubernetes.**>
  @type kubernetes_metadata
  kubernetes_url "#{ENV['KUBERNETES_SERVICE_HOST']}:#{ENV['KUBERNETES_SERVICE_PORT_HTTPS']}"
  cache_size 1000
  cache_ttl 3600
  watch false  # Disable watch to reduce API calls
</filter>

Problem: Out of disk space from buffer files
Fix: Use memory buffers or set total_limit_size on file buffers

Problem: Network policies blocking Fluentd output
Fix: Create network policies allowing egress to your log destination

Capacity Planning for Scale

Resource planning formula based on production data:

Memory: 400Mi base + 100Mi per 1K events/sec
CPU: 200m base + 100m per worker + 200m per 1K events/sec
Disk: Only needed for file buffers - 1GB per 10K events/sec buffer capacity

Scaling thresholds:

Scale up workers: CPU >80% consistently
Scale out pods: Memory >80% or >5K events/sec per pod
Scale destination: Retry rate >5% consistently

Production-tested limits:

Max workers per pod: 3 (diminishing returns beyond this)
Max events per pod: ~10K/sec before worker scaling needed (based on performance tuning guidelines)
Max memory per pod: 2GB (beyond this, scale out instead)

Production Problem vs Solution Matrix

Problem	Symptoms	Root Cause	Working Solution	Time to Fix
Memory Growing Forever	Pod restarts with OOMKilled	Buffer overflow + file buffer leaks in K8s	Switch to memory buffers with `total_limit_size`	15 minutes
CPU Pegged at 100%	Logs backing up, high latency	Ruby GIL + heavy regex processing	Multi-workers + simplified parsing	30 minutes
Logs Stopped Processing	No new logs in destination	Buffer queue full, output destination down	Add `overflow_action drop_oldest_chunk`	5 minutes
Pod Restart Loop	Constant pod restarts in K8s	Memory limits too low for buffer size	Increase memory limit to 1GB + buffer limits	10 minutes
Config Validates But Won't Start	Startup failure after syntax check	Runtime permissions/network/plugins	Run in foreground with `-vv` flag	20 minutes
High Retry Rate	Slow processing, retry errors	Output destination slow/unreliable	Reduce `retry_max_times`, add `flush_thread_count`	10 minutes
Plugin Multi-Worker Error	"Plugin 'tail' does not support multi workers"	Plugin doesn't support concurrent access	Wrap in `<worker 0>` directive	5 minutes
Buffer Files Growing Huge	Disk space alerts, slow startup	Output destination down for extended period	Set `total_limit_size` + `overflow_action`	15 minutes
Permission Denied Errors	Can't read log files	Process user lacks file access	Run as root or fix file permissions	10 minutes
Network Policy Blocking	Connection timeout to output	K8s network policy blocking egress	Create network policy for log destination	30 minutes

Essential Troubleshooting Resources (Curated from Battle Experience)

Related Tools & Recommendations

tool

Popular choice

Turso - SQLite Rewritten in Rust (Still Alpha)

They rewrote SQLite from scratch to fix the concurrency nightmare. Don't use this in production yet.

jQuery - The Library That Won't Die

Explore jQuery's enduring legacy, its impact on web development, and the key changes in jQuery 4.0. Understand its relevance for new projects in 2025.

jQuery

/tool/jquery/overview

50%

tool

Popular choice