I've been firefighting Fluentd production issues for years. Here's what you need to know when everything breaks and you need to fix it fast. Skip the theory - this is battlefield medicine for your logging infrastructure.
The Nuclear Option (Try This First)
When Fluentd is fucked and you need it working now (see the official troubleshooting guide for more context):
## Stop everything
sudo systemctl stop fluentd
## Kill any stuck processes
sudo pkill -9 fluentd
## Clear buffer files (ONLY if you can lose recent logs)
sudo rm -rf /var/log/fluentd/buffer/*
## Start with verbose logging
sudo systemctl start fluentd
sudo journalctl -u fluentd -f
Time to fix: 2 minutes if you're lucky, 2 hours if the config is broken.
Memory is Growing Like Cancer
This is the #1 production killer. Fluentd starts at 100MB and grows to 2GB+ until your pods get OOMKilled. Here's why (explained in detail in the buffer configuration docs):
Root cause: Buffer overflow + memory leaks in plugins + Ruby garbage collection not keeping up.
Real example from production:
2025-09-10 03:42:18 +0000 [error]: temporarily failed to flush the buffer.
next_retry=2025-09-10 03:42:28 +0000 retry_times=5
That innocent error? It means your output destination is down and Fluentd is buffering everything in memory. After 10 minutes, you're out of RAM.
The fix that actually works:
<system>
log_level debug
workers 2 # Multi-worker saves your ass
</system>
<buffer>
@type file # Get it out of memory NOW
path /var/log/fluentd/buffer/
chunk_limit_size 4MB
total_limit_size 512MB # Hard limit - will drop logs before OOM
overflow_action drop_oldest_chunk # Better than crashing
flush_thread_count 2
flush_interval 5s
retry_max_times 3 # Don't retry forever
retry_wait 10s
</buffer>
Pro tip: Set RUBY_GC_HEAP_OLDOBJECT_LIMIT_FACTOR=1.2
in your environment. Ruby's default GC settings are optimized for throughput, not memory usage. This forces more frequent garbage collection. See Ruby GC documentation for deeper understanding.
CPU Pegged at 100% (The Ruby GIL Problem)
Symptom: Fluentd CPU hits 100%, logs stop flowing, everything backs up.
Why this happens: Ruby's Global Interpreter Lock means Fluentd is basically single-threaded for CPU work. Heavy regex parsing or JSON transformation will peg one core and block everything else.
The PayU solution that worked:
Their team reduced Fluentd instances by 48% using multi-workers:
<system>
workers 3 # Use 2-3 workers, not more
</system>
## Only worker 0 handles file tailing (single source)
<worker 0>
<source>
@type tail
path /var/log/app/*.log
pos_file /var/log/fluentd/app.log.pos
format json
tag app.logs
</source>
</worker>
## All workers handle output (parallel processing)
<match **>
@type http
endpoint \"https://ingress.coralogix.com/logs/v1/singles\"
<buffer>
flush_thread_count 4 # Parallel flushes
chunk_limit_size 4MB
</buffer>
</match>
Before: 30 single-worker pods consuming 0.8 CPU each
After: 15 multi-worker pods consuming 1 CPU each
Result: Same throughput, half the resource usage
Kubernetes DaemonSet Hell
The problem: Kubernetes 1.10+ changed log rotation, triggering memory leaks in file buffer mode. Fluentd memory grows forever until pods get killed.
Symptoms you'll see:
- Memory usage climbing non-stop
kubectl get pods
shows constant restarts- Error:
signal: killed
in pod logs
The working fix:
## In your DaemonSet
spec:
template:
spec:
containers:
- name: fluentd
image: fluent/fluentd:v1.19.0
resources:
limits:
memory: 1Gi # Hard limit - pods will restart before OOM
requests:
memory: 400Mi
cpu: 100m
env:
- name: RUBY_GC_HEAP_OLDOBJECT_LIMIT_FACTOR
value: \"1.2\" # Aggressive GC for containers
- name: WORKERS
value: \"2\"
Memory buffer config that doesn't leak:
<buffer>
@type memory # Counter-intuitive but file buffers leak in K8s
chunk_limit_size 2MB
total_limit_size 64MB
flush_thread_count 2
overflow_action drop_oldest_chunk
</buffer>
Real Error Messages (And What They Mean)
Error: buffer queue limit overflow
Translation: Your output destination is slow/down and buffers are full
Fix: Increase total_limit_size
or fix the destination
Error: Fluent::ConfigError: Plugin 'tail' does not support multi workers
Translation: You tried to use tail
input with multi-workers
Fix: Wrap it in a worker directive:
<worker 0>
<source>
@type tail
# your tail config
</source>
</worker>
Error: parsing failed
Translation: Your config syntax is fucked but Fluentd won't tell you where
Fix: Use fluentd --dry-run -c your-config.conf
to validate
Error: Permission denied - /var/log/fluentd/buffer
Translation: Process can't write to buffer directory
Fix: chown -R fluentd:fluentd /var/log/fluentd
or run as root (not recommended)
The 3AM Debug Workflow
When Fluentd breaks at 3AM and you need it fixed NOW:
- Check if it's actually running:
ps aux | grep fluentd
- Look at recent logs:
tail -100 /var/log/fluentd/fluentd.log
- Check memory usage:
free -h && df -h
- Validate config:
fluentd --dry-run -c /etc/fluentd/fluent.conf
- Nuclear restart: Stop, kill, clear buffers, start
- Monitor for 5 minutes:
watch 'ps aux | grep fluentd && free -h'
Pro tip: Keep a working minimal config file handy. When shit's broken, switch to minimal config first, then add complexity back piece by piece.
Performance Numbers That Matter
From actual production deployments:
- Single worker: ~3-4K events/sec before choking
- Multi-worker (2-3): ~8-10K events/sec
- Memory growth: 200-300MB with file buffers, 100-500MB with memory buffers
- CPU usage: 0.5-1.0 cores per worker under normal load
When to scale out: If you're hitting >5K events/sec consistently, add more Fluentd instances rather than more workers per instance. Check the high availability configuration guide for scaling strategies and the performance tuning documentation for optimization techniques.