Falco + Prometheus + Grafana: The Only Security Stack That Doesn't Suck

Why This Combo Doesn't Suck (Unlike Most Security Tools)

Grafana Logo

I've deployed this combo four times now - twice for broke startups, once at a fintech that was hemorrhaging money to Datadog, and once more because the first fintech deployment went so well they wanted it everywhere. Here's what actually works and what will make you want to throw your laptop.

Real talk: Falco catches the runtime stuff that logs completely miss, Prometheus won't fill up your entire SSD like ELK inevitably does, and Grafana dashboards load in under 30 seconds instead of timing out like literally every other monitoring tool. I'm running whatever the latest Falco is and Prometheus 3.something - the eBPF driver finally works reliably instead of randomly shitting the bed every Tuesday.

How The Pieces Actually Fit Together

eBPF Architecture Overview

Falco is the thing that actually catches security events. It hooks into the kernel with eBPF and screams when containers try to escape, processes escalate privileges, or some asshole starts mining bitcoin on your nodes. Since version 0.38 they finally got rid of that stupid sidecar - now it just dumps metrics directly to Prometheus like a normal tool.

Prometheus stores everything as time-series data instead of the log aggregation nightmare that is ELK. Instead of paying Splunk $15k/month to tell you about problems three days later, you get weeks of historical data that takes up maybe 50GB. No more "we can't afford to store logs from last month" conversations with management.

Grafana makes dashboards that people actually use instead of rainbow vomit charts. The official Falco dashboard exists but it's optimized for demo environments, not real production workloads. Plan to spend a week unfucking the queries.

eBPF Hook Architecture

Why This Beats Paying Some Vendor $100k/Year

Most security tools are like smoke detectors that only go off after your house burns down. This catches shit while it's actually happening.

Real-time Detection: Falco's eBPF hooks catch container breakouts and privilege escalations within seconds. The modern driver works reliably now - if you're stuck on Ubuntu 18.04 or some ancient kernel, prepare for random crashes and missing events. Seriously, just upgrade already.

Historical Data That Won't Bankrupt You: Prometheus stores months of security events in maybe 50-100GB. Send that same volume to Splunk and watch your CFO have a panic attack when the bill shows up.

Dashboards That Actually Help: Grafana shows information that helps you fix problems instead of pretty circles that impress executives. You'll still waste two weeks getting the alerts right because nobody ships useful defaults.

What You Actually Get vs. Commercial Tools

What works:

Container escape detection that actually fucking works (unlike agent-based garbage that containers just ignore)
File system monitoring that catches crypto miners before they eat your entire CPU budget
Privilege escalation alerts that aren't buried under 50,000 SSH login events
Total cost: whatever you spend on servers + your time, not $60k/month in licensing fees

What doesn't:

Network monitoring is basically useless - you'll need something else for that
Application-level attacks? Good luck, this won't help
Compliance reports look like ass compared to enterprise SIEMs
When it breaks, there's no vendor support to blame - just you and GitHub issues

Real Production Experience

I've deployed this on clusters ranging from tiny 5-node dev environments to massive 600+ node production monstrosities. Works fine until you hit resource limits and Falco starts silently dropping events - learned that one the hard way during a security incident. Resource usage is pretty reasonable though - database nodes eat around 400MB RAM, web servers more like 150MB, and CI/CD build agents spike to 800MB+ during builds.

This stack catches privilege escalations, container breakouts, and weird file access that log-based monitoring completely misses. But don't kid yourself - sophisticated attackers know how to avoid syscall detection. This catches script kiddies and crypto miners, not nation-state actors.

Anyway, enough theory. Time to actually deploy this thing and deal with all the ways it will inevitably break.

How to Deploy This Without Completely Fucking Everything

Security Monitoring Architecture

Kubernetes Security Architecture

This is how I finally got it working after taking down prod twice and spending an entire weekend debugging why Prometheus was eating terabytes of disk space. First time I half-assed it and ignored the docs. Second time I actually read them but still missed critical shit. Fourth deployment finally worked because I learned from being an idiot.

Prerequisites (The Shit That Will Break If You Skip It)

Don't be like me and skip these. I learned the hard way:

Kubernetes 1.24+ or Docker (don't use Docker Swarm, it's dead)
Kernel 4.18+ for eBPF - but seriously use 5.8+ or the modern eBPF probe will randomly fail
At least 100GB storage for Prometheus, but I'd start with 200GB because you'll always need more (learned this during a security incident when we ran out of space)
Open network policies - strict policies will block metric scraping and you'll wonder why nothing works

Pro tip: If you have Kubernetes, use it. The Falco Helm charts actually work, which is shocking considering most Helm charts are broken trash maintained by people who've never run production systems. Bare metal deployments work too but you'll be hand-crafting service discovery configs like some kind of infrastructure caveman.

Phase 1: Deploy Falco (Prepare for It to Break)

Step 1: Install Falco with Prometheus metrics. This config actually works:

## falco-values.yaml - don't copy from random blogs, they're all wrong
falco:
  grpc:
    enabled: true
  grpcOutput:
    enabled: true
  http_output:
    enabled: false  # disable if not using falcosidekick, causes issues

## Prometheus metrics (works since 0.38, stable in 0.41.0)
metrics:
  enabled: true
  interval: 30s  # don't use 1h like the docs say, you'll miss events
  resource_utilization:
    enabled: true
  rules_counters:
    enabled: true
  base_syscalls:
    enabled: false  # generates too much noise in prod

Step 2: Deploy and immediately troubleshoot when it fails:

helm repo add falcosecurity https://falcosecurity.github.io/charts
helm install falco falcosecurity/falco \
  --namespace falco-system \
  --create-namespace \
  --values falco-values.yaml

When it inevitably breaks:

driver loading failed - your kernel is too old or missing BTF support. Ubuntu 18.04 users are particularly fucked here
permission denied accessing /proc - SELinux is cockblocking everything, just set privileged: true and move on
pod stays in Pending - you forgot to create the namespace first because the --create-namespace flag is a lying piece of shit

Version 0.41 finally fixed that bug where metrics would randomly stop working for no reason. If you're on older versions, upgrade immediately or accept that you'll miss critical events during actual incidents.

Phase 2: Configure Prometheus (Without It Eating Your Disk)

Prometheus configuration is where most people fuck up. Here's what actually works without consuming terabytes of storage:

## prometheus-config.yaml - the version that works in production
global:
  scrape_interval: 30s  # 15s is overkill for security metrics
  evaluation_interval: 30s

scrape_configs:
  - job_name: 'falco'
    kubernetes_sd_configs:
      - role: pod
        namespaces:
          names: ['falco-system']
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_name]
        action: keep
        regex: falco.*
      - source_labels: [__address__]
        action: replace
        target_label: __address__
        regex: (.+):.*
        replacement: $1:8765  # this port needs to be open
    scrape_interval: 15s  # security metrics need to be frequent
    metric_relabel_configs:
      - source_labels: [__name__]
        regex: 'falco_k8s_audit.*'  # drop k8s audit metrics, too noisy
        action: drop

The metrics that actually matter:

falco_events_total - how many security events (if this is 0, something's broken)
falco_outputs_queue_size - when this gets high, you're dropping events
falco_kernel_module_loaded - 0 means your driver failed to load
falco_rules_loaded - sanity check that rules are working

Pro tip: Stop trying to scrape falcosidekick unless you actually deployed it. Half the tutorial configs on the internet assume you're running it when you're probably not.

Phase 3: Grafana (The Part That Takes Forever to Get Right)

eBPF Map Architecture

The official Falco dashboard exists but it looks like someone designed it in 2015 and the queries are optimized for demo environments where nothing actually happens.

Import process (because it's never straightforward):

Navigate to Grafana UI (pray it loaded this time)
Import dashboard 17319 from grafana.com
Watch half the panels error out because the queries don't match your environment
Spend the next 4 hours fixing queries that should have worked out of the box
Build alerts that don't fire every time someone runs sudo

Queries that actually work in production:

## Events per minute (not per second, that's too granular)
rate(falco_events_total[1m]) * 60

## Queue size that indicates problems
falco_outputs_queue_size > 1000

## Rules that are actually firing (filter out test rules)
increase(falco_events_total{rule!~\".*test.*\"}[5m])

Testing (Use This to Confirm You're Not Just Watching a Broken System)

eBPF Helper Functions

Don't trust that it works until you test it. The Falco event generator actually triggers security events:

## This will generate real security events to test your setup
kubectl run falco-event-generator \
  --image=falcosecurity/event-generator:latest \
  --rm -it --restart=Never -- run syscall

What should happen (and what fails):

Event generator triggers container escapes and file system events
Falco detects them and increments falco_events_total metric
Prometheus scrapes the new metric values (check the targets page)
Grafana shows the events in dashboards (refresh manually, auto-refresh is broken)
Alert rules fire and spam your Slack channel

If nothing happens: Check the Falco logs with kubectl logs -n falco-system daemonset/falco. 90% of the time it's either the driver failed to load or some permission got fucked up.

Performance Tuning (Or: How to Not Kill Your Cluster)

Resource allocation that works in the real world:

Start with 500MB RAM per node, tune down based on actual usage
0.2-0.5 CPU cores - database nodes need more, web servers need less
If falco_outputs_queue_size stays above 1000, you need more resources or better buffer tuning

Storage planning so your CFO doesn't murder you:

Small cluster (10-50 nodes): maybe a few GB per month, depends on how noisy your workloads are
Medium cluster (50-200 nodes): 15-30GB per month if you tune the rules properly
Large cluster (200+ nodes): 80-150GB+ per month, definitely start sampling at this point or your disk will explode

Grafana optimization (because waiting 30 seconds for a dashboard to load is bullshit):

Use dashboard variables to filter by node/namespace
Don't query more than 24 hours of data without aggregation unless you like watching spinners
Enable query caching or prepare for dashboards that take forever to load

The Gotchas That Will Bite You

eBPF driver pain: Older kernels (Ubuntu 18.04, CentOS 7) don't support the modern eBPF stuff. Falco falls back to kernel modules, which need kernel headers that nobody ever installs by default. Cue three hours of package hunting.

Network policy nightmare: Strict network policies block port 8765. Symptoms: all Prometheus targets show as down, no metrics anywhere, and you'll spend 6 hours troubleshooting before realizing the firewall is eating everything.

Alert spam hell: Default rules trigger on every sudo command and container restart. Day one you'll get 50,000 alerts about normal system behavior. Budget a month minimum for rule tuning or your team will revolt.

Silent death: Resource limits cause event dropping with zero obvious symptoms. Watch falco_outputs_queue_size like your job depends on it - when it spikes, you're missing the actual security events you deployed this to catch.

This setup works in production if you actually tune it. Most teams deploy it, get flooded with alerts about every sudo command, give up after a week, and go back to paying Splunk. Don't be those teams.

Bottom line: Follow this guide, spend a month tuning the noise out, and you'll have security monitoring that catches real threats instead of charging you per gigabyte. The "free" part is licensing - you'll still pay with your time and sanity.

Reality check: This saves you $200-400k per year compared to enterprise vendors while catching runtime container threats that SIEMs completely miss. But it only works if you have engineers who understand Kubernetes, can debug eBPF issues, and won't quit when the initial alert flood hits.

That's the deployment reality. Now let's talk costs and what else will probably break.

Security Monitoring Integration Approaches Comparison

Approach	Setup Complexity	Performance Impact	Scalability	Cost	Best For
Falco + Prometheus + Grafana	Medium (2-3 days setup)	Low-Medium (1-5% CPU, 200-500MB RAM per node)	High (horizontal scaling)	Free (infrastructure only)	Teams wanting comprehensive open-source security monitoring with full control
Sysdig Secure	Low (< 1 day)	Low (same engine as Falco)	Very High	$35-50/node/month	Organizations wanting Falco capabilities without operational overhead
Datadog Security	Very Low (< 2 hours)	Low	Very High	$15/host additional to existing Datadog	Teams already using Datadog ecosystem
Splunk Security	High (1-2 weeks)	Medium	High	$100+/node/month	Enterprise environments with dedicated security teams
Elastic Security	Medium-High (3-5 days)	Medium-High	High	$95/node/month	Organizations with existing ELK stack

Frequently Asked Questions

How reliable is Falco's Prometheus metrics integration?

Way better than it used to be.

Version 0.38 was the first that didn't randomly break every few days. 0.41 finally fixed that bullshit bug where multiple event sources would kill metric collection for no reason.In production it's been solid for me

uptime is basically perfect unless I fuck up the resource limits, which I've definitely done more than once.Red flag: When falco_outputs_queue_size stays high for more than a few minutes. That means events are backing up and you're missing actual security alerts while staring at your pretty dashboards thinking everything's fine.

What's the actual performance impact on production systems?

Database nodes eat around 400MB RAM and maybe 3-5% CPU when shit gets busy. Web servers more like 150-200MB and 1% CPU most of the time. Way better than those bloated commercial agents that consume half your server.eBPF runs in kernel space so it's actually efficient, unlike userspace log parsing garbage. But syscall-heavy workloads (databases, CI builds) will see more overhead than boring static file servers.Real numbers from my deployments:

PostgreSQL: 3-4% CPU hit during peak (millions of queries/hour)
Redis cache: barely 1% CPU, steady 350MB RAM
Node.js APIs: 1-2% CPU, scales with traffic volume
CI build agents: 8-12% CPU during builds, almost nothing when idlePro tip: Start with way more resources than you think you need and tune down. The tuning docs help but your workload is definitely different from their toy examples. Watch the ratio of falco_kernel_evts_total vs falco_events_total to see how noisy your environment actually is.

How long does this integration take to implement properly?

Basic "hello world" setup takes 2-3 days if nothing breaks. Getting it production-ready takes 3-4 weeks, and here's why:

Tuning rules so you don't get 50,000 alerts per day (this is the worst part)
Fixing dashboards so they show useful shit instead of marketing demo graphs
Setting alert thresholds that actually matter instead of firing constantly
Performance testing so you don't accidentally kill your cluster on Monday morningIf you already know Kubernetes and Prometheus, maybe 2-3 weeks. If you're new to monitoring stacks, add another week minimum because you'll be learning three tools simultaneously while trying not to break production.

Can this replace our existing commercial SIEM?

Short answer: Probably not entirely, but it's way better at catching the stuff that actually matters in containerized environments.What it's great at:

Container breakouts, privilege escalations, file system fuckery, catching crypto miners before they eat your entire AWS bill
What it sucks at:
Network traffic analysis, application-level attacks, threat intelligence bullshit, pretty compliance reports that make auditors happyMost teams I've worked with deploy this for Kubernetes runtime security and keep their existing SIEM for everything else. Falco catches the runtime container threats that traditional SIEMs are completely blind to.

What happens when Falco generates thousands of alerts?

Day one you'll get absolutely flooded. Default rules trigger on every sudo command, container restart, and normal system behavior. It's fucking brutal.Here's how to not go insane:

Prometheus aggregates the chaos - instead of 50k individual alerts, you get useful trends
Grafana shows patterns not spam - way more helpful than alert fatigue
Alertmanager groups related shit - so you don't get pinged 500 times about the same container restartStart with only the most critical rules enabled. Add more gradually after your team stops developing alert blindness. Take it slow or they'll revolt and disable everything.

How do we handle the data retention and storage costs?

Prometheus metrics take way less storage than raw logs, which is why this doesn't bankrupt you like Splunk.Storage reality check:

Small environment (10 nodes): 2-4GB per month, depends how chatty your apps are
Medium environment (100 nodes): 15-25GB per month if you tune rules properly
Large environment (1000 nodes): 80-150GB per month, definitely need tiered retention hereSet up retention tiers: 15 days of high-res data, 90 days aggregated, 1 year summary metrics. For long-term storage consider remote backends, but honestly most teams never look at data older than 6 months anyway.

What's the learning curve for security teams?

Totally depends on your background:
DevOps teams:

1-2 weeks to get productive, you already know the stack
Security teams new to Kubernetes:
4-6 weeks, you need to learn k8s first and it's a lot
Traditional security analysts:
2-3 weeks figuring out Grafana and understanding what the hell the dashboards are showingGive security folks read-only Grafana access first so they can click around without accidentally breaking production. Only give them edit permissions after they stop asking "where's the SIEM interface?"

How does this integration scale across multiple clusters?

Few different approaches, all with tradeoffs:
Federated Prometheus:

Central instance pulls from all your clusters. Works fine up to maybe 10-15 clusters, then federation becomes a nightmare to debug.
Central Grafana:
Single Grafana connects to multiple Prometheus instances. Scales better and gives you one dashboard to rule them all.
Managed services:
Let your cloud provider handle the Prometheus/Grafana scaling so you can focus on not getting fired when security incidents happen.

What are the biggest implementation gotchas?

eBPF driver pain:

Modern eBPF shits the bed on older kernels. Always configure kernel module fallback or you're fucked.
Ubuntu 18.04: Missing BTF support, need to install linux-modules-extra
CentOS 7: Kernel 3.10 is too old, either upgrade or use kernel modules
Quick check: ls /sys/kernel/btf/vmlinux - if missing, modern eBPF is dead

Network policy hell:

Strict policies block port 8765 for metric scraping, everything shows as down.
Falco exposes metrics on port 8765 by default
Prometheus needs to reach every Falco pod on this port
Symptom: All targets down in Prometheus, zero metrics collected
Fix: Network policy allowing prometheus → falco-system:8765

Resource starvation death:

Under-allocate memory and events get dropped silently.
Memory too low: falco_outputs_queue_size spikes, you're missing security events
CPU too low: eBPF can't keep up, same disaster
OOMKilled pods: Double memory limits immediately

Rule drift disaster:

Teams disable noisy rules, don't document why, then months later wonder why attacks aren't getting caught.
Document every goddamn rule change
Version control your Falco rules like actual code
Audit your security coverage or you'll discover holes during actual incidents

How do we integrate this with existing alerting systems?

Several ways to get alerts out without driving everyone insane:
Prometheus Alertmanager:

Good for infrastructure alerts based on thresholds, basic but reliable
Grafana alerts:
More flexible, better UI, easier to configure complex rules
Direct webhooks:
Straight to Slack, PagerDuty, or whatever incident management circus you're runningMost teams route security alerts through Grafana (better context, fewer false positives) and infrastructure alerts through Alertmanager (simpler rules, less overhead). Keeps the security team from getting spammed about CPU usage and the DevOps team from getting woken up about every sudo command.

Essential Documentation and Resources

Quick Navigation

How The Pieces Actually Fit Together

Why This Beats Paying Some Vendor $100k/Year

What You Actually Get vs. Commercial Tools

Real Production Experience

Prerequisites (The Shit That Will Break If You Skip It)

Phase 1: Deploy Falco (Prepare for It to Break)

Phase 2: Configure Prometheus (Without It Eating Your Disk)

Phase 3: Grafana (The Part That Takes Forever to Get Right)

Testing (Use This to Confirm You're Not Just Watching a Broken System)

Performance Tuning (Or: How to Not Kill Your Cluster)

The Gotchas That Will Bite You

How reliable is Falco's Prometheus metrics integration?

What's the actual performance impact on production systems?

How long does this integration take to implement properly?

Can this replace our existing commercial SIEM?

What happens when Falco generates thousands of alerts?

How do we handle the data retention and storage costs?

What's the learning curve for security teams?

How does this integration scale across multiple clusters?

What are the biggest implementation gotchas?

How do we integrate this with existing alerting systems?

Related Tools & Recommendations

OpenTelemetry, Jaeger, Grafana, Kubernetes: Observability Stack

Set Up Microservices Monitoring That Actually Works

Temporal + Kubernetes + Redis: The Only Microservices Stack That Doesn't Hate You

Sysdig - Security Tools That Actually Watch What's Running

Fix Kubernetes ImagePullBackOff Error - The Complete Battle-Tested Guide

Pulumi Kubernetes Helm GitOps Workflow: Production Integration Guide

Container Runtime Security: Prevent Escapes with Falco

Datadog Cost Management - Stop Your Monitoring Bill From Destroying Your Budget

Enterprise Datadog Deployments That Don't Destroy Your Budget or Your Sanity

Datadog - Expensive Monitoring That Actually Works

Node Exporter Advanced Configuration - Stop It From Killing Your Server

GitOps Integration: Docker, Kubernetes, Argo CD, Prometheus Setup

Docker Desktop Won't Install? Welcome to Hell

Complete Guide to Setting Up Microservices with Docker and Kubernetes (2025)

Fix Docker Daemon Connection Failures

ELK Stack for Microservices - Stop Losing Log Data

EFK Stack Integration - Stop Your Logs From Disappearing Into the Void

Splunk - Expensive But It Works

RHACS Cost Analysis & Pricing Guide: Budget Without Breaking Security

Mastering GitOps: Docker, Kubernetes, ArgoCD, Prometheus Stack