Why This Combo Doesn't Suck (Unlike Most Security Tools)

Falco Logo

Prometheus Logo

Grafana Logo

I've deployed this combo four times now - twice for broke startups, once at a fintech that was hemorrhaging money to Datadog, and once more because the first fintech deployment went so well they wanted it everywhere. Here's what actually works and what will make you want to throw your laptop.

Real talk: Falco catches the runtime stuff that logs completely miss, Prometheus won't fill up your entire SSD like ELK inevitably does, and Grafana dashboards load in under 30 seconds instead of timing out like literally every other monitoring tool. I'm running whatever the latest Falco is and Prometheus 3.something - the eBPF driver finally works reliably instead of randomly shitting the bed every Tuesday.

How The Pieces Actually Fit Together

eBPF Architecture Overview

Falco is the thing that actually catches security events. It hooks into the kernel with eBPF and screams when containers try to escape, processes escalate privileges, or some asshole starts mining bitcoin on your nodes. Since version 0.38 they finally got rid of that stupid sidecar - now it just dumps metrics directly to Prometheus like a normal tool.

Prometheus stores everything as time-series data instead of the log aggregation nightmare that is ELK. Instead of paying Splunk $15k/month to tell you about problems three days later, you get weeks of historical data that takes up maybe 50GB. No more "we can't afford to store logs from last month" conversations with management.

Grafana makes dashboards that people actually use instead of rainbow vomit charts. The official Falco dashboard exists but it's optimized for demo environments, not real production workloads. Plan to spend a week unfucking the queries.

eBPF Hook Architecture

Why This Beats Paying Some Vendor $100k/Year

Most security tools are like smoke detectors that only go off after your house burns down. This catches shit while it's actually happening.

Real-time Detection: Falco's eBPF hooks catch container breakouts and privilege escalations within seconds. The modern driver works reliably now - if you're stuck on Ubuntu 18.04 or some ancient kernel, prepare for random crashes and missing events. Seriously, just upgrade already.

Historical Data That Won't Bankrupt You: Prometheus stores months of security events in maybe 50-100GB. Send that same volume to Splunk and watch your CFO have a panic attack when the bill shows up.

Dashboards That Actually Help: Grafana shows information that helps you fix problems instead of pretty circles that impress executives. You'll still waste two weeks getting the alerts right because nobody ships useful defaults.

What You Actually Get vs. Commercial Tools

What works:

  • Container escape detection that actually fucking works (unlike agent-based garbage that containers just ignore)
  • File system monitoring that catches crypto miners before they eat your entire CPU budget
  • Privilege escalation alerts that aren't buried under 50,000 SSH login events
  • Total cost: whatever you spend on servers + your time, not $60k/month in licensing fees

What doesn't:

  • Network monitoring is basically useless - you'll need something else for that
  • Application-level attacks? Good luck, this won't help
  • Compliance reports look like ass compared to enterprise SIEMs
  • When it breaks, there's no vendor support to blame - just you and GitHub issues

Real Production Experience

I've deployed this on clusters ranging from tiny 5-node dev environments to massive 600+ node production monstrosities. Works fine until you hit resource limits and Falco starts silently dropping events - learned that one the hard way during a security incident. Resource usage is pretty reasonable though - database nodes eat around 400MB RAM, web servers more like 150MB, and CI/CD build agents spike to 800MB+ during builds.

This stack catches privilege escalations, container breakouts, and weird file access that log-based monitoring completely misses. But don't kid yourself - sophisticated attackers know how to avoid syscall detection. This catches script kiddies and crypto miners, not nation-state actors.

Anyway, enough theory. Time to actually deploy this thing and deal with all the ways it will inevitably break.

How to Deploy This Without Completely Fucking Everything

Security Monitoring Architecture

Kubernetes Security Architecture

This is how I finally got it working after taking down prod twice and spending an entire weekend debugging why Prometheus was eating terabytes of disk space. First time I half-assed it and ignored the docs. Second time I actually read them but still missed critical shit. Fourth deployment finally worked because I learned from being an idiot.

Prerequisites (The Shit That Will Break If You Skip It)

Don't be like me and skip these. I learned the hard way:

  • Kubernetes 1.24+ or Docker (don't use Docker Swarm, it's dead)
  • Kernel 4.18+ for eBPF - but seriously use 5.8+ or the modern eBPF probe will randomly fail
  • At least 100GB storage for Prometheus, but I'd start with 200GB because you'll always need more (learned this during a security incident when we ran out of space)
  • Open network policies - strict policies will block metric scraping and you'll wonder why nothing works

Pro tip: If you have Kubernetes, use it. The Falco Helm charts actually work, which is shocking considering most Helm charts are broken trash maintained by people who've never run production systems. Bare metal deployments work too but you'll be hand-crafting service discovery configs like some kind of infrastructure caveman.

Phase 1: Deploy Falco (Prepare for It to Break)

Step 1: Install Falco with Prometheus metrics. This config actually works:

## falco-values.yaml - don't copy from random blogs, they're all wrong
falco:
  grpc:
    enabled: true
  grpcOutput:
    enabled: true
  http_output:
    enabled: false  # disable if not using falcosidekick, causes issues

## Prometheus metrics (works since 0.38, stable in 0.41.0)
metrics:
  enabled: true
  interval: 30s  # don't use 1h like the docs say, you'll miss events
  resource_utilization:
    enabled: true
  rules_counters:
    enabled: true
  base_syscalls:
    enabled: false  # generates too much noise in prod

Step 2: Deploy and immediately troubleshoot when it fails:

helm repo add falcosecurity https://falcosecurity.github.io/charts
helm install falco falcosecurity/falco \
  --namespace falco-system \
  --create-namespace \
  --values falco-values.yaml

When it inevitably breaks:

  • driver loading failed - your kernel is too old or missing BTF support. Ubuntu 18.04 users are particularly fucked here
  • permission denied accessing /proc - SELinux is cockblocking everything, just set privileged: true and move on
  • pod stays in Pending - you forgot to create the namespace first because the --create-namespace flag is a lying piece of shit

Version 0.41 finally fixed that bug where metrics would randomly stop working for no reason. If you're on older versions, upgrade immediately or accept that you'll miss critical events during actual incidents.

Phase 2: Configure Prometheus (Without It Eating Your Disk)

Prometheus configuration is where most people fuck up. Here's what actually works without consuming terabytes of storage:

## prometheus-config.yaml - the version that works in production
global:
  scrape_interval: 30s  # 15s is overkill for security metrics
  evaluation_interval: 30s

scrape_configs:
  - job_name: 'falco'
    kubernetes_sd_configs:
      - role: pod
        namespaces:
          names: ['falco-system']
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_name]
        action: keep
        regex: falco.*
      - source_labels: [__address__]
        action: replace
        target_label: __address__
        regex: (.+):.*
        replacement: $1:8765  # this port needs to be open
    scrape_interval: 15s  # security metrics need to be frequent
    metric_relabel_configs:
      - source_labels: [__name__]
        regex: 'falco_k8s_audit.*'  # drop k8s audit metrics, too noisy
        action: drop

The metrics that actually matter:

  • falco_events_total - how many security events (if this is 0, something's broken)
  • falco_outputs_queue_size - when this gets high, you're dropping events
  • falco_kernel_module_loaded - 0 means your driver failed to load
  • falco_rules_loaded - sanity check that rules are working

Pro tip: Stop trying to scrape falcosidekick unless you actually deployed it. Half the tutorial configs on the internet assume you're running it when you're probably not.

Phase 3: Grafana (The Part That Takes Forever to Get Right)

eBPF Map Architecture

The official Falco dashboard exists but it looks like someone designed it in 2015 and the queries are optimized for demo environments where nothing actually happens.

Import process (because it's never straightforward):

  1. Navigate to Grafana UI (pray it loaded this time)
  2. Import dashboard 17319 from grafana.com
  3. Watch half the panels error out because the queries don't match your environment
  4. Spend the next 4 hours fixing queries that should have worked out of the box
  5. Build alerts that don't fire every time someone runs sudo

Queries that actually work in production:

## Events per minute (not per second, that's too granular)
rate(falco_events_total[1m]) * 60

## Queue size that indicates problems
falco_outputs_queue_size > 1000

## Rules that are actually firing (filter out test rules)
increase(falco_events_total{rule!~\".*test.*\"}[5m])

Testing (Use This to Confirm You're Not Just Watching a Broken System)

eBPF Helper Functions

Don't trust that it works until you test it. The Falco event generator actually triggers security events:

## This will generate real security events to test your setup
kubectl run falco-event-generator \
  --image=falcosecurity/event-generator:latest \
  --rm -it --restart=Never -- run syscall

What should happen (and what fails):

  1. Event generator triggers container escapes and file system events
  2. Falco detects them and increments falco_events_total metric
  3. Prometheus scrapes the new metric values (check the targets page)
  4. Grafana shows the events in dashboards (refresh manually, auto-refresh is broken)
  5. Alert rules fire and spam your Slack channel

If nothing happens: Check the Falco logs with kubectl logs -n falco-system daemonset/falco. 90% of the time it's either the driver failed to load or some permission got fucked up.

Performance Tuning (Or: How to Not Kill Your Cluster)

Resource allocation that works in the real world:

  • Start with 500MB RAM per node, tune down based on actual usage
  • 0.2-0.5 CPU cores - database nodes need more, web servers need less
  • If falco_outputs_queue_size stays above 1000, you need more resources or better buffer tuning

Storage planning so your CFO doesn't murder you:

  • Small cluster (10-50 nodes): maybe a few GB per month, depends on how noisy your workloads are
  • Medium cluster (50-200 nodes): 15-30GB per month if you tune the rules properly
  • Large cluster (200+ nodes): 80-150GB+ per month, definitely start sampling at this point or your disk will explode

Grafana optimization (because waiting 30 seconds for a dashboard to load is bullshit):

  • Use dashboard variables to filter by node/namespace
  • Don't query more than 24 hours of data without aggregation unless you like watching spinners
  • Enable query caching or prepare for dashboards that take forever to load

The Gotchas That Will Bite You

eBPF driver pain: Older kernels (Ubuntu 18.04, CentOS 7) don't support the modern eBPF stuff. Falco falls back to kernel modules, which need kernel headers that nobody ever installs by default. Cue three hours of package hunting.

Network policy nightmare: Strict network policies block port 8765. Symptoms: all Prometheus targets show as down, no metrics anywhere, and you'll spend 6 hours troubleshooting before realizing the firewall is eating everything.

Alert spam hell: Default rules trigger on every sudo command and container restart. Day one you'll get 50,000 alerts about normal system behavior. Budget a month minimum for rule tuning or your team will revolt.

Silent death: Resource limits cause event dropping with zero obvious symptoms. Watch falco_outputs_queue_size like your job depends on it - when it spikes, you're missing the actual security events you deployed this to catch.

This setup works in production if you actually tune it. Most teams deploy it, get flooded with alerts about every sudo command, give up after a week, and go back to paying Splunk. Don't be those teams.

Bottom line: Follow this guide, spend a month tuning the noise out, and you'll have security monitoring that catches real threats instead of charging you per gigabyte. The "free" part is licensing - you'll still pay with your time and sanity.

Reality check: This saves you $200-400k per year compared to enterprise vendors while catching runtime container threats that SIEMs completely miss. But it only works if you have engineers who understand Kubernetes, can debug eBPF issues, and won't quit when the initial alert flood hits.

That's the deployment reality. Now let's talk costs and what else will probably break.

Security Monitoring Integration Approaches Comparison

Approach

Setup Complexity

Performance Impact

Scalability

Cost

Best For

Falco + Prometheus + Grafana

Medium (2-3 days setup)

Low-Medium (1-5% CPU, 200-500MB RAM per node)

High (horizontal scaling)

Free (infrastructure only)

Teams wanting comprehensive open-source security monitoring with full control

Sysdig Secure

Low (< 1 day)

Low (same engine as Falco)

Very High

$35-50/node/month

Organizations wanting Falco capabilities without operational overhead

Datadog Security

Very Low (< 2 hours)

Low

Very High

$15/host additional to existing Datadog

Teams already using Datadog ecosystem

Splunk Security

High (1-2 weeks)

Medium

High

$100+/node/month

Enterprise environments with dedicated security teams

Elastic Security

Medium-High (3-5 days)

Medium-High

High

$95/node/month

Organizations with existing ELK stack

Frequently Asked Questions

Q

How reliable is Falco's Prometheus metrics integration?

A

Way better than it used to be.

Version 0.38 was the first that didn't randomly break every few days. 0.41 finally fixed that bullshit bug where multiple event sources would kill metric collection for no reason.In production it's been solid for me

  • uptime is basically perfect unless I fuck up the resource limits, which I've definitely done more than once.Red flag: When falco_outputs_queue_size stays high for more than a few minutes. That means events are backing up and you're missing actual security alerts while staring at your pretty dashboards thinking everything's fine.
Q

What's the actual performance impact on production systems?

A

Database nodes eat around 400MB RAM and maybe 3-5% CPU when shit gets busy. Web servers more like 150-200MB and 1% CPU most of the time. Way better than those bloated commercial agents that consume half your server.eBPF runs in kernel space so it's actually efficient, unlike userspace log parsing garbage. But syscall-heavy workloads (databases, CI builds) will see more overhead than boring static file servers.Real numbers from my deployments:

  • PostgreSQL: 3-4% CPU hit during peak (millions of queries/hour)
  • Redis cache: barely 1% CPU, steady 350MB RAM
  • Node.js APIs: 1-2% CPU, scales with traffic volume
  • CI build agents: 8-12% CPU during builds, almost nothing when idlePro tip: Start with way more resources than you think you need and tune down. The tuning docs help but your workload is definitely different from their toy examples. Watch the ratio of falco_kernel_evts_total vs falco_events_total to see how noisy your environment actually is.
Q

How long does this integration take to implement properly?

A

Basic "hello world" setup takes 2-3 days if nothing breaks. Getting it production-ready takes 3-4 weeks, and here's why:

  • Tuning rules so you don't get 50,000 alerts per day (this is the worst part)
  • Fixing dashboards so they show useful shit instead of marketing demo graphs
  • Setting alert thresholds that actually matter instead of firing constantly
  • Performance testing so you don't accidentally kill your cluster on Monday morningIf you already know Kubernetes and Prometheus, maybe 2-3 weeks. If you're new to monitoring stacks, add another week minimum because you'll be learning three tools simultaneously while trying not to break production.
Q

Can this replace our existing commercial SIEM?

A

Short answer: Probably not entirely, but it's way better at catching the stuff that actually matters in containerized environments.What it's great at:

  • Container breakouts, privilege escalations, file system fuckery, catching crypto miners before they eat your entire AWS bill
    What it sucks at:
  • Network traffic analysis, application-level attacks, threat intelligence bullshit, pretty compliance reports that make auditors happyMost teams I've worked with deploy this for Kubernetes runtime security and keep their existing SIEM for everything else. Falco catches the runtime container threats that traditional SIEMs are completely blind to.
Q

What happens when Falco generates thousands of alerts?

A

Day one you'll get absolutely flooded. Default rules trigger on every sudo command, container restart, and normal system behavior. It's fucking brutal.Here's how to not go insane:

  • Prometheus aggregates the chaos - instead of 50k individual alerts, you get useful trends
  • Grafana shows patterns not spam - way more helpful than alert fatigue
  • Alertmanager groups related shit - so you don't get pinged 500 times about the same container restartStart with only the most critical rules enabled. Add more gradually after your team stops developing alert blindness. Take it slow or they'll revolt and disable everything.
Q

How do we handle the data retention and storage costs?

A

Prometheus metrics take way less storage than raw logs, which is why this doesn't bankrupt you like Splunk.Storage reality check:

  • Small environment (10 nodes): 2-4GB per month, depends how chatty your apps are
  • Medium environment (100 nodes): 15-25GB per month if you tune rules properly
  • Large environment (1000 nodes): 80-150GB per month, definitely need tiered retention hereSet up retention tiers: 15 days of high-res data, 90 days aggregated, 1 year summary metrics. For long-term storage consider remote backends, but honestly most teams never look at data older than 6 months anyway.
Q

What's the learning curve for security teams?

A

Totally depends on your background:
DevOps teams:

  • 1-2 weeks to get productive, you already know the stack
    Security teams new to Kubernetes:
  • 4-6 weeks, you need to learn k8s first and it's a lot
    Traditional security analysts:
  • 2-3 weeks figuring out Grafana and understanding what the hell the dashboards are showingGive security folks read-only Grafana access first so they can click around without accidentally breaking production. Only give them edit permissions after they stop asking "where's the SIEM interface?"
Q

How does this integration scale across multiple clusters?

A

Few different approaches, all with tradeoffs:
Federated Prometheus:

  • Central instance pulls from all your clusters. Works fine up to maybe 10-15 clusters, then federation becomes a nightmare to debug.
    Central Grafana:
  • Single Grafana connects to multiple Prometheus instances. Scales better and gives you one dashboard to rule them all.
    Managed services:
  • Let your cloud provider handle the Prometheus/Grafana scaling so you can focus on not getting fired when security incidents happen.
Q

What are the biggest implementation gotchas?

A

eBPF driver pain:

  • Modern eBPF shits the bed on older kernels. Always configure kernel module fallback or you're fucked.
  • Ubuntu 18.04: Missing BTF support, need to install linux-modules-extra
  • CentOS 7: Kernel 3.10 is too old, either upgrade or use kernel modules
  • Quick check: ls /sys/kernel/btf/vmlinux - if missing, modern eBPF is dead

Network policy hell:

  • Strict policies block port 8765 for metric scraping, everything shows as down.
  • Falco exposes metrics on port 8765 by default
  • Prometheus needs to reach every Falco pod on this port
  • Symptom: All targets down in Prometheus, zero metrics collected
  • Fix: Network policy allowing prometheus → falco-system:8765

Resource starvation death:

  • Under-allocate memory and events get dropped silently.
  • Memory too low: falco_outputs_queue_size spikes, you're missing security events
  • CPU too low: eBPF can't keep up, same disaster
  • OOMKilled pods: Double memory limits immediately

Rule drift disaster:

  • Teams disable noisy rules, don't document why, then months later wonder why attacks aren't getting caught.
  • Document every goddamn rule change
  • Version control your Falco rules like actual code
  • Audit your security coverage or you'll discover holes during actual incidents
Q

How do we integrate this with existing alerting systems?

A

Several ways to get alerts out without driving everyone insane:
Prometheus Alertmanager:

  • Good for infrastructure alerts based on thresholds, basic but reliable
    Grafana alerts:
  • More flexible, better UI, easier to configure complex rules
    Direct webhooks:
  • Straight to Slack, PagerDuty, or whatever incident management circus you're runningMost teams route security alerts through Grafana (better context, fewer false positives) and infrastructure alerts through Alertmanager (simpler rules, less overhead). Keeps the security team from getting spammed about CPU usage and the DevOps team from getting woken up about every sudo command.

Essential Documentation and Resources

Related Tools & Recommendations

integration
Similar content

OpenTelemetry, Jaeger, Grafana, Kubernetes: Observability Stack

Stop flying blind in production microservices

OpenTelemetry
/integration/opentelemetry-jaeger-grafana-kubernetes/complete-observability-stack
100%
howto
Recommended

Set Up Microservices Monitoring That Actually Works

Stop flying blind - get real visibility into what's breaking your distributed services

Prometheus
/howto/setup-microservices-observability-prometheus-jaeger-grafana/complete-observability-setup
44%
integration
Recommended

Temporal + Kubernetes + Redis: The Only Microservices Stack That Doesn't Hate You

Stop debugging distributed transactions at 3am like some kind of digital masochist

Temporal
/integration/temporal-kubernetes-redis-microservices/microservices-communication-architecture
41%
tool
Similar content

Sysdig - Security Tools That Actually Watch What's Running

Security tools that watch what your containers are actually doing, not just what they're supposed to do

Sysdig Secure
/tool/sysdig-secure/overview
37%
troubleshoot
Recommended

Fix Kubernetes ImagePullBackOff Error - The Complete Battle-Tested Guide

From "Pod stuck in ImagePullBackOff" to "Problem solved in 90 seconds"

Kubernetes
/troubleshoot/kubernetes-imagepullbackoff/comprehensive-troubleshooting-guide
34%
integration
Similar content

Pulumi Kubernetes Helm GitOps Workflow: Production Integration Guide

Stop fighting with YAML hell and infrastructure drift - here's how to manage everything through Git without losing your sanity

Pulumi
/integration/pulumi-kubernetes-helm-gitops/complete-workflow-integration
30%
review
Similar content

Container Runtime Security: Prevent Escapes with Falco

I've watched container escapes take down entire production environments. Here's what actually works.

Falco
/review/container-runtime-security/comprehensive-security-assessment
27%
tool
Recommended

Datadog Cost Management - Stop Your Monitoring Bill From Destroying Your Budget

competes with Datadog

Datadog
/tool/datadog/cost-management-guide
24%
tool
Recommended

Enterprise Datadog Deployments That Don't Destroy Your Budget or Your Sanity

Real deployment strategies from engineers who've survived $100k+ monthly Datadog bills

Datadog
/tool/datadog/enterprise-deployment-guide
24%
tool
Recommended

Datadog - Expensive Monitoring That Actually Works

Finally, one dashboard instead of juggling 5 different monitoring tools when everything's on fire

Datadog
/tool/datadog/overview
24%
tool
Recommended

Node Exporter Advanced Configuration - Stop It From Killing Your Server

The configuration that actually works in production (not the bullshit from the docs)

Prometheus Node Exporter
/tool/prometheus-node-exporter/advanced-configuration
24%
integration
Similar content

GitOps Integration: Docker, Kubernetes, Argo CD, Prometheus Setup

How to Wire Together the Modern DevOps Stack Without Losing Your Sanity

/integration/docker-kubernetes-argocd-prometheus/gitops-workflow-integration
24%
troubleshoot
Recommended

Docker Desktop Won't Install? Welcome to Hell

When the "simple" installer turns your weekend into a debugging nightmare

Docker Desktop
/troubleshoot/docker-cve-2025-9074/installation-startup-failures
22%
howto
Recommended

Complete Guide to Setting Up Microservices with Docker and Kubernetes (2025)

Split Your Monolith Into Services That Will Break in New and Exciting Ways

Docker
/howto/setup-microservices-docker-kubernetes/complete-setup-guide
22%
troubleshoot
Recommended

Fix Docker Daemon Connection Failures

When Docker decides to fuck you over at 2 AM

Docker Engine
/troubleshoot/docker-error-during-connect-daemon-not-running/daemon-connection-failures
22%
integration
Recommended

ELK Stack for Microservices - Stop Losing Log Data

How to Actually Monitor Distributed Systems Without Going Insane

Elasticsearch
/integration/elasticsearch-logstash-kibana/microservices-logging-architecture
22%
integration
Recommended

EFK Stack Integration - Stop Your Logs From Disappearing Into the Void

Elasticsearch + Fluentd + Kibana: Because searching through 50 different log files at 3am while the site is down fucking sucks

Elasticsearch
/integration/elasticsearch-fluentd-kibana/enterprise-logging-architecture
22%
tool
Recommended

Splunk - Expensive But It Works

Search your logs when everything's on fire. If you've got $100k+/year to spend and need enterprise-grade log search, this is probably your tool.

Splunk Enterprise
/tool/splunk/overview
21%
tool
Similar content

RHACS Cost Analysis & Pricing Guide: Budget Without Breaking Security

Red Hat quoted us $50K. We spent $127K. Here's why their estimates are fantasy.

Red Hat Advanced Cluster Security for Kubernetes
/tool/red-hat-advanced-cluster-security/cost-analysis-pricing-guide
19%
tool
Similar content

Mastering GitOps: Docker, Kubernetes, ArgoCD, Prometheus Stack

Stop manually SSHing into production servers to run kubectl commands like some kind of caveman

/tool/gitops-stack/complete-integration-stack
17%

Recommendations combine user behavior, content similarity, research intelligence, and SEO optimization