
After deploying security monitoring, you'll quickly learn that "it works in staging" has about as much predictive power as a weather forecast. Production load will find every weakness in your setup within 24 hours, usually during your lunch break. Here's how to optimize for reality and fix the disasters that always happen during the worst possible moments.
High-Cardinality Metrics Problem
Prometheus will eat your storage budget alive if you don't control metric cardinality. I've seen 2TB disappear in a weekend because someone deployed a service that exports pod IP addresses as metrics. Here's what breaks and how to fix it:
## prometheus-optimization.yaml
global:
scrape_interval: 30s # Increase from default 15s
evaluation_interval: 30s
## Metric relabeling to reduce cardinality
metric_relabel_configs:
# Drop high-cardinality Falco metrics
- source_labels: [__name__]
regex: 'falco_k8s_audit.*'
action: drop
# Keep only essential Gatekeeper metrics
- source_labels: [__name__]
regex: 'gatekeeper_violations_total|gatekeeper_audit_total'
action: keep
# Aggregate container metrics by namespace instead of pod
- source_labels: [__name__, container]
regex: 'container_.*;;'
target_label: container
replacement: 'aggregated'
## Storage optimization
storage:
tsdb:
retention.time: 15d # Reduce from 30d for high-volume clusters
retention.size: 100GB
wal-compression: true # Compress write-ahead logs
Falco Resource Optimization
Falco can consume significant CPU on busy nodes. Tune it for production load using the advanced performance tuning guide and production optimization tips:
## falco-production-tuning.yaml
falco:
driver:
kind: ebpf
ebpf:
# Reduce eBPF program complexity
hostNetwork: false
# Syscall filtering to reduce overhead
syscall_event_drops:
max_burst: 1000
rate: 1000 # events per second
# Buffer tuning for high-throughput environments
base_syscalls:
repair: false
# Performance-optimized rules
rules:
# Disable expensive regex matching in hot paths
- rule: Contact K8S API Server From Container
condition: >
kevt and container and
not user_known_contact_k8s_api_server_activities and
not proc.name in (kubectl, helm, terraform, argocd)
enabled: false # Re-enable after adding specific exceptions
# Keep only essential security rules active
- rule: Container with sensitive mount
enabled: true
- rule: Privilege escalation
enabled: true
- rule: Container Run as Root User
enabled: true
## Resource limits based on node size
resources:
limits:
cpu: 200m # Increase to 500m for nodes >32 cores
memory: 512Mi # Increase to 1Gi for nodes >64GB RAM
requests:
cpu: 100m
memory: 256Mi
OPA Gatekeeper Scaling
Gatekeeper can become a bottleneck during deployment storms. Scale it properly following Kubernetes security best practices and production requirements:
## gatekeeper-scaling.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: gatekeeper-controller-manager
namespace: gatekeeper-system
spec:
replicas: 3 # Scale based on cluster size: 1 per 100 nodes
template:
spec:
containers:
- name: manager
resources:
limits:
cpu: 1000m # Increase for policy-heavy environments
memory: 512Mi
requests:
cpu: 100m
memory: 256Mi
env:
# Webhook timeout tuning
- name: WEBHOOK_TIMEOUT
value: "10" # Seconds, increase if policies are complex
# Constraint evaluation caching
- name: CONSTRAINT_EVALUATIONS_CACHE_SIZE
value: "1000"
# Disable dry-run validation for performance
- name: DISABLE_DRY_RUN_VALIDATION
value: "true"
## Configure webhook to handle load
---
apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingAdmissionWebhook
metadata:
name: gatekeeper-validating-webhook-configuration
webhooks:
- name: validation.gatekeeper.sh
admissionReviewVersions: ["v1", "v1beta1"]
timeoutSeconds: 10 # Increase if policies are slow
failurePolicy: Fail # Change to Ignore for non-critical environments
objectSelector:
matchExpressions:
# Exclude kube-system and monitoring namespaces
- key: name
operator: NotIn
values: ["kube-system", "security-monitoring", "gatekeeper-system"]
Common Production Issues and Solutions
Issue 1: Falco Driver Loading Fails Because Managed Node Groups Are a Special Kind of Hell
Symptoms: No events being generated
, Failed to load eBPF probe
, and you're wondering if bartending school is still accepting applications
Debugging:
## Check kernel compatibility (spoiler: it's probably broken)
uname -r
ls /lib/modules/$(uname -r)/build # Should exist but probably doesn't
## Check eBPF support (breaks randomly on kernel updates)
kubectl exec -n security-monitoring falco-xxx -- falco --list-syscall-events
## This might hang for like 30 seconds, don't panic
## Check for BTF support (if this fails, you're probably on some ancient kernel)
ls /sys/kernel/btf/vmlinux # Should exist on modern kernels but who knows
## When eBPF inevitably gives up, fall back to kernel module
## This command will probably fail twice before it works
kubectl patch daemonset falco -n security-monitoring --type='json' \
-p='[{"op": "replace", "path": "/spec/template/spec/containers/0/env", "value": [{"name": "FALCO_DRIVER_KIND", "value": "module"}]}]'
## Wait like 2-3 minutes for pods to restart, grab a coffee
Issue 2: Prometheus Decides to Eat Your Entire Storage Budget While You Sleep
Symptoms: Prometheus out of space
, Cannot write to WAL
, and your AWS bill just made the CFO very unhappy
Emergency response (at 3am when everything breaks):
## Emergency cleanup (this might take 10-15 minutes)
kubectl exec prometheus-prometheus-kube-prometheus-prometheus-0 -n security-monitoring -- sh -c '
# Delete old WAL segments (cross your fingers)
find /prometheus -name "*.tmp" -delete
find /prometheus -name "wal" -type d -exec rm -rf {}/000* \;
# Compact old data (this part usually fails the first time)
promtool tsdb create-blocks-from-data --max-block-duration=2h /prometheus
'
## Increase storage immediately (AWS takes forever to provision)
kubectl patch pvc prometheus-prometheus-kube-prometheus-prometheus-db-prometheus-prometheus-kube-prometheus-prometheus-0 \
-n security-monitoring --type='json' \
-p='[{"op": "replace", "path": "/spec/resources/requests/storage", "value": "500Gi"}]'
## Then wait 10-15 minutes for the volume to resize, because cloud storage
Issue 3: Gatekeeper Becomes the Overzealous Security Guard That Nobody Likes
Symptoms: Deployments hanging forever
, Webhook timeout exceeded
, and your CI/CD pipeline is about as useful as a chocolate teapot
Quick fixes:
## Temporarily bypass Gatekeeper for emergency deployments
kubectl label namespace production admission.gatekeeper.sh/ignore=true
## Increase webhook timeout
kubectl patch validatingadmissionconfiguration gatekeeper-validating-webhook-configuration \
--type='json' -p='[{"op": "replace", "path": "/webhooks/0/timeoutSeconds", "value": 30}]'
## Scale Gatekeeper for load
kubectl scale deployment gatekeeper-controller-manager -n gatekeeper-system --replicas=5
Issue 4: The Alert Apocalypse That Ruins Everyone's Tuesday
Symptoms: Thousands of Falco alerts
, Slack threatening to ban your webhook
, and the entire team has learned to ignore security notifications
Immediate mitigation:
## Implement alert grouping in Falcosidekick
kubectl patch configmap falcosidekick -n security-monitoring --type='json' \
-p='[{"op": "add", "path": "/data/config.yaml", "value": "
slack:
webhookurl: \"YOUR_WEBHOOK\"
channel: \"#security-alerts\"
minimumpriority: \"error\"
messageformat: \"{{ range .Outputs }}{{ .Rule }}: {{ .Priority }}{{ end }}\"
alertmanager:
hostport: \"alertmanager:9093\"
minimumpriority: \"warning\"
webhook:
address: \"http://alert-aggregator:8080/webhook\"
minimumpriority: \"info\"
"}]'
## Temporarily raise alert threshold
kubectl patch falco-rules -n security-monitoring --type='json' \
-p='[{"op": "replace", "path": "/data/falco_rules.local.yaml", "value": "
- rule: Privilege escalation
condition: >
spawned_process and container and
(proc.auid != -1 and proc.auid != proc.uid) and
count(spawned_process) > 5 # Only alert on repeated escalation
"}]'
Security Monitoring Health Checks
Create automated health checks to catch monitoring failures before security incidents:
## monitoring-health-check.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
name: security-monitoring-health
namespace: security-monitoring
spec:
schedule: "*/5 * * * *" # Every 5 minutes
jobTemplate:
spec:
template:
spec:
containers:
- name: health-checker
image: curlimages/curl:8.0.1
command:
- sh
- -c
- |
set -e
echo "Checking security monitoring health..."
# Check Falco metrics endpoint
if ! curl -s http://falco:8765/metrics | grep -q "falco_events_total"; then
echo "ERROR: Falco metrics not available"
exit 1
fi
# Check Prometheus is scraping Falco
if ! curl -s "http://prometheus:9090/api/v1/query?query=up{job='falco'}" | \
jq -r '.data.result[0].value[1]' | grep -q "1"; then
echo "ERROR: Prometheus not scraping Falco"
exit 1
fi
# Check Gatekeeper webhook health
if ! curl -s -k https://gatekeeper-webhook-service.gatekeeper-system:443/v1/health; then
echo "ERROR: Gatekeeper webhook unhealthy"
exit 1
fi
# Verify recent security events
RECENT_EVENTS=$(curl -s "http://prometheus:9090/api/v1/query?query=increase(falco_events_total[5m])" | \
jq -r '.data.result[0].value[1] // "0"')
if [ "$RECENT_EVENTS" = "0" ]; then
echo "WARNING: No security events in last 5 minutes - check if monitoring is working"
fi
echo "Security monitoring health check passed"
restartPolicy: OnFailure
Production Deployment Checklist
Before going live with security monitoring, verify these critical items:
Pre-deployment validation:
#!/bin/bash
## security-monitoring-preflight.sh
echo "=== Security Monitoring Pre-deployment Checklist ==="
## 1. Cluster capacity check
TOTAL_CPU=$(kubectl describe nodes | grep "cpu:" | awk '{sum += $2} END {print sum}')
TOTAL_MEMORY=$(kubectl describe nodes | grep "memory:" | awk '{sum += $2} END {print sum/1024/1024}')
echo "Cluster capacity: ${TOTAL_CPU} CPU, ${TOTAL_MEMORY}GB RAM"
if [ "$TOTAL_CPU" -lt 10 ]; then
echo "WARNING: Low CPU capacity for security monitoring"
fi
## 2. Storage availability
AVAILABLE_STORAGE=$(kubectl get pvc -n security-monitoring -o jsonpath='{.items[*].status.capacity.storage}' | \
awk '{sum += $1} END {print sum}')
echo "Available monitoring storage: ${AVAILABLE_STORAGE}Gi"
## 3. Network policy compatibility
if kubectl get networkpolicy --all-namespaces | grep -q "default-deny"; then
echo "WARNING: Default-deny network policies detected - ensure monitoring traffic is allowed"
fi
## 4. Kernel compatibility
KERNEL_VERSION=$(kubectl get nodes -o jsonpath='{.items[0].status.nodeInfo.kernelVersion}')
echo "Kernel version: $KERNEL_VERSION"
## 5. Container runtime check
RUNTIME=$(kubectl get nodes -o jsonpath='{.items[0].status.nodeInfo.containerRuntimeVersion}')
echo "Container runtime: $RUNTIME"
## 6. Security policies check
if kubectl get psp 2>/dev/null | grep -q "restricted"; then
echo "Pod Security Policies detected - ensure compatibility with monitoring components"
fi
echo "=== Preflight check complete ==="
Post-deployment verification:
#!/bin/bash
## security-monitoring-verification.sh
echo "=== Security Monitoring Verification ==="
## Wait for all components to be ready
kubectl wait --for=condition=ready pod -l app.kubernetes.io/name=falco -n security-monitoring --timeout=300s
kubectl wait --for=condition=ready pod -l app.kubernetes.io/name=prometheus -n security-monitoring --timeout=300s
kubectl wait --for=condition=ready pod -l control-plane=controller-manager -n gatekeeper-system --timeout=300s
## Test security event generation
echo "Testing security event generation..."
kubectl run security-test --image=busybox --rm -it --restart=Never -- sh -c '
echo "Attempting to read sensitive file..."
cat /etc/shadow 2>/dev/null || echo "Access denied (expected)"
' || true
## Wait for events to be processed
sleep 30
## Verify Falco detected the test
FALCO_EVENTS=$(kubectl exec -n security-monitoring prometheus-prometheus-kube-prometheus-prometheus-0 -- \
promtool query instant 'increase(falco_events_total[1m])' | grep -o '[0-9]\+' | head -1)
if [ "$FALCO_EVENTS" -gt 0 ]; then
echo "✓ Falco runtime detection working"
else
echo "✗ Falco runtime detection not working"
exit 1
fi
## Test policy enforcement
echo "Testing policy enforcement..."
cat <<EOF | kubectl apply --dry-run=server -f - || echo "✓ Gatekeeper policy enforcement working"
apiVersion: apps/v1
kind: Deployment
metadata:
name: privileged-test
spec:
replicas: 1
selector:
matchLabels:
app: privileged-test
template:
metadata:
labels:
app: privileged-test
spec:
containers:
- name: test
image: nginx
securityContext:
privileged: true
EOF
echo "=== Security monitoring verification complete ==="
This production-hardened configuration actually catches real threats while not destroying your cluster performance. The monitoring scales with your growth and gives you actionable insights instead of the endless alert noise that makes everyone ignore security altogether.
How you know it's actually working:
- Security monitoring uses <5% of cluster resources (trust me, you'll feel it if it's more)
- False positive rate drops below 10% after you've spent weeks tuning out the noise
- You catch real threats in under a minute instead of finding out from Twitter
- Policies block 95% of the questionable deployment decisions your developers try to sneak past
- Monitoring stays up when the rest of your infrastructure is having an existential crisis (mostly)