Lock Down Your K8s Cluster Before It Costs You $50k

The $47k Lesson: Why Your Cluster Is Already Compromised

Six months ago, I got called in to fix a startup's "minor security issue." Their Kubernetes cluster was mining Monero, their AWS bill went from $800/month to $47k, and they had no idea how it happened.

The attack path was embarrassingly simple: default service account with cluster-admin permissions, no network policies, and secrets stored as plain text environment variables. The attacker got in through a vulnerable nginx pod, escalated to the node, and then owned the entire cluster within 20 minutes.

The Real Attack Vector (It's Not What You Think)

Everyone obsesses over container image vulnerabilities, but that's not how clusters get owned. Here's what actually happens:

Step 1: Pod Compromise - Usually through a web app vulnerability or misconfigured ingress
Step 2: Privilege Escalation - Default service accounts with excessive permissions
Step 3: Lateral Movement - No network segmentation between namespaces
Step 4: Persistence - Deploy cryptominers as DaemonSets across all nodes

The kicker? All of this is preventable with proper RBAC, network policies, and admission controllers. But most teams skip these because "we'll add security later."

Current State Assessment (The Uncomfortable Truth)

Before fixing anything, run these commands to see how fucked your cluster really is:

## Check if default service accounts have admin access (spoiler: they probably do)
kubectl auth can-i '*' '*' --as=system:serviceaccount:default:default

## List pods running as root (prepare to be horrified)
kubectl get pods --all-namespaces -o jsonpath='{range .items[*]}{.metadata.namespace}{"	"}{.metadata.name}{"	"}{.spec.securityContext.runAsUser}{"
"}{end}' | grep -E "^\w+\s+\w+\s*(0|$)"

## Check for secrets exposed as environment variables
kubectl get pods --all-namespaces -o jsonpath='{range .items[*]}{.metadata.namespace}{"	"}{.metadata.name}{"	"}{.spec.containers[*].env[*].valueFrom.secretKeyRef.name}{"
"}{end}' | grep -v "^.*	.*	$"

If any of these return results, you've got work to do.

The Hardening Priority Matrix

Based on actual incident response, here's the order that prevents the most damage:

Week 1-2: Stop the Obvious Stuff

RBAC implementation (blocks privilege escalation)
Network policies (prevents lateral movement)
Pod security standards (limits container breakouts)

Week 3-4: Lock Down Data

Secrets management (stop broadcasting credentials)
Admission controllers (block dangerous deployments)
Runtime monitoring (detect active attacks)

Month 2+: Advanced Hardening

Node hardening
Supply chain security
Compliance frameworks

Don't try to do everything at once. I've seen teams get overwhelmed and abandon hardening halfway through, leaving their cluster in a worse state than when they started.

Reality Check: Resource Requirements

Hardening isn't free. Plan for:

Time Investment:

40-60 hours for initial hardening (2-3 engineers, 2 weeks)
8-10 hours/week ongoing maintenance
20-30 hours when things break (and they will)

Performance Impact:

Network policies add 5-10ms latency per hop
Admission controllers can slow deployments by 30%
Runtime monitoring consumes 5-15% CPU per node

Cost Increases:

Security tools: $200-500/node/month for commercial solutions
Additional compute for monitoring: 10-20% overhead
Training and certification: $5k-15k per engineer

The alternative is that $47k surprise bill when your cluster gets compromised. Choose wisely.

Pre-Hardening Questions: What You Need to Know

How do I know if my cluster is already compromised?

Check for cryptocurrency mining processes first:

## Look for suspicious processes across all nodes
kubectl get pods --all-namespaces -o wide | grep -E "(mining|xmrig|cpuminer|crypto)"

## Check CPU usage spikes
kubectl top nodes
kubectl top pods --all-namespaces | sort -k 3 -nr | head -10

Red flags: pods you don't recognize consuming 90%+ CPU, processes named things like kthreads or systemd-worker that aren't actually system processes, and mysterious network traffic to pool servers.

Can I harden a cluster without breaking existing applications?

Yes, but test everything in staging first. The biggest gotcha is RBAC - applications that rely on overpermissioned service accounts will break immediately.

Start with audit mode for admission controllers and gradually tighten permissions. Budget 2-3 weeks for this process if you have legacy applications.

What's the minimum viable security for a production cluster?

Three non-negotiables:

RBAC with least privilege - No cluster-admin for application pods
Network policies - Default deny, explicit allow rules
Pod security standards - Block privileged containers and host mounts

Everything else is nice-to-have. These three prevent 80% of the attacks I've seen.

How much will commercial security tools actually cost?

Vendors lie about pricing. Here's the real math:

Aqua Security: Starts at $50/node/month, becomes $200+ with actual features
Twistlock/Prisma: $100-300/node/month depending on modules
Falco: Free, but you'll pay $5k+ for proper alert management
OPA Gatekeeper: Free, but plan for 40+ hours of policy development

For a 20-node cluster, budget $40k-80k annually for commercial tooling.

Should I use managed Kubernetes security features?

EKS/GKE/AKS security add-ons are decent starting points but not sufficient for production. They handle the basics (cluster API authentication, network encryption) but miss application-layer security.

Use managed features for:

Cluster authentication and authorization
Network encryption between nodes
Basic vulnerability scanning

Still need third-party tools for:

Runtime threat detection
Advanced admission control
Secrets management
Compliance reporting

What happens when security hardening breaks my deployments?

Have a rollback plan. Most admission controller configs can be updated without cluster restarts:

## Disable problematic admission controller
kubectl patch validatingadmissionconfiguration <config-name> --type='json' -p='[{"op": "replace", "path": "/webhooks/0/admissionReviewVersions", "value": []}]'

## Quickly allow a specific deployment through
kubectl annotate deployment <deployment-name> admission.example.com/bypass="true"

Keep staging and production configs in version control. Test every change in staging with actual application deployments, not just hello-world examples.

Step-by-Step Production Hardening Implementation

Time to stop fucking around and actually lock down your cluster. This is the playbook I use when fixing compromised clusters - battle-tested configurations that work in real production environments.

Phase 1: RBAC Implementation (Week 1)

RBAC is your first line of defense. Most clusters are wide open because the default service account has way too many permissions.

Step 1: Audit Current Permissions

## Create a quick audit script to see the damage
cat << 'EOF' > rbac-audit.sh
#!/bin/bash
echo \"=== Default Service Account Permissions ===\"
kubectl auth can-i --list --as=system:serviceaccount:default:default

echo -e \"
=== Cluster Admin Bindings ===\"
kubectl get clusterrolebindings -o json | jq -r '.items[] | select(.roleRef.name==\"cluster-admin\") | .metadata.name'

echo -e \"
=== Service Accounts with Admin Access ===\"
kubectl get clusterrolebindings -o json | jq -r '.items[] | select(.roleRef.name==\"cluster-admin\") | .subjects[]? | select(.kind==\"ServiceAccount\") | \"\\(.name) in namespace \\(.namespace)\"' 
EOF

chmod +x rbac-audit.sh && ./rbac-audit.sh

If you see the default service account with cluster-admin permissions, you're fucked. Fix it immediately.

Step 2: Implement Least Privilege RBAC

Create namespace-specific service accounts with minimal permissions:

## rbac-hardened.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  name: app-service-account
  namespace: production
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  namespace: production
  name: app-role
rules:
- apiGroups: [\"\"]
  resources: [\"pods\", \"configmaps\", \"secrets\"]
  verbs: [\"get\", \"list\", \"create\", \"update\", \"patch\"]
- apiGroups: [\"apps\"]
  resources: [\"deployments\", \"replicasets\"]
  verbs: [\"get\", \"list\", \"create\", \"update\", \"patch\"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: app-role-binding
  namespace: production
subjects:
- kind: ServiceAccount
  name: app-service-account
  namespace: production
roleRef:
  kind: Role
  name: app-role
  apiGroup: rbac.authorization.k8s.io

Apply it and update your deployment manifests to use the new service account:

spec:
  serviceAccountName: app-service-account
  # Remove any automountServiceAccountToken: true
  automountServiceAccountToken: false

Critical gotcha: Kubernetes 1.25+ deprecated PodSecurityPolicy. If you're still using PSPs, migrate to Pod Security Standards immediately or your deployments will break on upgrade.

Phase 2: Network Policies (Week 1-2)

Network policies are where most implementations fail. They look simple but have edge cases that will bite you in production.

Step 1: Default Deny All Traffic

## default-deny-all.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-all
  namespace: production
spec:
  podSelector: {}
  policyTypes:
  - Ingress
  - Egress

WARNING: This breaks everything immediately. Have your allow rules ready before applying this.

Step 2: Explicit Allow Rules

## web-app-network-policy.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: web-app-policy
  namespace: production
spec:
  podSelector:
    matchLabels:
      app: web-app
  policyTypes:
  - Ingress
  - Egress
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          name: ingress-nginx
    ports:
    - protocol: TCP
      port: 8080
  egress:
  - to:
    - namespaceSelector:
        matchLabels:
          name: database
    ports:
    - protocol: TCP
      port: 5432
  # Allow DNS resolution (you'll forget this and wonder why everything breaks)
  - to: []
    ports:
    - protocol: UDP
      port: 53

Step 3: Test Network Policies

Network policies fail silently. Use netshoot for testing:

## Deploy test pod
kubectl run netshoot --rm -i --tty --image nicolaka/netshoot

## Inside the pod, test connectivity
nslookup google.com  # Should work
curl your-internal-service:8080  # Should work only if explicitly allowed
curl external-blocked-service:8080  # Should timeout

Common gotcha: CNI compatibility. Calico, Cilium, and Weave handle network policies differently. Test on your actual CNI, not Docker Desktop.

Phase 3: Pod Security Standards (Week 2)

Pod Security Standards replaced PodSecurityPolicy in Kubernetes 1.25+. The migration is painful but necessary.

Enable Pod Security Standards

## pod-security-config.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: production
  labels:
    pod-security.kubernetes.io/enforce: restricted
    pod-security.kubernetes.io/audit: restricted
    pod-security.kubernetes.io/warn: restricted

This immediately blocks:

Privileged containers
Host network/PID/IPC access
Host path mounts
Running as root

Fix Your Deployments

Most applications will break. Update your manifests:

spec:
  template:
    spec:
      securityContext:
        runAsNonRoot: true
        runAsUser: 1001
        fsGroup: 1001
      containers:
      - name: app
        securityContext:
          allowPrivilegeEscalation: false
          readOnlyRootFilesystem: true
          capabilities:
            drop:
            - ALL
        # Add writable volume for temp files
        volumeMounts:
        - name: tmp
          mountPath: /tmp
      volumes:
      - name: tmp
        emptyDir: {}

Time-saving tip: Use kubectl --dry-run=server to test deployments against Pod Security Standards before applying.

Phase 4: Secrets Management (Week 2-3)

Stop putting secrets in environment variables. Seriously. That's how the $47k cluster got owned.

External Secrets Operator

## Install external-secrets
helm repo add external-secrets https://charts.external-secrets.io
helm install external-secrets external-secrets/external-secrets -n external-secrets-system --create-namespace

Configure for AWS Secrets Manager:

## secret-store.yaml
apiVersion: external-secrets.io/v1beta1
kind: SecretStore
metadata:
  name: aws-secrets-manager
  namespace: production
spec:
  provider:
    aws:
      service: SecretsManager
      region: us-west-2
      auth:
        secretRef:
          accessKeyID:
            name: aws-credentials
            key: access-key-id
          secretAccessKey:
            name: aws-credentials
            key: secret-access-key

Seal Secrets Alternative

If you're sticking with in-cluster secrets, use Bitnami Sealed Secrets:

## Install sealed-secrets controller
kubectl apply -f https://github.com/bitnami-labs/sealed-secrets/releases/download/v0.24.0/controller.yaml

## Encrypt existing secrets
echo -n 'your-secret-password' | kubectl create secret generic mysecret --dry-run=client --from-file=password=/dev/stdin -o yaml | kubeseal -o yaml > mysealedsecret.yaml

Reality check: Secrets rotation is a pain in the ass. Plan for 4-6 hours per application to implement proper secrets management.

Validation and Testing

After each phase, run these verification commands:

## Verify RBAC
kubectl auth can-i --list --as=system:serviceaccount:production:app-service-account

## Test network policies
kubectl run test-pod --rm -i --tty --image nicolaka/netshoot

## Check pod security
kubectl get pods -n production -o jsonpath='{range .items[*]}{.metadata.name}{"	"}{.spec.securityContext.runAsUser}{"
"}{end}'

## Audit secrets
kubectl get secrets --all-namespaces -o custom-columns=NAMESPACE:.metadata.namespace,NAME:.metadata.name,TYPE:.type | grep -v \"kubernetes.io/\"

If any of these fail, fix them before moving to the next phase. Half-implemented security is worse than no security - it gives you false confidence while leaving attack vectors open.

The next phase covers admission controllers and runtime monitoring, but get these basics rock-solid first.

Kubernetes Security Tools: Real Costs & Honest Assessments

Tool	Starting Price	Real Price (20 nodes)	What You Get	What's Missing	Verdict
Aqua Security	$50/node/month	$4k-8k/month	Runtime protection, vulnerability scanning, compliance	Workload identity management costs extra	Solid but expensive
Prisma Cloud (Twistlock)	$100/node/month	$6k-12k/month	Full security lifecycle, good compliance reporting	Complex licensing, steep learning curve	Enterprise-ready but overkill for most
Sysdig Secure	$60/node/month	$3.6k-7.2k/month	Runtime insights, forensics	Falco integration isn't seamless despite owning it	Good for incident response
StackRox (Red Hat)	$75/node/month	$4.5k-9k/month	Network visualization, declarative configs	Recently acquired, roadmap uncertain	Wait and see
Snyk Container	$25/node/month	$1.5k-3k/month	Great vulnerability database	Weak runtime protection	Good for dev teams

Advanced Hardening: Admission Controllers & Runtime Monitoring

You've got the basics locked down - RBAC, network policies, pod security standards. Now it's time for the advanced shit that separates production-ready clusters from the ones that get owned.

Admission Controllers: The Last Line of Defense

Admission controllers are your final chance to block dangerous deployments before they hit your cluster. Think of them as automated security reviews that never get tired, never approve shit just to make a deadline, and never forget to check for privilege escalation.

OPA Gatekeeper Implementation

Install Gatekeeper and prepare for some YAML hell:

## Install Gatekeeper
kubectl apply -f https://raw.githubusercontent.com/open-policy-agent/gatekeeper/release-3.14/deploy/gatekeeper.yaml

## Wait for it to be ready (this takes a few minutes)
kubectl wait --for=condition=Ready pod -l control-plane=controller-manager -n gatekeeper-system --timeout=300s

Create constraint templates for common security violations:

## privileged-containers-template.yaml
apiVersion: templates.gatekeeper.sh/v1beta1
kind: ConstraintTemplate
metadata:
  name: k8srequiredsecuritycontext
spec:
  crd:
    spec:
      names:
        kind: K8sRequiredSecurityContext
      validation:
        properties:
          runAsNonRoot:
            type: boolean
  targets:
    - target: admission.k8s.gatekeeper.sh
      rego: |
        package k8srequiredsecuritycontext
        
        violation[{"msg": msg}] {
            container := input.review.object.spec.template.spec.containers[_]
            not container.securityContext.runAsNonRoot
            msg := "Container must run as non-root user"
        }
        
        violation[{"msg": msg}] {
            container := input.review.object.spec.template.spec.containers[_]
            container.securityContext.privileged
            msg := "Privileged containers are not allowed"
        }

Apply the constraint:

## privileged-containers-constraint.yaml
apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sRequiredSecurityContext
metadata:
  name: must-run-as-nonroot
spec:
  match:
    kinds:
      - apiGroups: ["apps"]
        kinds: ["Deployment"]
    excludedNamespaces: ["kube-system", "gatekeeper-system"]
  parameters:
    runAsNonRoot: true

Critical gotcha: Gatekeeper has no dry-run mode. Test constraints in a staging cluster first, or you'll block legitimate deployments.

Custom Admission Webhooks

For complex policies, write custom admission webhooks. Here's a simple webhook that blocks images from untrusted registries:

// admission-webhook.go
package main

import (
    "context"
    "encoding/json"
    "fmt"
    "net/http"
    "strings"
    
    admissionv1 "k8s.io/api/admission/v1"
    corev1 "k8s.io/api/core/v1"
    metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
)

var trustedRegistries = []string{
    "registry.company.com",
    "gcr.io/company-project",
    "docker.io/library", // Official Docker images
}

func admissionHandler(w http.ResponseWriter, r *http.Request) {
    var body []byte
    if r.Body != nil {
        body, _ = io.ReadAll(r.Body)
    }

    var admissionReview admissionv1.AdmissionReview
    json.Unmarshal(body, &admissionReview)

    req := admissionReview.Request
    var pod corev1.Pod
    json.Unmarshal(req.Object.Raw, &pod)

    allowed := true
    message := ""

    for _, container := range pod.Spec.Containers {
        if !isImageTrusted(container.Image) {
            allowed = false
            message = fmt.Sprintf("Image %s from untrusted registry", container.Image)
            break
        }
    }

    response := &admissionv1.AdmissionResponse{
        UID:     req.UID,
        Allowed: allowed,
        Result:  &metav1.Status{Message: message},
    }

    admissionReview.Response = response
    respBytes, _ := json.Marshal(admissionReview)
    w.Header().Set("Content-Type", "application/json")
    w.Write(respBytes)
}

func isImageTrusted(image string) bool {
    for _, registry := range trustedRegistries {
        if strings.HasPrefix(image, registry) {
            return true
        }
    }
    return false
}

Deployment reality: Custom webhooks are powerful but require TLS certificates, proper error handling, and monitoring. Budget 2-3 weeks for production-ready webhook implementation.

Runtime Monitoring: Catching Active Attacks

Static policies catch deployment mistakes. Runtime monitoring catches actual attacks in progress.

Falco Setup and Configuration

Install Falco with proper rule customization:

## Install Falco using Helm
helm repo add falcosecurity https://falcosecurity.github.io/charts
helm repo update

## Create custom values file
cat << 'EOF' > falco-values.yaml
falco:
  rules_file:
    - /etc/falco/falco_rules.yaml
    - /etc/falco/falco_rules.local.yaml
    - /etc/falco/k8s_audit_rules.yaml
    - /etc/falco/rules.d
  
  # Reduce noise - default rules are too aggressive
  load_plugins: [k8saudit, json]
  
  # Output to multiple destinations
  json_output: true
  json_include_output_property: true
  
  # File output for log aggregation
  file_output:
    enabled: true
    keep_alive: false
    filename: /var/log/falco.log

  # gRPC output for real-time alerts
  grpc:
    enabled: true
    bind_address: "0.0.0.0:5060"
    threadiness: 8

## Custom rules to reduce false positives
customRules:
  rules.yaml: |
    # Override default rules with production-tested versions
    - rule: Write below binary dir
      desc: Detect file writes below common binary directories
      condition: >
        bin_dir and evt.dir = < and open_write
        and not package_mgmt_procs
        and not exe_running_docker_save
        and not python_running_get_pip
        and not python_running_ms_oms
        and not user_known_write_below_binary_dir_activities
        and not nodejs_running_npm
        # Add your application exceptions here
        and not proc.name in (your-app-binary)
      output: >
        File below a known binary directory opened for writing
        (user=%user.name command=%proc.cmdline file=%fd.name)
      priority: ERROR
      tags: [filesystem, mitre_persistence]
EOF

helm install falco falcosecurity/falco -f falco-values.yaml -n falco --create-namespace

Critical tuning step: Falco will spam the hell out of you for the first week. Create custom rules to filter out your application's normal behavior:

## falco-custom-rules.yaml
- rule: Allow expected file writes
  desc: Allow your application to write expected files
  condition: >
    (proc.name = your-app-name and fd.name startswith /app/data)
    or (proc.name = nginx and fd.name startswith /var/log/nginx)
    or (proc.name = postgres and fd.name startswith /var/lib/postgresql)
  priority: INFO

- list: known_drop_and_execute_containers
  items: [your-ci-runner, your-build-container]

## Override default rule to exclude your containers
- rule: Drop and execute new binary in container
  condition: >
    spawned_process and container and
    (proc.is_exe_upper_layer = true or proc.is_exe_from_memfd = true) and
    not container.image.repository in (known_drop_and_execute_containers)
  append: false

Alert Management and Response

Connect Falco to your incident response system. Here's a webhook forwarder for Slack/PagerDuty:

## falco-sidekick-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: falco-sidekick-config
  namespace: falco
data:
  config.yaml: |
    slack:
      webhookurl: "https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK"
      channel: "#security-alerts"
      username: "falco"
      minimumpriority: "error"
    
    pagerduty:
      routingkey: "your-pagerduty-integration-key"
      minimumpriority: "critical"
    
    # Custom webhook for your SIEM
    webhook:
      address: "https://your-siem.company.com/api/events"
      minimumpriority: "warning"

Response playbook: When Falco fires alerts, don't just dismiss them. Follow this process:

Immediate: Check if it's a known false positive
5 minutes: Verify the pod/container is legitimate
15 minutes: Check network connections from the suspicious pod
30 minutes: Review audit logs for privilege escalation attempts

Runtime Security Beyond Falco

eBPF-based monitoring for deeper system visibility:

## Install Tetragon for network and process monitoring
helm repo add cilium https://helm.cilium.io/
helm install tetragon cilium/tetragon -n tetragon --create-namespace \
  --set tetragon.enableProcessCred=true \
  --set tetragon.enableProcessNs=true

File integrity monitoring using AIDE:

## aide-daemonset.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: aide-fim
  namespace: security
spec:
  selector:
    matchLabels:
      name: aide-fim
  template:
    spec:
      hostNetwork: true
      hostPID: true
      containers:
      - name: aide
        image: your-registry.com/aide:latest
        securityContext:
          privileged: true
        volumeMounts:
        - name: host-root
          mountPath: /host
          readOnly: true
        command: ["/usr/bin/aide"]
        args: ["--check", "--config=/etc/aide.conf"]
      volumes:
      - name: host-root
        hostPath:
          path: /

Putting It All Together

Your final security stack should include:

Prevention: RBAC, network policies, pod security standards, admission controllers
Detection: Runtime monitoring, file integrity, network analysis
Response: Automated alerts, incident playbooks, forensic capabilities

Operational overhead: Plan for 10-15 hours/week managing security tooling. More during initial tuning, less once it's stable.

Performance impact: Properly configured security adds 5-10% CPU overhead and 50-100ms request latency. The alternative is getting owned, so it's worth it.

Testing regimen: Run chaos engineering exercises monthly. Simulate compromised pods, privilege escalation attempts, and data exfiltration. If your security tools don't catch it in testing, they won't catch it during a real attack.

The next step is compliance frameworks and audit preparation, but nail the runtime monitoring first. You can't secure what you can't see.

Troubleshooting Common Security Hardening Failures

My deployments are failing after implementing Pod Security Standards. What's the quickest fix?

Check the security context first:

## See exactly what's being blocked
kubectl describe replicaset <failing-rs-name>

## Common fix - add security context to deployment
kubectl patch deployment <deployment-name> -p '{\"spec\":{\"template\":{\"spec\":{\"securityContext\":{\"runAsNonRoot\":true,\"runAsUser\":1001}}}}}'

## For immediate temporary fix, relax the namespace policy
kubectl label namespace <namespace> pod-security.kubernetes.io/enforce=baseline --overwrite

Time estimate: 5 minutes if you know what you're doing, 2 hours if you have to debug each container individually.

Network policies broke everything. How do I roll back quickly?

Delete the default-deny policy immediately:

## Emergency rollback - removes all network policies
kubectl delete networkpolicy --all -n <namespace>

## Selective rollback - remove just the problematic policy
kubectl delete networkpolicy default-deny-all -n <namespace>

## Check what's still broken
kubectl get pods -n <namespace> -o wide
kubectl logs <pod-name> -n <namespace> | grep -i \"connection refused\|timeout\"

Pro tip: Always implement network policies with explicit allow rules BEFORE adding default deny. I learned this the hard way when I took down production on Black Friday.

Falco is generating 10,000 alerts per day. How do I tune it without disabling security?

Start with these high-noise, low-value rules and disable them:

## Add to falco-rules-override.yaml
- rule: Read sensitive file trusted after startup
  enabled: false

- rule: Write below etc
  enabled: false

- rule: System procs network activity
  enabled: false

- rule: Contact K8S API Server From Container
  condition: >
    kevt and container and
    not allowed_k8s_users and
    not user_known_contact_k8s_api_server_activities
    # Add your legitimate API calls here
    and not proc.name in (kubectl, helm, terraform)

Tuning process: Week 1: Disable noisy rules. Week 2: Add application-specific exceptions. Week 3: Re-enable rules with better filtering. Budget 15-20 hours total.

OPA Gatekeeper policies are too complex. Any simpler alternatives?

Use ValidatingAdmissionWebhooks with kustomize for simpler policy management:

## simple-security-policy.yaml
apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingAdmissionWebhook
metadata:
  name: security-policy
webhooks:
- name: block-privileged.example.com
  clientConfig:
    service:
      name: security-webhook
      namespace: security-system
      path: /validate
  rules:
  - operations: ["CREATE", "UPDATE"]
    apiGroups: ["apps"]
    resources: ["deployments", "daemonsets"]
  admissionReviewVersions: ["v1"]
  failurePolicy: Fail

Reality check: Simpler policies mean less flexibility. If you need complex conditional logic, stick with OPA. For basic "no privileged containers" rules, admission webhooks work fine.

External secrets operator isn't syncing. How do I debug?

Check the operator logs and secret store connection:

## Check operator status
kubectl logs -n external-secrets-system deployment/external-secrets

## Verify secret store connection
kubectl describe secretstore <store-name> -n <namespace>

## Test secret retrieval manually
kubectl get externalsecret <external-secret-name> -n <namespace> -o yaml

## Common fix - service account permissions
kubectl describe serviceaccount external-secrets -n external-secrets-system

Common gotcha: AWS IAM permissions. The service account needs secretsmanager:GetSecretValue and secretsmanager:DescribeSecret permissions. Took me 3 hours to figure this out because AWS error messages are garbage.

How do I troubleshoot RBAC permission denials?

Use kubectl auth can-i and audit logs:

## Test specific permissions
kubectl auth can-i create pods --as=system:serviceaccount:production:app-service-account -n production

## Check what permissions a service account actually has
kubectl auth can-i --list --as=system:serviceaccount:production:app-service-account -n production

## Enable audit logging to see denials
kubectl logs -n kube-system kube-apiserver-<node-name> | grep \"forbidden\"

Debug tip: Use kubectl auth can-i --list to see all permissions. If the output is huge, your service account has too many permissions. If it's tiny, it probably needs more.

My monitoring stack broke after implementing network policies. Quick fix?

Allow monitoring namespace to scrape metrics:

## monitoring-network-policy.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-monitoring
  namespace: production
spec:
  podSelector: {}
  policyTypes:
  - Ingress
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          name: monitoring
    ports:
    - protocol: TCP
      port: 8080  # Your metrics port
    - protocol: TCP
      port: 9090  # Prometheus

Common mistake: Forgetting to label the monitoring namespace. kubectl label namespace monitoring name=monitoring

Container images failing security scans. How do I fix vulnerabilities quickly?

Use multi-stage builds to minimize attack surface:

## Before: Single stage with all build tools
FROM node:18
COPY . .
RUN npm install
CMD ["npm", "start"]

## After: Multi-stage build
FROM node:18-alpine AS builder
COPY package*.json ./
RUN npm ci --only=production

FROM node:18-alpine AS runtime
RUN addgroup -g 1001 -S app && adduser -S app -u 1001 -G app
USER app
COPY --from=builder --chown=app:app /node_modules ./node_modules
COPY --chown=app:app . .
CMD ["npm", "start"]

Priority fix order:

Remove unnecessary packages
Update base images
Fix high/critical CVEs
Add non-root user.

What's the fastest way to check if my cluster is still vulnerable after hardening?

Run this security audit script:

#!/bin/bash
echo "=== Security Audit Script ==="

echo "1. Checking for privileged containers..."
kubectl get pods --all-namespaces -o jsonpath='{range .items[*]}{.metadata.namespace}{"	"}{.metadata.name}{"	"}{.spec.containers[*].securityContext.privileged}{"
"}{end}' | grep true

echo "2. Checking for root containers..."
kubectl get pods --all-namespaces -o jsonpath='{range .items[*]}{.metadata.namespace}{"	"}{.metadata.name}{"	"}{.spec.securityContext.runAsUser}{"
"}{end}' | grep -E "^\w+\s+\w+\s*(0|$)"

echo "3. Checking cluster admin bindings..."
kubectl get clusterrolebindings -o json | jq -r '.items[] | select(.roleRef.name=="cluster-admin") | .metadata.name'

echo "4. Checking default service account permissions..."
kubectl auth can-i '*' '*' --as=system:serviceaccount:default:default

echo "5. Checking for network policies..."
kubectl get networkpolicy --all-namespaces | wc -l

Pass/fail criteria: If any of these return results (except network policy count > 0), you're still vulnerable.

Time to fix after identifying issues: Budget 4-8 hours per critical finding, 1-2 hours per medium finding.

The worst part about Kubernetes security is that it's easy to think you're secure when you're not. Test everything, assume nothing, and prepare to spend way more time on this than you initially planned.

Essential Kubernetes Security Resources

31%

Recommendations combine user behavior, content similarity, research intelligence, and SEO optimization

Quick Navigation

The Real Attack Vector (It's Not What You Think)

Current State Assessment (The Uncomfortable Truth)

The Hardening Priority Matrix

Reality Check: Resource Requirements

How do I know if my cluster is already compromised?

Can I harden a cluster without breaking existing applications?

What's the minimum viable security for a production cluster?

How much will commercial security tools actually cost?

Should I use managed Kubernetes security features?

What happens when security hardening breaks my deployments?

Phase 1: RBAC Implementation (Week 1)

Step 1: Audit Current Permissions

Step 2: Implement Least Privilege RBAC

Phase 2: Network Policies (Week 1-2)

Step 1: Default Deny All Traffic

Step 2: Explicit Allow Rules

Step 3: Test Network Policies

Phase 3: Pod Security Standards (Week 2)

Enable Pod Security Standards

Fix Your Deployments

Phase 4: Secrets Management (Week 2-3)

External Secrets Operator

Seal Secrets Alternative

Validation and Testing

Admission Controllers: The Last Line of Defense

OPA Gatekeeper Implementation

Custom Admission Webhooks

Runtime Monitoring: Catching Active Attacks

Falco Setup and Configuration

Alert Management and Response

Runtime Security Beyond Falco

Putting It All Together

My deployments are failing after implementing Pod Security Standards. What's the quickest fix?

Network policies broke everything. How do I roll back quickly?

Falco is generating 10,000 alerts per day. How do I tune it without disabling security?

OPA Gatekeeper policies are too complex. Any simpler alternatives?

External secrets operator isn't syncing. How do I debug?

How do I troubleshoot RBAC permission denials?

My monitoring stack broke after implementing network policies. Quick fix?

Container images failing security scans. How do I fix vulnerabilities quickly?

What's the fastest way to check if my cluster is still vulnerable after hardening?

Related Tools & Recommendations

Jenkins Docker Kubernetes CI/CD: Deploy Without Breaking Production

Google Kubernetes Engine (GKE) - Google's Managed Kubernetes (That Actually Works Most of the Time)

Setting Up Prometheus Monitoring That Won't Make You Hate Your Job

Helm Troubleshooting Guide: Fix Deployments & Debug Errors

Fix Kubernetes Service Not Accessible - Stop the 503 Hell

Terraform Alternatives That Don't Suck to Migrate To

Infrastructure as Code Pricing Reality Check: Terraform vs Pulumi vs CloudFormation

Terraform - Define Infrastructure in Code Instead of Clicking Through AWS Console for 3 Hours

Fix gRPC Production Errors - The 3AM Debugging Guide

Prometheus - Scrapes Metrics From Your Shit So You Know When It Breaks

ArgoCD Production Troubleshooting: Debugging & Fixing Deployments

Grafana - The Monitoring Dashboard That Doesn't Suck

Helm - Because Managing 47 YAML Files Will Drive You Insane

Aqua Security - Container Security That Actually Works

MongoDB Atlas Enterprise Deployment Guide

Docker Won't Start on Windows 11? Here's How to Fix That Garbage

Stop Docker from Killing Your Containers at Random (Exit Code 137 Is Not Your Friend)

Docker Desktop's Stupidly Simple Container Escape Just Owned Everyone

etcd Overview: The Core Database Powering Kubernetes Clusters

Google Guy Says AI is Better Than You at Most Things Now