What is KEDA and Why You Actually Need It

KEDA (Kubernetes Event-driven Autoscaler) is a CNCF graduated project originally created by Microsoft and Red Hat. The latest version is v2.17.2, released in June 2024, and it fixes Kubernetes' biggest autoscaling fuckup: scaling based on metrics that actually matter.

The Problem: CPU Metrics Are Bullshit

Traditional Kubernetes Horizontal Pod Autoscaler (HPA) only looks at CPU and memory. That's like judging a restaurant's popularity by how hot the kitchen gets. Your message queue could have 10,000 pending jobs, but if your workers aren't pegging CPU, HPA doesn't give a shit.

I learned this the hard way when our Redis queue went completely nuts - had to be 40-something thousand messages, maybe 50k? Hard to say exactly - and HPA just sat there like a useless brick while our response times turned to absolute shit. CPU was fine, so why scale?

How KEDA Actually Works

KEDA Architecture Diagram

KEDA has three components that don't suck:

KEDA Operator monitors external stuff - your message queues, databases, whatever. It creates ScaledObjects and ScaledJobs that actually make sense for your workload.

Metrics Server translates external metrics into something Kubernetes HPA can understand. It's basically the middleware that makes KEDA play nice with existing K8s autoscaling.

Scalers connect to 60+ external services including Apache Kafka, Redis, RabbitMQ, AWS SQS, Azure Service Bus, Prometheus, and tons more.

Scale-to-Zero: Actually Useful

KEDA can scale your pods to zero when there's no work. Not "minimum 1 replica" - actual zero. When messages show up in your queue, it spins up pods in about 30 seconds (don't believe the "within seconds" marketing BS).

This saved us maybe 60% on our staging environment costs, hard to say exactly. Production? That's where you learn the hard way that your startup time is actually like 45 seconds on a bad day, not the 5 seconds you thought, and users start bitching about timeouts.

Real Production Gotchas Nobody Tells You

KEDA operator resource usage: The docs say 200MB RAM but that's bullshit. Plan for at least 400-500MB RAM, probably more if your cluster decides to be an asshole. Each scaler hammers your APIs every 30 seconds - hope you like those sweet, sweet cloud provider API bills.

Scale-to-zero timing: That "within seconds" claim? Complete bullshit. Expect like 30-60 seconds for the first pod, maybe longer if your image is huge. If you need sub-5-second response times, scale-to-zero will piss off your users.

Event source failures: If your Redis/Kafka/whatever goes down, KEDA keeps your app at current scale. It doesn't freak out and scale to zero, which is actually pretty smart.

Authentication debugging: TriggerAuthentication fails silently like a passive-aggressive coworker who leaves you Post-it notes about your failures instead of just telling you. You'll spend 6 hours debugging why your ScaledObject sits there doing absolutely nothing, only to discover you fat-fingered the secret name. Again. Always check kubectl logs -l app=keda-operator -n keda first, not after you've already questioned your career choices.

When KEDA Actually Makes Sense

  • Event-driven workloads (message queues, batch processing)
  • Variable traffic patterns where CPU scaling is useless
  • Cost-sensitive environments where scale-to-zero matters
  • Integration with cloud services that HPA can't see

KEDA works with Deployments, StatefulSets, and Jobs, plus any custom resource that implements the /scale sub-resource. It plays nice with existing VPA and other Kubernetes tools.

Anyway, here's how KEDA compares to the other autoscaling options that probably aren't working for you either.

KEDA vs HPA vs VPA - Kubernetes Autoscaling Comparison

Feature

KEDA

HPA (Horizontal Pod Autoscaler)

VPA (Vertical Pod Autoscaler)

Scaling Direction

Horizontal (pod replicas) + Scale-to-Zero

Horizontal (pod replicas)

Vertical (resource requests/limits)

Scaling Triggers

60+ event sources including message queues, databases, cloud services, custom metrics

CPU, memory, custom metrics via Kubernetes Metrics API

CPU and memory utilization patterns

Scale-to-Zero

✅ Built-in capability

❌ Minimum 1 replica

❌ Not applicable

Event Sources

Apache Kafka, RabbitMQ, Redis, AWS SQS, Azure Service Bus, Prometheus, PostgreSQL, MongoDB, Cron, HTTP, and more

Kubernetes metrics only

Resource usage patterns only

Complexity

Moderate

  • requires event source configuration

Low

  • straightforward setup

High

  • requires careful tuning

Best Use Cases

Event-driven applications, microservices, batch processing, serverless workloads

Traditional web applications with predictable traffic patterns

Resource optimization for long-running applications

Workload Types

Deployments, StatefulSets, Jobs, Custom Resources with /scale

Deployments, ReplicaSets, StatefulSets

Deployments, StatefulSets, DaemonSets, Jobs

Cost Efficiency

Excellent (scale-to-zero + precise event-based scaling)

Good (horizontal scaling based on metrics)

Good (resource optimization)

Cloud Integration

✅ Native cloud service scalers

Limited to metrics

Resource-focused only

Learning Curve

Moderate (event source concepts)

Low (standard Kubernetes)

High (resource analysis required)

Production Readiness

✅ CNCF Graduated, widely adopted

✅ Built into Kubernetes

✅ Stable but requires expertise

Dependencies

KEDA operator installation

Metrics Server

VPA components installation

Kubernetes Version

1.23+ recommended

Built-in (1.1+)

1.11+

Resource Overhead

Minimal

Very minimal

Low

Actually Deploying KEDA (And What Goes Wrong)

I've deployed KEDA about a dozen times across different teams. Here's what actually works and what will waste your time.

Installation: Just Use Helm

Skip the YAML manifests unless you love pain. The official Helm chart handles all the CRD and certificate BS for you:

helm repo add kedacore https://kedacore.github.io/charts
helm install keda kedacore/keda --namespace keda --create-namespace

Takes about 2 minutes if your cluster doesn't hate you. The YAML manifests exist but you'll spend like 3 hours dealing with webhook certificates, RBAC bullshit, and admission controllers that some asshole forgot to document.

OpenShift users: Use OperatorHub instead. It handles the security context constraints that will otherwise make your life miserable.

ScaledObject: Where Most People Screw Up

Here's a Redis scaler that actually works in production:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: worker-scaler
spec:
  scaleTargetRef:
    name: worker-deployment
  minReplicaCount: 1    # Don't set to 0 in prod unless you like angry users
  maxReplicaCount: 20   # Set this or prepare for chaos
  triggers:
  - type: redis
    metadata:
      address: redis.default.svc.cluster.local:6379
      listName: work-queue
      listLength: '5'
    authenticationRef:
      name: redis-auth

Common fuckups:

  • Setting minReplicaCount: 0 in production without thinking about cold start times
  • Forgetting maxReplicaCount and watching your cluster explode
  • Using redis-server:6379 instead of the full Kubernetes service name
  • Not setting up TriggerAuthentication for anything that needs auth

Authentication: The Part That Always Breaks

TriggerAuthentication fails silently. When your ScaledObject isn't scaling, this is usually why:

apiVersion: keda.sh/v1alpha1
kind: TriggerAuthentication
metadata:
  name: redis-auth
spec:
  secretTargetRef:
  - parameter: password
    name: redis-secret
    key: password

Debug with: kubectl logs -l app=keda-operator -n keda | grep -i auth

Pro tip: Test your auth separately before adding it to ScaledObject. I've seen teams spend fucking days debugging scaling when the problem was just a typo in the secret name.

Scalers That Actually Work Well

Apache Kafka: Solid. Scales based on consumer lag. Works great for event processing.

RabbitMQ: Reliable. Queue depth scaling works exactly as expected. Just don't forget the management plugin.

AWS SQS: Works well with IRSA. Approximate message count is good enough for most use cases.

Prometheus: Powerful but debugging PromQL queries in KEDA will make you hate life. Test your queries in Prometheus first.

Cron: Perfect for predictable workloads. We use it to pre-scale before our morning batch jobs.

Production Reality Check

Here's the shit nobody tells you in the marketing demos:

Resource requirements: The docs say 200MB RAM, 100m CPU. That's a fucking joke. We're using way more than that, probably 400-500MB with around 20 ScaledObjects. Hit some crazy memory spike when everyone deployed at once - think it was like 700-800MB? That was a fun Friday. Some genius created like 200 ScaledObjects all polling every 30 seconds. Each scaler hammers your APIs constantly - hope you like those sweet, sweet AWS CloudWatch API calls at $0.01 per 1,000 requests.

Scaling storms: Someone's query was scanning the entire user table every 30 seconds. Took forever to figure out which one was doing it. Database completely shit the bed, took like 2 hours to recover, and my phone wouldn't stop buzzing. Turns out some dev forgot a WHERE clause in their monitoring script. Always test your queries under load, preferably not in production like we did.

KEDA operator restarts: When KEDA crashes (and it will crash - we've seen it die from memory pressure, certificate issues, and random network timeouts), all scaling just stops. Your apps keep running at whatever scale they were at, which is great until your morning batch jobs don't scale up and your CEO asks why reports are 3 hours late.

Scale-to-zero gotchas:

  • Pod startup time matters. If your app takes 30+ seconds to start, users will notice.
  • PersistentVolumeClaims don't get cleaned up automatically with ScaledJobs
  • Some cloud load balancers freak out when target groups go to zero

Monitoring You Actually Need

Prometheus integration is essential. Watch these metrics:

  • keda_scaler_errors_total - When your scalers are failing
  • keda_scaled_object_paused - When scaling is broken
  • keda_metrics_server_* - API server health

Set up alerts for scaler errors. Silent failures are KEDA's specialty - it'll fail quietly while your pods sit there doing nothing and your users get pissed.

When KEDA Isn't Worth It

  • Simple web apps with predictable traffic - just use HPA
  • Stateful workloads that can't handle restarts
  • Real-time systems that need sub-second response times
  • Teams that don't understand their event sources

KEDA is great for event-driven stuff. For everything else, it's probably overkill.


Still have questions? Of course you do - KEDA is powerful but not always intuitive. Here are the questions everyone asks when they're trying to figure out if KEDA is right for their setup.

Questions People Actually Ask

Q

What the hell is KEDA and why should I care?

A

KEDA (Kubernetes Event-driven Autoscaler) is a CNCF graduated project that scales pods based on actual events instead of bullshit CPU metrics. Your message queue has like 800 pending jobs but CPU is at 20%? KEDA scales up anyway. HPA just sits there like an idiot. KEDA supports 70+ event sources and can scale to zero pods when there's no work, which is great for your cloud bill.

Q

Can I run this on my janky cluster?

A

KEDA needs Kubernetes v1.23+ for all features, though v1.17+ works with limitations. Your cluster needs CRDs and Metrics Server. Works on AKS, EKS, GKE, OpenShift, and whatever homebrew Kubernetes setup you're running.

Q

Is this production-ready or just another demo project?

A

KEDA v2.17+ is solid. Microsoft uses it, Reddit runs it, and it's CNCF graduated which means real governance and security practices. I've run it in production for 2+ years across multiple companies without major issues.

Q

Does KEDA cost anything or is it another "free trial" scam?

A

KEDA is actually free. No hidden fees, no "enterprise features", no bullshit. It's CNCF funded and open source. The only cost is the resources KEDA uses (plan for like 400-500MB RAM, maybe more if you're unlucky).

Q

Can I run KEDA with my existing HPA setup?

A

No, don't be an idiot. KEDA and HPA will fight each other over the same deployment. KEDA creates its own HPA under the hood. If you need CPU/memory scaling, use KEDA's CPU and Memory scalers instead of running both.

Q

How fast does scale-to-zero actually work?

A

KEDA marketing says "within seconds" which is complete bullshit. Expect like 30-60 seconds for the first pod to start from zero, maybe longer if your image is huge or your cluster decides to be an asshole that day. Scale-to-zero works great for batch jobs and development, but think twice before using it for user-facing APIs.

Q

Can I run multiple KEDA installations because I like chaos?

A

No, and stop asking. One KEDA per cluster, period. Kubernetes only lets one external metrics server run, and KEDA claims that spot. Install a second one and they'll fight over who gets to provide metrics. Your scaling will break in fun and mysterious ways.

Q

What happens when KEDA crashes?

A

When the KEDA operator dies (and it will eventually), your apps keep running at whatever scale they were at. New scaling events stop processing until KEDA recovers. The HPA uses stale metrics for a while, then gives up. Run KEDA with multiple replicas and proper resource limits or prepare for 3am pages when your batch jobs don't scale.

Q

How do I scale based on HTTP requests without losing my mind?

A

KEDA has no native HTTP scaler because HTTP scaling is hard. Your options: the experimental HTTP Add-on (use at your own risk), the Prometheus scaler with HTTP metrics (prepare for PromQL debugging hell), or cloud-specific options like Azure Application Insights. For most HTTP workloads, just use regular HPA.

Q

Can I throw multiple triggers at one ScaledObject?

A

Yes, and you should. Multiple triggers in one ScaledObject work fine. KEDA scales when ANY trigger fires, and HPA picks the highest replica count. It's cleaner than managing multiple ScaledObjects that fight each other.

Q

How often does KEDA poll my event sources?

A

Default is every 30 seconds, configurable per scaler. This matters for scale-from-zero timing

  • your first pod won't appear until the next poll cycle. Scale-up/down after that uses HPA's faster polling. Some scalers support webhooks to reduce polling, but most just hammer your APIs relentlessly every 30 seconds like an impatient child.
Q

How does KEDA handle auth without exposing my secrets?

A

KEDA uses TriggerAuthentication resources to keep credentials separate from ScaledObjects. It supports AWS IAM roles, Azure Workload Identity, GCP Workload Identity, HashiCorp Vault, and plain Kubernetes secrets. The auth setup is usually where things break first.

Q

Can I run KEDA with paranoid security settings?

A

Yes, KEDA v2.10+ runs with readOnlyRootFilesystem=true by default. Earlier versions need custom certificate volumes. If your security team makes you jump through hoops, check the security guide for all the knobs you can turn.

Q

Why isn't my ScaledObject doing anything?

A

Nine times out of fucking ten, it's auth. Check kubectl logs -l app=keda-operator -n keda | grep ERROR. Common fuckups: wrong service name, typo in secret name, RBAC permissions missing, network can't reach your event source, or you forgot to create the TriggerAuthentication entirely.

Q

How do I debug why scaling is broken?

A

Start with kubectl get hpa and kubectl describe hpa [scaledObject-name]. Look for "unable to get metrics" or similar errors. Check KEDA operator logs with kubectl logs -l app=keda-operator -n keda. If you're using Prometheus scaler, test your PromQL query directly first. The troubleshooting guide has more debugging steps that actually work.

Q

How do I migrate from HPA without breaking production?

A

Start by running KEDA alongside HPA on non-production workloads.

Map your CPU/memory triggers to KEDA's CPU and Memory scalers.

Once you trust it, delete the HPA and let KEDA take over. The migration guide has the step-by-step process. Don't rush this

  • HPA conflicts with ScaledObjects.

Essential KEDA Resources

Related Tools & Recommendations

tool
Similar content

Helm: Simplify Kubernetes Deployments & Avoid YAML Chaos

Package manager for Kubernetes that saves you from copy-pasting deployment configs like a savage. Helm charts beat maintaining separate YAML files for every dam

Helm
/tool/helm/overview
100%
integration
Similar content

Kafka, MongoDB, K8s, Prometheus: Event-Driven Observability

When your event-driven services die and you're staring at green dashboards while everything burns, you need real observability - not the vendor promises that go

Apache Kafka
/integration/kafka-mongodb-kubernetes-prometheus-event-driven/complete-observability-architecture
98%
tool
Similar content

Linkerd Overview: The Lightweight Kubernetes Service Mesh

Actually works without a PhD in YAML

Linkerd
/tool/linkerd/overview
74%
compare
Recommended

PostgreSQL vs MySQL vs MongoDB vs Cassandra - Which Database Will Ruin Your Weekend Less?

Skip the bullshit. Here's what breaks in production.

PostgreSQL
/compare/postgresql/mysql/mongodb/cassandra/comprehensive-database-comparison
70%
integration
Recommended

Setting Up Prometheus Monitoring That Won't Make You Hate Your Job

How to Connect Prometheus, Grafana, and Alertmanager Without Losing Your Sanity

Prometheus
/integration/prometheus-grafana-alertmanager/complete-monitoring-integration
65%
tool
Similar content

Istio Service Mesh: Real-World Complexity, Benefits & Deployment

The most complex way to connect microservices, but it actually works (eventually)

Istio
/tool/istio/overview
64%
tool
Recommended

MongoDB Atlas Enterprise Deployment Guide

integrates with MongoDB Atlas

MongoDB Atlas
/tool/mongodb-atlas/enterprise-deployment
61%
tool
Similar content

kubectl: Kubernetes CLI - Overview, Usage & Extensibility

Because clicking buttons is for quitters, and YAML indentation is a special kind of hell

kubectl
/tool/kubectl/overview
61%
tool
Similar content

Flux GitOps: Secure Kubernetes Deployments with CI/CD

GitOps controller that pulls from Git instead of having your build pipeline push to Kubernetes

FluxCD (Flux v2)
/tool/flux/overview
59%
tool
Similar content

containerd - The Container Runtime That Actually Just Works

The boring container runtime that Kubernetes uses instead of Docker (and you probably don't need to care about it)

containerd
/tool/containerd/overview
57%
tool
Similar content

Red Hat OpenShift Container Platform: Enterprise Kubernetes Overview

More expensive than vanilla K8s but way less painful to operate in production

Red Hat OpenShift Container Platform
/tool/openshift/overview
57%
tool
Similar content

Development Containers - Production Deployment Guide

Got dev containers working but now you're fucked trying to deploy to production?

Development Containers
/tool/development-containers/production-deployment
57%
review
Similar content

Kubernetes Enterprise Value Assessment: Is It Worth the Investment?

Evaluate Kubernetes for enterprise. This guide assesses real-world implementation, success stories, pain points, and total cost of ownership for businesses in 2

Kubernetes
/review/kubernetes/enterprise-value-assessment
54%
tool
Similar content

kubeadm - The Official Way to Bootstrap Kubernetes Clusters

Sets up Kubernetes clusters without the vendor bullshit

kubeadm
/tool/kubeadm/overview
54%
howto
Similar content

Master Microservices Setup: Docker & Kubernetes Guide 2025

Split Your Monolith Into Services That Will Break in New and Exciting Ways

Docker
/howto/setup-microservices-docker-kubernetes/complete-setup-guide
50%
tool
Similar content

ArgoCD - GitOps for Kubernetes That Actually Works

Continuous deployment tool that watches your Git repos and syncs changes to Kubernetes clusters, complete with a web UI you'll actually want to use

Argo CD
/tool/argocd/overview
48%
tool
Similar content

ArgoCD Production Troubleshooting: Debugging & Fixing Deployments

The real-world guide to debugging ArgoCD when your deployments are on fire and your pager won't stop buzzing

Argo CD
/tool/argocd/production-troubleshooting
48%
tool
Similar content

Change Data Capture (CDC) Integration Patterns for Production

Set up CDC at three companies. Got paged at 2am during Black Friday when our setup died. Here's what keeps working.

Change Data Capture (CDC)
/tool/change-data-capture/integration-deployment-patterns
48%
tool
Similar content

etcd Overview: The Core Database Powering Kubernetes Clusters

etcd stores all the important cluster state. When it breaks, your weekend is fucked.

etcd
/tool/etcd/overview
48%
troubleshoot
Similar content

Fix Kubernetes Pod CrashLoopBackOff - Complete Troubleshooting Guide

Master Kubernetes CrashLoopBackOff. This complete guide explains what it means, diagnoses common causes, provides proven solutions, and offers advanced preventi

Kubernetes
/troubleshoot/kubernetes-pod-crashloopbackoff/crashloop-diagnosis-solutions
48%

Recommendations combine user behavior, content similarity, research intelligence, and SEO optimization