What the hell is KEDA and why should I care?

KEDA (Kubernetes Event-driven Autoscaler) is a [CNCF graduated project](https://keda.sh/blog/2023-08-22-keda-cncf-graduation/) that scales pods based on actual events instead of bullshit CPU metrics. Your message queue has like 800 pending jobs but CPU is at 20%? KEDA scales up anyway. HPA just sits there like an idiot. KEDA supports [70+ event sources](https://keda.sh/docs/2.17/scalers/) and can scale to zero pods when there's no work, which is great for your cloud bill.

Can I run this on my janky cluster?

KEDA needs [Kubernetes v1.23+](https://keda.sh/docs/2.17/deploy/) for all features, though v1.17+ works with limitations. Your cluster needs CRDs and Metrics Server. Works on [AKS](https://docs.microsoft.com/en-us/azure/aks/), [EKS](https://aws.amazon.com/eks/), [GKE](https://cloud.google.com/kubernetes-engine), [OpenShift](https://www.redhat.com/en/technologies/cloud-computing/openshift), and whatever homebrew Kubernetes setup you're running.

Is this production-ready or just another demo project?

KEDA v2.17+ is solid. [Microsoft uses it](https://keda.sh/community/#end-users), [Reddit runs it](https://keda.sh/community/#end-users), and it's CNCF graduated which means real governance and security practices. I've run it in production for 2+ years across multiple companies without major issues.

Does KEDA cost anything or is it another "free trial" scam?

KEDA is actually free. No hidden fees, no "enterprise features", no bullshit. It's [CNCF](https://www.cncf.io/) funded and [open source](https://github.com/kedacore/keda). The only cost is the resources KEDA uses (plan for like 400-500MB RAM, maybe more if you're unlucky).

Can I run KEDA with my existing HPA setup?

No, don't be an idiot. [KEDA and HPA will fight each other](https://keda.sh/docs/2.17/reference/faq/#dont-combine-scaledobject-with-horizontal-pod-autoscaler-hpa) over the same deployment. KEDA creates its own HPA under the hood. If you need CPU/memory scaling, use KEDA's [CPU](https://keda.sh/docs/2.17/scalers/cpu/) and [Memory](https://keda.sh/docs/2.17/scalers/memory/) scalers instead of running both.

How fast does scale-to-zero actually work?

KEDA marketing says "within seconds" which is complete bullshit. Expect like 30-60 seconds for the first pod to start from zero, maybe longer if your image is huge or your cluster decides to be an asshole that day. [Scale-to-zero](https://keda.sh/docs/2.17/concepts/scaling-deployments/) works great for batch jobs and development, but think twice before using it for user-facing APIs.

Can I run multiple KEDA installations because I like chaos?

No, and stop asking. [One KEDA per cluster](https://keda.sh/docs/2.17/reference/faq/#can-i-run-multiple-installations-of-keda-in-the-same-cluster), period. Kubernetes only lets one external metrics server run, and KEDA claims that spot. Install a second one and they'll fight over who gets to provide metrics. Your scaling will break in fun and mysterious ways.

What happens when KEDA crashes?

When the KEDA operator dies (and it will eventually), your apps keep running at whatever scale they were at. New scaling events stop processing until KEDA recovers. The HPA uses stale metrics for a while, then gives up. Run KEDA with [multiple replicas](https://keda.sh/docs/2.17/operate/cluster/) and proper resource limits or prepare for 3am pages when your batch jobs don't scale.

How do I scale based on HTTP requests without losing my mind?

KEDA has no native HTTP scaler because HTTP scaling is hard. Your options: the experimental [HTTP Add-on](https://github.com/kedacore/http-add-on) (use at your own risk), the [Prometheus scaler](https://keda.sh/docs/2.17/scalers/prometheus/) with HTTP metrics (prepare for PromQL debugging hell), or cloud-specific options like [Azure Application Insights](https://keda.sh/docs/2.17/scalers/azure-app-insights/). For most HTTP workloads, just use regular HPA.

Can I throw multiple triggers at one ScaledObject?

Yes, and you should. [Multiple triggers](https://keda.sh/docs/2.17/reference/faq/#using-multiple-triggers-for-the-same-scale-target) in one ScaledObject work fine. KEDA scales when ANY trigger fires, and HPA picks the highest replica count. It's cleaner than managing multiple ScaledObjects that fight each other.

How often does KEDA poll my event sources?

Default is every 30 seconds, configurable per scaler. This matters for scale-from-zero timing - your first pod won't appear until the next poll cycle. Scale-up/down after that uses HPA's faster polling. Some scalers support webhooks to reduce polling, but most just hammer your APIs relentlessly every 30 seconds like an impatient child.

How does KEDA handle auth without exposing my secrets?

KEDA uses [TriggerAuthentication](https://keda.sh/docs/2.17/authentication-providers/) resources to keep credentials separate from ScaledObjects. It supports [AWS IAM roles](https://keda.sh/docs/2.17/authentication-providers/aws/), [Azure Workload Identity](https://keda.sh/docs/2.17/authentication-providers/azure-ad-workload-identity/), [GCP Workload Identity](https://keda.sh/docs/2.17/authentication-providers/gcp-workload-identity/), [HashiCorp Vault](https://keda.sh/docs/2.17/authentication-providers/hashicorp-vault/), and plain [Kubernetes secrets](https://keda.sh/docs/2.17/authentication-providers/secret/). The auth setup is usually where things break first.

Can I run KEDA with paranoid security settings?

Yes, [KEDA v2.10+](https://keda.sh/docs/2.17/reference/faq/#how-do-i-run-keda-with-readonlyrootfilesystemtrue) runs with `readOnlyRootFilesystem=true` by default. Earlier versions need custom certificate volumes. If your security team makes you jump through hoops, check the [security guide](https://keda.sh/docs/2.17/operate/security/) for all the knobs you can turn.

Why isn't my ScaledObject doing anything?

Nine times out of fucking ten, it's auth. Check `kubectl logs -l app=keda-operator -n keda | grep ERROR`. Common fuckups: wrong service name, typo in secret name, RBAC permissions missing, network can't reach your event source, or you forgot to create the [TriggerAuthentication](https://keda.sh/docs/2.17/concepts/authentication/) entirely.

How do I debug why scaling is broken?

Start with `kubectl get hpa` and `kubectl describe hpa [scaledObject-name]`. Look for "unable to get metrics" or similar errors. Check KEDA operator logs with `kubectl logs -l app=keda-operator -n keda`. If you're using Prometheus scaler, test your PromQL query directly first. The [troubleshooting guide](https://keda.sh/docs/2.17/troubleshooting/) has more debugging steps that actually work.

How do I migrate from HPA without breaking production?

Start by running KEDA alongside HPA on non-production workloads. Map your CPU/memory triggers to KEDA's [CPU](https://keda.sh/docs/2.17/scalers/cpu/) and [Memory](https://keda.sh/docs/2.17/scalers/memory/) scalers. Once you trust it, delete the HPA and let KEDA take over. The [migration guide](https://keda.sh/docs/2.17/migration/) has the step-by-step process. Don't rush this - HPA conflicts with ScaledObjects.

Currently viewing the AI version

Switch to human version

KEDA - Kubernetes Event-driven Autoscaling: AI-Optimized Technical Reference

Overview

KEDA (Kubernetes Event-driven Autoscaler) is a CNCF graduated project that addresses Kubernetes' critical autoscaling limitation: HPA only scales based on CPU/memory metrics, not actual workload events. KEDA scales pods based on real events (message queues, database changes, HTTP requests) and can scale to zero when idle.

Latest Version: v2.17.2 (June 2024)
Kubernetes Requirement: v1.23+ (recommended), v1.17+ (limited features)

Core Problem & Solution

Critical HPA Limitation

Traditional HPA scales only on CPU/memory metrics
Real-world example: Redis queue with 40,000-50,000 messages, but low CPU means HPA won't scale
Result: Response times degrade while HPA remains inactive
Impact Severity: Production systems fail to scale under actual load conditions

KEDA's Approach

Monitors 70+ external event sources directly
Translates external metrics to Kubernetes-compatible format
Enables scale-to-zero capability (actual zero pods, not minimum 1)
Scale-up timing: 30-60 seconds from zero (not "within seconds" as marketed)

Architecture Components

KEDA Operator

Monitors external event sources
Creates ScaledObjects and ScaledJobs
Resource Requirements: 400-500MB RAM (not documented 200MB), variable CPU based on scaler count

Metrics Server

Translates external metrics for Kubernetes HPA consumption
Single external metrics server per cluster (no multiple KEDA installations)

Scalers

70+ supported services including Kafka, Redis, RabbitMQ, AWS SQS, Azure Service Bus, Prometheus
Polling frequency: Every 30 seconds default
API cost impact: Each scaler generates constant API calls to event sources

Production Configuration

Resource Requirements (Actual vs Documented)

Component	Documented	Production Reality
RAM	200MB	400-500MB minimum
CPU	100m	Variable, can spike to 700-800MB with 200+ ScaledObjects
Network	Not specified	Constant API polling every 30 seconds per scaler

Critical Configuration Settings

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: worker-scaler
spec:
  scaleTargetRef:
    name: worker-deployment
  minReplicaCount: 1    # CRITICAL: Don't set to 0 in production without cold start planning
  maxReplicaCount: 20   # MANDATORY: Prevent resource exhaustion
  triggers:
  - type: redis
    metadata:
      address: redis.default.svc.cluster.local:6379  # Use full Kubernetes service name
      listName: work-queue
      listLength: '5'
    authenticationRef:
      name: redis-auth

Critical Failure Scenarios

Silent Authentication Failures

Problem: TriggerAuthentication fails without visible errors
Impact: ScaledObjects remain inactive indefinitely
Detection: kubectl logs -l app=keda-operator -n keda | grep -i auth
Common causes: Typos in secret names, incorrect RBAC permissions, missing TriggerAuthentication

Operator Crashes and Recovery

Failure modes: Memory pressure, certificate issues, network timeouts
Impact: All scaling stops at current replica count until recovery
Detection: Monitor keda_scaler_errors_total and keda_scaled_object_paused metrics
Mitigation: Multiple replicas, proper resource limits, alerting

Scale-to-Zero Production Gotchas

Cold start reality: 30-60 seconds minimum, longer with large images
User impact: Timeouts on user-facing APIs
Infrastructure issues: Some cloud load balancers fail with zero targets
PVC cleanup: PersistentVolumeClaims not automatically cleaned with ScaledJobs

Scaler Reliability Assessment

Production-Ready Scalers

Scaler	Reliability	Common Issues	Notes
Apache Kafka	High	Consumer lag calculation	Solid for event processing
RabbitMQ	High	Requires management plugin	Queue depth scaling reliable
AWS SQS	High	IRSA configuration	Approximate message count sufficient
Redis	Medium	Connection timeouts	Works well with proper auth
Cron	High	Timezone configuration	Perfect for predictable workloads

Problematic Scalers

Scaler	Issues	Alternative
Prometheus	PromQL debugging difficulty	Test queries in Prometheus first
HTTP Add-on	Experimental status	Use cloud-specific solutions or HPA

Cost Analysis

Cost Savings

Scale-to-zero: ~60% reduction in staging environment costs
Event-driven scaling: More precise resource allocation vs CPU-based scaling
Hidden costs: Increased API calls to event sources, monitoring overhead

Cost Considerations

AWS CloudWatch API: $0.01 per 1,000 requests (significant with many scalers)
Increased monitoring complexity
Operational overhead for scaler management

Deployment Best Practices

Installation Method

# Recommended: Use Helm, not YAML manifests
helm repo add kedacore https://kedacore.github.io/charts
helm install keda kedacore/keda --namespace keda --create-namespace

Time: ~2 minutes success, ~3 hours with YAML manifests due to certificate and RBAC issues

OpenShift Specific

Use OperatorHub instead of Helm
Handles security context constraints automatically

Essential Monitoring

# Required Prometheus metrics
- keda_scaler_errors_total        # Scaler failures
- keda_scaled_object_paused       # Broken scaling
- keda_metrics_server_*           # API server health

Decision Criteria

Use KEDA When

Event-driven workloads with specific triggers
Variable traffic patterns where CPU scaling fails
Cost-sensitive environments requiring scale-to-zero
Integration with external services HPA cannot monitor

Avoid KEDA When

Simple web applications with predictable CPU-based scaling
Real-time systems requiring sub-second response times
Stateful workloads that cannot handle restarts
Teams lacking understanding of event sources

Migration Strategy

Test KEDA alongside HPA on non-production workloads
Map existing CPU/memory triggers to KEDA equivalents
Remove HPA before deploying ScaledObjects (they conflict)
Monitor for scaling storms and resource exhaustion

Common Production Failures

Database Scaling Storm Example

Scenario: Scaler with unbounded database query scanning entire user table every 30 seconds
Impact: Database failure, 2-hour recovery time
Root cause: Missing WHERE clause in monitoring script
Prevention: Test all queries under load before production deployment

Memory Pressure Incident

Scenario: 200 ScaledObjects deployed simultaneously
Impact: KEDA operator memory spike to 700-800MB, eventual crash
Recovery: 3 hours, manual intervention required
Prevention: Gradual rollout, resource monitoring, replica limits

Troubleshooting Workflow

Check HPA status: kubectl get hpa and kubectl describe hpa [name]
Verify KEDA operator logs: kubectl logs -l app=keda-operator -n keda
Test authentication separately: Validate TriggerAuthentication before ScaledObject
Verify network connectivity: Ensure cluster can reach event sources
Monitor scaling metrics: Check for silent failures in Prometheus

Resource Links

Official Documentation: Complete configuration reference
Troubleshooting Guide: Production debugging steps
Scalers Catalog: All 70+ supported event sources
GitHub Issues: Real production problems and solutions
KEDA Slack (#keda): Active community support

Useful Links for Further Investigation

Essential KEDA Resources

Link	Description
KEDA Official Documentation	Actually decent documentation that doesn't suck. Covers installation, configuration, all the scalers, and advanced topics for KEDA v2.17. They even have working examples.
KEDA GitHub Repository	Main source code repository with 9.4k+ stars. Check the issues section - that's where you'll find the actual problems people are hitting in production.
KEDA Samples Repository	Actually useful examples that work without modification. Start here if you want to copy-paste your way to success.
KEDA Helm Chart	Official Helm chart for deploying KEDA with all the configuration options and production-ready defaults you actually need.
KEDA Deployment Guide	Step-by-step installation instructions for Helm, YAML manifests, and operator-based deployments across different Kubernetes distributions.
Quick Start with RabbitMQ and Go	Hands-on tutorial demonstrating KEDA autoscaling with a RabbitMQ message queue and Go consumer application.
Azure Functions on Kubernetes	Example showing how to run and scale Azure Functions on Kubernetes using KEDA with Azure Storage Queue triggers.
Complete Scalers Catalog	Detailed documentation for all 70+ available scalers including configuration examples, authentication methods, and troubleshooting tips.
Prometheus Integration Guide	Instructions for monitoring KEDA metrics and scaling decisions using Prometheus and Grafana dashboards.
KEDA HTTP Add-on (Experimental)	Experimental extension enabling HTTP-based autoscaling for web applications and APIs.
KEDA Community Page	Information about community meetings, Slack channels, contributing guidelines, and user adoption stories.
KEDA Slack Channel (#keda)	Active community where maintainers actually respond to questions. Much better than Stack Overflow for real-time troubleshooting.
Weekly Community Meetings	Bi-weekly meetings every Tuesday at 17:00 CEST/CET for community updates, feature discussions, and Q&A sessions.
KEDA Videos and Talks	Collection of conference presentations, tutorials, and deep-dive technical sessions from KubeCon and other events.
CNCF KEDA Graduation Announcement	Official announcement of KEDA's graduation to CNCF graduated project status, highlighting maturity and adoption milestones.
KEDA Blog	Regular updates, feature announcements, case studies, and technical insights from the KEDA maintainers and community.
KEDA Security Guide	Security best practices, authentication providers, and operational considerations for production deployments.
Troubleshooting Guide	Solid troubleshooting docs covering common issues, debugging techniques, and performance optimization that actually work.
KEDA Enterprise Solutions	Information about commercial support, professional services, and enterprise-grade deployments from KEDA ecosystem partners.

KEDA - Kubernetes Event-driven Autoscaling: AI-Optimized Technical Reference

Overview

Core Problem & Solution

Critical HPA Limitation

KEDA's Approach

Architecture Components

KEDA Operator

Metrics Server

Scalers

Production Configuration

Resource Requirements (Actual vs Documented)

Critical Configuration Settings

Critical Failure Scenarios

Silent Authentication Failures

Operator Crashes and Recovery

Scale-to-Zero Production Gotchas

Scaler Reliability Assessment

Production-Ready Scalers

Problematic Scalers

Cost Analysis

Cost Savings

Cost Considerations

Deployment Best Practices

Installation Method

OpenShift Specific

Essential Monitoring

Decision Criteria

Use KEDA When

Avoid KEDA When

Migration Strategy

Common Production Failures

Database Scaling Storm Example

Memory Pressure Incident

Troubleshooting Workflow

Resource Links

Useful Links for Further Investigation

Essential KEDA Resources

Related Tools & Recommendations

Kafka + MongoDB + Kubernetes + Prometheus Integration - When Event Streams Break

GitOps Integration Hell: Docker + Kubernetes + ArgoCD + Prometheus

MongoDB vs PostgreSQL vs MySQL: Which One Won't Ruin Your Weekend

Prometheus + Grafana + Jaeger: Stop Debugging Microservices Like It's 2015

MongoDB Alternatives: Choose the Right Database for Your Specific Use Case

Kafka Will Fuck Your Budget - Here's the Real Cost

Apache Kafka - The Distributed Log That LinkedIn Built (And You Probably Don't Need)

MongoDB Alternatives: The Migration Reality Check

Redis vs Memcached vs Hazelcast: Production Caching Decision Guide

Redis Alternatives for High-Performance Applications

Redis - In-Memory Data Platform for Real-Time Applications

RabbitMQ - Message Broker That Actually Works

RabbitMQ Production Review - Real-World Performance Analysis

Stop Fighting Your Messaging Architecture - Use All Three

Azure AI Foundry Production Reality Check

Azure OpenAI Service - OpenAI Models Wrapped in Microsoft Bureaucracy

Azure Container Instances Production Troubleshooting - Fix the Shit That Always Breaks

OpenAI Gets Sued After GPT-5 Convinced Kid to Kill Himself

AWS Organizations - Stop Losing Your Mind Managing Dozens of AWS Accounts

AWS Amplify - Amazon's Attempt to Make Fullstack Development Not Suck