Your cluster autoscaler takes forever to spin up nodes during traffic spikes. I've seen this problem enough times to recognize the patterns.
The Shit That's Actually Broken
Your scan interval is probably wrong. Most people leave it at the default 10 seconds, which sounds fast but production clusters usually bump it to 15-30 seconds to avoid hammering the API server. Problem is, that means your autoscaler might not even notice pending pods for 30+ seconds before starting the 5+ minute node provisioning process.
Had a traffic spike last year where the autoscaler just sat there doing nothing. Took way too long to realize what was happening - it wasn't even scanning for new work.
AWS will throttle you during spikes. Auto Scaling Groups have undisclosed API rate limits that you'll hit during any real traffic event. The autoscaler just fails silently when this happens. No errors, no warnings, just...nothing.
Spent a few hours debugging this during an incident before realizing AWS was just rate limiting every scaling request.
Too many node groups kills performance. I've seen clusters with 20+ node groups because someone thought they needed a separate group for every instance type. The autoscaler has to simulate every pending pod against every possible placement. With that many groups, it spends 30-60 seconds just thinking before doing anything.
The Config That Actually Matters
Forget the documentation - here's what works in production:
Scan intervals based on reality:
- Small clusters:
--scan-interval=10s
(if you're lucky) - Medium clusters:
--scan-interval=15s
(still slow during emergencies) - Large clusters:
--scan-interval=20s
(pray you don't need fast scaling)
Never go above 30 seconds unless you enjoy watching your app crash while pods stay pending.
Resource limits that don't suck:
The default 100MB memory limit is a joke. I've seen autoscaler pods OOM during large scale events, which is peak irony.
resources:
requests:
memory: \"1Gi\" # Start here or suffer
cpu: \"500m\" # CPU-bound during simulation
limits:
memory: \"2Gi\" # Give it room to breathe
cpu: \"1\" # More for large clusters
Node groups that make sense:
Stop creating a node group for every instance type. I consolidate down to 3-5 max:
- General compute (mixed instances)
- Memory-heavy (for your data hogs)
- GPU (if you're doing ML nonsense)
- Spot instances (cheap and disposable)
More groups = exponentially slower decisions. The simulation overhead will kill you.
The Weird Shit That Breaks Everything
Pod Disruption Budgets become simulation hell. Each PDB adds complexity to the scale-down calculations. I've seen clusters with tons of PDBs take 10+ minutes just to figure out if it's safe to remove a node.
One cluster I worked on had PDBs for everything, including services that could handle full outages. Removing unnecessary PDBs cut scale-down time significantly.
DaemonSets pile up and slow everything down. Every DaemonSet has to be considered during node operations. Enterprise clusters love their monitoring, security, and networking tools - I've seen 15+ DaemonSets that turned every node operation into a crawl.
Multi-zone setups create weird bottlenecks. AWS environments with 6+ availability zones see notable simulation overhead. Unless you have regulatory requirements, stick to 3 zones max.
The real fix isn't tweaking scan intervals - it's fixing your cluster architecture so the autoscaler has less shit to think about.