Kubernetes Cluster Autoscaler Broken? Debug This Shit

Currently viewing the human version

Common Autoscaler Error Messages (And What They Actually Mean)

What does "NotTriggerScaleUp pod didn't trigger scale-up" actually mean?

This is autoscaler's passive-aggressive way of saying "I see your pod needs resources, but I'm not gonna help." Usually means your pod can't fit on any node type in your node groups, even hypothetical new ones.Check if your pod's resource requests are larger than your biggest available instance type. I've seen pods requesting 64GB RAM in clusters where the largest nodes only have 32GB. The autoscaler simulation runs the math and decides it's pointless.

Why do I see "0/3 nodes are available: 3 Insufficient cpu, 3 Insufficient memory" but no scaling?

Your autoscaler pod is probably misconfigured or missing permissions. The scheduler sees the resource shortage but autoscaler isn't getting the message.First thing to check: kubectl get events --field-selector reason=FailedScheduling. If you see scheduling failures but no corresponding autoscaler activity, your autoscaler is broken, not busy.

My autoscaler shows "cluster_autoscaler_cluster_safe_to_autoscale=0" - what broke?

The autoscaler disabled itself because something's fundamentally wrong. Common causes:

Multiple autoscaler pods fighting for leadership (check for duplicates)
Node registration failures (nodes join but never become Ready)
RBAC permissions missing for core operations

Run kubectl describe pod -l app=cluster-autoscaler and look for permission errors. Usually it's obvious once you read the logs.

The logs say "failed to fix node group sizes" - what does that actually mean?

Your cloud provider API is rejecting the scaling requests. This happens when:

Auto Scaling Group hit max size limits you forgot about
IAM permissions missing for specific scaling operations
Cloud provider is out of capacity (most common during outages)

Check your ASG settings first - I've wasted hours debugging "broken" autoscaler when we just hit the max instance limit.

Why does autoscaler work fine then suddenly stop scaling with no errors?

Usually means you hit cloud provider API rate limits. AWS throttles aggressively during peak traffic and just silently drops requests.Look for cluster_autoscaler_failed_scale_ups_total metric climbing. If it's increasing but you see no errors in logs, you're getting rate limited. Fix: reduce --max-concurrent-scale-ups or spread load across regions.

Pods are pending but autoscaler logs show "node group is not ready for scaleup"

Recent scale-up attempts failed so autoscaler is giving the node group a timeout. Default backoff is 15-30 minutes.Check kubectl get events for failed node launches. Common causes: wrong AMI, subnet capacity, or instance types not available in your AZ. The autoscaler won't retry until the backoff expires, even if you fix the underlying issue.

How to Debug a Broken Cluster Autoscaler (3AM Edition)

When your autoscaler breaks, it never breaks gracefully. Pods pile up in pending state while customers start calling, and the autoscaler just sits there logging unhelpful messages like "didn't trigger scale-up." Here's how to actually debug this mess.

Kubernetes Autoscaler Workflow

Step 1: Confirm What's Actually Broken

Don't assume the autoscaler is broken first. Half the time it's working fine and the problem is somewhere else entirely.

## See what pods are actually pending and why
kubectl get pods --all-namespaces --field-selector=status.phase=Pending

## Check if it's really an autoscaler problem
kubectl get events --field-selector reason=FailedScheduling -o wide

If you see Insufficient cpu or Insufficient memory in the events, keep reading. If you see taint/toleration errors or node selector issues, that's not an autoscaler problem - fix your pod specs.

Check if autoscaler is even running:

kubectl get pods -n kube-system -l app=cluster-autoscaler
kubectl logs -n kube-system -l app=cluster-autoscaler --tail=50

If the pod doesn't exist or is CrashLoopBackOff, you have a deployment problem, not a scaling problem.

Step 2: Understand Why Autoscaler Isn't Scaling

The autoscaler simulation phase is where most problems hide. It simulates placing your pending pods on hypothetical new nodes. If simulation fails, no scaling happens.

Look for these log patterns:

## This means simulation worked, scaling started
I1028 15:23:45.123456 scale_up.go:123] Scale-up: group <your-node-group> -> 3 (max: 5)

## This means simulation failed - pod won't fit anywhere
I1028 15:23:45.123456 scale_up.go:789] Pod <namespace/pod-name> is unschedulable

## This means recent failures, autoscaler is backing off
I1028 15:23:45.123456 scale_up.go:456] Skipping node group <your-node-group> - not ready for scaleup

Common simulation failures:

Resource requests too large: Your pod wants 64GB RAM but your node groups only go up to 32GB. The autoscaler knows new nodes won't help.

Affinity rules that can't be satisfied: Pod anti-affinity or node affinity that creates impossible scheduling constraints.

Taints without tolerations: Node groups with taints that your pods don't tolerate. Autoscaler won't scale groups that can't run your workload.

Kubernetes Cluster Components

Step 3: Debug Cloud Provider Issues

AWS weirdness that breaks scaling:

## Check if you hit Auto Scaling Group limits
aws autoscaling describe-auto-scaling-groups --auto-scaling-group-names <your-asg-name>

## Look for capacity issues in specific AZs
aws ec2 describe-instance-type-offerings --location-type availability-zone \
    --filters Name=instance-type,Values=m5.large

Most common AWS failures:

Hit max instances in ASG (fix: increase max capacity)
No capacity for instance type in AZ (fix: use mixed instance types)
IAM permissions missing for autoscaling actions (fix: add proper policies)

GCP quirks:

Project quotas that you didn't know existed
Instance group in wrong region/zone
Firewall rules blocking node registration

Azure randomness:

VM Scale Sets just being slow for no reason
Subscription limits that aren't documented anywhere
Network security groups causing registration failures

Step 4: Fix Node Registration Failures

Nodes launch but never become Ready. This creates a nightmare scenario where autoscaler thinks it scaled successfully, but no capacity actually becomes available.

## Check for nodes that launched but never joined
kubectl get nodes | grep NotReady

## Look for kubelet issues on the failed nodes
kubectl describe node <stuck-node-name>

Common registration failures:

Network connectivity: Node can't reach API server due to security groups, NACLs, or VPC configuration.

Wrong kubelet args: Node launched with wrong cluster name, endpoint, or auth config.

CNI problems: Pod networking not configured correctly so kubelet can't start.

Resource constraints: Node runs out of memory during startup due to too many DaemonSets.

Step 5: Debug Permission and RBAC Issues

Autoscaler needs specific permissions that often get broken during cluster upgrades or security hardening.

## Check autoscaler service account permissions
kubectl auth can-i get nodes --as=system:serviceaccount:kube-system:cluster-autoscaler
kubectl auth can-i patch nodes --as=system:serviceaccount:kube-system:cluster-autoscaler
kubectl auth can-i create nodes --as=system:serviceaccount:kube-system:cluster-autoscaler

Cloud provider permissions:

AWS: EC2 and AutoScaling API permissions
GCP: Compute Engine and Instance Groups API access
Azure: Virtual Machine Scale Sets permissions

Kubernetes RBAC: The autoscaler ClusterRole might be missing permissions added in newer versions.

Kubernetes Pod Lifecycle

Step 6: Handle the Weird Edge Cases

Multiple autoscaler pods fighting: Leader election breaks and you get split-brain scaling decisions.

## Check for duplicate autoscaler pods
kubectl get pods -n kube-system -l app=cluster-autoscaler -o wide

## Look for leader election errors
kubectl logs -n kube-system -l app=cluster-autoscaler | grep "leader election"

PodDisruptionBudgets blocking scale-down: Autoscaler can't remove empty nodes because PDBs prevent pod eviction.

## Find PDBs that might be blocking
kubectl get pdb --all-namespaces

## Check PDB status
kubectl describe pdb <pdb-name> -n <namespace>

DaemonSets causing simulation failures: Every new node needs to run DaemonSet pods, but DaemonSet resource requirements aren't being calculated correctly.

Mixed instance policies confusing simulation: Autoscaler simulation gets confused with too many instance type options.

The Nuclear Option: When Everything's Broken

Sometimes everything's so broken you just restart and hope for the best:

## Delete autoscaler pod (deployment will recreate)
kubectl delete pod -n kube-system -l app=cluster-autoscaler

## Scale down problematic node groups manually
aws autoscaling update-auto-scaling-group --auto-scaling-group-name <name> --min-size 0 --desired-capacity 0

## Wait for nodes to drain, then scale back up
aws autoscaling update-auto-scaling-group --auto-scaling-group-name <name> --min-size 1 --desired-capacity 3

This fixes:

Corrupted autoscaler state
Stuck scale-up attempts
Node groups in weird intermediate states
Leadership election problems

Don't do this in production unless everything's already broken. You'll lose all running workloads on affected nodes.

What Actually Breaks Most Often

Based on actual production incidents:

IAM permissions - 40% of cases. AWS permissions especially get broken during security reviews.
Node group configuration - 25% of cases. Wrong instance types, subnets, or AMIs.
Resource request mismatches - 20% of cases. Pods requesting more than any node can provide.
Cloud provider capacity - 10% of cases. AWS/GCP/Azure just don't have the instances you want.
Network/registration issues - 5% of cases. Nodes launch but can't join the cluster.

The autoscaler itself rarely breaks. It's usually everything around it that's misconfigured or broken. Start by checking the environment, not the autoscaler configuration.

Autoscaler Error Categories and Quick Fixes

Error Type	Symptoms	Quick Check	Time to Fix	Difficulty
NotTriggerScaleUp	Pods pending, autoscaler silent	Check pod resource requests vs node capacity	5 minutes if config issue	Easy
Permission/RBAC	Autoscaler logs show access denied	`kubectl auth can-i` checks	15 minutes	Medium
Cloud Provider Capacity	"Insufficient capacity" in ASG events	Check instance availability in console	Can't fix (wait or change types)	Not your problem
Node Registration Failure	Nodes launch but stay NotReady	`kubectl describe node` on stuck nodes	30-60 minutes	Hard
IAM/Service Account	AWS API errors in autoscaler logs	Check CloudTrail for permission denials	10-30 minutes	Medium
Resource Request Mismatch	Simulation says pod won't fit	Compare pod requests to largest instance type	2 minutes to identify	Easy
Taint/Toleration Issues	Pods don't tolerate node group taints	Check node group taints vs pod tolerations	5 minutes	Easy
Multiple Autoscaler Pods	Conflicting scaling decisions	`kubectl get pods` in autoscaler namespace	2 minutes	Easy
PodDisruptionBudget Blocking	Can't scale down empty nodes	Check PDB status and allowed disruptions	10-20 minutes	Medium
API Rate Limiting	Silent failures during traffic spikes	Monitor `failed_scale_ups_total` metric	Can't fix directly	Not your fault

Advanced Troubleshooting Scenarios (When Basic Fixes Don't Work)

My autoscaler scaled up nodes but pods are still pending. What's going on?

Nodes launched but they're stuck in NotReady state. The autoscaler thinks it did its job, but no actual capacity became available.Check kubectl get nodes for nodes that launched recently but never became Ready. Usually it's network issues preventing kubelet from reaching the API server, or the wrong AMI/image that's missing required components.Common culprits: security groups blocking kubelet port (10250), wrong VPC/subnet configuration, or CNI plugin failures. On AWS, check if your workers can reach the EKS endpoint.

Autoscaler says "skipped due to recent failover" - how long does it stay broken?

Default backoff is 3 minutes for scale-up failures and 15 minutes for node group problems. But if nodes keep failing to register, the backoff increases exponentially up to 30 minutes.The autoscaler maintains internal state about failed node groups. You can reset this by restarting the autoscaler pod, but fix the underlying issue first or it'll just fail again.

I see "couldn't get node group size" errors constantly. IAM permissions look correct.

Usually means the autoscaler can't map nodes back to their Auto Scaling Groups. This happens when:

Node group tags are missing or wrong (especially kubernetes.io/cluster/<cluster-name>)
Multiple clusters sharing the same Auto Scaling Groups
Node registration timing issues where nodes get tagged before autoscaler sees them

Check that your worker nodes have the right cluster tags and the autoscaler discovery tags match your actual node groups.

Pods with large resource requests never trigger scaling, even though we have big instances

Your node groups might not be configured to support the instance types you think they are. The autoscaler simulates based on the actual launch template configuration, not what instance types are theoretically possible.Check your Auto Scaling Group launch template

if it specifies m5.large only, the autoscaler won't scale for pods that need more than m5.large capacity, even if your account can launch bigger instances.

The autoscaler worked yesterday, today it doesn't scale anything. No config changes.

Something changed in your environment even if you didn't change autoscaler config. Common external changes:

AWS/GCP quota reductions (they do this without warning)
Security team changed IAM policies or security groups
Instance types you were using got deprecated or had capacity removed
Kubernetes version upgrade changed internal scheduling behavior

Check CloudTrail/audit logs for any infrastructure changes in the past 24 hours.

My mixed instance type policy isn't working - autoscaler only launches one instance type

The autoscaler picks the "best" instance type based on its internal algorithms and launches only that type until it hits capacity issues. It doesn't randomly distribute across your instance list.This is actually correct behavior

mixing instance types within a single scaling event can cause scheduling weirdness. If you want true mixed instances, use multiple node groups with different instance types, not one group with mixed types.

Autoscaler removes nodes immediately after adding them - it's fighting with itself

Usually means you have pod requests that are too small, so new nodes look underutilized immediately. Or your scale-down thresholds are too aggressive.Check your --scale-down-unneeded-time setting. Default is 10 minutes, but if your pods are small, nodes might look empty before pods actually get scheduled. Increase to 20-30 minutes.Also check if you have DaemonSets with tiny resource requests that make nodes look empty to the autoscaler.

The logs show "no node group can schedule pod" but I know the pods should fit

Pod anti-affinity or node affinity rules are creating impossible constraints. The autoscaler simulates with your actual affinity rules

if you have anti-affinity that spreads pods across zones but only one zone has capacity, simulation fails.Look at your pod specs for preferredDuringSchedulingIgnoredDuringExecution vs requiredDuringSchedulingIgnoredDuringExecution. The autoscaler treats "required" rules as hard constraints.

Scale-down takes forever even though nodes are completely empty

PodDisruptionBudgets are blocking node eviction. Even if nodes look empty, they might be running system pods that have PDBs preventing eviction.Check kubectl get pdb --all-namespaces and look for PDBs with disruptionsAllowed: 0. Common culprits: monitoring agents, logging DaemonSets, or ingress controllers with overly restrictive PDBs.

I fixed the underlying issue but autoscaler still won't scale the node group

The autoscaler remembers failed node groups and backs off for 15-30 minutes. This is intentional to avoid rapid retry loops, but annoying when you've actually fixed the problem.You can reset the backoff by restarting the autoscaler pod: kubectl delete pod -l app=cluster-autoscaler -n kube-system. The new pod starts with clean state and will immediately retry failed node groups.

Troubleshooting Resources That Don't Suck

49%

Recommendations combine user behavior, content similarity, research intelligence, and SEO optimization

Quick Navigation

What does "NotTriggerScaleUp pod didn't trigger scale-up" actually mean?

Why do I see "0/3 nodes are available: 3 Insufficient cpu, 3 Insufficient memory" but no scaling?

My autoscaler shows "cluster_autoscaler_cluster_safe_to_autoscale=0" - what broke?

The logs say "failed to fix node group sizes" - what does that actually mean?

Why does autoscaler work fine then suddenly stop scaling with no errors?

Pods are pending but autoscaler logs show "node group is not ready for scaleup"

Step 1: Confirm What's Actually Broken

Step 2: Understand Why Autoscaler Isn't Scaling

Step 3: Debug Cloud Provider Issues

Step 4: Fix Node Registration Failures

Step 5: Debug Permission and RBAC Issues

Step 6: Handle the Weird Edge Cases

The Nuclear Option: When Everything's Broken

What Actually Breaks Most Often

My autoscaler scaled up nodes but pods are still pending. What's going on?

Autoscaler says "skipped due to recent failover" - how long does it stay broken?

I see "couldn't get node group size" errors constantly. IAM permissions look correct.

Pods with large resource requests never trigger scaling, even though we have big instances

The autoscaler worked yesterday, today it doesn't scale anything. No config changes.

My mixed instance type policy isn't working - autoscaler only launches one instance type

Autoscaler removes nodes immediately after adding them - it's fighting with itself

The logs show "no node group can schedule pod" but I know the pods should fit

Scale-down takes forever even though nodes are completely empty

I fixed the underlying issue but autoscaler still won't scale the node group

Related Tools & Recommendations

Making Pulumi, Kubernetes, Helm, and GitOps Actually Work Together

VPA: Because Nobody Actually Knows How Much RAM Their App Needs

KEDA - Kubernetes Event-driven Autoscaling

Kubernetes Cluster Autoscaler - Add and Remove Nodes When You Actually Need Them

Why Your Kubernetes Autoscaler Is Slow as Hell

Fix Kubernetes OOMKilled Errors (Before They Ruin Your Weekend)

Fix Kubernetes Pod CrashLoopBackOff - Complete Troubleshooting Guide

Cluster Autoscaler - Stop Manually Scaling Kubernetes Nodes Like It's 2015

Deploying Temporal to Kubernetes Without Losing Your Mind

Migration vers Kubernetes

Kubernetes 替代方案：轻量级 vs 企业级选择指南

Kubernetes - Le Truc que Google a Lâché dans la Nature

AWS API Gateway - Production Security Hardening

AWS Security Hardening - Stop Getting Hacked

my vercel bill hit eighteen hundred and something last month because tiktok found my side project

Fix Azure DevOps Pipeline Performance - Stop Waiting 45 Minutes for Builds

AWS vs Azure vs GCP - 한국에서 클라우드 안 망하는 법

Multi-Cloud DR That Actually Works (And Won't Bankrupt You)

Google Cloud SQL - Database Hosting That Doesn't Require a DBA

Google Cloud Database Migration Service