When your autoscaler breaks, it never breaks gracefully. Pods pile up in pending state while customers start calling, and the autoscaler just sits there logging unhelpful messages like "didn't trigger scale-up." Here's how to actually debug this mess.
Step 1: Confirm What's Actually Broken
Don't assume the autoscaler is broken first. Half the time it's working fine and the problem is somewhere else entirely.
## See what pods are actually pending and why
kubectl get pods --all-namespaces --field-selector=status.phase=Pending
## Check if it's really an autoscaler problem
kubectl get events --field-selector reason=FailedScheduling -o wide
If you see Insufficient cpu
or Insufficient memory
in the events, keep reading. If you see taint/toleration errors or node selector issues, that's not an autoscaler problem - fix your pod specs.
Check if autoscaler is even running:
kubectl get pods -n kube-system -l app=cluster-autoscaler
kubectl logs -n kube-system -l app=cluster-autoscaler --tail=50
If the pod doesn't exist or is CrashLoopBackOff
, you have a deployment problem, not a scaling problem.
Step 2: Understand Why Autoscaler Isn't Scaling
The autoscaler simulation phase is where most problems hide. It simulates placing your pending pods on hypothetical new nodes. If simulation fails, no scaling happens.
Look for these log patterns:
## This means simulation worked, scaling started
I1028 15:23:45.123456 scale_up.go:123] Scale-up: group <your-node-group> -> 3 (max: 5)
## This means simulation failed - pod won't fit anywhere
I1028 15:23:45.123456 scale_up.go:789] Pod <namespace/pod-name> is unschedulable
## This means recent failures, autoscaler is backing off
I1028 15:23:45.123456 scale_up.go:456] Skipping node group <your-node-group> - not ready for scaleup
Common simulation failures:
Resource requests too large: Your pod wants 64GB RAM but your node groups only go up to 32GB. The autoscaler knows new nodes won't help.
Affinity rules that can't be satisfied: Pod anti-affinity or node affinity that creates impossible scheduling constraints.
Taints without tolerations: Node groups with taints that your pods don't tolerate. Autoscaler won't scale groups that can't run your workload.
Step 3: Debug Cloud Provider Issues
AWS weirdness that breaks scaling:
## Check if you hit Auto Scaling Group limits
aws autoscaling describe-auto-scaling-groups --auto-scaling-group-names <your-asg-name>
## Look for capacity issues in specific AZs
aws ec2 describe-instance-type-offerings --location-type availability-zone \
--filters Name=instance-type,Values=m5.large
Most common AWS failures:
- Hit max instances in ASG (fix: increase max capacity)
- No capacity for instance type in AZ (fix: use mixed instance types)
- IAM permissions missing for autoscaling actions (fix: add proper policies)
GCP quirks:
- Project quotas that you didn't know existed
- Instance group in wrong region/zone
- Firewall rules blocking node registration
Azure randomness:
- VM Scale Sets just being slow for no reason
- Subscription limits that aren't documented anywhere
- Network security groups causing registration failures
Step 4: Fix Node Registration Failures
Nodes launch but never become Ready. This creates a nightmare scenario where autoscaler thinks it scaled successfully, but no capacity actually becomes available.
## Check for nodes that launched but never joined
kubectl get nodes | grep NotReady
## Look for kubelet issues on the failed nodes
kubectl describe node <stuck-node-name>
Common registration failures:
Network connectivity: Node can't reach API server due to security groups, NACLs, or VPC configuration.
Wrong kubelet args: Node launched with wrong cluster name, endpoint, or auth config.
CNI problems: Pod networking not configured correctly so kubelet can't start.
Resource constraints: Node runs out of memory during startup due to too many DaemonSets.
Step 5: Debug Permission and RBAC Issues
Autoscaler needs specific permissions that often get broken during cluster upgrades or security hardening.
## Check autoscaler service account permissions
kubectl auth can-i get nodes --as=system:serviceaccount:kube-system:cluster-autoscaler
kubectl auth can-i patch nodes --as=system:serviceaccount:kube-system:cluster-autoscaler
kubectl auth can-i create nodes --as=system:serviceaccount:kube-system:cluster-autoscaler
Cloud provider permissions:
- AWS: EC2 and AutoScaling API permissions
- GCP: Compute Engine and Instance Groups API access
- Azure: Virtual Machine Scale Sets permissions
Kubernetes RBAC: The autoscaler ClusterRole might be missing permissions added in newer versions.
Step 6: Handle the Weird Edge Cases
Multiple autoscaler pods fighting: Leader election breaks and you get split-brain scaling decisions.
## Check for duplicate autoscaler pods
kubectl get pods -n kube-system -l app=cluster-autoscaler -o wide
## Look for leader election errors
kubectl logs -n kube-system -l app=cluster-autoscaler | grep "leader election"
PodDisruptionBudgets blocking scale-down: Autoscaler can't remove empty nodes because PDBs prevent pod eviction.
## Find PDBs that might be blocking
kubectl get pdb --all-namespaces
## Check PDB status
kubectl describe pdb <pdb-name> -n <namespace>
DaemonSets causing simulation failures: Every new node needs to run DaemonSet pods, but DaemonSet resource requirements aren't being calculated correctly.
Mixed instance policies confusing simulation: Autoscaler simulation gets confused with too many instance type options.
The Nuclear Option: When Everything's Broken
Sometimes everything's so broken you just restart and hope for the best:
## Delete autoscaler pod (deployment will recreate)
kubectl delete pod -n kube-system -l app=cluster-autoscaler
## Scale down problematic node groups manually
aws autoscaling update-auto-scaling-group --auto-scaling-group-name <name> --min-size 0 --desired-capacity 0
## Wait for nodes to drain, then scale back up
aws autoscaling update-auto-scaling-group --auto-scaling-group-name <name> --min-size 1 --desired-capacity 3
This fixes:
- Corrupted autoscaler state
- Stuck scale-up attempts
- Node groups in weird intermediate states
- Leadership election problems
Don't do this in production unless everything's already broken. You'll lose all running workloads on affected nodes.
What Actually Breaks Most Often
Based on actual production incidents:
IAM permissions - 40% of cases. AWS permissions especially get broken during security reviews.
Node group configuration - 25% of cases. Wrong instance types, subnets, or AMIs.
Resource request mismatches - 20% of cases. Pods requesting more than any node can provide.
Cloud provider capacity - 10% of cases. AWS/GCP/Azure just don't have the instances you want.
Network/registration issues - 5% of cases. Nodes launch but can't join the cluster.
The autoscaler itself rarely breaks. It's usually everything around it that's misconfigured or broken. Start by checking the environment, not the autoscaler configuration.