Why do I see "0/3 nodes are available: 3 Insufficient cpu, 3 Insufficient memory" but no scaling?

Your autoscaler pod is probably misconfigured or missing permissions. The scheduler sees the resource shortage but autoscaler isn't getting the message.First thing to check: `kubectl get events --field-selector reason=FailedScheduling`. If you see scheduling failures but no corresponding autoscaler activity, your autoscaler is broken, not busy.

My autoscaler shows "cluster_autoscaler_cluster_safe_to_autoscale=0" - what broke?

The autoscaler disabled itself because something's fundamentally wrong. Common causes: - Multiple autoscaler pods fighting for leadership (check for duplicates) - Node registration failures (nodes join but never become Ready) - RBAC permissions missing for core operations Run `kubectl describe pod -l app=cluster-autoscaler` and look for permission errors. Usually it's obvious once you read the logs.

The logs say "failed to fix node group sizes" - what does that actually mean?

Your cloud provider API is rejecting the scaling requests. This happens when: - Auto Scaling Group hit max size limits you forgot about - IAM permissions missing for specific scaling operations - Cloud provider is out of capacity (most common during outages) Check your ASG settings first - I've wasted hours debugging "broken" autoscaler when we just hit the max instance limit.

Why does autoscaler work fine then suddenly stop scaling with no errors?

Usually means you hit cloud provider API rate limits. AWS throttles aggressively during peak traffic and just silently drops requests.Look for `cluster_autoscaler_failed_scale_ups_total` metric climbing. If it's increasing but you see no errors in logs, you're getting rate limited. Fix: reduce `--max-concurrent-scale-ups` or spread load across regions.

Pods are pending but autoscaler logs show "node group is not ready for scaleup"

Recent scale-up attempts failed so autoscaler is giving the node group a timeout. Default backoff is 15-30 minutes.Check `kubectl get events` for failed node launches. Common causes: wrong AMI, subnet capacity, or instance types not available in your AZ. The autoscaler won't retry until the backoff expires, even if you fix the underlying issue.

My autoscaler scaled up nodes but pods are still pending. What's going on?

Nodes launched but they're stuck in `NotReady` state. The autoscaler thinks it did its job, but no actual capacity became available.Check `kubectl get nodes` for nodes that launched recently but never became Ready. Usually it's network issues preventing kubelet from reaching the API server, or the wrong AMI/image that's missing required components.Common culprits: security groups blocking kubelet port (10250), wrong VPC/subnet configuration, or CNI plugin failures. On AWS, check if your workers can reach the EKS endpoint.

Autoscaler says "skipped due to recent failover" - how long does it stay broken?

Default backoff is 3 minutes for scale-up failures and 15 minutes for node group problems. But if nodes keep failing to register, the backoff increases exponentially up to 30 minutes.The autoscaler maintains internal state about failed node groups. You can reset this by restarting the autoscaler pod, but fix the underlying issue first or it'll just fail again.

I see "couldn't get node group size" errors constantly. IAM permissions look correct.

Usually means the autoscaler can't map nodes back to their Auto Scaling Groups. This happens when: - Node group tags are missing or wrong (especially `kubernetes.io/cluster/ `) - Multiple clusters sharing the same Auto Scaling Groups - Node registration timing issues where nodes get tagged before autoscaler sees them Check that your worker nodes have the right cluster tags and the autoscaler discovery tags match your actual node groups.

Pods with large resource requests never trigger scaling, even though we have big instances

Your node groups might not be configured to support the instance types you think they are. The autoscaler simulates based on the **actual** launch template configuration, not what instance types are theoretically possible.Check your Auto Scaling Group launch template - if it specifies `m5.large` only, the autoscaler won't scale for pods that need more than `m5.large` capacity, even if your account can launch bigger instances.

The autoscaler worked yesterday, today it doesn't scale anything. No config changes.

Something changed in your environment even if you didn't change autoscaler config. Common external changes: - AWS/GCP quota reductions (they do this without warning) - Security team changed IAM policies or security groups - Instance types you were using got deprecated or had capacity removed - Kubernetes version upgrade changed internal scheduling behavior Check CloudTrail/audit logs for any infrastructure changes in the past 24 hours.

My mixed instance type policy isn't working - autoscaler only launches one instance type

The autoscaler picks the "best" instance type based on its internal algorithms and launches only that type until it hits capacity issues. It doesn't randomly distribute across your instance list.This is actually correct behavior - mixing instance types within a single scaling event can cause scheduling weirdness. If you want true mixed instances, use multiple node groups with different instance types, not one group with mixed types.

Autoscaler removes nodes immediately after adding them - it's fighting with itself

Usually means you have pod requests that are too small, so new nodes look underutilized immediately. Or your scale-down thresholds are too aggressive.Check your `--scale-down-unneeded-time` setting. Default is 10 minutes, but if your pods are small, nodes might look empty before pods actually get scheduled. Increase to 20-30 minutes.Also check if you have DaemonSets with tiny resource requests that make nodes look empty to the autoscaler.

The logs show "no node group can schedule pod" but I know the pods should fit

Pod anti-affinity or node affinity rules are creating impossible constraints. The autoscaler simulates with your actual affinity rules - if you have anti-affinity that spreads pods across zones but only one zone has capacity, simulation fails.Look at your pod specs for `preferredDuringSchedulingIgnoredDuringExecution` vs `requiredDuringSchedulingIgnoredDuringExecution`. The autoscaler treats "required" rules as hard constraints.

Scale-down takes forever even though nodes are completely empty

PodDisruptionBudgets are blocking node eviction. Even if nodes look empty, they might be running system pods that have PDBs preventing eviction.Check `kubectl get pdb --all-namespaces` and look for PDBs with `disruptionsAllowed: 0`. Common culprits: monitoring agents, logging DaemonSets, or ingress controllers with overly restrictive PDBs.

I fixed the underlying issue but autoscaler still won't scale the node group

The autoscaler remembers failed node groups and backs off for 15-30 minutes. This is intentional to avoid rapid retry loops, but annoying when you've actually fixed the problem.You can reset the backoff by restarting the autoscaler pod: `kubectl delete pod -l app=cluster-autoscaler -n kube-system`. The new pod starts with clean state and will immediately retry failed node groups.

Currently viewing the AI version

Switch to human version

Kubernetes Cluster Autoscaler: AI-Optimized Troubleshooting Reference

Technology Overview

Function: Automatically scales Kubernetes cluster nodes based on pod resource demands
Critical Limitation: UI breaks at 1000+ spans, making debugging large distributed transactions effectively impossible
Operational Reality: The autoscaler itself rarely breaks (5% of cases) - environment misconfiguration causes 95% of failures

Failure Mode Classification and Resolution

Production Impact Severity

Error Type	Business Impact	MTTR	Frequency	Root Cause Distribution
IAM/Permissions	Complete scaling failure	10-30 min	40% of cases	Security reviews break AWS permissions
Node Group Config	Partial scaling failure	15-60 min	25% of cases	Wrong instance types/subnets/AMIs
Resource Mismatches	Pod-specific failures	2-5 min	20% of cases	Pods request more than largest node provides
Cloud Provider Capacity	Uncontrollable delays	Cannot fix	10% of cases	AWS/GCP/Azure capacity exhaustion
Network/Registration	Silent capacity loss	30-60 min	5% of cases	Nodes launch but cannot join cluster

Critical Error Messages and Operational Intelligence

"NotTriggerScaleUp pod didn't trigger scale-up"

Actual Meaning: Pod resource requests exceed largest available instance type in node groups
Investigation Time: 5 minutes if configuration issue
Common Scenario: Pods requesting 64GB RAM in clusters with 32GB max nodes
Fix Difficulty: Easy - check pod requests vs node capacity
Hidden Cost: Autoscaler simulation runs but reports futility

"0/3 nodes available: Insufficient cpu/memory" + No Scaling

Actual Meaning: Autoscaler pod misconfigured or missing permissions
Investigation Command: kubectl get events --field-selector reason=FailedScheduling
Diagnostic Pattern: Scheduling failures present but no autoscaler activity
Time Investment: 15 minutes for permission issues
Success Indicator: Corresponding autoscaler activity appears after permission fix

"cluster_autoscaler_cluster_safe_to_autoscale=0"

Actual Meaning: Autoscaler disabled itself due to fundamental failure
Investigation Priority: Check for multiple autoscaler pods fighting for leadership
Common Causes:

Node registration failures (nodes join but never become Ready)
RBAC permissions missing for core operations
Duplicate autoscaler deployments
Diagnostic Command: kubectl describe pod -l app=cluster-autoscaler
Resolution Time: 2 minutes for duplicates, 30+ minutes for registration issues

"failed to fix node group sizes"

Actual Meaning: Cloud provider API rejecting scaling requests
Primary Cause: Auto Scaling Group hit max size limits (most common oversight)
Secondary Causes: IAM permissions missing, cloud provider capacity exhaustion
Investigation Priority: Check ASG settings before complex debugging
Cost of Delay: Hours wasted on "broken" autoscaler when simple limit reached

Production Configuration That Actually Works

Resource Requirements

Minimum RBAC Permissions: get/patch/create nodes, Auto Scaling Group modifications
Backoff Timing: 3 minutes scale-up failures, 15 minutes node group problems, exponential up to 30 minutes
API Rate Limit Mitigation: Reduce --max-concurrent-scale-ups during peak traffic
Scale-down Protection: Set --scale-down-unneeded-time to 20-30 minutes (not default 10) for small pod workloads

Critical Configuration Warnings

Mixed Instance Policies: Autoscaler selects "best" instance type and uses only that type per scaling event
Default Settings Fail in Production: 10-minute scale-down threshold causes node thrashing with small pods
Simulation Requirements: Launch template must specify exact instance types autoscaler should consider
Taint/Toleration: Autoscaler won't scale node groups with taints that pending pods don't tolerate

Debugging Workflow (3AM Production Incidents)

Step 1: Confirm Problem Source (5 minutes)

# Verify pending pods and reasons
kubectl get pods --all-namespaces --field-selector=status.phase=Pending
kubectl get events --field-selector reason=FailedScheduling -o wide

# Confirm autoscaler running
kubectl get pods -n kube-system -l app=cluster-autoscaler

Decision Point: If no "Insufficient cpu/memory" errors, not an autoscaler problem

Step 2: Simulation Analysis (10 minutes)

Log Patterns for Successful Scaling:

Scale-up: group <node-group> -> 3 (max: 5)

Log Patterns for Simulation Failure:

Pod <namespace/pod-name> is unschedulable
Skipping node group <name> - not ready for scaleup

Resolution Time: 2 minutes if resource mismatch, 30+ minutes if node registration

Step 3: Cloud Provider Validation (15 minutes)

AWS Verification Commands:

aws autoscaling describe-auto-scaling-groups --auto-scaling-group-names <name>
aws ec2 describe-instance-type-offerings --location-type availability-zone

Common AWS Failures: Max instances hit (easy fix), no capacity in AZ (cannot fix), IAM permissions (medium difficulty)

Step 4: Permission Verification (10 minutes)

kubectl auth can-i get nodes --as=system:serviceaccount:kube-system:cluster-autoscaler
kubectl auth can-i patch nodes --as=system:serviceaccount:kube-system:cluster-autoscaler

Success Rate: 90% of permission issues obvious from first check

Advanced Failure Scenarios

Nodes Launch But Pods Stay Pending

Symptom: Autoscaler reports successful scaling, no capacity available
Root Cause: Nodes stuck in NotReady state due to network/kubelet issues
Investigation: kubectl get nodes | grep NotReady
Common Causes: Security groups blocking kubelet port 10250, wrong VPC configuration, CNI failures
Resolution Difficulty: Hard - requires network troubleshooting expertise

Silent Scaling Failures During Traffic Spikes

Symptom: failed_scale_ups_total metric climbing, no errors in logs
Root Cause: Cloud provider API rate limiting
Workaround: Reduce concurrent scale-ups, spread load across regions
Business Impact: Can't fix directly - requires architectural changes

Recent Failover Backoff

Symptom: "skipped due to recent failover" in logs
Mechanism: Exponential backoff up to 30 minutes for failed node groups
Reset Method: Restart autoscaler pod (fixes state, not underlying issue)
Production Risk: Must fix root cause before reset or failure repeats

Recovery Procedures by Scenario

Nuclear Option (Last Resort)

Conditions: Everything broken, multiple system failures
Commands:

kubectl delete pod -n kube-system -l app=cluster-autoscaler
aws autoscaling update-auto-scaling-group --auto-scaling-group-name <name> --desired-capacity 0
# Wait for drain, then scale back up

Consequences: Loses all running workloads on affected nodes
Usage: Only when system already completely broken

PodDisruptionBudget Resolution

Symptom: Can't scale down empty nodes
Investigation: kubectl get pdb --all-namespaces
Common Culprits: Monitoring agents, logging DaemonSets with restrictive PDBs
Resolution Time: 10-20 minutes to identify and modify PDBs

Multiple Autoscaler Conflict

Symptom: Conflicting scaling decisions, leader election errors
Investigation: kubectl get pods -n kube-system -l app=cluster-autoscaler -o wide
Resolution: Delete duplicate pods, ensure single deployment
Time to Fix: 2 minutes

Decision Criteria for Alternatives

When to Use Mixed Instance Types

Recommended: Use multiple node groups with different instance types
Avoid: Single node group with mixed instance policy
Reason: Autoscaler simulation complexity causes unpredictable scaling behavior

When to Restart vs Debug

Restart First: Leadership conflicts, corrupted autoscaler state
Debug First: Permission errors, node registration failures
Cost Consideration: Debugging time vs impact of lost workloads

Regional vs Multi-AZ Strategy

Single Region Risk: API rate limiting during traffic spikes
Multi-AZ Complexity: Network configuration overhead increases failure modes
Recommendation: Start single-AZ, expand after operational stability achieved

Resource Investment Analysis

Expertise Requirements

Basic Operation: Kubernetes admin skills sufficient
Production Debugging: Requires cloud provider networking knowledge
Advanced Scenarios: Multi-cloud experience, deep Kubernetes internals

Time Investments by Problem Type

Permission Issues: 10-30 minutes (medium difficulty)
Resource Mismatches: 2-5 minutes (easy, obvious from logs)
Node Registration: 30-60 minutes (hard, requires network troubleshooting)
Cloud Capacity: Cannot fix (wait for provider recovery)

Monitoring Requirements

Essential Metrics:

cluster_autoscaler_cluster_safe_to_autoscale: System health indicator
failed_scale_ups_total: API rate limiting detection
Node ReadyCondition: Registration failure detection

Alert Thresholds:

Safe to autoscale = 0: Immediate escalation
Failed scale-ups increasing: Rate limiting investigation
Nodes NotReady > 15 minutes: Network/configuration issue

Hidden Costs and Operational Reality

Undocumented Behaviors

AWS: Project quotas exist but aren't visible until hit
GCP: Instance availability changes without notification
Azure: VM Scale Sets experience random slowness without clear cause

Breaking Points

1000+ pending pods: UI becomes unusable for debugging
30+ node groups: Simulation complexity causes significant delays
Mixed instance policies: Unpredictable instance type selection

Community Support Quality

Kubernetes Slack #sig-autoscaling: Maintainers respond to edge cases
GitHub Issues: Required reading for complex scenarios
Cloud Provider Docs: AWS most complete, GCP admits problems, Azure minimal

This reference enables automated decision-making for autoscaler troubleshooting by providing structured failure analysis, time estimates, and difficulty assessments for each scenario.

Useful Links for Further Investigation

Troubleshooting Resources That Don't Suck

Link	Description
Kubernetes Autoscaler Troubleshooting FAQ	The one FAQ that actually has answers instead of telling you to "check your configuration." Covers most of the weird edge cases you'll encounter.
AWS EKS Troubleshooting Guide	Rare AWS doc that's based on actual customer pain instead of perfect-world scenarios. Includes the IAM permissions that AWS forgot to mention elsewhere.
GCP Troubleshooting Scale-Up Issues	Google's debugging guide that admits their platform has problems. Shows how to check for capacity issues and quota limits that GCP loves to hide.
Autoscaler Issue #3115 - "Not Ready for ScaleUp"	The definitive thread about why autoscaler sometimes refuses to scale even when pods are pending. Required reading for anyone debugging scaling failures.
Issue #6452 - Taints and Status Problems	How taints can completely break autoscaler behavior and the workarounds that actually work. Includes the `--status-taint` flag fix.
Issue #4893 - Scale From Zero Problems	Why scaling from zero nodes is harder than it should be and the configuration tricks to make it work reliably.
Autoscaler Simulator	Test your autoscaler config without burning money on real infrastructure. Shows exactly why pods won't schedule on new nodes.
kubectl-node-shell Plugin	Debug node registration issues by getting shell access to nodes that won't join the cluster. Essential for CNI and networking problems.
Popeye - Cluster Sanitizer	Finds configuration problems that break autoscaling. Especially good at catching resource request issues and misconfigured PDBs.
Autoscaler Grafana Dashboard	Shows when scaling fails and why. Skip the fancy metrics and focus on `failed_scale_ups_total` and function duration.
AWS Node Termination Handler	Required if you use spot instances. Prevents the "node disappeared during scaling" disasters that confuse autoscaler state.
AWS Auto Scaling Group Limits Documentation	The limits that will bite you during traffic spikes. API rate limits aren't documented but these capacity limits are.
GCP Instance Group API Reference	When GCP's autoscaling mysteriously fails, this is where you find the actual error messages instead of the useless ones in the console.
Azure VM Scale Set Troubleshooting	Microsoft's admission that VM Scale Sets randomly break. Includes the "restart and pray" methodology that somehow works.
Cluster Autoscaler Helm Chart	The configuration that actually works instead of the broken shit you'll find in Medium articles. Use this or spend weeks debugging why your YAML is cursed.
Autoscaler Priority Expander Guide	How to control which node groups get scaled first. Essential for cost optimization and avoiding expensive instances.
Kubernetes Slack #sig-autoscaling	Where the autoscaler maintainers hang out. They've seen every weird edge case and usually have workarounds for bugs that aren't fixed yet.
Stack Overflow - Kubernetes Autoscaler Tag	Real production problems and the hacks people used to fix them. Filter by "newest" to see issues with current versions.

49%

Recommendations combine user behavior, content similarity, research intelligence, and SEO optimization