Why does VPA keep recommending insane amounts of memory?

VPA is completely clueless about your application. It stares at container limits like they're hieroglyphics and makes wild guesses based on startup spikes. Your Java app with a 4GB heap in an 8GB container? VPA thinks it needs 8GB. Set `maxAllowed` limits or prepare for bankruptcy. [GitHub issue #6705](https://github.com/kubernetes/autoscaler/issues/6705) has hundreds of people complaining about this exact bullshit.

Can I use VPA with HPA without everything breaking?

Don't use them on the same metrics or you'll create feedback loops that make your cluster act like a drunk person. VPA changes requests, HPA changes replicas - they fight each other. Use custom metrics for HPA (request latency, queue depth) and let VPA handle CPU/memory. [Official docs warn about this](https://kubernetes.io/docs/concepts/workloads/autoscaling/) for good reason.

Why does VPA keep killing my database pods?

Because you're using VPA on StatefulSets like a masochist. Don't do this. VPA evicts pods to resize them, which is fine for stateless apps but terrible for databases. If you must use VPA with databases, use "Initial" mode only and never "Recreate".

How long before VPA gives useful recommendations?

VPA needs [8 days of data](https://github.com/kubernetes/autoscaler/blob/master/vertical-pod-autoscaler/FAQ.md) to generate stable recommendations. New deployments get garbage suggestions for the first week because VPA is trying to do math with like 3 data points. I've seen it recommend 12GB for a hello-world container based on startup memory spikes.

Which workloads will make VPA lose its mind?

Anything that doesn't run the same way every day: - **JVM apps** - VPA can't see heap usage, only container limits - **Batch jobs** - resource usage spikes confuse the algorithm - **Machine learning training** - memory usage varies wildly between epochs - **Single-replica apps** - VPA won't evict the last pod, so they never get resized - **Databases** - unless you enjoy random restarts during peak traffic - **Anything with scheduled traffic** - VPA doesn't understand "busy during business hours"

What happens when my pods OOM?

VPA is supposed to catch OOM events and increase memory recommendations, but there's a [nasty bug](https://github.com/kubernetes/autoscaler/issues/6705) where it creates OOM loops instead. Pod dies from OOM, VPA sees high memory usage and recommends more, new pod gets insufficient memory and dies again. Set proper `minAllowed` and `maxAllowed` boundaries or enjoy the infinite restart cycle.

How much memory does VPA itself need?

VPA's memory leaks are so bad we restart the component weekly like it's Windows XP. The VPA Recommender has [memory leaks](https://github.com/kubernetes/autoscaler/issues/6368) that get worse with cluster size. In large clusters (3000+ pods), I've seen it consume 5GB+ of RAM before getting OOMKilled. Plan on restarting VPA components every few weeks. The [Prometheus integration](https://github.com/kubernetes/autoscaler/tree/master/vertical-pod-autoscaler#quick-start) makes this 5x worse.

Can VPA use metrics other than CPU and memory?

No, VPA only looks at CPU and memory from the Metrics Server. It has no idea about network I/O, disk throughput, or custom application metrics. Your database is disk-bound but CPU/memory look fine? VPA thinks everything is perfect while your users wait 30 seconds for queries.

How often are VPA recommendations actually useful?

For boring, predictable workloads that do the same thing every day: pretty accurate. For everything else: wildly wrong. VPA's 8-day window misses weekly patterns, monthly cycles, and seasonal traffic. I've seen it recommend massive resources based on a one-time data migration spike from 3 weeks ago.

What if VPA recommends more resources than my nodes have?

VPA doesn't give a shit about node capacity. It'll happily recommend 64GB of RAM for a pod running on 32GB nodes. Your pods become unschedulable and you debug for hours before realizing VPA is the problem. Always set `maxAllowed` limits that actually fit on your nodes.

Is VPA actually production-ready?

VPA is production-ready if your definition of production includes: - Regular memory leaks requiring component restarts - Random pod evictions when resource requests change - Admission controller webhooks that occasionally prevent all pod creation - Recommendations that ignore application logic and business cycles [Google runs it in GKE Autopilot](https://cloud.google.com/kubernetes-engine/docs/concepts/verticalpodautoscaler), so it can't be completely broken. Most sane people start with "Off" mode and spend months evaluating recommendations before trusting VPA to actually change anything.

Why is my VPA admission controller blocking pod creation?

The admission controller webhook is fragile as hell and occasionally decides to reject all pod creation requests with cryptic TLS errors. Usually means certificates expired or the webhook endpoint is unreachable. Pure joy to debug at 3am when your entire deployment is stuck in "Pending" state and you completely forgot VPA existed.

Currently viewing the AI version

Switch to human version

Vertical Pod Autoscaler (VPA) - AI-Optimized Technical Reference

Executive Summary

What VPA Does: Automatically adjusts Kubernetes pod CPU and memory requests based on actual usage patterns, eliminating resource guessing.

Critical Reality Check: VPA has severe production limitations including memory leaks, random pod evictions, and wildly inaccurate recommendations. Most organizations run it in "Off" mode indefinitely.

Production Readiness: Google uses it in GKE Autopilot, but most enterprises avoid auto-scaling modes due to operational risks.

Architecture Components

VPA Recommender

Function: Analyzes 8 days of historical usage data from Metrics Server
Algorithm: Uses 95th percentile for recommendations to handle traffic spikes
Critical Failure: Documented memory leaks consuming 200%+ of allocated resources
Performance Threshold: Becomes unreliable above 3,000 pods (5GB+ memory consumption)
Failure Impact: When Recommender crashes, no new recommendations generated

VPA Updater

Function: Evicts pods requiring resource changes
Constraint: Respects Pod Disruption Budgets
Critical Limitation: Will not evict last pod in single-replica deployments
Operational Impact: Causes service disruption during pod restarts

VPA Admission Controller

Function: Injects new resource specifications via validating webhook
Critical Failure Mode: When down, ALL pod creation fails
Debugging Difficulty: TLS certificate issues cause cryptic errors at 3am
Monitoring Requirement: Webhook latency must be monitored

Operational Modes

Mode	Behavior	Production Risk	Use Case
Off	Generates recommendations only	None	Production evaluation (most common)
Initial	Applies to new pods only	Low	Batch jobs, frequent restarts
Recreate	Kills pods to resize	High	Stateless apps only
Auto	Deprecated, same as Recreate	High	Legacy configurations

Resource Requirements & Costs

Time Investment

Initial Setup: 1-2 hours with cloud provider add-ons
Manual Installation: 4-8 hours including troubleshooting
Evaluation Period: 2-3 weeks minimum before trusting recommendations
Production Deployment: 2-3 months of gradual rollout

Infrastructure Costs

VPA Components: 1-2 CPU cores, 4-8GB memory for large clusters
Memory Leak Impact: Plan for weekly restarts, 5x memory with Prometheus integration
Cloud Cost Impact: Can significantly increase bills without proper boundaries

Expertise Requirements

Kubernetes Administration: Advanced level required
Monitoring Setup: Prometheus/Grafana knowledge essential
Troubleshooting Skills: Deep understanding of admission controllers and webhooks

Critical Configuration Requirements

Mandatory Resource Boundaries

resources:
  minAllowed:
    memory: "100Mi"
    cpu: "100m"
  maxAllowed:
    memory: "2Gi"    # CRITICAL: Prevents bankruptcy
    cpu: "1"

Failure Consequence: Without boundaries, VPA will recommend 64GB RAM for hello-world containers.

Prerequisites

Metrics Server: Required, VPA fails silently without it
Kubernetes Version: 1.25+ required for stable operation
OpenSSL: 1.1.1+ required for certificate generation
Node Capacity: maxAllowed must fit actual node resources

Workload Compatibility Matrix

Suitable Workloads

CRUD APIs: Predictable resource patterns
Web Servers: Steady traffic patterns
Microservices: Consistent daily operations

Problematic Workloads

JVM Applications: VPA sees container limits, not heap usage
Batch Jobs: Resource spikes confuse recommendations
Machine Learning: Wildly variable memory usage per epoch
Single-Replica Apps: Never get resized due to PDB constraints
Databases: Random restarts during peak traffic

Failure Scenarios

OOM Loops: Insufficient memory → pod death → higher recommendation → repeat
Unschedulable Pods: Recommendations exceed node capacity
Startup Spikes: Initial 8-day period produces garbage recommendations

Integration Conflicts

VPA + HPA Combination

Conflict: Both compete for CPU/memory scaling decisions
Solution: Use custom metrics (latency, queue depth) for HPA
Failure Mode: Feedback loops causing cluster instability

Pod Disruption Budgets

Benefit: Prevents outages during resizing
Limitation: Blocks resizing of single-replica applications
Workaround: Use "Initial" mode for single-replica workloads

Monitoring Requirements

Critical Metrics to Monitor

VPA Recommender Memory Usage: Alert above 2GB (leak indicator)
Admission Controller Latency: Pod creation fails if slow
Pod Eviction Rate: High rate indicates boundary issues
OOM Kill Rate: Should decrease after VPA implementation

Failure Indicators

Certificate Expiration: Admission controller TLS issues
Webhook Failures: Pod creation completely blocked
Recommendation Staleness: No updates in 24+ hours

Installation Procedures

Cloud Provider Methods

Google GKE: Enabled by default in Autopilot
Azure AKS: Add-on available since late 2023
Amazon EKS: Manual setup with IAM configuration required

Manual Installation

git clone https://github.com/kubernetes/autoscaler.git
cd autoscaler/vertical-pod-autoscaler/
./hack/vpa-up.sh

Critical Dependencies:

Metrics Server must be running first
Cluster-admin privileges required
Private registries need manifest updates

Common Failure Modes & Solutions

Memory Leak in Recommender

Symptoms: Gradual memory increase, eventual OOMKill
Frequency: Guaranteed in clusters with 3,000+ pods
Solution: Weekly component restarts, upgrade to v0.13+

Wildly Inaccurate Recommendations

Cause: VPA uses container limits instead of actual usage
Example: Java app with 2GB heap in 4GB container gets 4GB recommendation
Solution: Set realistic maxAllowed boundaries

Admission Controller Failures

Symptoms: All pod creation fails with TLS errors
Root Cause: Certificate expiration or network issues
Prevention: Monitor webhook endpoint health

Single-Replica Resize Failures

Cause: VPA respects PodDisruptionBudgets
Symptom: Recommendations generated but never applied
Solution: Use "Initial" mode or increase replica count

Production Deployment Strategy

Phase 1: Observation (Weeks 1-3)

Deploy VPA in "Off" mode
Collect recommendations for all workloads
Identify obvious resource waste patterns
Set appropriate boundaries

Phase 2: Limited Testing (Weeks 4-8)

Enable "Initial" mode for new deployments
Focus on batch jobs and dev environments
Monitor for unexpected resource requests

Phase 3: Gradual Rollout (Weeks 9-16)

Enable "Recreate" mode for non-critical stateless apps
Implement comprehensive monitoring
Establish incident response procedures

Phase 4: Production Scale (Month 4+)

Expand to critical applications with careful monitoring
Maintain exclusion list for problematic workloads
Regular component health checks and restarts

Cost-Benefit Analysis

Benefits

Resource Optimization: 15-30% cost reduction for over-provisioned workloads
Eliminates Guessing: Data-driven resource allocation
Automatic Adjustment: Responds to application changes over time

Hidden Costs

Engineering Time: 2-3 months of careful deployment
Operational Overhead: Component monitoring and maintenance
Service Disruption: Pod restarts during business hours
Debugging Complexity: Admission controller failures are hard to diagnose

ROI Timeline

Break-even: 6-12 months depending on cluster size
Maximum Benefit: Organizations with severely over-provisioned workloads
Negative ROI: Small clusters or well-tuned applications

Critical Warnings

Financial Risks

Runaway Recommendations: VPA can recommend massive resources without boundaries
AWS Bill Shock: Documented cases of unexpected cost increases
Resource Waste: Overestimated recommendations common in first 8 days

Operational Risks

Service Disruption: Pod evictions during peak traffic
Cascade Failures: Admission controller down = no pod creation
Data Loss Risk: Database pod restarts without proper preparation

Technical Debt

Component Maintenance: Regular restarts required for memory leaks
Complex Troubleshooting: Multiple components can fail independently
Version Dependencies: Kubernetes version compatibility requirements

Decision Criteria

Use VPA When:

Cluster has 100+ pods with unknown resource requirements
Workloads have predictable, steady-state resource patterns
Engineering team can invest 3+ months in proper deployment
Cost optimization justifies operational complexity

Avoid VPA When:

Applications have well-tuned resource requests
Single-replica critical applications dominate workload
Limited operational expertise with Kubernetes admission controllers
Cannot tolerate pod restart disruptions

Alternative Approaches:

Manual Profiling: Use monitoring data to set requests manually
Resource Rightsizing Tools: Cloud provider recommendations
Cluster Autoscaler: Focus on node-level optimization instead
Custom Controllers: Build application-specific resource management

Useful Links for Further Investigation

Resources to Save Your Sanity

Link	Description
VPA GitHub Repository	The source of truth. Read the FAQ - it'll answer 80% of your questions and save you hours of debugging. The examples/ folder has working configs you can actually use.
Kubernetes Autoscaling Docs	Official docs that explain why VPA and HPA fight each other and how to make them play nice (spoiler: use different metrics).
VPA Design Proposal	The original design doc if you need to understand why VPA works the way it does (and why some design decisions seem insane).
Google Cloud VPA Docs	Google enables VPA by default in Autopilot because they're confident enough in their own code. Their docs don't suck, which is rare.
AWS EKS VPA Guide	Amazon makes you jump through IAM hoops but provides solid installation instructions. Follow them exactly or spend hours debugging permissions.
Azure AKS VPA Docs	Microsoft added VPA support in 2023. Their docs explicitly recommend starting with "Off" mode, which tells you everything about their confidence level.
KodeKloud VPA Architecture Guide	Good explanation of how VPA components work together. Has diagrams that actually make sense.
Densify VPA Tutorial	Practical examples with working YAML configs. They understand that VPA is complex and don't pretend otherwise.
StormForge VPA Guide	Comprehensive guide that covers the gotchas and limitations. Written by people who've obviously debugged VPA in production.
VPA GitHub Issues	Search for "memory leak" and you'll find hundreds of people reporting the same VPA Recommender issues. Good for confirming you're not insane.
Kubernetes Slack #sig-autoscaling	Ask questions about VPA here when GitHub issues aren't helpful. The maintainers sometimes respond.
Important VPA Issues to Watch	The infamous VPA Recommender memory leak issue. Follow this to track when (if?) they fix the most annoying VPA bug.
Stack Overflow VPA Questions	Real engineers asking why VPA is recommending insane memory amounts. More debugging stories than you can handle.

40%

Recommendations combine user behavior, content similarity, research intelligence, and SEO optimization