My right-sizing changes either save nothing or crash production - what the fuck am I doing wrong?

![Kubernetes Resource Usage Monitoring](https://scaleops.com/wp-content/uploads/2024/04/blog-pic-5.png) **Here's how to tell if you're helping or just breaking things:** 1. **Before making changes**: Record baseline metrics for 1 week ```bash # Record current costs and performance kubectl top pods --all-namespaces > baseline-usage.txt # Document response times and error rates from APM tool ``` 2. **After right-sizing**: Wait 48 hours, then compare: ```bash # Check for increased restart rates (bad sign) kubectl get pods --all-namespaces -o custom-columns="NAME:.metadata.name,RESTARTS:.status.containerStatuses[0].restartCount" | sort -k2 -n | tail -20 # Monitor for OOMKilled events kubectl get events --all-namespaces --field-selector reason=OOMKilling # Compare resource usage patterns kubectl top pods --all-namespaces --sort-by=memory ``` 3. **Cost validation**: Use your monitoring tool (KubeCost/OpenCost) to compare monthly projections **Red flags that mean you went too aggressive:** - Pod restart count increased >20% - Response times increased >10% - Any OOMKilled events for production pods - CPU throttling showing up in monitoring

My spot instances are getting murdered left and right and taking my app down with them - what the hell am I doing wrong?

**Your app isn't spot-ready. Here's what you fucked up:** 1. **Check replica count** - You need minimum 3 replicas for spot resilience: ```bash kubectl get deployments --all-namespaces -o custom-columns="NAME:.metadata.name,REPLICAS:.spec.replicas" | grep -E " [12]$" ``` 2. **Verify Pod Disruption Budgets exist**: ```bash kubectl get pdb --all-namespaces # Should return PDBs for all spot-instance workloads ``` 3. **Test your app's shutdown behavior**: ```bash # Simulate spot interruption kubectl drain node-name --ignore-daemonsets --delete-emptydir-data # Watch how your app handles graceful shutdown ``` 4. **Check if you're using inappropriate workloads on spot**: - Databases or StatefulSets (move to on-demand) - Single-replica deployments (increase replicas) - Apps that can't handle 30-second shutdown windows **Pro tip**: Start with batch jobs and dev environments on spot before moving production workloads.

I implemented autoscaling but it's either too slow or scaling too aggressively

**The autoscaling tuning that actually works:** **For HPA being too slow:** ```yaml behavior: scaleUp: stabilizationWindowSeconds: 30 # React faster (default 300s) policies: - type: Percent value: 100 # Double pods quickly periodSeconds: 15 ``` **For HPA being too aggressive:** ```yaml basis: scaleDown: stabilizationWindowSeconds: 600 # Wait longer before scaling down policies: - type: Percent value: 10 # Scale down slowly periodSeconds: 60 ``` **For Cluster Autoscaler being slow:** ```bash # Reduce scale-down delays --scale-down-delay-after-add=5m --scale-down-unneeded-time=5m --scale-down-delay-after-failure=1m ``` **The real issue**: Default settings assume you prefer stability over cost. For cost optimization, you want faster scaling with shorter delays.

KubeCost shows I'm spending $30k but AWS bill is only $25k - which is right?

**KubeCost includes estimated costs AWS doesn't show in the same place:** 1. **Network transfer costs** - Check your Data Transfer line items in AWS billing 2. **EBS storage and snapshots** - Look at EC2-EBS and snapshot costs 3. **Load balancer costs** - ELB charges are separate line items 4. **Reserved instance amortization** - KubeCost spreads RI costs differently **Fix this with bill reconciliation:** ```yaml # Enable AWS Cost and Usage Report integration kubecostProductConfigs: athenaProjectID: my-athena-database athenaBucketName: aws-cur-bucket athenaRegion: us-east-1 awsSpotDataRegion: us-east-1 awsSpotDataBucket: spot-data-feed-bucket ``` Wait 24-48 hours after enabling bill reconciliation. The numbers should align within 2-3%.

How do I convince management that spending 2 months on cost optimization is worth it?

**Build a business case with real numbers:** 1. **Calculate current waste** (from Phase 1 assessment): ``` Current monthly spend: ~$47k (give or take) Estimated waste (maybe 40%): ~$19k/month Annual waste: somewhere around $230k ``` 2. **Project savings** (conservative estimate): ``` Phase 1 (right-sizing): maybe $8-12k/month if we're lucky Phase 2 (spot instances): probably another $12-15k/month Total savings: $20-27k/month = $240-320k/year (if nothing breaks) ``` 3. **Calculate ROI**: ``` Engineering cost (160 hours @ $150/hour): $24,000 Annual savings: $282,000 ROI: 1,175% (pays for itself in 1 month) ``` 4. **Present risk mitigation**: - "Without optimization, costs will grow 30% annually as we scale" - "Competitors using spot instances have 70% lower compute costs" - "Current over-provisioning masks performance issues we should fix"

My dev team says resource limits will hurt performance - how do I handle this?

**Performance vs cost conversation that works:** 1. **Show actual usage data**: ```bash # Prove most apps use <20% of allocated resources kubectl top pods --all-namespaces --containers | awk '{print $1, $3, $4}' ``` 2. **Offer A/B testing**: - Right-size 25% of non-critical workloads first - Measure performance impact over 2 weeks - Let data drive the conversation 3. **Explain the performance benefits**: - Better resource utilization = more predictable performance - Proper limits prevent noisy neighbor problems - Right-sizing forces fixing actual performance issues 4. **Give them a performance budget**: "We'll optimize for cost, but maintain <5% performance degradation. If we exceed that, we'll adjust."

I followed the guide but only saved 15% instead of 50% - what went wrong?

**Common reasons optimizations underperform:** 1. **Your workloads were already somewhat optimized** - Some teams only have 20% waste, not 60% 2. **You didn't implement spot instances properly**: ```bash # Check what percentage is actually on spot kubectl get nodes -l node-type=spot --no-headers | wc -l kubectl get nodes --no-headers | wc -l ``` 3. **Reserved Instances are masking savings** - If 80% of your capacity is covered by RIs, spot savings are limited 4. **Monitoring timeframe is too short** - Cost optimization benefits compound over months, not days 5. **You optimized the wrong workloads** - Focus on the highest-cost namespaces first **Debug your optimization:** - Run the Phase 1 assessment again - where is remaining waste? - Check if right-sizing actually took effect (compare current vs baseline) - Verify spot instances are being used for appropriate workloads

Our compliance team says spot instances violate our SLA requirements

**How to use spot instances while meeting SLAs:** 1. **Architecture approach**: - Critical path: 100% on-demand instances - Background jobs: 100% spot instances - Web tier: 70% spot, 30% on-demand 2. **SLA-safe spot implementation**: ```yaml # Ensure critical replicas on on-demand affinity: podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchLabels: app: critical-api topologyKey: "node-type" nodeSelector: node-type: "ondemand" # First 2 replicas ``` 3. **Document the math**: - Spot interruption rate: 99.9% availability - On-demand failover: <30 second failover time 4. **Start with non-SLA workloads**: - Development environments (no SLA) - Batch processing (time-flexible) - Background workers (retry-able)

How do I automate this so I don't have to babysit cost optimization forever?

**Set up automation that actually works:** 1. **Automated right-sizing** with Vertical Pod Autoscaler: ```bash # Clone the autoscaler repository and install VPA git clone https://github.com/kubernetes/autoscaler.git cd autoscaler/vertical-pod-autoscaler/ ./hack/vpa-up.sh ``` 2. **Cost monitoring alerts**: ```yaml # Prometheus alert for cost spikes - alert: KubernetesCostSpike expr: increase(kubecost_cluster_cost_total[24h]) > increase(kubecost_cluster_cost_total[24h] offset 7d) * 1.3 for: 1h labels: severity: warning annotations: summary: "Kubernetes costs increased >30% vs last week" ``` 3. **Automated dev environment shutdown**: ```bash # Cron job to shutdown dev namespaces at 6 PM apiVersion: batch/v1 kind: CronJob metadata: name: shutdown-dev-environments spec: schedule: "0 18 * * 1-5" # 6 PM Monday-Friday jobTemplate: spec: template: spec: containers: - name: shutdown image: bitnami/kubectl command: ["/bin/sh"] args: - -c - kubectl scale deployment --all --replicas=0 -n dev-team-1 -n dev-team-2 -n staging ``` 4. **Monthly cost review automation**: - Set up monthly reports from your cost monitoring tool - Automated Slack alerts for >10% cost increases - Automated recommendations for new right-sizing opportunities **The key**: Automate the monitoring and alerting, but keep human judgment in the optimization decisions.

Currently viewing the AI version

Switch to human version

Kubernetes Cost Optimization: Production Implementation Guide

Configuration

Phase 1: Assessment and Baseline (30-55% immediate savings potential)

Critical Prerequisites:

Install cost monitoring tool BEFORE optimization attempts
Collect minimum 2-4 weeks usage data for right-sizing decisions
Document baseline metrics to prevent optimization failures

Cost Monitoring Tool Selection:

KubeCost: Fastest deployment, commercial support, licensing limits
OpenCost: CNCF project, no licensing restrictions, requires more setup
Manual Prometheus: Use existing infrastructure, requires custom configuration

Installation Commands:

# KubeCost - Quick deployment
helm repo add kubecost https://kubecost.github.io/cost-analyzer/
helm install kubecost kubecost/cost-analyzer -n kubecost --create-namespace \
  --set prometheus.server.resources.requests.memory=4Gi \
  --set prometheus.server.resources.limits.memory=8Gi

# OpenCost - Free alternative
kubectl apply -f https://raw.githubusercontent.com/opencost/opencost/develop/kubernetes/opencost.yaml

Essential Data Collection:

# Identify over-provisioned resources
kubectl top pods --all-namespaces --sort-by memory
kubectl get pods --all-namespaces -o jsonpath='{range .items[*]}{.metadata.namespace}{"/"}{.metadata.name}{"\t"}{.spec.containers[*].resources}{"\n"}{end}' | grep -v limits

# Find resource waste patterns
kubectl get pods --all-namespaces --field-selector=status.phase=Running -o custom-columns="NAMESPACE:.metadata.namespace,NAME:.metadata.name,AGE:.metadata.creationTimestamp" | sort -k3

Baseline Assessment Results (Typical Findings):

40-60% of CPU requests unused
30-50% of memory requests unused
Dev/staging consuming 40-70% of total spend
10-20% storage attached to deleted resources
5-15% load balancers serving zero traffic

Phase 2: Implementation - Resource Right-Sizing and Spot Instances

Resource Right-Sizing Formula:

Memory requests: 80th percentile usage + 20% buffer
CPU requests: 95th percentile usage (no buffer needed)
Memory limits: 150-200% of requests (prevents OOM kills)
CPU limits: Avoid - causes throttling issues

Critical Failure Prevention:

Monitor for OOMKilled events: kubectl get events --all-namespaces --field-selector reason=OOMKilling
Track restart counts: Increase >20% indicates over-aggressive sizing
Validate 48 hours after changes before proceeding

Spot Instance Architecture Requirements:

Minimum 3 replicas for resilience
Pod Disruption Budgets mandatory
Spread across availability zones
Install AWS Node Termination Handler

Spot Instance Configuration (AWS EKS):

# Mixed node group configuration
instancesDistribution:
  maxPrice: 0.50  # Never pay >50% of on-demand
  instanceTypes: ["m5.large", "m5.xlarge", "m5a.large", "m4.large"]
  onDemandBaseCapacity: 0
  onDemandPercentageAboveBaseCapacity: 0
  spotInstancePools: 4  # Diversify across instance types

Workload Tolerations for Spot:

tolerations:
- key: "spot-instance"
  operator: "Equal"
  value: "true"
  effect: "NoSchedule"
nodeSelector:
  node-type: "spot"
topologySpreadConstraints:
- maxSkew: 1
  topologyKey: topology.kubernetes.io/zone
  whenUnsatisfiable: DoNotSchedule

Phase 3: Automation and Long-term Monitoring

Vertical Pod Autoscaler (VPA) Production Setup:

Start with recommendation-only mode for 2 weeks
Graduate to automatic updates only after stable recommendations
Monitor VPA-initiated restarts: kubectl get events --field-selector reason=EvictedByVPA

Cost Alert Configuration:

# Critical cost monitoring alerts
- alert: KubernetesCostSpike
  expr: increase(kubecost_cluster_cost_total[7d]) > increase(kubecost_cluster_cost_total[7d] offset 7d) * 1.3
  for: 2h
  
- alert: ExpensivePodDetected
  expr: kubecost_pod_cpu_cost_total + kubecost_pod_memory_cost_total > 200
  for: 1h

Automated Environment Management:

# Development environment shutdown (7 PM weekdays)
schedule: "0 19 * * 1-5"
command: kubectl scale deployment --all --replicas=0 -n dev-*

# Startup restoration (8 AM weekdays)  
schedule: "0 8 * * 1-5"
command: kubectl scale deployment --all --replicas=1 -n dev-*

Resource Requirements

Implementation Timeline and Effort

Phase	Duration	Engineering Hours	Risk Level	Expected Savings
Assessment & Baseline	1-2 weeks	20-40 hours	Low	30-55%
Resource Right-sizing	1-2 weeks	20-30 hours	Medium	20-40%
Spot Instance Integration	2-4 weeks	30-50 hours	Medium-High	50-90%
Automation Setup	2-3 weeks	20-40 hours	Low	5-15%
Total Implementation	6-11 weeks	90-160 hours	Medium	50-70%

Ongoing Maintenance Requirements

Monthly Maintenance (2-4 hours):

Review cost monitoring alerts
Validate VPA recommendations
Clean up idle development environments
Update spot instance configurations

Quarterly Reviews (4-8 hours):

Assess new optimization opportunities
Update resource quotas and policies
Review and tune autoscaling parameters
Validate cost monitoring accuracy

Tool-Specific Resource Costs

KubeCost Resource Requirements:

Prometheus: 4-8GB memory, 2-4 CPU cores
KubeCost analyzer: 2-4GB memory, 1-2 CPU cores
Storage: 100-500GB for historical data

Operational Overhead:

Initial setup: 40-80 hours engineering time
Break-even point: 1 month (typical ROI 1,175%)
Ongoing automation reduces manual effort to <5% of platform team time

Critical Warnings

What Official Documentation Doesn't Tell You

VPA Operational Reality:

Will restart pods at inconvenient times in automatic mode
Recommendations require 2+ weeks of stable traffic patterns
Cannot handle workloads with strict availability requirements
Kubernetes 1.25+ required, VPA v1.2.0+ for stability

Spot Instance Breaking Points:

Single-replica deployments will experience downtime
Stateful workloads (databases) should remain on on-demand instances
Apps requiring >30-second shutdown gracefully will fail
Interruption rate: <5% per instance per month, but varies by region/type

Right-Sizing Failure Modes:

Basing decisions on <2 weeks data causes production outages
Memory limits too aggressive cause OOMKilled events
CPU requests too low cause performance degradation during traffic spikes
Batch jobs have different patterns than web applications

Resource Utilization Thresholds

Safe Operating Ranges:

CPU utilization: 60-80% (above 80% risks performance degradation)
Memory utilization: 70-85% (above 85% increases OOM risk)
Storage utilization: <90% (above 90% causes pod evictions)

Danger Signals:

Pod restart rate increase >20%
Response time degradation >10%
Any OOMKilled events in production
CPU throttling visible in monitoring
Autoscaler failing to provision nodes within 2 minutes

Cost Monitoring Accuracy Issues

Common Billing Discrepancies:

KubeCost may show $30k while AWS shows $25k due to:
- Network transfer costs (separate AWS line items)
- EBS storage and snapshots (EC2-EBS billing)
- Load balancer costs (ELB separate charges)
- Reserved instance amortization differences

Bill Reconciliation Required:

# Enable AWS Cost and Usage Report integration
kubecostProductConfigs:
  athenaProjectID: my-athena-database
  athenaBucketName: aws-cur-bucket
  athenaRegion: us-east-1

Decision Criteria

Strategy Selection Matrix

Strategy	Best For	Avoid When	Implementation Complexity
Resource Right-sizing	Over-provisioned workloads	Unknown traffic patterns	Low-Medium
Spot Instances	Stateless, fault-tolerant apps	Databases, single-replica services	Medium-High
Reserved Instances	Predictable workloads	Variable or temporary workloads	Low
Cluster Autoscaling	Variable traffic patterns	Consistent load, tight SLAs	Medium
Development Shutdown	Non-production environments	Always-on development requirements	Low

Risk Assessment Framework

Low Risk (Implement First):

Development environment shutdown
Storage cleanup and optimization
Reserved instance purchasing
Basic resource quota implementation

Medium Risk (Implement with Monitoring):

Resource right-sizing with gradual rollout
Cluster autoscaling configuration
Mixed on-demand/spot node groups

High Risk (Implement Last, Test Thoroughly):

Full spot instance migration
Aggressive VPA automatic mode
Complex multi-tier autoscaling

ROI Calculation Framework

Monthly Savings Calculation:

Current monthly spend: $X
Optimization target: Y% reduction
Monthly savings: $X * (Y/100)
Annual savings: Monthly savings * 12
Implementation cost: Engineering hours * $150/hour
ROI = (Annual savings - Implementation cost) / Implementation cost * 100

Typical ROI Examples:

$47k/month cluster, 50% reduction = $282k annual savings
Implementation cost: $24k (160 hours)
ROI: 1,175% (payback in 1 month)

Implementation Patterns

Graduated Rollout Strategy

Week 1-2: Assessment Only

Install monitoring tools
Collect baseline data
Identify obvious waste (idle resources, over-provisioning)
Document current costs and utilization

Week 3-4: Low-Risk Optimizations

Clean up unused resources
Implement development environment schedules
Right-size obviously over-provisioned workloads
Target 20-30% cost reduction

Week 5-8: Spot Instance Integration

Create mixed node groups
Migrate stateless workloads to spot
Implement interruption handling
Target additional 30-50% reduction

Week 9-11: Automation and Monitoring

Deploy VPA in recommendation mode
Configure cost alerting
Implement resource policy enforcement
Establish ongoing optimization processes

Troubleshooting Common Failures

Right-sizing Causing OOM Kills:

Increase memory requests by 25%
Add memory limits 150% of requests
Monitor for 48 hours before further changes
Check application memory leak patterns

Spot Instances Causing Downtime:

Verify minimum 3 replicas for all spot workloads
Confirm Pod Disruption Budgets exist
Check topology spread constraints
Validate graceful shutdown handling (<30 seconds)

Autoscaling Too Slow/Aggressive:

# Tune HPA behavior
behavior:
  scaleUp:
    stabilizationWindowSeconds: 30  # Faster response
    policies:
    - type: Percent
      value: 100
      periodSeconds: 15
  scaleDown:
    stabilizationWindowSeconds: 600  # Slower scale-down

Cost Monitoring Inaccuracy:

Enable AWS Cost and Usage Report integration
Wait 24-48 hours for bill reconciliation
Validate node pricing data is current
Check for missing resource types in monitoring

This guide provides the operational intelligence needed to successfully implement Kubernetes cost optimization while avoiding common failure modes that can impact production availability and performance.

Useful Links for Further Investigation

Tools I Actually Use (No Bullshit List)

Link	Description
KubeCost	Official documentation for KubeCost, a comprehensive Kubernetes cost management tool that provides out-of-the-box functionality for monitoring and optimizing cloud spend.
OpenCost	Official documentation for OpenCost, an open-source Kubernetes cost monitoring tool that offers flexibility and cost savings for users willing to perform additional setup.
AWS Node Termination Handler	GitHub repository for the AWS Node Termination Handler, an essential tool for gracefully handling spot instance interruptions and preventing unexpected pod termination in Kubernetes clusters on AWS.
Karpenter	Official website for Karpenter, a high-performance Kubernetes cluster autoscaler designed to quickly launch and terminate nodes in response to workload changes, offering a faster alternative to Cluster Autoscaler on AWS.
Production Kubernetes	O'Reilly book 'Production Kubernetes' by Josh Rosso, recommended as the definitive guide for building and operating robust Kubernetes clusters in production environments, covering best practices and advanced topics.
VPA troubleshooting docs	GitHub repository for Kubernetes Vertical Pod Autoscaler (VPA) troubleshooting documentation, providing guidance for resolving common issues that may arise during cost optimization efforts.
Stack Overflow	Stack Overflow questions tagged with 'kubernetes+cost', a community-driven resource for finding solutions and discussing specific error messages related to Kubernetes cost optimization that may not be covered in official documentation.

Kubernetes Cost Optimization: Production Implementation Guide

Configuration

Phase 1: Assessment and Baseline (30-55% immediate savings potential)

Phase 2: Implementation - Resource Right-Sizing and Spot Instances

Phase 3: Automation and Long-term Monitoring

Resource Requirements

Implementation Timeline and Effort

Ongoing Maintenance Requirements

Tool-Specific Resource Costs

Critical Warnings

What Official Documentation Doesn't Tell You

Resource Utilization Thresholds

Cost Monitoring Accuracy Issues

Decision Criteria

Strategy Selection Matrix

Risk Assessment Framework

ROI Calculation Framework

Implementation Patterns

Graduated Rollout Strategy

Troubleshooting Common Failures

Useful Links for Further Investigation

Tools I Actually Use (No Bullshit List)

Related Tools & Recommendations

Kafka + MongoDB + Kubernetes + Prometheus Integration - When Event Streams Break

GitOps Integration Hell: Docker + Kubernetes + ArgoCD + Prometheus

Prometheus + Grafana + Jaeger: Stop Debugging Microservices Like It's 2015

Grafana - The Monitoring Dashboard That Doesn't Suck

Set Up Microservices Monitoring That Actually Works

MongoDB Alternatives: Choose the Right Database for Your Specific Use Case

MongoDB vs PostgreSQL vs MySQL: Which One Won't Ruin Your Weekend

RAG on Kubernetes: Why You Probably Don't Need It (But If You Do, Here's How)

GitHub Actions + Docker + ECS: Stop SSH-ing Into Servers Like It's 2015

Fix Helm When It Inevitably Breaks - Debug Guide

Helm - Because Managing 47 YAML Files Will Drive You Insane

Making Pulumi, Kubernetes, Helm, and GitOps Actually Work Together

Cluster Autoscaler - Stop Manually Scaling Kubernetes Nodes Like It's 2015

Docker Swarm Node Down? Here's How to Fix It

Docker Swarm Service Discovery Broken? Here's How to Unfuck It

Docker Swarm - Container Orchestration That Actually Works

HashiCorp Nomad - Kubernetes Alternative Without the YAML Hell

Amazon ECS - Container orchestration that actually works

Google Cloud Run - Throw a Container at Google, Get Back a URL

CAST AI - Stop Burning Money on Kubernetes