KubeCost - Finally Know Where Your K8s Money Goes

Why Your K8s Costs Are Out of Control (And How KubeCost Fixes It)

Every month it's the same shit: AWS bill comes in 50% higher than expected and nobody knows why. Your EKS cluster is burning through money but AWS Cost Explorer just shows you EC2 instance costs. Useless.

The \"My Bill Exploded\" Problem

Here's what actually happens: Your data team spins up a Jupyter notebook for some "quick analysis" and accidentally leaves a model training job running all weekend. Meanwhile, your dev teams are deploying pods with CPU requests of 2 cores that actually use 0.1 cores. Nobody notices until the monthly AWS bill shows up at $47k instead of $30k.

Real production horror stories:

ML team's TensorFlow job ran on GPU nodes for 3 weeks straight - cost $12k
Staging namespace using production-sized RDS instances - nobody turned it off
Persistent volumes growing to 500GB because log rotation was broken
LoadBalancer services left running in 12 different namespaces

KubeCost Shows You Where the Money Goes

This is where KubeCost saves your sanity (and budget). Instead of staring at meaningless EC2 line items, you get actual answers.

KubeCost deploys as a pod in your cluster and scrapes Prometheus metrics to figure out exactly what's consuming resources. It then applies real AWS pricing data to show you costs down to the individual pod level.

The architecture is straightforward: KubeCost runs in your cluster, connects to Prometheus for resource metrics, pulls pricing data from cloud providers, and serves a web UI for cost visibility.

What you actually get:

Pod-level costs: "That Redis pod costs $47/month"
Namespace breakdown: "QA environment is costing $8k/month"
Idle resource detection: "You have $3k/month of unused CPU"
Network costs that AWS hides in separate line items

Works with GKE, AKS, and even on-premises clusters if you hate yourself enough to run K8s on bare metal.

IBM Bought Them (September 2024)

IBM acquired KubeCost for their Apptio portfolio. Good news: Enterprise features actually work now. Bad news: Expect pricing to go up. The free tier still exists (up to 250 CPU cores) but enterprise features cost real money.

What changed post-acquisition:

Better multi-cluster federation
Actual support instead of "check our Slack"
Integration with IBM's FinOps tools
More reliable RBAC integration
KubeCost 2.7 (April 2025) added enhanced cost visibility and reporting flexibility
Free trial of Enterprise Cloud extended through the rest of 2025

OpenCost Logo

Links you'll actually need:

Official installation guide (use this, not the old docs)
Helm chart repo (where the real config lives)
AWS EKS integration docs (AWS finally documented it properly)
Community Slack (for when things break at 3am)
GitHub issues (where the real problems are documented)
r/kubernetes cost management thread (real user experiences)
Cost optimization best practices (actually useful, not marketing fluff)

KubeCost vs The Competition (Honest Assessment)

Feature	KubeCost	OpenCost	CloudZero	AWS Cost Explorer
Actually Works	✅ (post-IBM acquisition)	⚠️ (if you like YAML hell)	✅ (enterprise-priced)	❌ (useless for K8s)
Setup Reality	10 min (5 min install + 5 min RBAC debugging)	2-6 hours (manual Prometheus config)	$50k + 3 months consulting	Already there (but useless)
Pod-level Costs	✅ Actually accurate	✅ Close enough	⚠️ Rolls up to services	❌ Shows EC2 instances only
Multi-cluster	✅ (Enterprise only, $$)	❌ (manual aggregation hell)	✅ (if you're Netflix-scale)	❌ (per-account silos)
When It Breaks	Slack support + docs	GitHub issues + prayers	Actual support engineers	AWS Support (good luck)
Prometheus Requirements	Bundled or BYO	BYO + debugging	Managed	N/A
Memory Usage	2-4GB (plan for 8GB)	1-2GB (if configured right)	Unknown (they manage it)	N/A
Real Cost Accuracy	95%+ (after bill reconciliation)	85%+ (manual reconciliation)	98%+ (they charge for accuracy)	100% (but wrong granularity)

What Actually Breaks When You Deploy KubeCost

KubeCost Prometheus Integration

Let's talk about what happens when you actually install this thing in production. The marketing says "5 minutes" but plan for 2 hours minimum because something will definitely break. Note: KubeCost 2.7 (current version as of September 2025) is much more stable than earlier 2.x releases.

Installation Reality Check

The Basic Install:

helm repo add kubecost https://kubecost.github.io/cost-analyzer/
helm install kubecost kubecost/cost-analyzer -n kubecost --create-namespace

What actually happens:

Prometheus starts scraping and immediately runs out of memory
RBAC permissions are wrong and pods can't read cluster metrics
LoadBalancer service gets stuck in "Pending" because your cluster doesn't have one
Cost data shows up as $0 for everything because cloud pricing API calls are failing

Prometheus Memory Explosions

The "bundled Prometheus" is a lie. It's a minimal config that works for demo clusters but dies on real workloads. Here's what you need to know:

Resource requirements (real numbers):

Small cluster (< 100 pods): 4GB RAM minimum, plan for 8GB
Medium cluster (100-500 pods): 8GB RAM, 4 CPU cores
Large cluster (500+ pods): 16GB+ RAM, dedicated nodes

Common failure modes:

Prometheus OOMKilled after 24 hours
Scraping timeouts causing missing metrics gaps
Retention cleanup failing and eating all disk space
Remote write to AWS Managed Prometheus timing out

Fix: Use existing Prometheus or size it properly

prometheus:
  server:
    resources:
      requests:
        memory: 8Gi
        cpu: 2000m
      limits:
        memory: 16Gi
    retention: "7d"
    storagePath: /data
    persistentVolume:
      size: 100Gi

Storage Grows Faster Than Expected

The "1GB per 1000 pods per month" estimate is bullshit. Plan for 3-5x that number.

What actually happens:

Metrics data includes network topology that's huge
PV cost tracking stores every volume resize event
ETL aggregation creates duplicate data during multi-cluster setups
Grafana dashboards query high-cardinality metrics and explode storage

Storage reality check (from production):

200-pod cluster: 5-10GB/month
1000-pod cluster: 50GB+/month
Multi-cluster federation: 2-3x single cluster storage needs

Cost Data Accuracy Problems

Why your numbers don't match AWS bills:

Reserved Instance allocation is broken - KubeCost doesn't properly distribute RI discounts across workloads
Network costs are estimated - AWS Data Transfer pricing is complex and KubeCost guesses wrong
Spot instance pricing lags - Real spot prices change every 5 minutes, KubeCost updates hourly
EBS volume costs are weird - gp3 IOPS pricing isn't handled correctly in older versions

Bill reconciliation fixes most issues but requires:

AWS Cost and Usage Reports configured properly
S3 bucket with proper IAM permissions
24-48 hour delay for reconciled data to appear

Multi-Cluster Federation (Enterprise)

The federated ETL setup is where things get really fun. IBM's docs make it sound simple but here's the reality:

What breaks:

ETL pipeline fails silently when one cluster has connectivity issues
Data deduplication doesn't work with different cluster naming schemes
Thanos integration requires custom Prometheus configs
Cross-cluster networking policies block federation traffic

Network requirements nobody tells you:

Clusters need service mesh connectivity or VPC peering
Federation queries timeout on clusters >1000 nodes
mTLS certificates expire and break federation silently

ARM64 Node Issues

Running KubeCost on ARM-based nodes (AWS Graviton, Apple Silicon) is broken in subtle ways:

Problems you'll hit:

Cost-analyzer pod crashes on ARM nodes with "exec format error"
Multi-arch images exist but Helm chart doesn't use them by default
Prometheus node-exporter metrics are missing CPU topology data
Network cost calculation fails on Graviton instances

Workaround: Pin KubeCost pods to AMD64 nodes:

nodeSelector:
  kubernetes.io/arch: amd64

Performance at Scale

Memory usage scales badly:

100 nodes: 2GB
500 nodes: 8GB
1000 nodes: 20GB+ (not the documented 4GB)
2000+ nodes: You need the enterprise federated architecture or it falls over

Query timeouts become common:

UI becomes unusable with >30 days of retention
API queries timeout after 2 minutes
Grafana dashboard panels show "no data" intermittently

Real production configs:

High availability setup requires 3+ replicas and shared storage
Database backend (PostgreSQL) is mandatory for >500 nodes
Redis caching required for acceptable UI performance

After dealing with all these deployment issues, you might wonder if there are better alternatives. That's fair - KubeCost isn't the only game in town, and depending on your situation, something else might make more sense.

Useful troubleshooting links:

KubeCost memory issues GitHub thread
Prometheus scaling guide
AWS EKS KubeCost troubleshooting
Community troubleshooting wiki
IBM support portal (if you're paying for enterprise)
KubeCost Slack debugging channel
Stack Overflow: KubeCost deployment issues
KubeCost performance tuning guide
Helm chart configuration reference

Questions Engineers Actually Ask

Why does KubeCost show $30k but my AWS bill is $25k?

Because AWS doesn't count reserved instance discounts properly and KubeCost includes networking costs you forgot about. Also, Spot instance pricing changes every 5 minutes but KubeCost updates hourly.Real fix: Enable bill reconciliation with your AWS Cost and Usage Reports. Takes 48 hours to sync but then it's accurate within 2-3%.

Installation is stuck at "Gathering metrics" for 2 hours - what the hell?

Your Prometheus is probably broken or can't scrape cAdvisor metrics. Check if these work:bashkubectl get --raw "/api/v1/nodes/node-name/proxy/metrics/cadvisor" | head -5kubectl top nodesIf those fail, your metrics-server is fucked or your cluster RBAC is blocking metric access. Common fix.

KubeCost killed my existing Prometheus - how do I fix this?

Yeah, it does that.

KubeCost's Prometheus has aggressive scraping configs that can overwhelm smaller clusters. Either: 1.

Use your existing Prometheus (preferred):yamlprometheus: server: enabled: false # Point to your existing prometheus prometheusEndpoint: "http://your-prometheus:9090"2.

Or increase resources on KubeCost's Prometheus:yamlprometheus: server: resources: limits: memory: 16Gi cpu: 4000m

Cost data is all $0 or missing - what's wrong?

**Check these in order:**1.

Cloud pricing API calls failing (check IAM permissions)2.

Prometheus can't reach Kubernetes API 3. Node pricing data isn't available (happens on custom instance types)4. Network policies blocking cost-analyzer podDebug commands:bashkubectl logs -n kubecost deployment/kubecost-cost-analyzer -fkubectl get pods -n kubecost -o wideLook for "failed to get pricing data" or "prometheus unreachable" errors.

Multi-cluster federation shows duplicate costs - help!

This is a known bug when cluster names aren't unique. The ETL deduplication logic breaks when you have "production" in multiple regions.Workaround: Give each cluster unique names in the federation config:yamlkubecostProductConfigs: clusterName: "prod-us-east-1" # not just "production"

ARM64 nodes cause KubeCost to crash with "exec format error"

KubeCost images don't properly support multi-architecture deployments.

Pin to AMD64 nodes:```yamlnodeSelector: kubernetes.io/arch: amd64tolerations:

key: "kubernetes.io/arch" operator: "Equal" value: "arm64" effect: "NoSchedule"```Or use OpenCost which has proper ARM support.

Memory usage keeps growing until pods get OOMKilled

This is normal on large clusters.

KubeCost's memory usage is roughly:

100 pods: 2-4GB
500 pods: 8GB+
1000+ pods: 16GB+ (despite docs saying 4GB)Set proper limits and use horizontal pod autoscaling:yamlresources: limits: memory: 16Gi requests: memory: 8Gi

Why does the UI timeout on queries longer than 30 days?

Because the query optimization is shit.

Large time ranges with high cardinality (lots of pods/namespaces) hit the 2-minute query timeout.Workarounds:

Use API directly with smaller time windows
Enable query caching (enterprise feature)
Aggregate by namespace instead of pod level

Free tier expired, now what?

You hit 250 CPU cores.

Either: 1.

Try KubeCost Enterprise Cloud free trial (available through the rest of 2025)2.

Pay for enterprise (starts around $500/month)3. Switch to OpenCost (free forever, more setup work)4.

Delete some dev/test namespaces to get under the limitCheck your usage:bashkubectl top nodes | awk '{sum+=$2} END {print "Total CPU cores: " sum/1000}'

Network costs seem completely wrong

Network cost calculation is notoriously difficult and KubeCost's estimates are often 50% off. AWS Data Transfer pricing has like 20 different rates depending on direction, region, and service.Only bill reconciliation makes this accurate. Or just ignore network costs if they're <10% of your total bill.

Performance is terrible on clusters >500 nodes

You need the enterprise federated architecture or a database backend.

The default SQLite storage doesn't scale.Required for large clusters:

PostgreSQL backend instead of local storage
Redis caching for query performance
Multi-replica deployment with load balancing

What's new in KubeCost 2.7 (latest version)?

KubeCost 2.7 released April 2025 includes:

Enhanced cost visibility and diagnostics
Improved reporting flexibility
GPU cost insights (finally shows which ML workloads are expensive)
Better multi-cloud support
Granular RBAC controls

Production-ready, unlike the earlier 2.x releases that had memory leaks.

I want to try OpenCost instead - how to migrate?

OpenCost is the open source version.

You'll lose enterprise features but gain:

No licensing limitations
CNCF backing (won't disappear)
Better ARM64 support
More transparent developmentMigration is manual
no data export/import. You start fresh but keep your Prometheus data.

Quick Navigation

The \"My Bill Exploded\" Problem

KubeCost Shows You Where the Money Goes

IBM Bought Them (September 2024)

Installation Reality Check

Prometheus Memory Explosions

Storage Grows Faster Than Expected

Cost Data Accuracy Problems

Multi-Cluster Federation (Enterprise)

ARM64 Node Issues

Performance at Scale

Why does KubeCost show $30k but my AWS bill is $25k?

Installation is stuck at "Gathering metrics" for 2 hours - what the hell?

KubeCost killed my existing Prometheus - how do I fix this?

Cost data is all $0 or missing - what's wrong?

Multi-cluster federation shows duplicate costs - help!

ARM64 nodes cause KubeCost to crash with "exec format error"

Memory usage keeps growing until pods get OOMKilled

Why does the UI timeout on queries longer than 30 days?

Free tier expired, now what?

Network costs seem completely wrong

Performance is terrible on clusters >500 nodes

What's new in KubeCost 2.7 (latest version)?

I want to try OpenCost instead - how to migrate?

Related Tools & Recommendations

Jenkins Docker Kubernetes CI/CD: Deploy Without Breaking Production

Fix Kubernetes Service Not Accessible: Stop 503 Errors

kubectl: Kubernetes CLI - Overview, Usage & Extensibility

Setting Up Prometheus Monitoring That Won't Make You Hate Your Job

Kubernetes Pricing: Uncover Hidden K8s Costs & Skyrocketing Bills

ArgoCD - GitOps for Kubernetes That Actually Works

Flux GitOps: Secure Kubernetes Deployments with CI/CD

containerd - The Container Runtime That Actually Just Works

Lock Down Kubernetes: Production Cluster Hardening & Security

etcd Overview: The Core Database Powering Kubernetes Clusters

Linkerd Overview: The Lightweight Kubernetes Service Mesh

ArgoCD Production Troubleshooting: Debugging & Fixing Deployments

Change Data Capture (CDC) Integration Patterns for Production

Fix Kubernetes Pod CrashLoopBackOff - Complete Troubleshooting Guide

Terraform, Pulumi, CloudFormation: IaC Cost Analysis 2025

Enterprise Data Platform Pricing: Real Costs & Hidden Fees 2025

New Relic Overview: App Monitoring, Setup & Cost Insights

Grafana - The Monitoring Dashboard That Doesn't Suck

Fix gRPC Production Errors - The 3AM Debugging Guide

KEDA - Kubernetes Event-driven Autoscaling: Overview & Deployment Guide