kube-state-metrics - See What's Actually Happening in Your Kubernetes Cluster

Your Kubernetes Cluster is a Black Box. This Tool Fixes That.

I've been running Kubernetes in production for years, and let me tell you - if you're not running kube-state-metrics, you're basically flying blind. Your cluster is doing all sorts of shit behind the scenes, and without this tool, you'll find out about problems way too late.

kube-state-metrics is deceptively simple: it watches the Kubernetes API and exports metrics about what your objects are actually doing. Not CPU usage (that's what metrics-server is for), but the important stuff like "why is my deployment stuck at 2/3 replicas?" or "which pods have been restarting every 30 seconds for the past hour?"

Monitoring Architecture: kube-state-metrics sits between the Kubernetes API server and your monitoring system (typically Prometheus). It watches API objects, converts them to metrics, and exposes them for scraping - creating a complete picture of cluster state.

The flow is simple: kube-state-metrics connects to the API server, maintains persistent watch connections on all objects, converts object states into Prometheus metrics, and exposes them on port 8080 for your monitoring system to scrape.

What Actually Happens When You Deploy This

The moment you install kube-state-metrics, it connects to your API server and starts watching everything. Every pod, deployment, service, configmap - you name it. It doesn't store anything or make API calls constantly (thank fuck), it just maintains a persistent watch connection and updates its internal state when things change.

The beauty is in what it exposes. Instead of guessing why your HPA isn't scaling, you get metrics like kube_deployment_status_replicas_available vs kube_deployment_spec_replicas. Boom - you can see exactly where the disconnect is.

As of September 2025, the latest stable version is v2.17.0 (released September 1, 2025). This release adds crucial new metrics:

kube_pod_unscheduled_time_seconds - tracks how long pods sit unscheduled (finally!)
kube_deployment_deletion_timestamp - shows when deployments are being deleted
kube_deployment_status_condition now includes the reason label for better debugging

It builds with Go 1.24.6 and client-go v0.33.4, supporting all recent Kubernetes versions. Match your client-go version to avoid API compatibility issues - learned this the hard way with 1.28 clusters.

Real-World Problem Solving

Here's what kube-state-metrics actually helps you debug in production:

Deployment Hell: Your pods keep dying and you don't know why? kube_pod_container_status_restarts_total will show you which containers are crashlooping. I've spent way too many hours manually running kubectl get pods in a loop when this metric would have told me immediately.

Resource Starvation: Pods stuck in Pending? Check kube_pod_status_phase and kube_pod_status_conditions. Often it's resource quotas or node capacity issues that aren't obvious from the standard kubectl output.

Job Failures: CronJobs silently failing? kube_job_status_failed and kube_job_status_succeeded will show you the pattern. I've seen production CronJobs fail for weeks because nobody was monitoring the actual job status.

Node Problems: Before a node completely dies, you'll see it in the metrics. kube_node_status_condition shows Ready, DiskPressure, MemoryPressure states before your pods start getting evicted.

The Kubernetes SIG Instrumentation team maintains this thing, so it's not going anywhere. Unlike some random project that might disappear, this is officially supported Kubernetes infrastructure.

kube-state-metrics vs Everything Else You're Probably Running

Feature	kube-state-metrics	Kubernetes Metrics Server	Prometheus Node Exporter	cAdvisor
What It Actually Does	Tells you why pods are fucked	Makes HPA work	Shows you when nodes are dying	Container resource usage (mostly useless)
Actually Works?	Yes, reliably	Works until it doesn't	Rock solid	Built-in, can't avoid it
Real Resource Usage	200MB-800MB (grows with cluster)	40MB (until it OOMs)	20MB (never changes)	Whatever Kubelet uses
Setup Pain Level	Medium (RBAC will get you)	Easy (usually pre-installed)	Easy	None (already there)
When It Breaks	API server connectivity issues	Random OOM kills	Never breaks	Breaks with Kubelet
Debug Value	High shows actual object states	Low just resource numbers	High real system metrics	Medium good for container limits
Installation Reality	Helm chart works, manifests are painful	Usually already there	Standard DaemonSet, just works	Can't uninstall it
Prometheus Scraping	Just works on port 8080	Need adapter for HPA metrics	Native on port 9100	Native on Kubelet port

How to Actually Deploy This Thing (And Not Fuck It Up)

The documentation makes this sound simple. It's not. Here's what actually works in production and what will bite you in the ass.

Helm Chart

Deployment Reality: kube-state-metrics runs as a single pod (or multiple for sharding) that connects to your API server with read-only permissions. It's simple in concept but the RBAC and resource sizing will fuck you up.

Helm Chart - Just Use This

Skip the manual manifests. The Prometheus Community Helm chart works and saves you hours of RBAC debugging:

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install kube-state-metrics prometheus-community/kube-state-metrics

But here's the shit they don't tell you:

Resource Limits Are Wrong: The default 250MB memory limit is bullshit for any real cluster. Start with 500MB minimum, 1GB for large clusters. I learned this when our deployment kept getting OOM killed because we have 2000+ pods.

RBAC Will Fuck You: The chart creates ClusterRole permissions, but if you have Pod Security Standards enabled, you'll need to allow system:metrics or create a custom policy. Spent 4 hours debugging why metrics weren't showing up - turns out RBAC was blocking API server access.

Port 8080 Conflicts: Half the shit you're running probably uses port 8080. Change it in your values file:

service:
  port: 8081
  targetPort: 8081

Manual Installation (When Helm Isn't An Option)

Sometimes you can't use Helm (corporate policies, whatever). The official manifests work but you'll spend time fixing them.

Copy the manifests and fix these issues:

Memory limits are too low (bump to 500MB minimum)
Namespace restrictions if you can't use cluster-wide access
Security context if you have PSP/PSS enabled

The service account needs read access to basically everything. If you're paranoid about security, you can scope it to specific namespaces, but you'll lose cluster-wide visibility.

Cloud Platform Reality Check

GKE: Google has built-in kube-state-metrics but it's limited and sends data to Cloud Monitoring. If you want Prometheus, install your own.

EKS: AWS doesn't include this by default. Use the Helm chart or their managed Prometheus service add-on.

AKS: Microsoft's Container Insights includes some kube-state-metrics data, but again, limited compared to running your own.

Large Cluster Scaling (When Everything Goes to Shit)

If you have 1000+ nodes or 10,000+ pods, you need horizontal sharding. This splits the monitoring load across multiple instances.

The autosharding examples use StatefulSets where each pod monitors a subset of objects. It works, but debugging which instance is monitoring what object is a pain.

Pro tip: Monitor the health metrics kube_state_metrics_list_total and kube_state_metrics_watch_total. If these stop incrementing, your API server connection is fucked and you'll lose visibility.

What a Real Deployment Looks Like

Visualization Layer: Most teams use Grafana to visualize the metrics that kube-state-metrics exposes. The official kube-state-metrics v2 dashboard gives you instant visibility into cluster health.

Here's what your actual resource configuration should look like for a medium cluster (100+ nodes):

resources:
  requests:
    cpu: 100m
    memory: 500Mi
  limits:
    cpu: 200m
    memory: 1Gi

And the telemetry endpoint configuration for proper monitoring:

telemetryPort: 8081
telemetryHost: "0.0.0.0"

Monitor these key health metrics to catch issues early:

kube_state_metrics_list_total - should increment regularly
kube_state_metrics_watch_total - tracks API watch connections
process_resident_memory_bytes - memory usage (should be stable)

Questions You'll Actually Ask While Debugging This Shit

Why do I need both of these damn things?

Look, I get it. You already have Metrics Server for autoscaling and now someone wants you to install kube-state-metrics too. Here's the deal:

Metrics Server: Shows resource usage (CPU/memory) for HPA/VPA
kube-state-metrics: Shows object states (why pods are failing, replica counts, etc.)

You need both because Metrics Server won't tell you why your deployment is stuck at 2/3 replicas. That's what kube-state-metrics does. Trust me, install both and stop asking questions.

How much memory will this thing actually eat?

Forget the "250MB" bullshit in the docs. In production:

Small cluster (10-50 nodes): 300-500MB
Medium cluster (50-200 nodes): 500-800MB
Large cluster (200+ nodes): 800MB-1.5GB

I started with the recommended limits and watched it get OOM killed constantly until I bumped memory to 1GB. Performance docs are optimistic at best.

Can I stop it from monitoring everything?

Yes, thank God. Use these flags to avoid metric explosion:

--resources=pods,deployments,services
--namespaces=production,staging
--metric-allowlist=kube_pod_status.*,kube_deployment_.*

The metric filtering docs are actually useful for once.

How the hell do I get Prometheus to scrape this?

If you're using kube-prometheus-stack, it's automatic. Otherwise, add this to your scrape config:

- job_name: kube-state-metrics
  static_configs:
  - targets: ['kube-state-metrics.kube-system.svc.cluster.local:8080']

Port 8080 by default, port 8081 for telemetry metrics. Don't forget the telemetry - that's how you know if the thing is broken.

Is this secure enough for production?

It's read-only API access, runs as non-root, and the Kubernetes community maintains it. Security-wise it's fine.

The real issue is RBAC. You'll need cluster-wide read permissions or very specific role bindings. If your security team freaks out about ClusterRole, you can scope it per-namespace but you lose cluster-wide visibility.

What happens when this breaks?

Your Prometheus scrapes fail and you lose real-time cluster state visibility. Historical data is fine, but you're flying blind on current issues.

The good news: it's stateless. Just restart the pod and it reconnects to the API server immediately. I've had zero data loss from restarts in 2+ years of running this.

Why are my metrics missing?

90% of the time it's one of these:

RBAC permissions: Check the ClusterRole allows reading the resources you want
API server connectivity: Look at pod logs for connection errors
Wrong Prometheus config: Verify the scrape target and port
Resource filtering: You probably filtered out the metrics you want

Run kubectl port-forward and hit the /metrics endpoint directly. If you see metrics there, it's a Prometheus config problem.

Will this scale to my massive cluster?

If you have 1000+ nodes or 10,000+ pods, you need horizontal sharding. It works but adds complexity.

For most clusters, a single instance with proper resource limits handles 100-500 nodes fine. I've run it on 300-node clusters without sharding.

Can I monitor my custom CRDs?

Yeah, through Custom Resource State configuration. You define a YAML config mapping your CRD fields to metrics.

It's useful for monitoring operators like cert-manager or database operators. But expect to spend time figuring out the YAML syntax - the examples are minimal.

What's new in v2.17.0 that I actually care about?

The unscheduled pod tracking is huge - kube_pod_unscheduled_time_seconds finally shows you when pods are stuck in Pending for too long. I've waited years for this metric.

The deletion timestamp metrics (kube_deployment_deletion_timestamp, etc.) help track cleanup operations. And the enhanced reason labels on deployment conditions make debugging failed rollouts way easier.

Performance note: v2.17.0 also includes better memory management with automemlimit support, which helps prevent those random OOM kills in large clusters.

Visualization: Most people use Grafana dashboards to visualize this data. There are dozens of pre-built dashboards, though most are overcomplicated. Start with something simple and build from there.

Grafana Integration: Grafana provides the visualization layer for your kube-state-metrics data, with dozens of pre-built dashboards available.

Popular dashboard options include the Kubernetes cluster monitoring dashboard which provides cluster-level insights, and specialized dashboards for workload monitoring that focus on pod and deployment health.