I've been running Kubernetes in production for years, and let me tell you - if you're not running kube-state-metrics, you're basically flying blind. Your cluster is doing all sorts of shit behind the scenes, and without this tool, you'll find out about problems way too late.
kube-state-metrics is deceptively simple: it watches the Kubernetes API and exports metrics about what your objects are actually doing. Not CPU usage (that's what metrics-server is for), but the important stuff like "why is my deployment stuck at 2/3 replicas?" or "which pods have been restarting every 30 seconds for the past hour?"
Monitoring Architecture: kube-state-metrics sits between the Kubernetes API server and your monitoring system (typically Prometheus). It watches API objects, converts them to metrics, and exposes them for scraping - creating a complete picture of cluster state.
The flow is simple: kube-state-metrics connects to the API server, maintains persistent watch connections on all objects, converts object states into Prometheus metrics, and exposes them on port 8080 for your monitoring system to scrape.
What Actually Happens When You Deploy This
The moment you install kube-state-metrics, it connects to your API server and starts watching everything. Every pod, deployment, service, configmap - you name it. It doesn't store anything or make API calls constantly (thank fuck), it just maintains a persistent watch connection and updates its internal state when things change.
The beauty is in what it exposes. Instead of guessing why your HPA isn't scaling, you get metrics like kube_deployment_status_replicas_available
vs kube_deployment_spec_replicas
. Boom - you can see exactly where the disconnect is.
As of September 2025, the latest stable version is v2.17.0 (released September 1, 2025). This release adds crucial new metrics:
kube_pod_unscheduled_time_seconds
- tracks how long pods sit unscheduled (finally!)kube_deployment_deletion_timestamp
- shows when deployments are being deletedkube_deployment_status_condition
now includes thereason
label for better debugging
It builds with Go 1.24.6 and client-go v0.33.4, supporting all recent Kubernetes versions. Match your client-go version to avoid API compatibility issues - learned this the hard way with 1.28 clusters.
Real-World Problem Solving
Here's what kube-state-metrics actually helps you debug in production:
Deployment Hell: Your pods keep dying and you don't know why? kube_pod_container_status_restarts_total
will show you which containers are crashlooping. I've spent way too many hours manually running kubectl get pods
in a loop when this metric would have told me immediately.
Resource Starvation: Pods stuck in Pending? Check kube_pod_status_phase
and kube_pod_status_conditions
. Often it's resource quotas or node capacity issues that aren't obvious from the standard kubectl output.
Job Failures: CronJobs silently failing? kube_job_status_failed
and kube_job_status_succeeded
will show you the pattern. I've seen production CronJobs fail for weeks because nobody was monitoring the actual job status.
Node Problems: Before a node completely dies, you'll see it in the metrics. kube_node_status_condition
shows Ready, DiskPressure, MemoryPressure states before your pods start getting evicted.
The Kubernetes SIG Instrumentation team maintains this thing, so it's not going anywhere. Unlike some random project that might disappear, this is officially supported Kubernetes infrastructure.