Your Kubernetes Cluster is a Black Box. This Tool Fixes That.

I've been running Kubernetes in production for years, and let me tell you - if you're not running kube-state-metrics, you're basically flying blind. Your cluster is doing all sorts of shit behind the scenes, and without this tool, you'll find out about problems way too late.

kube-state-metrics is deceptively simple: it watches the Kubernetes API and exports metrics about what your objects are actually doing. Not CPU usage (that's what metrics-server is for), but the important stuff like "why is my deployment stuck at 2/3 replicas?" or "which pods have been restarting every 30 seconds for the past hour?"

Monitoring Architecture: kube-state-metrics sits between the Kubernetes API server and your monitoring system (typically Prometheus). It watches API objects, converts them to metrics, and exposes them for scraping - creating a complete picture of cluster state.

Monitoring Architecture

The flow is simple: kube-state-metrics connects to the API server, maintains persistent watch connections on all objects, converts object states into Prometheus metrics, and exposes them on port 8080 for your monitoring system to scrape.

What Actually Happens When You Deploy This

The moment you install kube-state-metrics, it connects to your API server and starts watching everything. Every pod, deployment, service, configmap - you name it. It doesn't store anything or make API calls constantly (thank fuck), it just maintains a persistent watch connection and updates its internal state when things change.

The beauty is in what it exposes. Instead of guessing why your HPA isn't scaling, you get metrics like kube_deployment_status_replicas_available vs kube_deployment_spec_replicas. Boom - you can see exactly where the disconnect is.

As of September 2025, the latest stable version is v2.17.0 (released September 1, 2025). This release adds crucial new metrics:

  • kube_pod_unscheduled_time_seconds - tracks how long pods sit unscheduled (finally!)
  • kube_deployment_deletion_timestamp - shows when deployments are being deleted
  • kube_deployment_status_condition now includes the reason label for better debugging

It builds with Go 1.24.6 and client-go v0.33.4, supporting all recent Kubernetes versions. Match your client-go version to avoid API compatibility issues - learned this the hard way with 1.28 clusters.

Real-World Problem Solving

Here's what kube-state-metrics actually helps you debug in production:

Deployment Hell: Your pods keep dying and you don't know why? kube_pod_container_status_restarts_total will show you which containers are crashlooping. I've spent way too many hours manually running kubectl get pods in a loop when this metric would have told me immediately.

Resource Starvation: Pods stuck in Pending? Check kube_pod_status_phase and kube_pod_status_conditions. Often it's resource quotas or node capacity issues that aren't obvious from the standard kubectl output.

Job Failures: CronJobs silently failing? kube_job_status_failed and kube_job_status_succeeded will show you the pattern. I've seen production CronJobs fail for weeks because nobody was monitoring the actual job status.

Node Problems: Before a node completely dies, you'll see it in the metrics. kube_node_status_condition shows Ready, DiskPressure, MemoryPressure states before your pods start getting evicted.

The Kubernetes SIG Instrumentation team maintains this thing, so it's not going anywhere. Unlike some random project that might disappear, this is officially supported Kubernetes infrastructure.

kube-state-metrics vs Everything Else You're Probably Running

Feature

kube-state-metrics

Kubernetes Metrics Server

Prometheus Node Exporter

cAdvisor

What It Actually Does

Tells you why pods are fucked

Makes HPA work

Shows you when nodes are dying

Container resource usage (mostly useless)

Actually Works?

Yes, reliably

Works until it doesn't

Rock solid

Built-in, can't avoid it

Real Resource Usage

200MB-800MB (grows with cluster)

40MB (until it OOMs)

20MB (never changes)

Whatever Kubelet uses

Setup Pain Level

Medium (RBAC will get you)

Easy (usually pre-installed)

Easy

None (already there)

When It Breaks

API server connectivity issues

Random OOM kills

Never breaks

Breaks with Kubelet

Debug Value

High

  • shows actual object states

Low

  • just resource numbers

High

  • real system metrics

Medium

  • good for container limits

Installation Reality

Helm chart works, manifests are painful

Usually already there

Standard DaemonSet, just works

Can't uninstall it

Prometheus Scraping

Just works on port 8080

Need adapter for HPA metrics

Native on port 9100

Native on Kubelet port

How to Actually Deploy This Thing (And Not Fuck It Up)

The documentation makes this sound simple. It's not. Here's what actually works in production and what will bite you in the ass.

Helm Chart

Deployment Reality: kube-state-metrics runs as a single pod (or multiple for sharding) that connects to your API server with read-only permissions. It's simple in concept but the RBAC and resource sizing will fuck you up.

Helm Chart - Just Use This

Skip the manual manifests. The Prometheus Community Helm chart works and saves you hours of RBAC debugging:

Docker Container

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install kube-state-metrics prometheus-community/kube-state-metrics

But here's the shit they don't tell you:

Resource Limits Are Wrong: The default 250MB memory limit is bullshit for any real cluster. Start with 500MB minimum, 1GB for large clusters. I learned this when our deployment kept getting OOM killed because we have 2000+ pods.

RBAC Will Fuck You: The chart creates ClusterRole permissions, but if you have Pod Security Standards enabled, you'll need to allow system:metrics or create a custom policy. Spent 4 hours debugging why metrics weren't showing up - turns out RBAC was blocking API server access.

Port 8080 Conflicts: Half the shit you're running probably uses port 8080. Change it in your values file:

service:
  port: 8081
  targetPort: 8081

Manual Installation (When Helm Isn't An Option)

Sometimes you can't use Helm (corporate policies, whatever). The official manifests work but you'll spend time fixing them.

Copy the manifests and fix these issues:

  • Memory limits are too low (bump to 500MB minimum)
  • Namespace restrictions if you can't use cluster-wide access
  • Security context if you have PSP/PSS enabled

The service account needs read access to basically everything. If you're paranoid about security, you can scope it to specific namespaces, but you'll lose cluster-wide visibility.

Cloud Platform Reality Check

GKE: Google has built-in kube-state-metrics but it's limited and sends data to Cloud Monitoring. If you want Prometheus, install your own.

EKS: AWS doesn't include this by default. Use the Helm chart or their managed Prometheus service add-on.

AKS: Microsoft's Container Insights includes some kube-state-metrics data, but again, limited compared to running your own.

Large Cluster Scaling (When Everything Goes to Shit)

If you have 1000+ nodes or 10,000+ pods, you need horizontal sharding. This splits the monitoring load across multiple instances.

The autosharding examples use StatefulSets where each pod monitors a subset of objects. It works, but debugging which instance is monitoring what object is a pain.

Pro tip: Monitor the health metrics kube_state_metrics_list_total and kube_state_metrics_watch_total. If these stop incrementing, your API server connection is fucked and you'll lose visibility.

What a Real Deployment Looks Like

Grafana Logo

Visualization Layer: Most teams use Grafana to visualize the metrics that kube-state-metrics exposes. The official kube-state-metrics v2 dashboard gives you instant visibility into cluster health.

Here's what your actual resource configuration should look like for a medium cluster (100+ nodes):

resources:
  requests:
    cpu: 100m
    memory: 500Mi
  limits:
    cpu: 200m
    memory: 1Gi

And the telemetry endpoint configuration for proper monitoring:

telemetryPort: 8081
telemetryHost: "0.0.0.0"

Monitor these key health metrics to catch issues early:

  • kube_state_metrics_list_total - should increment regularly
  • kube_state_metrics_watch_total - tracks API watch connections
  • process_resident_memory_bytes - memory usage (should be stable)

Questions You'll Actually Ask While Debugging This Shit

Q

Why do I need both of these damn things?

A

Look, I get it. You already have Metrics Server for autoscaling and now someone wants you to install kube-state-metrics too. Here's the deal:

  • Metrics Server: Shows resource usage (CPU/memory) for HPA/VPA
  • kube-state-metrics: Shows object states (why pods are failing, replica counts, etc.)

You need both because Metrics Server won't tell you why your deployment is stuck at 2/3 replicas. That's what kube-state-metrics does. Trust me, install both and stop asking questions.

Q

How much memory will this thing actually eat?

A

Forget the "250MB" bullshit in the docs. In production:

  • Small cluster (10-50 nodes): 300-500MB
  • Medium cluster (50-200 nodes): 500-800MB
  • Large cluster (200+ nodes): 800MB-1.5GB

I started with the recommended limits and watched it get OOM killed constantly until I bumped memory to 1GB. Performance docs are optimistic at best.

Q

Can I stop it from monitoring everything?

A

Yes, thank God. Use these flags to avoid metric explosion:

--resources=pods,deployments,services
--namespaces=production,staging
--metric-allowlist=kube_pod_status.*,kube_deployment_.*

The metric filtering docs are actually useful for once.

Q

How the hell do I get Prometheus to scrape this?

A

If you're using kube-prometheus-stack, it's automatic. Otherwise, add this to your scrape config:

- job_name: kube-state-metrics
  static_configs:
  - targets: ['kube-state-metrics.kube-system.svc.cluster.local:8080']

Port 8080 by default, port 8081 for telemetry metrics. Don't forget the telemetry - that's how you know if the thing is broken.

Q

Is this secure enough for production?

A

It's read-only API access, runs as non-root, and the Kubernetes community maintains it. Security-wise it's fine.

The real issue is RBAC. You'll need cluster-wide read permissions or very specific role bindings. If your security team freaks out about ClusterRole, you can scope it per-namespace but you lose cluster-wide visibility.

Q

What happens when this breaks?

A

Your Prometheus scrapes fail and you lose real-time cluster state visibility. Historical data is fine, but you're flying blind on current issues.

The good news: it's stateless. Just restart the pod and it reconnects to the API server immediately. I've had zero data loss from restarts in 2+ years of running this.

Q

Why are my metrics missing?

A

90% of the time it's one of these:

  1. RBAC permissions: Check the ClusterRole allows reading the resources you want
  2. API server connectivity: Look at pod logs for connection errors
  3. Wrong Prometheus config: Verify the scrape target and port
  4. Resource filtering: You probably filtered out the metrics you want

Run kubectl port-forward and hit the /metrics endpoint directly. If you see metrics there, it's a Prometheus config problem.

Q

Will this scale to my massive cluster?

A

If you have 1000+ nodes or 10,000+ pods, you need horizontal sharding. It works but adds complexity.

For most clusters, a single instance with proper resource limits handles 100-500 nodes fine. I've run it on 300-node clusters without sharding.

Q

Can I monitor my custom CRDs?

A

Yeah, through Custom Resource State configuration. You define a YAML config mapping your CRD fields to metrics.

It's useful for monitoring operators like cert-manager or database operators. But expect to spend time figuring out the YAML syntax - the examples are minimal.

Q

What's new in v2.17.0 that I actually care about?

A

The unscheduled pod tracking is huge - kube_pod_unscheduled_time_seconds finally shows you when pods are stuck in Pending for too long. I've waited years for this metric.

The deletion timestamp metrics (kube_deployment_deletion_timestamp, etc.) help track cleanup operations. And the enhanced reason labels on deployment conditions make debugging failed rollouts way easier.

Performance note: v2.17.0 also includes better memory management with automemlimit support, which helps prevent those random OOM kills in large clusters.

Visualization: Most people use Grafana dashboards to visualize this data. There are dozens of pre-built dashboards, though most are overcomplicated. Start with something simple and build from there.

Grafana Integration: Grafana provides the visualization layer for your kube-state-metrics data, with dozens of pre-built dashboards available.

Popular dashboard options include the Kubernetes cluster monitoring dashboard which provides cluster-level insights, and specialized dashboards for workload monitoring that focus on pod and deployment health.

Resources That Don't Suck

Related Tools & Recommendations

integration
Recommended

OpenTelemetry + Jaeger + Grafana on Kubernetes - The Stack That Actually Works

Stop flying blind in production microservices

OpenTelemetry
/integration/opentelemetry-jaeger-grafana-kubernetes/complete-observability-stack
100%
howto
Recommended

Set Up Microservices Monitoring That Actually Works

Stop flying blind - get real visibility into what's breaking your distributed services

Prometheus
/howto/setup-microservices-observability-prometheus-jaeger-grafana/complete-observability-setup
90%
tool
Similar content

Prometheus Monitoring: Overview, Deployment & Troubleshooting Guide

Free monitoring that actually works (most of the time) and won't die when your network hiccups

Prometheus
/tool/prometheus/overview
66%
tool
Similar content

Debug Kubernetes Issues: The 3AM Production Survival Guide

When your pods are crashing, services aren't accessible, and your pager won't stop buzzing - here's how to actually fix it

Kubernetes
/tool/kubernetes/debugging-kubernetes-issues
64%
tool
Recommended

Node Exporter Advanced Configuration - Stop It From Killing Your Server

The configuration that actually works in production (not the bullshit from the docs)

Prometheus Node Exporter
/tool/prometheus-node-exporter/advanced-configuration
64%
integration
Recommended

Setting Up Prometheus Monitoring That Won't Make You Hate Your Job

How to Connect Prometheus, Grafana, and Alertmanager Without Losing Your Sanity

Prometheus
/integration/prometheus-grafana-alertmanager/complete-monitoring-integration
63%
tool
Similar content

Grafana: Monitoring Dashboards, Observability & Ecosystem Overview

Explore Grafana's journey from monitoring dashboards to a full observability ecosystem. Learn about its features, LGTM stack, and how it empowers 20 million use

Grafana
/tool/grafana/overview
60%
tool
Similar content

Kibana - Because Raw Elasticsearch JSON Makes Your Eyes Bleed

Stop manually parsing Elasticsearch responses and build dashboards that actually help debug production issues.

Kibana
/tool/kibana/overview
60%
tool
Similar content

Jsonnet Overview: Stop Copy-Pasting YAML Like an Animal

Because managing 50 microservice configs by hand will make you lose your mind

Jsonnet
/tool/jsonnet/overview
56%
troubleshoot
Similar content

Kubernetes Network Troubleshooting Guide: Fix Common Issues

When nothing can talk to anything else and you're getting paged at 2am on a Sunday because someone deployed a \

Kubernetes
/troubleshoot/kubernetes-networking/network-troubleshooting-guide
55%
tool
Similar content

TypeScript Compiler Performance: Fix Slow Builds & Optimize Speed

Practical performance fixes that actually work in production, not marketing bullshit

TypeScript Compiler
/tool/typescript/performance-optimization-guide
53%
howto
Similar content

Weaviate Production Deployment & Scaling: Avoid Common Pitfalls

So you've got Weaviate running in dev and now management wants it in production

Weaviate
/howto/weaviate-production-deployment-scaling/production-deployment-scaling
53%
tool
Similar content

Poetry - Python Dependency Manager: Overview & Advanced Usage

Explore Poetry, the Python dependency manager. Understand its benefits over pip, learn advanced usage, and get answers to common FAQs about dependency managemen

Poetry
/tool/poetry/overview
53%
tool
Similar content

Binance Pro Mode: Unlock Advanced Trading & Features for Pros

Stop getting treated like a child - Pro Mode is where Binance actually shows you all their features, including the leverage that can make you rich or bankrupt y

Binance Pro
/tool/binance-pro/overview
51%
tool
Similar content

Change Data Capture (CDC) Integration Patterns for Production

Set up CDC at three companies. Got paged at 2am during Black Friday when our setup died. Here's what keeps working.

Change Data Capture (CDC)
/tool/change-data-capture/integration-deployment-patterns
47%
tool
Similar content

Rancher Desktop: The Free Docker Desktop Alternative That Works

Discover why Rancher Desktop is a powerful, free alternative to Docker Desktop. Learn its features, installation process, and solutions for common issues on mac

Rancher Desktop
/tool/rancher-desktop/overview
43%
tool
Similar content

Google Cloud Vertex AI Production Deployment Troubleshooting Guide

Debug endpoint failures, scaling disasters, and the 503 errors that'll ruin your weekend. Everything Google's docs won't tell you about production deployments.

Google Cloud Vertex AI
/tool/vertex-ai/production-deployment-troubleshooting
43%
tool
Similar content

CDC Enterprise Implementation Guide: Real-World Challenges & Solutions

I've implemented CDC at 3 companies. Here's what actually works vs what the vendors promise.

Change Data Capture (CDC)
/tool/change-data-capture/enterprise-implementation-guide
43%
tool
Similar content

Open Policy Agent (OPA): Centralize Authorization & Policy Management

Stop hardcoding "if user.role == admin" across 47 microservices - ask OPA instead

/tool/open-policy-agent/overview
42%
tool
Similar content

Git Restore: Safely Undo Changes & Restore Files in Git

Stop using git checkout to restore files - git restore actually does what you expect

Git Restore
/tool/git-restore/overview
42%

Recommendations combine user behavior, content similarity, research intelligence, and SEO optimization