cert-manager - Stops You From Getting Paged at 3AM Because Certs Expired Again

Why You Actually Need cert-manager (And Why Manual Certs Suck)

I discovered cert-manager after our SSL certs expired on Black Friday weekend and killed our entire e-commerce site for 4 hours. Nothing like watching revenue tank while you fumble with Let's Encrypt CLI tools at 3am, trying to explain to the CEO why "certificate expired" means customers can't buy anything.

cert-manager saves your ass when certificates expire and you're the one getting paged. Jetstack created it in 2016 after getting tired of the same certificate management nightmare we all face. The CNCF graduated it on November 12, 2024 because literally everyone was using it anyway - might as well make it official.

Ingress TLS Workflow

The Three Things That Actually Matter

Certificate Resources: You define what domains need certs using Kubernetes custom resources. cert-manager watches these and handles the renewal dance automatically through the ACME protocol. Set it once, forget it exists until something breaks (which it rarely does).

Issuer/ClusterIssuer Resources: Point to your certificate authority. Let's Encrypt for public stuff, HashiCorp Vault PKI for internal certificates, or whatever enterprise CA your security team is obsessing over this quarter. ClusterIssuer works cluster-wide; Issuer is namespace-scoped.

CertificateRequest Resources: The actual certificate signing requests that follow the X.509 standard. You usually don't touch these - cert-manager creates them automatically when certificates need renewal. Only mess with these if you know what you're doing or debugging failed certificate issuance.

The Numbers Don't Lie (Because Everyone's Been Burned)

The 500+ million monthly downloads sound like marketing bullshit but they're real - it's because we've all been there. Your site goes down because some certificate you forgot about expired. Kubernetes adoption surveys show 86% of production clusters run cert-manager because manual cert renewal is like playing Russian roulette with production.

Real talk: I've personally seen Let's Encrypt rate limits fuck over teams who waited until the last minute to renew 20+ domain certs. That 50 certificates per domain per week limit hits hard when you're scrambling.

What Works (And What Doesn't)

cert-manager Kubernetes Resources

HTTP-01 challenges work great until your ingress controller decides to shit the bed. Let's Encrypt tries to validate ownership by hitting yourdomain.com/.well-known/acme-challenge/some-token and if that returns 404 or times out, you're fucked. I've spent hours debugging ingress-nginx configuration just to get ACME working again.

DNS-01 challenges are your friend for internal services and wildcard certs, but DNS provider APIs are consistently terrible. cert-manager creates TXT records like _acme-challenge.example.com to prove domain ownership. Works with Route53, Cloudflare, Google Cloud DNS, and 50+ other providers. DNS propagation can take forever though - GoDaddy is especially slow as hell.

The latest version 1.18.2 from July 2, 2025 fixes private key rotation edge cases that caused certificates to randomly fail validation. Nothing revolutionary, just fewer "why the fuck did this break" moments. Always check the upgrade guide - cert-manager migrations have burned me before when webhook configurations changed between versions.

cert-manager vs The Alternatives (Reality Check)

What You Actually Get	cert-manager	Traefik Built-in	Manual ACME Scripts	Cloud Provider Certs	Vault PKI
Setup Complexity	Install once, works everywhere	Easy if you use Traefik	Pain in the ass bash scripting	AWS Console clicking	Vault expertise required
When It Breaks	Good error messages, docs actually help	Traefik community forums if you're lucky	You're fucked and on your own	AWS support plans cost extra	Read HashiCorp docs for hours
Certificate Authorities	Let's Encrypt, Vault, Venafi, custom CAs	Mainly Let's Encrypt	Let's Encrypt focused	Whatever AWS/Azure supports	Internal PKI only
Automation Reality	Set it and forget it	Works until it doesn't	Cron jobs and prayer	Mostly works	Someone has to click buttons
Wildcard Certificates	DNS-01 challenge handles it	Supported with DNS API	Requires DNS-01 scripting	Provider-dependent	Works fine
Multi-cluster	ClusterIssuer works across clusters	Configure per cluster	Custom tooling required	Per-cloud setup	Vault per cluster
Learning Curve	Official tutorials work	Low if you know Traefik	High (shell scripting + ACME protocol)	Easy for cloud natives	High Vault is complicated
Vendor Lock-in	None it's CNCF	Traefik-specific	None	Locked to your cloud	HashiCorp ecosystem
Cost	Free	Free	Free + your time debugging	$$ per certificate	Vault Enterprise $$$$
Production Failures	Rare if configured correctly	Sometimes Traefik updates break things	Your bash script fails at 3 AM	Usually stable	Vault downtime = no certs

How to Actually Deploy cert-manager (Without Breaking Everything)

Installation Reality Check

Skip the kubectl apply bullshit with raw YAML - that path leads to webhook validation failures and hours of debugging. The Helm chart actually works:

helm repo add jetstack https://charts.jetstack.io
helm install cert-manager jetstack/cert-manager \
  --namespace cert-manager \
  --create-namespace \
  --set installCRDs=true

That --set installCRDs=true flag is critical - without it you have to manually install custom resource definitions and trust me, you'll fuck up the order. I spent 6 hours debugging webhook failures because I forgot this flag. The installation docs bury this detail for some reason.

cert-manager Components Diagram

You get three pods that actually do stuff: the main controller handles certificates, a validating webhook validates your YAML, and a CA injector patches webhook configurations with CA bundles. Resource requirements are reasonable - about 50MB RAM and 10m CPU each in normal operation. Scales well unless you're managing thousands of certificates across multiple clusters.

Multi-Issuer Setup (The Right Way)

Production environments need multiple certificate sources. Here's what actually works:

Let's Encrypt for public services - free, automated, works everywhere:

apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-prod
spec:
  acme:
    server: https://acme-v02.api.letsencrypt.org/directory
    email: ops@yourcompany.com  # They send expiration warnings here
    privateKeySecretRef:
      name: letsencrypt-prod
    solvers:
    - http01:
        ingress:
          class: nginx  # or traefik, whatever you use

Internal CA for internal services - because security teams love internal PKI:

apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: ca-issuer
spec:
  ca:
    secretName: ca-key-pair

Staging environment - use Let's Encrypt staging for testing to avoid rate limits:

apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-staging
spec:
  acme:
    server: https://acme-staging-v02.api.letsencrypt.org/directory

DNS-01 Challenges (When HTTP-01 Won't Work)

Internal services and wildcard certificates need DNS-01. Route53 integration works best:

spec:
  acme:
    solvers:
    - dns01:
        route53:
          region: us-east-1
          accessKeyID: AKIAIOSFODNN7EXAMPLE  # Use IRSA instead
          secretAccessKeySecretRef:
            name: route53-credentials
            key: secret-access-key

Pro tip: Use IAM Roles for Service Accounts (IRSA) instead of hardcoded AWS access keys. Your security team will thank you, and AWS security best practices recommend avoiding long-term credentials.

Service Mesh Integration (If You Must)

istio-csr replaces Istio's built-in certificate management with cert-manager. Useful for consistent certificate policies across service mesh and ingress:

helm install -n istio-system cert-manager-istio-csr jetstack/cert-manager-istio-csr

It handles mTLS certificates for service-to-service communication. Works fine but adds complexity - only use if you need unified certificate management.

cert-manager Certificate Validation Process

What Burns Down Production (Learn From My Pain)

Resource limits will crash cert-manager under load - the default resource limits are pathetically small. Our cert-manager died during a mass certificate renewal, took us 3 hours to figure out why:

resources:
  limits:
    cpu: 200m      # Default 100m is not enough 
    memory: 256Mi   # Default 128Mi will OOM under pressure
  requests:
    cpu: 50m       # Be realistic about CPU needs
    memory: 64Mi

Webhook timeouts kill everything - the admission webhook has a 10-second timeout that's too short when DNS is being slow. I've seen DNS-01 challenges fail because DNS propagation took 15 seconds:

webhook:
  timeoutSeconds: 30  # Give DNS time to not suck

Let's Encrypt rate limits are brutal - 50 certificates per registered domain per week. We hit this limit during a Kubernetes migration and couldn't get new certs for 3 days. Plan your certificate requests or you're fucked.

DNS propagation delays - DNS-01 challenges fail randomly when DNS changes take forever to propagate. cert-manager retries, but sometimes you have to manually kick your DNS provider. Cloudflare DNS is usually fast, GoDaddy DNS can take 30+ minutes for no good reason.

cert-manager Default SSL Certificate Issue

Monitoring (Because Things Break)

Prometheus metrics tell you when renewals fail. This alert saved my ass when our API gateway cert was about to expire during Christmas break:

- alert: CertManagerCertificateExpirySoon
  expr: cert_manager_certificate_expiration_timestamp_seconds - time() < 86400 * 7
  for: 1h
  annotations:
    summary: "Certificate expires in less than 7 days"

Monitor cert_manager_certificate_renewal_timestamp_seconds and cert_manager_acme_client_request_count to catch issues early. Set up alerts for failed renewals too - trust me, you don't want to discover cert-manager is broken when customers start complaining.

Security Hardening

RBAC permissions are overly broad by default. Version 1.18 lets you disable HTTP-01 challenges if you only use DNS-01:

global:
  disableHTTP01Solver: true

Private key storage - keys go in Kubernetes Secrets by default. For paranoid environments, use the CSI driver for ephemeral keys that never touch etcd.

Certificate approval - approver-policy adds human approval workflows for regulated environments. Slows things down but compliance teams love it.

Questions People Actually Ask About cert-manager

Why does HTTP-01 keep shitting the bed even though my ingress works fine?

Because Let's Encrypt's validation servers are picky as fuck. They need to hit http://yourdomain.com/.well-known/acme-challenge/some-random-token and get a specific response, and every little thing can break this:

Your load balancer isn't actually public (happened to me with internal ALBs - spent 4 hours wondering why "public" meant "internal")
Wrong kubernetes.io/ingress.class annotation - it has to match exactly
Some firewall rule blocking port 80 (AWS security groups love to do this)
Multiple ingress controllers fighting over the same domain like toddlers over toys
CloudFlare proxy mode fucking with the ACME challenge requests

The error you actually want: kubectl describe certificaterequest <name> - ignore the useless controller logs.

How fucked am I when my certs expired even with cert-manager running?

Pretty fucked, but fixable. cert-manager tries to renew at 60 days (2/3 of Let's Encrypt's 90-day lifetime), so something went seriously wrong:

Let's Encrypt rate limits - 50 certs per domain per week and you probably hit it during a mass renewal (been there, waited 3 days for the counter to reset)
DNS propagation took forever for DNS-01 challenges (looking at you, GoDaddy - 45 minutes is not "instant")
Someone "improved" the ingress config and broke HTTP-01 validation
Resource limits strangling the cert-manager pods - default limits are pathetically low for production
Webhook timeouts if your cluster DNS is being shitty

The logs are garbage, but check: kubectl describe certificate <name> and look for events. That's where the real error lives.

Why are these logs complete garbage that tell me absolutely nothing?

Because the cert-manager developers apparently hate us. The controller logs are useless by design. Instead of actual error messages, check the Kubernetes resources where the real information lives:

kubectl describe certificate <name>
kubectl describe certificaterequest <name>
kubectl describe challenge <name>  # for ACME

Look for events and status conditions. The real error is usually in the Challenge or Order resource, not the main logs.

Can I use wildcard certificates with cert-manager?

Yes, but only with DNS-01 challenges. HTTP-01 doesn't support wildcards because Let's Encrypt can't verify *.example.com through HTTP.

You need DNS API access for your provider. Route53, Cloudflare, and others are supported.

Does this work with Traefik or will they fight?

It works but Traefik's built-in ACME might conflict. Pick one:

Use Traefik's built-in certificate management (simpler)
Use cert-manager with Traefik (more flexible)
Don't use both - they'll fight over the same certificates

If you choose cert-manager, disable Traefik's ACME and use cert-manager annotations on your Ingress resources.

Why does DNS-01 sit there like a dead fish for 20 minutes?

Because DNS is held together with hopes and prayers. DNS propagation delays will make you question your career choices:

Your Route53 credentials are wrong (happened to me when IAM keys expired)
The hosted zone doesn't actually match your domain (Route53 console lies sometimes)
DNS propagation takes forever - GoDaddy can take 30+ minutes, Namecheap isn't much better
cert-manager lacks IAM permissions to create TXT records
Cloudflare's API decided to be rate-limited right when you need it

I've lost entire weekends to DNS propagation delays. The ACME spec says wait up to 5 minutes, but real-world DNS providers laugh at that.

How do I use this thing with internal domains?

You can't use Let's Encrypt for private domains - they only validate public internet domains. Options:

Use an internal CA issuer with your company's PKI
Use HashiCorp Vault for internal certificates
Set up a self-signed CA for development

Private domains need DNS-01 challenges with access to your internal DNS infrastructure.

What happens if cert-manager dies?

Existing certificates keep working - cert-manager doesn't proxy traffic. But renewals stop, so certificates that expire during downtime won't get renewed.

Run multiple replicas for high availability:

helm upgrade cert-manager jetstack/cert-manager \
  --reuse-values \
  --set replicaCount=2

Monitor cert-manager health and certificate expiration dates.

Can I approve certificate requests manually?

Yes, with approver-policy. Useful for regulated environments where humans need to approve certificate issuance.

Warning: Manual approval breaks automation. Only use if compliance requires it.

Is this going to cost me a fortune to run?

cert-manager itself is free (Apache 2.0 license). Real costs:

Let's Encrypt certificates: Free
Compute resources: ~100MB RAM, 10m CPU per pod
DNS API costs for DNS-01 challenges (Route53 charges per query)
Commercial CA certificates if you need them
Your time debugging when things break

Usually much cheaper than manual certificate management or commercial alternatives.

Should I use ClusterIssuer or Issuer?

Use ClusterIssuer for organization-wide CAs like Let's Encrypt that all teams use.

Use Issuer for namespace-specific CAs or when teams need different certificate authorities.

Most people start with ClusterIssuer and add namespace-specific Issuers later.

How do I migrate from manual certificates to cert-manager?

Gradual migration works best:

Install cert-manager alongside existing certificates
Configure ClusterIssuer for your CA (Let's Encrypt, etc.)
Create Certificate resources for new services
Migrate existing services one at a time
Remove manual certificate management scripts

Don't try to migrate everything at once - you'll break production.

Quick Navigation

The Three Things That Actually Matter

The Numbers Don't Lie (Because Everyone's Been Burned)

What Works (And What Doesn't)

Installation Reality Check

Multi-Issuer Setup (The Right Way)

DNS-01 Challenges (When HTTP-01 Won't Work)

Service Mesh Integration (If You Must)

What Burns Down Production (Learn From My Pain)

Monitoring (Because Things Break)

Security Hardening

Why does HTTP-01 keep shitting the bed even though my ingress works fine?

How fucked am I when my certs expired even with cert-manager running?

Why are these logs complete garbage that tell me absolutely nothing?

Can I use wildcard certificates with cert-manager?

Does this work with Traefik or will they fight?

Why does DNS-01 sit there like a dead fish for 20 minutes?

How do I use this thing with internal domains?

What happens if cert-manager dies?

Can I approve certificate requests manually?

Is this going to cost me a fortune to run?

Should I use ClusterIssuer or Issuer?

How do I migrate from manual certificates to cert-manager?

Related Tools & Recommendations

NGINX Certbot Integration: Automate SSL Renewals & Prevent Outages

Certbot: Get Free SSL Certificates & Simplify Installation

Helm: Simplify Kubernetes Deployments & Avoid YAML Chaos

Istio Service Mesh: Real-World Complexity, Benefits & Deployment

kubectl: Kubernetes CLI - Overview, Usage & Extensibility

etcd Overview: The Core Database Powering Kubernetes Clusters

Kubernetes Overview: Google's Container Orchestrator Explained

containerd - The Container Runtime That Actually Just Works

GitOps Overview: Principles, Benefits & Implementation Guide

KEDA - Kubernetes Event-driven Autoscaling: Overview & Deployment Guide

Python 3.13 Team Migration Guide: Avoid SSL Hell & CI/CD Breaks

Django Production Deployment Guide: Docker, Security, Monitoring

kubeadm - The Official Way to Bootstrap Kubernetes Clusters

Flux GitOps: Secure Kubernetes Deployments with CI/CD

Fix Kubernetes OOMKilled Pods: Production Crisis Guide

Temporal Kubernetes Production Deployment Guide: Avoid Failures

Lightweight Kubernetes Alternatives: K3s, MicroK8s, & More

Master Microservices Setup: Docker & Kubernetes Guide 2025

Kubernetes Crisis Management: Fix Your Down Cluster Fast

Python 3.13 SSL Changes & Enterprise Compatibility Analysis