Why You Actually Need cert-manager (And Why Manual Certs Suck)

I discovered cert-manager after our SSL certs expired on Black Friday weekend and killed our entire e-commerce site for 4 hours. Nothing like watching revenue tank while you fumble with Let's Encrypt CLI tools at 3am, trying to explain to the CEO why "certificate expired" means customers can't buy anything.

cert-manager saves your ass when certificates expire and you're the one getting paged. Jetstack created it in 2016 after getting tired of the same certificate management nightmare we all face. The CNCF graduated it on November 12, 2024 because literally everyone was using it anyway - might as well make it official.

Ingress TLS Workflow

The Three Things That Actually Matter

Certificate Resources: You define what domains need certs using Kubernetes custom resources. cert-manager watches these and handles the renewal dance automatically through the ACME protocol. Set it once, forget it exists until something breaks (which it rarely does).

Issuer/ClusterIssuer Resources: Point to your certificate authority. Let's Encrypt for public stuff, HashiCorp Vault PKI for internal certificates, or whatever enterprise CA your security team is obsessing over this quarter. ClusterIssuer works cluster-wide; Issuer is namespace-scoped.

CertificateRequest Resources: The actual certificate signing requests that follow the X.509 standard. You usually don't touch these - cert-manager creates them automatically when certificates need renewal. Only mess with these if you know what you're doing or debugging failed certificate issuance.

The Numbers Don't Lie (Because Everyone's Been Burned)

The 500+ million monthly downloads sound like marketing bullshit but they're real - it's because we've all been there. Your site goes down because some certificate you forgot about expired. Kubernetes adoption surveys show 86% of production clusters run cert-manager because manual cert renewal is like playing Russian roulette with production.

Real talk: I've personally seen Let's Encrypt rate limits fuck over teams who waited until the last minute to renew 20+ domain certs. That 50 certificates per domain per week limit hits hard when you're scrambling.

What Works (And What Doesn't)

cert-manager Kubernetes Resources

HTTP-01 challenges work great until your ingress controller decides to shit the bed. Let's Encrypt tries to validate ownership by hitting yourdomain.com/.well-known/acme-challenge/some-token and if that returns 404 or times out, you're fucked. I've spent hours debugging ingress-nginx configuration just to get ACME working again.

DNS-01 challenges are your friend for internal services and wildcard certs, but DNS provider APIs are consistently terrible. cert-manager creates TXT records like _acme-challenge.example.com to prove domain ownership. Works with Route53, Cloudflare, Google Cloud DNS, and 50+ other providers. DNS propagation can take forever though - GoDaddy is especially slow as hell.

The latest version 1.18.2 from July 2, 2025 fixes private key rotation edge cases that caused certificates to randomly fail validation. Nothing revolutionary, just fewer "why the fuck did this break" moments. Always check the upgrade guide - cert-manager migrations have burned me before when webhook configurations changed between versions.

cert-manager vs The Alternatives (Reality Check)

What You Actually Get

cert-manager

Traefik Built-in

Manual ACME Scripts

Cloud Provider Certs

Vault PKI

Setup Complexity

Install once, works everywhere

Easy if you use Traefik

Pain in the ass bash scripting

AWS Console clicking

Vault expertise required

When It Breaks

Good error messages, docs actually help

Traefik community forums if you're lucky

You're fucked and on your own

AWS support plans cost extra

Read HashiCorp docs for hours

Certificate Authorities

Let's Encrypt, Vault, Venafi, custom CAs

Mainly Let's Encrypt

Let's Encrypt focused

Whatever AWS/Azure supports

Internal PKI only

Automation Reality

Set it and forget it

Works until it doesn't

Cron jobs and prayer

Mostly works

Someone has to click buttons

Wildcard Certificates

DNS-01 challenge handles it

Supported with DNS API

Requires DNS-01 scripting

Provider-dependent

Works fine

Multi-cluster

ClusterIssuer works across clusters

Configure per cluster

Custom tooling required

Per-cloud setup

Vault per cluster

Learning Curve

Official tutorials work

Low if you know Traefik

High (shell scripting + ACME protocol)

Easy for cloud natives

High

  • Vault is complicated

Vendor Lock-in

None

Traefik-specific

None

Locked to your cloud

HashiCorp ecosystem

Cost

Free

Free

Free + your time debugging

$$ per certificate

Vault Enterprise $$$$

Production Failures

Rare if configured correctly

Sometimes Traefik updates break things

Your bash script fails at 3 AM

Usually stable

Vault downtime = no certs

How to Actually Deploy cert-manager (Without Breaking Everything)

Installation Reality Check

Skip the kubectl apply bullshit with raw YAML - that path leads to webhook validation failures and hours of debugging. The Helm chart actually works:

helm repo add jetstack https://charts.jetstack.io
helm install cert-manager jetstack/cert-manager \
  --namespace cert-manager \
  --create-namespace \
  --set installCRDs=true

That --set installCRDs=true flag is critical - without it you have to manually install custom resource definitions and trust me, you'll fuck up the order. I spent 6 hours debugging webhook failures because I forgot this flag. The installation docs bury this detail for some reason.

cert-manager Components Diagram

You get three pods that actually do stuff: the main controller handles certificates, a validating webhook validates your YAML, and a CA injector patches webhook configurations with CA bundles. Resource requirements are reasonable - about 50MB RAM and 10m CPU each in normal operation. Scales well unless you're managing thousands of certificates across multiple clusters.

Multi-Issuer Setup (The Right Way)

Production environments need multiple certificate sources. Here's what actually works:

Let's Encrypt for public services - free, automated, works everywhere:

apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-prod
spec:
  acme:
    server: https://acme-v02.api.letsencrypt.org/directory
    email: ops@yourcompany.com  # They send expiration warnings here
    privateKeySecretRef:
      name: letsencrypt-prod
    solvers:
    - http01:
        ingress:
          class: nginx  # or traefik, whatever you use

Internal CA for internal services - because security teams love internal PKI:

apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: ca-issuer
spec:
  ca:
    secretName: ca-key-pair

Staging environment - use Let's Encrypt staging for testing to avoid rate limits:

apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-staging
spec:
  acme:
    server: https://acme-staging-v02.api.letsencrypt.org/directory

DNS-01 Challenges (When HTTP-01 Won't Work)

Internal services and wildcard certificates need DNS-01. Route53 integration works best:

spec:
  acme:
    solvers:
    - dns01:
        route53:
          region: us-east-1
          accessKeyID: AKIAIOSFODNN7EXAMPLE  # Use IRSA instead
          secretAccessKeySecretRef:
            name: route53-credentials
            key: secret-access-key

Pro tip: Use IAM Roles for Service Accounts (IRSA) instead of hardcoded AWS access keys. Your security team will thank you, and AWS security best practices recommend avoiding long-term credentials.

Service Mesh Integration (If You Must)

istio-csr replaces Istio's built-in certificate management with cert-manager. Useful for consistent certificate policies across service mesh and ingress:

helm install -n istio-system cert-manager-istio-csr jetstack/cert-manager-istio-csr

It handles mTLS certificates for service-to-service communication. Works fine but adds complexity - only use if you need unified certificate management.

cert-manager Certificate Validation Process

What Burns Down Production (Learn From My Pain)

Resource limits will crash cert-manager under load - the default resource limits are pathetically small. Our cert-manager died during a mass certificate renewal, took us 3 hours to figure out why:

resources:
  limits:
    cpu: 200m      # Default 100m is not enough 
    memory: 256Mi   # Default 128Mi will OOM under pressure
  requests:
    cpu: 50m       # Be realistic about CPU needs
    memory: 64Mi

Webhook timeouts kill everything - the admission webhook has a 10-second timeout that's too short when DNS is being slow. I've seen DNS-01 challenges fail because DNS propagation took 15 seconds:

webhook:
  timeoutSeconds: 30  # Give DNS time to not suck

Let's Encrypt rate limits are brutal - 50 certificates per registered domain per week. We hit this limit during a Kubernetes migration and couldn't get new certs for 3 days. Plan your certificate requests or you're fucked.

DNS propagation delays - DNS-01 challenges fail randomly when DNS changes take forever to propagate. cert-manager retries, but sometimes you have to manually kick your DNS provider. Cloudflare DNS is usually fast, GoDaddy DNS can take 30+ minutes for no good reason.

cert-manager Default SSL Certificate Issue

Monitoring (Because Things Break)

Prometheus metrics tell you when renewals fail. This alert saved my ass when our API gateway cert was about to expire during Christmas break:

- alert: CertManagerCertificateExpirySoon
  expr: cert_manager_certificate_expiration_timestamp_seconds - time() < 86400 * 7
  for: 1h
  annotations:
    summary: "Certificate expires in less than 7 days"

Monitor cert_manager_certificate_renewal_timestamp_seconds and cert_manager_acme_client_request_count to catch issues early. Set up alerts for failed renewals too - trust me, you don't want to discover cert-manager is broken when customers start complaining.

Security Hardening

RBAC permissions are overly broad by default. Version 1.18 lets you disable HTTP-01 challenges if you only use DNS-01:

global:
  disableHTTP01Solver: true

Private key storage - keys go in Kubernetes Secrets by default. For paranoid environments, use the CSI driver for ephemeral keys that never touch etcd.

Certificate approval - approver-policy adds human approval workflows for regulated environments. Slows things down but compliance teams love it.

Questions People Actually Ask About cert-manager

Q

Why does HTTP-01 keep shitting the bed even though my ingress works fine?

A

Because Let's Encrypt's validation servers are picky as fuck. They need to hit http://yourdomain.com/.well-known/acme-challenge/some-random-token and get a specific response, and every little thing can break this:

  • Your load balancer isn't actually public (happened to me with internal ALBs - spent 4 hours wondering why "public" meant "internal")
  • Wrong kubernetes.io/ingress.class annotation - it has to match exactly
  • Some firewall rule blocking port 80 (AWS security groups love to do this)
  • Multiple ingress controllers fighting over the same domain like toddlers over toys
  • CloudFlare proxy mode fucking with the ACME challenge requests

The error you actually want: kubectl describe certificaterequest <name> - ignore the useless controller logs.

Q

How fucked am I when my certs expired even with cert-manager running?

A

Pretty fucked, but fixable. cert-manager tries to renew at 60 days (2/3 of Let's Encrypt's 90-day lifetime), so something went seriously wrong:

  • Let's Encrypt rate limits - 50 certs per domain per week and you probably hit it during a mass renewal (been there, waited 3 days for the counter to reset)
  • DNS propagation took forever for DNS-01 challenges (looking at you, GoDaddy - 45 minutes is not "instant")
  • Someone "improved" the ingress config and broke HTTP-01 validation
  • Resource limits strangling the cert-manager pods - default limits are pathetically low for production
  • Webhook timeouts if your cluster DNS is being shitty

The logs are garbage, but check: kubectl describe certificate <name> and look for events. That's where the real error lives.

Q

Why are these logs complete garbage that tell me absolutely nothing?

A

Because the cert-manager developers apparently hate us. The controller logs are useless by design. Instead of actual error messages, check the Kubernetes resources where the real information lives:

kubectl describe certificate <name>
kubectl describe certificaterequest <name>
kubectl describe challenge <name>  # for ACME

Look for events and status conditions. The real error is usually in the Challenge or Order resource, not the main logs.

Q

Can I use wildcard certificates with cert-manager?

A

Yes, but only with DNS-01 challenges. HTTP-01 doesn't support wildcards because Let's Encrypt can't verify *.example.com through HTTP.

You need DNS API access for your provider. Route53, Cloudflare, and others are supported.

Q

Does this work with Traefik or will they fight?

A

It works but Traefik's built-in ACME might conflict. Pick one:

  • Use Traefik's built-in certificate management (simpler)
  • Use cert-manager with Traefik (more flexible)
  • Don't use both - they'll fight over the same certificates

If you choose cert-manager, disable Traefik's ACME and use cert-manager annotations on your Ingress resources.

Q

Why does DNS-01 sit there like a dead fish for 20 minutes?

A

Because DNS is held together with hopes and prayers. DNS propagation delays will make you question your career choices:

  • Your Route53 credentials are wrong (happened to me when IAM keys expired)
  • The hosted zone doesn't actually match your domain (Route53 console lies sometimes)
  • DNS propagation takes forever - GoDaddy can take 30+ minutes, Namecheap isn't much better
  • cert-manager lacks IAM permissions to create TXT records
  • Cloudflare's API decided to be rate-limited right when you need it

I've lost entire weekends to DNS propagation delays. The ACME spec says wait up to 5 minutes, but real-world DNS providers laugh at that.

Q

How do I use this thing with internal domains?

A

You can't use Let's Encrypt for private domains - they only validate public internet domains. Options:

  • Use an internal CA issuer with your company's PKI
  • Use HashiCorp Vault for internal certificates
  • Set up a self-signed CA for development

Private domains need DNS-01 challenges with access to your internal DNS infrastructure.

Q

What happens if cert-manager dies?

A

Existing certificates keep working - cert-manager doesn't proxy traffic. But renewals stop, so certificates that expire during downtime won't get renewed.

Run multiple replicas for high availability:

helm upgrade cert-manager jetstack/cert-manager \
  --reuse-values \
  --set replicaCount=2

Monitor cert-manager health and certificate expiration dates.

Q

Can I approve certificate requests manually?

A

Yes, with approver-policy. Useful for regulated environments where humans need to approve certificate issuance.

Warning: Manual approval breaks automation. Only use if compliance requires it.

Q

Is this going to cost me a fortune to run?

A

cert-manager itself is free (Apache 2.0 license). Real costs:

  • Let's Encrypt certificates: Free
  • Compute resources: ~100MB RAM, 10m CPU per pod
  • DNS API costs for DNS-01 challenges (Route53 charges per query)
  • Commercial CA certificates if you need them
  • Your time debugging when things break

Usually much cheaper than manual certificate management or commercial alternatives.

Q

Should I use ClusterIssuer or Issuer?

A

Use ClusterIssuer for organization-wide CAs like Let's Encrypt that all teams use.

Use Issuer for namespace-specific CAs or when teams need different certificate authorities.

Most people start with ClusterIssuer and add namespace-specific Issuers later.

Q

How do I migrate from manual certificates to cert-manager?

A

Gradual migration works best:

  1. Install cert-manager alongside existing certificates
  2. Configure ClusterIssuer for your CA (Let's Encrypt, etc.)
  3. Create Certificate resources for new services
  4. Migrate existing services one at a time
  5. Remove manual certificate management scripts

Don't try to migrate everything at once - you'll break production.

Related Tools & Recommendations

integration
Similar content

NGINX Certbot Integration: Automate SSL Renewals & Prevent Outages

NGINX + Certbot Integration: Because Expired Certificates at 3AM Suck

NGINX
/integration/nginx-certbot/overview
100%
tool
Similar content

Certbot: Get Free SSL Certificates & Simplify Installation

Learn how Certbot simplifies obtaining and installing free SSL/TLS certificates. This guide covers installation, common issues like renewal failures, and config

Certbot
/tool/certbot/overview
85%
tool
Similar content

Helm: Simplify Kubernetes Deployments & Avoid YAML Chaos

Package manager for Kubernetes that saves you from copy-pasting deployment configs like a savage. Helm charts beat maintaining separate YAML files for every dam

Helm
/tool/helm/overview
82%
tool
Similar content

Istio Service Mesh: Real-World Complexity, Benefits & Deployment

The most complex way to connect microservices, but it actually works (eventually)

Istio
/tool/istio/overview
76%
tool
Similar content

kubectl: Kubernetes CLI - Overview, Usage & Extensibility

Because clicking buttons is for quitters, and YAML indentation is a special kind of hell

kubectl
/tool/kubectl/overview
69%
tool
Similar content

etcd Overview: The Core Database Powering Kubernetes Clusters

etcd stores all the important cluster state. When it breaks, your weekend is fucked.

etcd
/tool/etcd/overview
66%
tool
Similar content

Kubernetes Overview: Google's Container Orchestrator Explained

The orchestrator that went from managing Google's chaos to running 80% of everyone else's production workloads

Kubernetes
/tool/kubernetes/overview
57%
tool
Similar content

containerd - The Container Runtime That Actually Just Works

The boring container runtime that Kubernetes uses instead of Docker (and you probably don't need to care about it)

containerd
/tool/containerd/overview
52%
tool
Similar content

GitOps Overview: Principles, Benefits & Implementation Guide

Finally, a deployment method that doesn't require you to SSH into production servers at 3am to fix what some jackass manually changed

Argo CD
/tool/gitops/overview
52%
tool
Similar content

KEDA - Kubernetes Event-driven Autoscaling: Overview & Deployment Guide

Explore KEDA (Kubernetes Event-driven Autoscaler), a CNCF project. Understand its purpose, why it's essential, and get practical insights into deploying KEDA ef

KEDA
/tool/keda/overview
51%
tool
Similar content

Python 3.13 Team Migration Guide: Avoid SSL Hell & CI/CD Breaks

For teams who don't want to debug SSL hell at 3am

Python 3.13
/tool/python-3.13/team-migration-strategy
51%
tool
Similar content

Django Production Deployment Guide: Docker, Security, Monitoring

From development server to bulletproof production: Docker, Kubernetes, security hardening, and monitoring that doesn't suck

Django
/tool/django/production-deployment-guide
48%
tool
Similar content

kubeadm - The Official Way to Bootstrap Kubernetes Clusters

Sets up Kubernetes clusters without the vendor bullshit

kubeadm
/tool/kubeadm/overview
46%
tool
Similar content

Flux GitOps: Secure Kubernetes Deployments with CI/CD

GitOps controller that pulls from Git instead of having your build pipeline push to Kubernetes

FluxCD (Flux v2)
/tool/flux/overview
46%
troubleshoot
Similar content

Fix Kubernetes OOMKilled Pods: Production Crisis Guide

When your pods die with exit code 137 at 3AM and production is burning - here's the field guide that actually works

Kubernetes
/troubleshoot/kubernetes-oom-killed-pod/oomkilled-production-crisis-management
45%
integration
Similar content

Temporal Kubernetes Production Deployment Guide: Avoid Failures

What I learned after three failed production deployments

Temporal
/integration/temporal-kubernetes/production-deployment-guide
45%
alternatives
Similar content

Lightweight Kubernetes Alternatives: K3s, MicroK8s, & More

Explore lightweight Kubernetes alternatives like K3s and MicroK8s. Learn why they're ideal for small teams, discover real-world use cases, and get a practical g

Kubernetes
/alternatives/kubernetes/lightweight-orchestration-alternatives/lightweight-alternatives
43%
howto
Similar content

Master Microservices Setup: Docker & Kubernetes Guide 2025

Split Your Monolith Into Services That Will Break in New and Exciting Ways

Docker
/howto/setup-microservices-docker-kubernetes/complete-setup-guide
42%
troubleshoot
Similar content

Kubernetes Crisis Management: Fix Your Down Cluster Fast

How to fix Kubernetes disasters when everything's on fire and your phone won't stop ringing.

Kubernetes
/troubleshoot/kubernetes-production-crisis-management/production-crisis-management
40%
tool
Similar content

Python 3.13 SSL Changes & Enterprise Compatibility Analysis

Analyze Python 3.13's stricter SSL validation breaking production environments and the predictable challenges of enterprise compatibility testing and migration.

Python 3.13
/tool/python-3.13/security-compatibility-analysis
40%

Recommendations combine user behavior, content similarity, research intelligence, and SEO optimization