Why does HTTP-01 keep shitting the bed even though my ingress works fine?

Because Let's Encrypt's validation servers are picky as fuck. They need to hit `http://yourdomain.com/.well-known/acme-challenge/some-random-token` and get a specific response, and every little thing can break this: - Your load balancer isn't actually public (happened to me with internal ALBs - spent 4 hours wondering why "public" meant "internal") - Wrong `kubernetes.io/ingress.class` annotation - it has to match exactly - Some firewall rule blocking port 80 (AWS security groups love to do this) - Multiple ingress controllers fighting over the same domain like toddlers over toys - CloudFlare proxy mode fucking with the ACME challenge requests The error you actually want: `kubectl describe certificaterequest ` - ignore the useless controller logs.

How fucked am I when my certs expired even with cert-manager running?

Pretty fucked, but fixable. cert-manager tries to renew at 60 days (2/3 of Let's Encrypt's 90-day lifetime), so something went seriously wrong: - Let's Encrypt rate limits - 50 certs per domain per week and you probably hit it during a mass renewal (been there, waited 3 days for the counter to reset) - DNS propagation took forever for DNS-01 challenges (looking at you, GoDaddy - 45 minutes is not "instant") - Someone "improved" the ingress config and broke HTTP-01 validation - Resource limits strangling the cert-manager pods - default limits are pathetically low for production - Webhook timeouts if your cluster DNS is being shitty The logs are garbage, but check: `kubectl describe certificate ` and look for events. That's where the real error lives.

Why are these logs complete garbage that tell me absolutely nothing?

Because the cert-manager developers apparently hate us. The controller logs are useless by design. Instead of actual error messages, check the Kubernetes resources where the real information lives: ```bash kubectl describe certificate kubectl describe certificaterequest kubectl describe challenge # for ACME ``` Look for events and status conditions. The real error is usually in the Challenge or Order resource, not the main logs.

Can I use wildcard certificates with cert-manager?

Yes, but only with DNS-01 challenges. HTTP-01 doesn't support wildcards because Let's Encrypt can't verify `*.example.com` through HTTP. You need DNS API access for your provider. [Route53](https://cert-manager.io/docs/configuration/acme/dns01/route53/), [Cloudflare](https://cert-manager.io/docs/configuration/acme/dns01/cloudflare/), and [others are supported](https://cert-manager.io/docs/configuration/acme/dns01/).

Does this work with Traefik or will they fight?

It works but Traefik's built-in ACME might conflict. Pick one: - Use Traefik's built-in certificate management (simpler) - Use cert-manager with Traefik (more flexible) - Don't use both - they'll fight over the same certificates If you choose cert-manager, disable Traefik's ACME and use cert-manager annotations on your Ingress resources.

Why does DNS-01 sit there like a dead fish for 20 minutes?

Because DNS is held together with hopes and prayers. DNS propagation delays will make you question your career choices: - Your Route53 credentials are wrong (happened to me when IAM keys expired) - The hosted zone doesn't actually match your domain (Route53 console lies sometimes) - DNS propagation takes forever - GoDaddy can take 30+ minutes, Namecheap isn't much better - cert-manager lacks IAM permissions to create TXT records - Cloudflare's API decided to be rate-limited right when you need it I've lost entire weekends to DNS propagation delays. The ACME spec says wait up to 5 minutes, but real-world DNS providers laugh at that.

How do I use this thing with internal domains?

You can't use Let's Encrypt for private domains - they only validate public internet domains. Options: - Use an internal CA issuer with your company's PKI - Use HashiCorp Vault for internal certificates - Set up a self-signed CA for development Private domains need DNS-01 challenges with access to your internal DNS infrastructure.

What happens if cert-manager dies?

Existing certificates keep working - cert-manager doesn't proxy traffic. But renewals stop, so certificates that expire during downtime won't get renewed. Run multiple replicas for high availability: ```bash helm upgrade cert-manager jetstack/cert-manager \ --reuse-values \ --set replicaCount=2 ``` Monitor cert-manager health and certificate expiration dates.

Can I approve certificate requests manually?

Yes, with approver-policy. Useful for regulated environments where humans need to approve certificate issuance. Warning: Manual approval breaks automation. Only use if compliance requires it.

Is this going to cost me a fortune to run?

cert-manager itself is free (Apache 2.0 license). Real costs: - Let's Encrypt certificates: Free - Compute resources: ~100MB RAM, 10m CPU per pod - DNS API costs for DNS-01 challenges (Route53 charges per query) - Commercial CA certificates if you need them - Your time debugging when things break Usually much cheaper than manual certificate management or commercial alternatives.

Should I use ClusterIssuer or Issuer?

Use **ClusterIssuer** for organization-wide CAs like Let's Encrypt that all teams use. Use **Issuer** for namespace-specific CAs or when teams need different certificate authorities. Most people start with ClusterIssuer and add namespace-specific Issuers later.

How do I migrate from manual certificates to cert-manager?

Gradual migration works best: 1. Install cert-manager alongside existing certificates 2. Configure ClusterIssuer for your CA (Let's Encrypt, etc.) 3. Create Certificate resources for new services 4. Migrate existing services one at a time 5. Remove manual certificate management scripts Don't try to migrate everything at once - you'll break production.

Currently viewing the AI version

Switch to human version

cert-manager: Kubernetes Certificate Management - AI Reference

Core Value Proposition

Problem Solved: Eliminates manual SSL certificate management that causes production outages during certificate expiration
Critical Failure Scenario: SSL certificate expiration during high-traffic periods (Black Friday example: 4-hour e-commerce outage, direct revenue impact)
Automation Benefit: Prevents 3 AM paging incidents from expired certificates

Technical Specifications

Core Components

Certificate Resources: Kubernetes custom resources defining domain certificate requirements
Issuer/ClusterIssuer Resources: Certificate Authority configuration (ClusterIssuer = cluster-wide, Issuer = namespace-scoped)
CertificateRequest Resources: X.509 standard certificate signing requests (auto-generated, rarely manually managed)

Resource Requirements (Production)

resources:
  limits:
    cpu: 200m      # Default 100m insufficient under load
    memory: 256Mi   # Default 128Mi causes OOM during mass renewals
  requests:
    cpu: 50m
    memory: 64Mi

Challenge Methods

Method	Use Case	Failure Mode	Requirements
HTTP-01	Public services	Ingress controller conflicts, firewall blocks port 80	Public domain, ingress controller
DNS-01	Wildcard certs, internal services	DNS propagation delays (GoDaddy: 30+ minutes)	DNS API access

Critical Configuration

Production-Ready Installation

helm repo add jetstack https://charts.jetstack.io
helm install cert-manager jetstack/cert-manager \
  --namespace cert-manager \
  --create-namespace \
  --set installCRDs=true  # CRITICAL: Prevents webhook failures

Multi-Issuer Setup

Let's Encrypt Production: Public certificates, free, 90-day lifecycle
Let's Encrypt Staging: Testing environment, avoids rate limits
Internal CA: Enterprise PKI integration
HashiCorp Vault: Internal certificate management

DNS-01 Configuration (AWS Route53 Example)

spec:
  acme:
    solvers:
    - dns01:
        route53:
          region: us-east-1
          # Use IRSA instead of hardcoded keys for security

Critical Failure Modes

Rate Limiting (Let's Encrypt)

Limit: 50 certificates per domain per week
Consequence: 3-7 day lockout during mass renewals
Prevention: Use staging environment for testing, plan certificate requests

Resource Exhaustion

Scenario: Mass certificate renewal causes cert-manager pod crashes
Root Cause: Default resource limits too low (100m CPU, 128Mi memory)
Solution: Increase limits to 200m CPU, 256Mi memory minimum

DNS Propagation Delays

Common Providers: GoDaddy (30+ minutes), Namecheap (slow), Cloudflare (fast)
Impact: DNS-01 challenge timeouts, renewal failures
Mitigation: Increase webhook timeout to 30 seconds

Webhook Timeout Issues

webhook:
  timeoutSeconds: 30  # Default 10s too short for DNS delays

Monitoring and Alerting

Essential Prometheus Metrics

cert_manager_certificate_expiration_timestamp_seconds: Certificate expiration tracking
cert_manager_certificate_renewal_timestamp_seconds: Renewal success monitoring
cert_manager_acme_client_request_count: ACME API request monitoring

Critical Alert (7-day expiration warning)

- alert: CertManagerCertificateExpirySoon
  expr: cert_manager_certificate_expiration_timestamp_seconds - time() < 86400 * 7
  for: 1h
  annotations:
    summary: "Certificate expires in less than 7 days"

Technology Comparison Matrix

Solution	Setup Complexity	Failure Support	CA Support	Automation	Multi-cluster	Vendor Lock-in
cert-manager	Medium	Good docs/community	Universal	Full	Yes	None
Traefik Built-in	Low	Community forums	Let's Encrypt focus	Limited	Per-cluster	Traefik
Manual ACME Scripts	High	Self-support	Let's Encrypt	Cron-based	Custom tooling	None
Cloud Provider	Low	Paid support	Provider-specific	Partial	Per-cloud	High
Vault PKI	High	Enterprise support	Internal only	Manual approval	Complex	HashiCorp

Common Troubleshooting

HTTP-01 Challenge Failures

Debug Commands:

kubectl describe certificate <name>
kubectl describe certificaterequest <name>
kubectl describe challenge <name>

Root Causes:

Load balancer not publicly accessible
Incorrect ingress.class annotation
Firewall blocking port 80
Multiple ingress controllers conflict
CloudFlare proxy mode interference

DNS-01 Challenge Failures

Root Causes:

Invalid DNS API credentials
Hosted zone mismatch
DNS propagation delays (provider-specific)
Missing IAM permissions for DNS record creation
DNS provider API rate limiting

Security Considerations

Production Hardening

Disable HTTP-01 solver if only using DNS-01: global.disableHTTP01Solver: true
Use IAM Roles for Service Accounts (IRSA) instead of hardcoded AWS keys
Implement certificate approval workflows with approver-policy for compliance
Consider CSI driver for ephemeral certificates in high-security environments

Resource Planning

Scale Considerations

Normal operation: 50MB RAM, 10m CPU per pod
Mass renewal periods: Resource requirements spike significantly
High availability: Run multiple replicas (replicaCount: 2)
Multi-cluster: Use ClusterIssuer for organization-wide policies

Cost Factors

Let's Encrypt certificates: Free
DNS API costs: Route53 charges per query for DNS-01
Compute resources: Minimal in normal operation
Commercial CA certificates: Variable pricing
Operational time: Significantly reduced vs manual management

Migration Strategy

From Manual to Automated

Install cert-manager alongside existing certificates
Configure ClusterIssuer for existing CA
Create Certificate resources for new services
Migrate existing services incrementally
Decommission manual certificate scripts

Critical: Never migrate all certificates simultaneously - gradual migration prevents production impact

Known Limitations

Let's Encrypt Constraints

Public domains only (no private/internal domains)
Rate limits enforce weekly planning requirements
ACME protocol dependencies on external validation

DNS Provider Reliability

DNS propagation inconsistency across providers
API reliability varies significantly
Some providers have extended propagation delays

Kubernetes Dependencies

Requires functional ingress controller for HTTP-01
Webhook validation adds cluster dependency
etcd storage for certificate keys (unless using CSI driver)

Useful Links for Further Investigation

Actually Useful cert-manager Links

Link	Description
cert-manager Installation Guide	The only installation guide you need. Helm method works best. Skip the kubectl apply bullshit.
GitHub Repository	Check this for actual release notes and known issues. The README has useful examples.
cert-manager Troubleshooting	When things break (and they will), start here. Actually helpful unlike most Kubernetes troubleshooting docs.
Getting Started Tutorial	Basic nginx + Let's Encrypt setup. Works as advertised, which is rare for Kubernetes tutorials.
ACME HTTP-01 Challenges	For public-facing services. Simple but requires ingress controller cooperation.
ACME DNS-01 Challenges	For wildcard certs and internal services. More complex but more flexible.
Route53 DNS-01 Setup	Most common DNS-01 provider. Use IRSA for authentication, not hardcoded keys.
Cloudflare DNS-01 Setup	Popular alternative to Route53. API tokens work better than global API keys.
HashiCorp Vault Integration	For internal PKI. Complex setup but worth it for enterprise environments.
ACME Troubleshooting	Let's Encrypt-specific debugging. Actually tells you how to fix common problems.
Prometheus Metrics	Monitor certificate expiration and renewal failures. Essential for production.
cert-manager Slack	Active community. Real people answer real questions, usually quickly.
Common Issues on GitHub	Search existing issues before creating new ones. Maintainers are helpful but busy.
istio-csr for Service Mesh	Replaces Istio's built-in certificate management. Only use if you need unified cert policies.
approver-policy for Compliance	Manual certificate approval workflows. Breaks automation but compliance teams love it.
trust-manager	Distributes CA bundles across clusters. Useful for multi-cluster deployments.
CSI Driver	Ephemeral certificates that never touch etcd. For paranoid security environments.
Supported DNS Providers List	Full list of DNS-01 challenge providers. Most major providers supported.
Let's Encrypt Rate Limits	50 certificates per domain per week. Plan accordingly for large deployments.
Let's Encrypt Staging Environment	Use this for testing to avoid hitting rate limits in production.
ACME Challenge Types	HTTP-01 vs DNS-01 explained by Let's Encrypt themselves.
CNCF Project Page	Official project status and governance. cert-manager graduated in November 2024.
Release Notes	What changed in each version. Usually minor fixes, occasionally breaking changes.
cert-manager Security Advisories	Security updates and CVE notifications. Essential reading for production users.
Traefik Built-in ACME	Works fine if you only use Traefik. Simpler than cert-manager for basic setups.
AWS Certificate Manager	Good if you're all-in on AWS. Tight ALB/CloudFront integration but locks you to AWS.
Certbot	Manual ACME client. Use for non-Kubernetes environments or when you like writing cron jobs.

cert-manager: Kubernetes Certificate Management - AI Reference

Core Value Proposition

Technical Specifications

Core Components

Resource Requirements (Production)

Challenge Methods

Critical Configuration

Production-Ready Installation

Multi-Issuer Setup

DNS-01 Configuration (AWS Route53 Example)

Critical Failure Modes

Rate Limiting (Let's Encrypt)

Resource Exhaustion

DNS Propagation Delays

Webhook Timeout Issues

Monitoring and Alerting

Essential Prometheus Metrics

Critical Alert (7-day expiration warning)

Technology Comparison Matrix

Common Troubleshooting

HTTP-01 Challenge Failures

DNS-01 Challenge Failures

Security Considerations

Production Hardening

Resource Planning

Scale Considerations

Cost Factors

Migration Strategy

From Manual to Automated

Known Limitations

Let's Encrypt Constraints

DNS Provider Reliability

Kubernetes Dependencies

Useful Links for Further Investigation

Actually Useful cert-manager Links

Related Tools & Recommendations

GitOps Integration Hell: Docker + Kubernetes + ArgoCD + Prometheus

Kafka + MongoDB + Kubernetes + Prometheus Integration - When Event Streams Break

NGINX Ingress Controller - Traffic Routing That Doesn't Shit the Bed

HashiCorp Vault - Overly Complicated Secrets Manager

HashiCorp Vault Pricing: What It Actually Costs When the Dust Settles

Let's Encrypt - Finally, SSL Certs That Don't Cost a Mortgage Payment

Fix Helm When It Inevitably Breaks - Debug Guide

Helm - Because Managing 47 YAML Files Will Drive You Insane

Making Pulumi, Kubernetes, Helm, and GitOps Actually Work Together

Certbot - Get SSL Certificates Without Wanting to Die

Automate Your SSL Renewals Before You Forget and Take Down Production

Stop Debugging Microservices Networking at 3AM

Istio - Service Mesh That'll Make You Question Your Life Choices

How to Deploy Istio Without Destroying Your Production Environment

Prometheus + Grafana + Jaeger: Stop Debugging Microservices Like It's 2015

Aider - Terminal AI That Actually Works

jQuery - The Library That Won't Die

RAG on Kubernetes: Why You Probably Don't Need It (But If You Do, Here's How)

MongoDB Alternatives: Choose the Right Database for Your Specific Use Case

vtenext CRM Allows Unauthenticated Remote Code Execution