cert-manager: Kubernetes Certificate Management - AI Reference
Core Value Proposition
Problem Solved: Eliminates manual SSL certificate management that causes production outages during certificate expiration
Critical Failure Scenario: SSL certificate expiration during high-traffic periods (Black Friday example: 4-hour e-commerce outage, direct revenue impact)
Automation Benefit: Prevents 3 AM paging incidents from expired certificates
Technical Specifications
Core Components
- Certificate Resources: Kubernetes custom resources defining domain certificate requirements
- Issuer/ClusterIssuer Resources: Certificate Authority configuration (ClusterIssuer = cluster-wide, Issuer = namespace-scoped)
- CertificateRequest Resources: X.509 standard certificate signing requests (auto-generated, rarely manually managed)
Resource Requirements (Production)
resources:
limits:
cpu: 200m # Default 100m insufficient under load
memory: 256Mi # Default 128Mi causes OOM during mass renewals
requests:
cpu: 50m
memory: 64Mi
Challenge Methods
Method | Use Case | Failure Mode | Requirements |
---|---|---|---|
HTTP-01 | Public services | Ingress controller conflicts, firewall blocks port 80 | Public domain, ingress controller |
DNS-01 | Wildcard certs, internal services | DNS propagation delays (GoDaddy: 30+ minutes) | DNS API access |
Critical Configuration
Production-Ready Installation
helm repo add jetstack https://charts.jetstack.io
helm install cert-manager jetstack/cert-manager \
--namespace cert-manager \
--create-namespace \
--set installCRDs=true # CRITICAL: Prevents webhook failures
Multi-Issuer Setup
- Let's Encrypt Production: Public certificates, free, 90-day lifecycle
- Let's Encrypt Staging: Testing environment, avoids rate limits
- Internal CA: Enterprise PKI integration
- HashiCorp Vault: Internal certificate management
DNS-01 Configuration (AWS Route53 Example)
spec:
acme:
solvers:
- dns01:
route53:
region: us-east-1
# Use IRSA instead of hardcoded keys for security
Critical Failure Modes
Rate Limiting (Let's Encrypt)
- Limit: 50 certificates per domain per week
- Consequence: 3-7 day lockout during mass renewals
- Prevention: Use staging environment for testing, plan certificate requests
Resource Exhaustion
- Scenario: Mass certificate renewal causes cert-manager pod crashes
- Root Cause: Default resource limits too low (100m CPU, 128Mi memory)
- Solution: Increase limits to 200m CPU, 256Mi memory minimum
DNS Propagation Delays
- Common Providers: GoDaddy (30+ minutes), Namecheap (slow), Cloudflare (fast)
- Impact: DNS-01 challenge timeouts, renewal failures
- Mitigation: Increase webhook timeout to 30 seconds
Webhook Timeout Issues
webhook:
timeoutSeconds: 30 # Default 10s too short for DNS delays
Monitoring and Alerting
Essential Prometheus Metrics
cert_manager_certificate_expiration_timestamp_seconds
: Certificate expiration trackingcert_manager_certificate_renewal_timestamp_seconds
: Renewal success monitoringcert_manager_acme_client_request_count
: ACME API request monitoring
Critical Alert (7-day expiration warning)
- alert: CertManagerCertificateExpirySoon
expr: cert_manager_certificate_expiration_timestamp_seconds - time() < 86400 * 7
for: 1h
annotations:
summary: "Certificate expires in less than 7 days"
Technology Comparison Matrix
Solution | Setup Complexity | Failure Support | CA Support | Automation | Multi-cluster | Vendor Lock-in |
---|---|---|---|---|---|---|
cert-manager | Medium | Good docs/community | Universal | Full | Yes | None |
Traefik Built-in | Low | Community forums | Let's Encrypt focus | Limited | Per-cluster | Traefik |
Manual ACME Scripts | High | Self-support | Let's Encrypt | Cron-based | Custom tooling | None |
Cloud Provider | Low | Paid support | Provider-specific | Partial | Per-cloud | High |
Vault PKI | High | Enterprise support | Internal only | Manual approval | Complex | HashiCorp |
Common Troubleshooting
HTTP-01 Challenge Failures
Debug Commands:
kubectl describe certificate <name>
kubectl describe certificaterequest <name>
kubectl describe challenge <name>
Root Causes:
- Load balancer not publicly accessible
- Incorrect ingress.class annotation
- Firewall blocking port 80
- Multiple ingress controllers conflict
- CloudFlare proxy mode interference
DNS-01 Challenge Failures
Root Causes:
- Invalid DNS API credentials
- Hosted zone mismatch
- DNS propagation delays (provider-specific)
- Missing IAM permissions for DNS record creation
- DNS provider API rate limiting
Security Considerations
Production Hardening
- Disable HTTP-01 solver if only using DNS-01:
global.disableHTTP01Solver: true
- Use IAM Roles for Service Accounts (IRSA) instead of hardcoded AWS keys
- Implement certificate approval workflows with approver-policy for compliance
- Consider CSI driver for ephemeral certificates in high-security environments
Resource Planning
Scale Considerations
- Normal operation: 50MB RAM, 10m CPU per pod
- Mass renewal periods: Resource requirements spike significantly
- High availability: Run multiple replicas (
replicaCount: 2
) - Multi-cluster: Use ClusterIssuer for organization-wide policies
Cost Factors
- Let's Encrypt certificates: Free
- DNS API costs: Route53 charges per query for DNS-01
- Compute resources: Minimal in normal operation
- Commercial CA certificates: Variable pricing
- Operational time: Significantly reduced vs manual management
Migration Strategy
From Manual to Automated
- Install cert-manager alongside existing certificates
- Configure ClusterIssuer for existing CA
- Create Certificate resources for new services
- Migrate existing services incrementally
- Decommission manual certificate scripts
Critical: Never migrate all certificates simultaneously - gradual migration prevents production impact
Known Limitations
Let's Encrypt Constraints
- Public domains only (no private/internal domains)
- Rate limits enforce weekly planning requirements
- ACME protocol dependencies on external validation
DNS Provider Reliability
- DNS propagation inconsistency across providers
- API reliability varies significantly
- Some providers have extended propagation delays
Kubernetes Dependencies
- Requires functional ingress controller for HTTP-01
- Webhook validation adds cluster dependency
- etcd storage for certificate keys (unless using CSI driver)
Useful Links for Further Investigation
Actually Useful cert-manager Links
Link | Description |
---|---|
cert-manager Installation Guide | The only installation guide you need. Helm method works best. Skip the kubectl apply bullshit. |
GitHub Repository | Check this for actual release notes and known issues. The README has useful examples. |
cert-manager Troubleshooting | When things break (and they will), start here. Actually helpful unlike most Kubernetes troubleshooting docs. |
Getting Started Tutorial | Basic nginx + Let's Encrypt setup. Works as advertised, which is rare for Kubernetes tutorials. |
ACME HTTP-01 Challenges | For public-facing services. Simple but requires ingress controller cooperation. |
ACME DNS-01 Challenges | For wildcard certs and internal services. More complex but more flexible. |
Route53 DNS-01 Setup | Most common DNS-01 provider. Use IRSA for authentication, not hardcoded keys. |
Cloudflare DNS-01 Setup | Popular alternative to Route53. API tokens work better than global API keys. |
HashiCorp Vault Integration | For internal PKI. Complex setup but worth it for enterprise environments. |
ACME Troubleshooting | Let's Encrypt-specific debugging. Actually tells you how to fix common problems. |
Prometheus Metrics | Monitor certificate expiration and renewal failures. Essential for production. |
cert-manager Slack | Active community. Real people answer real questions, usually quickly. |
Common Issues on GitHub | Search existing issues before creating new ones. Maintainers are helpful but busy. |
istio-csr for Service Mesh | Replaces Istio's built-in certificate management. Only use if you need unified cert policies. |
approver-policy for Compliance | Manual certificate approval workflows. Breaks automation but compliance teams love it. |
trust-manager | Distributes CA bundles across clusters. Useful for multi-cluster deployments. |
CSI Driver | Ephemeral certificates that never touch etcd. For paranoid security environments. |
Supported DNS Providers List | Full list of DNS-01 challenge providers. Most major providers supported. |
Let's Encrypt Rate Limits | 50 certificates per domain per week. Plan accordingly for large deployments. |
Let's Encrypt Staging Environment | Use this for testing to avoid hitting rate limits in production. |
ACME Challenge Types | HTTP-01 vs DNS-01 explained by Let's Encrypt themselves. |
CNCF Project Page | Official project status and governance. cert-manager graduated in November 2024. |
Release Notes | What changed in each version. Usually minor fixes, occasionally breaking changes. |
cert-manager Security Advisories | Security updates and CVE notifications. Essential reading for production users. |
Traefik Built-in ACME | Works fine if you only use Traefik. Simpler than cert-manager for basic setups. |
AWS Certificate Manager | Good if you're all-in on AWS. Tight ALB/CloudFront integration but locks you to AWS. |
Certbot | Manual ACME client. Use for non-Kubernetes environments or when you like writing cron jobs. |
Related Tools & Recommendations
GitOps Integration Hell: Docker + Kubernetes + ArgoCD + Prometheus
How to Wire Together the Modern DevOps Stack Without Losing Your Sanity
Kafka + MongoDB + Kubernetes + Prometheus Integration - When Event Streams Break
When your event-driven services die and you're staring at green dashboards while everything burns, you need real observability - not the vendor promises that go
NGINX Ingress Controller - Traffic Routing That Doesn't Shit the Bed
NGINX running in Kubernetes pods, doing what NGINX does best - not dying under load
HashiCorp Vault - Overly Complicated Secrets Manager
The tool your security team insists on that's probably overkill for your project
HashiCorp Vault Pricing: What It Actually Costs When the Dust Settles
From free to $200K+ annually - and you'll probably pay more than you think
Let's Encrypt - Finally, SSL Certs That Don't Cost a Mortgage Payment
Free automated certificates that renew themselves so you never get paged at 3am again
Fix Helm When It Inevitably Breaks - Debug Guide
The commands, tools, and nuclear options for when your Helm deployment is fucked and you need to debug template errors at 3am.
Helm - Because Managing 47 YAML Files Will Drive You Insane
Package manager for Kubernetes that saves you from copy-pasting deployment configs like a savage. Helm charts beat maintaining separate YAML files for every dam
Making Pulumi, Kubernetes, Helm, and GitOps Actually Work Together
Stop fighting with YAML hell and infrastructure drift - here's how to manage everything through Git without losing your sanity
Certbot - Get SSL Certificates Without Wanting to Die
alternative to Certbot
Automate Your SSL Renewals Before You Forget and Take Down Production
NGINX + Certbot Integration: Because Expired Certificates at 3AM Suck
Stop Debugging Microservices Networking at 3AM
How Docker, Kubernetes, and Istio Actually Work Together (When They Work)
Istio - Service Mesh That'll Make You Question Your Life Choices
The most complex way to connect microservices, but it actually works (eventually)
How to Deploy Istio Without Destroying Your Production Environment
A battle-tested guide from someone who's learned these lessons the hard way
Prometheus + Grafana + Jaeger: Stop Debugging Microservices Like It's 2015
When your API shits the bed right before the big demo, this stack tells you exactly why
Aider - Terminal AI That Actually Works
Explore Aider, the terminal-based AI coding assistant. Learn what it does, how to install it, and get answers to common questions about API keys and costs.
jQuery - The Library That Won't Die
Explore jQuery's enduring legacy, its impact on web development, and the key changes in jQuery 4.0. Understand its relevance for new projects in 2025.
RAG on Kubernetes: Why You Probably Don't Need It (But If You Do, Here's How)
Running RAG Systems on K8s Will Make You Hate Your Life, But Sometimes You Don't Have a Choice
MongoDB Alternatives: Choose the Right Database for Your Specific Use Case
Stop paying MongoDB tax. Choose a database that actually works for your use case.
vtenext CRM Allows Unauthenticated Remote Code Execution
Three critical vulnerabilities enable complete system compromise in enterprise CRM platform
Recommendations combine user behavior, content similarity, research intelligence, and SEO optimization