HashiCorp Vault + Kubernetes CI/CD: Production Implementation Guide
Configuration Requirements
Dynamic Database Credentials
- Vault creates actual database users with UUID suffixes and precise permissions
- TTL expires automatically - credentials stop working without manual cleanup
- Supported databases: PostgreSQL, MySQL, MongoDB, AWS RDS, dozens more
- Performance impact: Each secret request creates/revokes database users
- Connection limits critical: Postgres requires 200+ max_connections (default 100 fails)
- Real failure mode: 100 pods requesting secrets = database connection exhaustion at 95% CPU
Authentication Methods (Production Reality)
Method | RAM Usage | Failure Mode | Recovery |
---|---|---|---|
Vault Agent Injector | ~100MB per pod | Init:0/1 with useless logs | Delete pod, recreate |
External Secrets Operator | No pod overhead | Stops syncing silently for weeks | Restart operator |
Direct API Calls | Minimal overhead | OIDC tokens expire mid-deployment | Split long jobs |
CSI Driver | Low overhead | Mount failures kill pods | Restart affected pods |
Network Policy Requirements
Critical connections to allow:
- Agent Injector → Vault (port 8200)
- Vault → Kubernetes API (port 6443 for token validation)
- Pods → Vault (if using direct API)
- ESO → Vault (if using External Secrets)
Common failure: Cilium 1.14.x randomly blocks cross-namespace calls with context deadline exceeded
Resource Requirements
Time Investments
- Initial setup: 2-3 days for basic implementation
- Network policy debugging: Full day when enabling security policies
- Production hardening: 1-2 weeks including monitoring and disaster recovery
- Authentication troubleshooting: 8+ hours for complex RBAC issues
Expertise Requirements
- Kubernetes RBAC: Essential for service account configuration
- Database administration: Understanding connection pooling and user management
- Network debugging: Required for policy troubleshooting
- OIDC/JWT knowledge: Critical for CI/CD pipeline authentication
Infrastructure Costs
- Vault cluster: 3+ nodes for HA, Enterprise for disaster recovery
- Database connections: 2-3x normal connection limits
- Monitoring: Prometheus, Grafana, audit log storage
- Backup storage: For Vault snapshots and disaster recovery
Critical Warnings
Token Expiration Issues
- GitHub Actions OIDC tokens: 5-minute lifespan, pipelines fail at final deploy
- Kubernetes service account tokens: 1-hour default in new clusters
- GitLab CI tokens: 10-minute maximum before becoming worthless
- Real impact: Entire Friday release pipeline died during Trivy security scan
External Secrets Operator Failures
- Silent failures: Shows "successfully refreshed" while secrets expire
- Duration: Can stop working for 3+ weeks without alerts
- Impact: 47+ services failing simultaneously at 3:15 AM
- Root cause: ESO lies in logs while service account tokens expire
Database Performance Degradation
- Connection exhaustion:
FATAL: too many connections for role "vault"
- Concurrent user creation: Postgres 13 struggles, 14+ handles better
- Rolling deployments: 100 pods requesting secrets overwhelms database
- Recovery time: Weeks of connection limit alerts after fixes
Vault Downtime Consequences
- Complete CI/CD failure: All pipelines stop when Vault unavailable
- No automatic failover: Requires manual intervention or Enterprise features
- Network policy conflicts: Mysterious timeouts with zero useful errors
- Recovery complexity: Unsealing, token validation, service restoration
Implementation Strategies
Production-Ready Configuration
# Vault Agent memory limits (per pod overhead)
resources:
limits:
memory: "128Mi" # Actual usage ~100MB
cpu: "100m"
requests:
memory: "64Mi"
cpu: "50m"
# Database connection configuration
max_connections: 200 # Minimum for production clusters
connection_timeout: "30s"
lease_duration: "4h" # Balance between security and operational overhead
Monitoring Requirements
Essential alerts (prevent 3AM pages):
- Secret request latency > 5 seconds
- Authentication failure rate > 5%
- Token expiration warnings (10 minutes before)
- Vault unsealed status
- Database connection count approaching limits
Useful metrics:
- Secret request error rate
- Database user creation/deletion patterns
- Network policy blocking events
- OIDC token refresh failures
Failure Recovery Procedures
Agent Injector stuck in Init:0/1:
- Check service account has
system:auth-delegator
cluster role - Verify Agent Injector webhook can reach Kubernetes API
- Confirm network policies allow Vault connectivity
- Nuclear option: Delete pod and recreate (40% success rate)
External Secrets stopped syncing:
- Restart ESO operator pods (fixes 90% of cases)
- Check Vault token expiration
- Verify network policy ingress rules
- Force refresh with shorter TTL values
CI/CD authentication failures:
- Verify OIDC token availability in pipeline
- Check Vault OIDC configuration (GitHub changed URLs in March 2024)
- Split long deployments into shorter jobs
- Implement caching between pipeline stages
Decision Criteria
When to Use Vault + Kubernetes
Worth the complexity if:
- Storing secrets in Git is unacceptable
- Compliance requires audit trails (SOC 2, HIPAA, FedRAMP)
- Dynamic credentials reduce rotation overhead
- Multiple environments need secret isolation
Not worth it if:
- Simple applications with few secrets
- Team lacks Kubernetes/Vault expertise
- Can't dedicate resources to ongoing maintenance
- Vault downtime would be catastrophic
Integration Method Selection
Use Agent Injector when:
- Applications can't be modified for API calls
- Memory overhead acceptable (~100MB per pod)
- Team comfortable debugging init container failures
Use External Secrets when:
- GitOps workflow required
- Standard Kubernetes Secret objects preferred
- Can tolerate sync delays and silent failures
Use Direct API when:
- Maximum performance required
- Team has OIDC/JWT expertise
- Can handle token expiration complexity
Alternative Solutions
Consider instead:
- Sealed Secrets: Simpler for Git-based workflows
- AWS Secrets Manager: If already on AWS
- Azure Key Vault: If already on Azure
- Google Secret Manager: If already on GCP
- SOPS: For encrypted secrets in Git
Breaking Points and Failure Modes
Scale Limitations
- Database user creation: Rate-limited by database performance
- Memory usage: Agent Injector scales linearly with pod count
- Network connections: Vault clustering required beyond single cluster
- Audit log volume: Can overwhelm logging infrastructure
Security Considerations
- Cluster-admin requirements: Initial setup needs elevated privileges
- Network exposure: Vault API accessible within cluster
- Secret leakage: Temporary files on disk during injection
- Audit gaps: Silent failures don't generate audit events
Operational Overhead
- 24/7 monitoring required: Vault outages impact all services
- Expertise requirements: Complex troubleshooting scenarios
- Disaster recovery complexity: Enterprise features for full automation
- Version compatibility: Kubernetes/Vault version matrix management
Production Success Indicators
- Secret request latency consistently < 2 seconds
- Authentication failure rate < 1%
- Zero silent External Secrets sync failures
- Database connection utilization < 70%
- Vault cluster availability > 99.9%
- Complete audit trail for compliance requirements
- Automated disaster recovery tested quarterly
Useful Links for Further Investigation
Resources That Don't Suck
Link | Description |
---|---|
DevOpsCube Vault Integration Series | Actually useful step-by-step guide. Written by someone who's clearly debugged this shit in production. Includes the gotchas that HashiCorp's docs conveniently skip. |
External Secrets Operator Docs | The real MVP. Better documented than HashiCorp's own operator. Includes working examples that actually work. |
ArgoCD Vault Plugin Docs | Works but setup is a pain in the ass. Budget 2 hours minimum for configuration, more if you're new to ArgoCD. The docs are refreshingly honest about the complexity instead of pretending it's simple. |
Medium: Hands-On Vault in Kubernetes | Recent tutorial (2024) that actually works. Covers Helm installation and basic secret injection without the marketing fluff. |
Bank-Vaults Operator | Full-featured Vault operator. More complex than the official one but handles auto-unseal and backup properly. Worth the complexity if you're serious about production. |
Stakater Reloader | Essential companion tool. Automatically restarts pods when secrets change. Works with External Secrets Operator to handle secret rotation. |
Vault CLI | For debugging authentication issues. Test your setup manually before blaming the integration. |
kubectl-vault-sync Plugin | Handy for troubleshooting. Synchronizes secrets from Vault to Kubernetes. |
Vault GitHub Repository | For checking known issues and release notes. The issue tracker is more honest than the marketing materials. |
Vault Helm Chart | Official Helm chart. Read the values.yaml file - it's more informative than most documentation. |
Kubernetes Community Vault Discussions | Real problems from real people. Search for your specific error messages - someone's probably hit it before. |
Vault Kubernetes on Stack Overflow | Actual solutions to actual problems. Skip the blog spam and go straight to the answers that worked. |
CNCF Slack #vault Channel | Active community. Good place to ask "is this normal?" questions about weird behavior. |
Vault Metrics with Prometheus | The Prometheus integration actually works well. Set up these metrics before you go to production. |
Grafana Vault Dashboard | Community-maintained dashboard. Shows the metrics you actually care about for operations. |
Kubernetes Troubleshooting Guide | For when your pods are stuck in Init:0/1 and you need to debug the basics. |
Related Tools & Recommendations
GitOps Integration Hell: Docker + Kubernetes + ArgoCD + Prometheus
How to Wire Together the Modern DevOps Stack Without Losing Your Sanity
GitHub Actions + Docker + ECS: Stop SSH-ing Into Servers Like It's 2015
Deploy your app without losing your mind or your weekend
Kafka + MongoDB + Kubernetes + Prometheus Integration - When Event Streams Break
When your event-driven services die and you're staring at green dashboards while everything burns, you need real observability - not the vendor promises that go
GitHub Actions Marketplace - Where CI/CD Actually Gets Easier
integrates with GitHub Actions Marketplace
GitHub Actions Alternatives That Don't Suck
integrates with GitHub Actions
Docker Alternatives That Won't Break Your Budget
Docker got expensive as hell. Here's how to escape without breaking everything.
I Tested 5 Container Security Scanners in CI/CD - Here's What Actually Works
Trivy, Docker Scout, Snyk Container, Grype, and Clair - which one won't make you want to quit DevOps
RAG on Kubernetes: Why You Probably Don't Need It (But If You Do, Here's How)
Running RAG Systems on K8s Will Make You Hate Your Life, But Sometimes You Don't Have a Choice
Jenkins + Docker + Kubernetes: How to Deploy Without Breaking Production (Usually)
The Real Guide to CI/CD That Actually Works
Jenkins Production Deployment - From Dev to Bulletproof
integrates with Jenkins
Jenkins - The CI/CD Server That Won't Die
integrates with Jenkins
Prometheus + Grafana + Jaeger: Stop Debugging Microservices Like It's 2015
When your API shits the bed right before the big demo, this stack tells you exactly why
Docker Daemon Won't Start on Linux - Fix This Shit Now
Your containers are useless without a running daemon. Here's how to fix the most common startup failures.
Linux Foundation Takes Control of Solo.io's AI Agent Gateway - August 25, 2025
Open source governance shift aims to prevent vendor lock-in as AI agent infrastructure becomes critical to enterprise deployments
GitLab CI/CD - The Platform That Does Everything (Usually)
CI/CD, security scanning, and project management in one place - when it works, it's great
etcd - The Database That Keeps Kubernetes Working
etcd stores all the important cluster state. When it breaks, your weekend is fucked.
Docker Swarm Node Down? Here's How to Fix It
When your production cluster dies at 3am and management is asking questions
Docker Swarm Service Discovery Broken? Here's How to Unfuck It
When your containers can't find each other and everything goes to shit
Docker Swarm - Container Orchestration That Actually Works
Multi-host Docker without the Kubernetes PhD requirement
HashiCorp Nomad - Kubernetes Alternative Without the YAML Hell
competes with HashiCorp Nomad
Recommendations combine user behavior, content similarity, research intelligence, and SEO optimization