Temporal Kubernetes Production Deployment: AI-Optimized Guide
Critical Configuration Requirements
Shard Count Configuration (IMMUTABLE DECISION)
- Default Setting: 4 shards (WILL FAIL in production)
- Production Minimum: 512 shards
- Critical Warning: Cannot be changed after deployment - requires complete cluster rebuild
- Failure Scenario: "shard ownership lost" errors at 1000+ workflows with default 4 shards
- Rebuild Cost: 2+ days downtime, complete data migration required
Service Resource Requirements
History Service (Memory-Critical)
- Minimum Memory: 8GB per pod (12GB recommended)
- CPU: 2-4 cores per pod
- Scaling Pattern: Memory usage grows unpredictably with workflow patterns
- Failure Mode: OOMKilled when limits too low, corrupts workflow state
- Critical Error:
Signal: killed (9)
followed byReason: OOMKilled
Frontend Service (CPU-Bound)
- Memory: 2-4GB per pod
- CPU: 1-2 cores per pod
- Failure Symptom:
context deadline exceeded
errors during traffic spikes - Scaling: Horizontal scaling required for traffic spikes
Database Connection Pool
- Default Setting: 20 connections per pod (WILL EXHAUST)
- Production Setting: 5 connections per pod maximum
- Critical Error:
FATAL: remaining connection slots are reserved for non-replication superuser connections
Production Deployment Anti-Patterns
Never Use Default Helm Chart in Production
- Includes: Bundled Cassandra, Elasticsearch, Prometheus, Grafana
- Failure Timeline: 6 hours before system collapse
- Resource Exhaustion: Bundled services fail under any real load
- Fix: Disable all bundled services, use managed databases
Database Selection Impact
PostgreSQL/MySQL (Recommended)
- Use: Managed services (RDS, Cloud SQL, Azure Database)
- Storage: SSD-backed only (gp3, Premium SSD, SSD persistent disks)
- Connection Pooling: PgBouncer required for PostgreSQL
- Schema Requirements: TWO separate databases (core + visibility)
Cassandra (High Complexity)
- Operational Overhead: Massive - ring topology, JVM tuning, compaction strategies
- Additional Requirement: Elasticsearch cluster for visibility data
- Management Cost: Two complex distributed systems vs one PostgreSQL instance
- Recommendation: Avoid unless proven scale requirements
Critical Metrics for Production Monitoring
Immediate Alert Thresholds
- Shard Lock Latency: >5ms (warning), >10ms (critical)
- Schedule-to-Start Latency: >200ms indicates worker capacity issues
- Database Connection Pool: >80% utilization triggers connection exhaustion
- Poll Sync Rate: <99% indicates task distribution failure
Memory Monitoring
- History Service Pattern: Grows unpredictably with workflow history size
- Current Production Example: 12GB per pod and increasing
- Correlation: Loosely correlates with active workflows (100 simple > 1000 complex workflows possible)
Upgrade Procedures (High Risk)
Required Sequence (EXACT ORDER)
- Database schema migration (temporal-sql-tool)
- Worker services
- Matching and Frontend services
- History services (LAST - most sensitive)
Failure Prevention
- Rolling Updates: Required with pod disruption budgets
- Testing: Exact sequence must be tested in staging with production load
- Never: Restart all History pods simultaneously
High Availability Configuration
Pod Distribution
# Anti-affinity - prevent single node failure
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: "app.kubernetes.io/component"
operator: In
values: ["history"]
topologyKey: "kubernetes.io/hostname"
Pod Disruption Budgets
# Maintain 50% availability during maintenance
spec:
minAvailable: 50%
selector:
matchLabels:
app.kubernetes.io/component: history
Disaster Recovery Requirements
Backup Strategy
- Database = Temporal: All workflow state stored in database only
- Point-in-Time Recovery: Required capability - saved multiple production incidents
- Configuration Storage: Helm values and K8s manifests in Git (GitOps)
- Recovery Time: 30 minutes (prepared) vs 6 hours (unprepared)
Certificate Management
- TLS Required: Production clusters mandate TLS for inter-service communication
- Automation: cert-manager with automatic rotation prevents 3am certificate expiry incidents
- Monitoring: Certificate expiry alerts with 30-day warning minimum
Resource Planning Reality
Storage Requirements
- Growth Rate: 1-10GB per million workflow executions (highly variable)
- IOPS Critical: Database IOPS bottlenecks cause system-wide latency
- Storage Class: SSD-backed mandatory for production performance
Scaling Characteristics
- History Service: Memory-bound scaling, unpredictable growth patterns
- Frontend/Matching: CPU-bound, horizontal scaling required
- Worker Services: Task queue dependent, monitor schedule-to-start latency
Common Production Failures
Memory Exhaustion (History Service)
- Symptom: Gradual memory growth then sudden OOMKill
- Timeline: Can occur hours to days after deployment
- Prevention: Start with 8-12GB limits, monitor growth patterns
- Emergency Fix: Increase memory limits, restart affected pods
Database Connection Exhaustion
- Symptom: All operations fail simultaneously with connection errors
- Root Cause: Default 20 connections per pod * pod count exceeds database limits
- Fix: Reduce to 5 connections per pod, implement connection pooling
- Prevention: Monitor connection pool utilization
Shard Ownership Conflicts
- Symptom: "shard ownership lost" error floods
- Causes: OOMKilled History pods, CPU throttling, database connectivity issues
- Impact: Workflows stuck in Running state indefinitely
- Resolution: Increase History pod resources, stable database connections
Deployment Method Comparison
Method | Setup Complexity | Production Readiness | Scaling | Maintenance | Cost Reality |
---|---|---|---|---|---|
Temporal Cloud | Minimal | Production-ready | Automatic | Zero | $200/month → $2000+/month |
Official Helm | Moderate | Requires hardening | Manual | Medium + 3am incidents | Infrastructure + operational overhead |
Manual K8s | High | Full control | Highly customizable | High + 60-hour weeks | Infrastructure + extensive engineering time |
Critical Decision Points
Shard Count (One-Time Decision)
- Impact: Determines maximum cluster performance ceiling
- Cannot: Be changed without complete rebuild
- Conservative Choice: 512 shards (handles most production workloads)
- Aggressive Choice: 4096+ shards (enterprise scale, higher infrastructure cost)
Database Strategy
- Managed Services: Higher cost, lower operational risk
- Self-Managed: Lower cost, significantly higher operational complexity
- Recommendation: Managed services unless proven database expertise available
Monitoring Complexity
- Temporal Metrics: Hundreds available, most are noise
- Focus: 3-4 critical metrics during incidents
- Cardinality Warning: Can overwhelm smaller Prometheus instances
- Solution: Metric filtering and retention tuning required
Production Hardening Checklist
Security
- TLS inter-service communication configured
- Certificate rotation automated (cert-manager)
- External secrets management (no passwords in YAML)
- Network policies for service isolation
- RBAC configured for operational access
Reliability
- Pod disruption budgets configured
- Anti-affinity rules prevent single-node failures
- Resource requests/limits based on production patterns
- Database connection pooling configured
- Retention policies prevent unbounded growth
Monitoring
- Shard lock latency alerts (<5ms warning, <10ms critical)
- Schedule-to-start latency monitoring (<200ms)
- Database connection pool utilization alerts (>80%)
- Memory usage trending for History services
- Certificate expiry monitoring (30-day warning)
Operational
- Upgrade procedures tested in staging
- Backup/restore procedures validated
- GitOps configuration management
- Incident response playbooks
- Capacity planning based on actual usage patterns
Resource References
Essential Documentation
Operational Tools
- temporal-sql-tool - Database schema management
- Kubernetes Resource Management
- Pod Disruption Budgets
- cert-manager - TLS automation
Monitoring Stack
Community Support
Useful Links for Further Investigation
Essential Resources for Temporal Kubernetes Production Deployments
Link | Description |
---|---|
Temporal Self-Hosted Guide | Comprehensive deployment guide covering all supported deployment methods and production considerations. |
Temporal Helm Charts Repository | Official Kubernetes deployment charts with extensive configuration examples and production-ready templates. |
Temporal Production Checklist | Critical configuration requirements and operational best practices for production deployments. |
Scaling Temporal: The Basics | Detailed performance optimization guide with load testing methodology and scaling strategies. |
Temporal Server Configuration Reference | Complete configuration options documentation for all Temporal services and deployment scenarios. |
Kubernetes Documentation | Essential reference for StatefulSets, Services, ConfigMaps, and other resources used in Temporal deployments. |
Helm Documentation | Package manager documentation for understanding chart customization and deployment automation. |
Kubernetes Resource Management | Critical guide for configuring CPU and memory requests/limits for Temporal services. |
Pod Disruption Budgets | Maintaining availability during cluster maintenance and upgrades. |
Kubernetes Security Best Practices | Security hardening guidelines applicable to Temporal production deployments. |
PostgreSQL High Availability | Database clustering and replication strategies for Temporal persistence requirements. |
MySQL Performance Tuning | Optimization techniques for MySQL-backed Temporal deployments. |
temporal-sql-tool Usage | Database schema management and migration procedures for production upgrades. |
Cassandra Operations | Advanced database management for high-scale Temporal deployments using Cassandra. |
Prometheus Operator | Kubernetes-native monitoring stack deployment and configuration for Temporal metrics. |
Grafana Temporal Dashboards | Pre-built monitoring dashboards for operational visibility. |
Temporal Metrics Reference | Complete metrics documentation for monitoring production cluster health. |
OpenTelemetry Integration | Distributed tracing setup for complex workflow debugging and performance analysis. |
Temporal Samples Repository | Code examples and patterns for building production-ready workflows across multiple programming languages. |
Temporal Benchmarking Tools | Load testing utilities for validating cluster performance and scaling characteristics. |
Temporal CLI Documentation | Command-line tools for cluster administration, workflow management, and debugging. |
Temporal SDK Documentation | Language-specific guides for building applications that integrate with Kubernetes-deployed clusters. |
Temporal Community Forum | Active community discussions including production deployment experiences and troubleshooting advice. |
Temporal Blog | Regular technical articles covering advanced deployment patterns, performance optimization, and operational insights. |
Temporal GitHub Issues | Bug reports and feature discussions relevant to production deployments. |
Temporal Slack Community | Real-time community support for deployment questions and operational challenges. |
Amazon EKS Best Practices | AWS-specific Kubernetes optimization techniques applicable to Temporal deployments. |
Google GKE Production Readiness | GCP-focused production deployment guidelines and managed service integration patterns. |
Azure AKS Operations | Microsoft Azure Kubernetes Service optimization and operational best practices. |
Multi-Cloud Kubernetes Patterns | Cross-cloud deployment strategies and disaster recovery planning. |
Temporal Security Model | Authentication, authorization, and encryption capabilities for enterprise deployments. |
Kubernetes Network Policies | Network segmentation and traffic control for secure Temporal service communication. |
RBAC Configuration | Role-based access control setup for Temporal service accounts and operational access. |
Certificate Management | TLS certificate automation for encrypted communication between Temporal services. |
ArgoCD | GitOps deployment automation for Temporal cluster management and configuration drift prevention. |
Flux | Alternative GitOps tool for automated Temporal deployment management and synchronization. |
Terraform Kubernetes Provider | Infrastructure as Code approaches for reproducible Temporal cluster deployments. |
Ansible Kubernetes Collection | Configuration management automation for complex Temporal deployment scenarios. |
Related Tools & Recommendations
GitOps Integration Hell: Docker + Kubernetes + ArgoCD + Prometheus
How to Wire Together the Modern DevOps Stack Without Losing Your Sanity
Prometheus + Grafana + Jaeger: Stop Debugging Microservices Like It's 2015
When your API shits the bed right before the big demo, this stack tells you exactly why
Kafka + MongoDB + Kubernetes + Prometheus Integration - When Event Streams Break
When your event-driven services die and you're staring at green dashboards while everything burns, you need real observability - not the vendor promises that go
GitHub Actions + Docker + ECS: Stop SSH-ing Into Servers Like It's 2015
Deploy your app without losing your mind or your weekend
Docker Alternatives That Won't Break Your Budget
Docker got expensive as hell. Here's how to escape without breaking everything.
I Tested 5 Container Security Scanners in CI/CD - Here's What Actually Works
Trivy, Docker Scout, Snyk Container, Grype, and Clair - which one won't make you want to quit DevOps
Grafana - The Monitoring Dashboard That Doesn't Suck
integrates with Grafana
Set Up Microservices Monitoring That Actually Works
Stop flying blind - get real visibility into what's breaking your distributed services
Why I Finally Dumped Cassandra After 5 Years of 3AM Hell
depends on MongoDB
MongoDB vs PostgreSQL vs MySQL: Which One Won't Ruin Your Weekend
depends on postgresql
Docker Swarm Node Down? Here's How to Fix It
When your production cluster dies at 3am and management is asking questions
Docker Swarm Service Discovery Broken? Here's How to Unfuck It
When your containers can't find each other and everything goes to shit
Docker Swarm - Container Orchestration That Actually Works
Multi-host Docker without the Kubernetes PhD requirement
Apache Airflow: Two Years of Production Hell
I've Been Fighting This Thing Since 2023 - Here's What Actually Happens
Apache Airflow - Python Workflow Orchestrator That Doesn't Completely Suck
Python-based workflow orchestrator for when cron jobs aren't cutting it and you need something that won't randomly break at 3am
dbt + Snowflake + Apache Airflow: Production Orchestration That Actually Works
How to stop burning money on failed pipelines and actually get your data stack working together
HashiCorp Nomad - Kubernetes Alternative Without the YAML Hell
competes with HashiCorp Nomad
Amazon ECS - Container orchestration that actually works
alternative to Amazon ECS
Google Cloud Run - Throw a Container at Google, Get Back a URL
Skip the Kubernetes hell and deploy containers that actually work.
Spring Boot - Finally, Java That Doesn't Suck
The framework that lets you build REST APIs without XML configuration hell
Recommendations combine user behavior, content similarity, research intelligence, and SEO optimization