Currently viewing the AI version
Switch to human version

Temporal Kubernetes Production Deployment: AI-Optimized Guide

Critical Configuration Requirements

Shard Count Configuration (IMMUTABLE DECISION)

  • Default Setting: 4 shards (WILL FAIL in production)
  • Production Minimum: 512 shards
  • Critical Warning: Cannot be changed after deployment - requires complete cluster rebuild
  • Failure Scenario: "shard ownership lost" errors at 1000+ workflows with default 4 shards
  • Rebuild Cost: 2+ days downtime, complete data migration required

Service Resource Requirements

History Service (Memory-Critical)

  • Minimum Memory: 8GB per pod (12GB recommended)
  • CPU: 2-4 cores per pod
  • Scaling Pattern: Memory usage grows unpredictably with workflow patterns
  • Failure Mode: OOMKilled when limits too low, corrupts workflow state
  • Critical Error: Signal: killed (9) followed by Reason: OOMKilled

Frontend Service (CPU-Bound)

  • Memory: 2-4GB per pod
  • CPU: 1-2 cores per pod
  • Failure Symptom: context deadline exceeded errors during traffic spikes
  • Scaling: Horizontal scaling required for traffic spikes

Database Connection Pool

  • Default Setting: 20 connections per pod (WILL EXHAUST)
  • Production Setting: 5 connections per pod maximum
  • Critical Error: FATAL: remaining connection slots are reserved for non-replication superuser connections

Production Deployment Anti-Patterns

Never Use Default Helm Chart in Production

  • Includes: Bundled Cassandra, Elasticsearch, Prometheus, Grafana
  • Failure Timeline: 6 hours before system collapse
  • Resource Exhaustion: Bundled services fail under any real load
  • Fix: Disable all bundled services, use managed databases

Database Selection Impact

PostgreSQL/MySQL (Recommended)

  • Use: Managed services (RDS, Cloud SQL, Azure Database)
  • Storage: SSD-backed only (gp3, Premium SSD, SSD persistent disks)
  • Connection Pooling: PgBouncer required for PostgreSQL
  • Schema Requirements: TWO separate databases (core + visibility)

Cassandra (High Complexity)

  • Operational Overhead: Massive - ring topology, JVM tuning, compaction strategies
  • Additional Requirement: Elasticsearch cluster for visibility data
  • Management Cost: Two complex distributed systems vs one PostgreSQL instance
  • Recommendation: Avoid unless proven scale requirements

Critical Metrics for Production Monitoring

Immediate Alert Thresholds

  • Shard Lock Latency: >5ms (warning), >10ms (critical)
  • Schedule-to-Start Latency: >200ms indicates worker capacity issues
  • Database Connection Pool: >80% utilization triggers connection exhaustion
  • Poll Sync Rate: <99% indicates task distribution failure

Memory Monitoring

  • History Service Pattern: Grows unpredictably with workflow history size
  • Current Production Example: 12GB per pod and increasing
  • Correlation: Loosely correlates with active workflows (100 simple > 1000 complex workflows possible)

Upgrade Procedures (High Risk)

Required Sequence (EXACT ORDER)

  1. Database schema migration (temporal-sql-tool)
  2. Worker services
  3. Matching and Frontend services
  4. History services (LAST - most sensitive)

Failure Prevention

  • Rolling Updates: Required with pod disruption budgets
  • Testing: Exact sequence must be tested in staging with production load
  • Never: Restart all History pods simultaneously

High Availability Configuration

Pod Distribution

# Anti-affinity - prevent single node failure
affinity:
  podAntiAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
            - key: "app.kubernetes.io/component"
              operator: In
              values: ["history"]
        topologyKey: "kubernetes.io/hostname"

Pod Disruption Budgets

# Maintain 50% availability during maintenance
spec:
  minAvailable: 50%
  selector:
    matchLabels:
      app.kubernetes.io/component: history

Disaster Recovery Requirements

Backup Strategy

  • Database = Temporal: All workflow state stored in database only
  • Point-in-Time Recovery: Required capability - saved multiple production incidents
  • Configuration Storage: Helm values and K8s manifests in Git (GitOps)
  • Recovery Time: 30 minutes (prepared) vs 6 hours (unprepared)

Certificate Management

  • TLS Required: Production clusters mandate TLS for inter-service communication
  • Automation: cert-manager with automatic rotation prevents 3am certificate expiry incidents
  • Monitoring: Certificate expiry alerts with 30-day warning minimum

Resource Planning Reality

Storage Requirements

  • Growth Rate: 1-10GB per million workflow executions (highly variable)
  • IOPS Critical: Database IOPS bottlenecks cause system-wide latency
  • Storage Class: SSD-backed mandatory for production performance

Scaling Characteristics

  • History Service: Memory-bound scaling, unpredictable growth patterns
  • Frontend/Matching: CPU-bound, horizontal scaling required
  • Worker Services: Task queue dependent, monitor schedule-to-start latency

Common Production Failures

Memory Exhaustion (History Service)

  • Symptom: Gradual memory growth then sudden OOMKill
  • Timeline: Can occur hours to days after deployment
  • Prevention: Start with 8-12GB limits, monitor growth patterns
  • Emergency Fix: Increase memory limits, restart affected pods

Database Connection Exhaustion

  • Symptom: All operations fail simultaneously with connection errors
  • Root Cause: Default 20 connections per pod * pod count exceeds database limits
  • Fix: Reduce to 5 connections per pod, implement connection pooling
  • Prevention: Monitor connection pool utilization

Shard Ownership Conflicts

  • Symptom: "shard ownership lost" error floods
  • Causes: OOMKilled History pods, CPU throttling, database connectivity issues
  • Impact: Workflows stuck in Running state indefinitely
  • Resolution: Increase History pod resources, stable database connections

Deployment Method Comparison

Method Setup Complexity Production Readiness Scaling Maintenance Cost Reality
Temporal Cloud Minimal Production-ready Automatic Zero $200/month → $2000+/month
Official Helm Moderate Requires hardening Manual Medium + 3am incidents Infrastructure + operational overhead
Manual K8s High Full control Highly customizable High + 60-hour weeks Infrastructure + extensive engineering time

Critical Decision Points

Shard Count (One-Time Decision)

  • Impact: Determines maximum cluster performance ceiling
  • Cannot: Be changed without complete rebuild
  • Conservative Choice: 512 shards (handles most production workloads)
  • Aggressive Choice: 4096+ shards (enterprise scale, higher infrastructure cost)

Database Strategy

  • Managed Services: Higher cost, lower operational risk
  • Self-Managed: Lower cost, significantly higher operational complexity
  • Recommendation: Managed services unless proven database expertise available

Monitoring Complexity

  • Temporal Metrics: Hundreds available, most are noise
  • Focus: 3-4 critical metrics during incidents
  • Cardinality Warning: Can overwhelm smaller Prometheus instances
  • Solution: Metric filtering and retention tuning required

Production Hardening Checklist

Security

  • TLS inter-service communication configured
  • Certificate rotation automated (cert-manager)
  • External secrets management (no passwords in YAML)
  • Network policies for service isolation
  • RBAC configured for operational access

Reliability

  • Pod disruption budgets configured
  • Anti-affinity rules prevent single-node failures
  • Resource requests/limits based on production patterns
  • Database connection pooling configured
  • Retention policies prevent unbounded growth

Monitoring

  • Shard lock latency alerts (<5ms warning, <10ms critical)
  • Schedule-to-start latency monitoring (<200ms)
  • Database connection pool utilization alerts (>80%)
  • Memory usage trending for History services
  • Certificate expiry monitoring (30-day warning)

Operational

  • Upgrade procedures tested in staging
  • Backup/restore procedures validated
  • GitOps configuration management
  • Incident response playbooks
  • Capacity planning based on actual usage patterns

Resource References

Essential Documentation

Operational Tools

Monitoring Stack

Community Support

Useful Links for Further Investigation

Essential Resources for Temporal Kubernetes Production Deployments

LinkDescription
Temporal Self-Hosted GuideComprehensive deployment guide covering all supported deployment methods and production considerations.
Temporal Helm Charts RepositoryOfficial Kubernetes deployment charts with extensive configuration examples and production-ready templates.
Temporal Production ChecklistCritical configuration requirements and operational best practices for production deployments.
Scaling Temporal: The BasicsDetailed performance optimization guide with load testing methodology and scaling strategies.
Temporal Server Configuration ReferenceComplete configuration options documentation for all Temporal services and deployment scenarios.
Kubernetes DocumentationEssential reference for StatefulSets, Services, ConfigMaps, and other resources used in Temporal deployments.
Helm DocumentationPackage manager documentation for understanding chart customization and deployment automation.
Kubernetes Resource ManagementCritical guide for configuring CPU and memory requests/limits for Temporal services.
Pod Disruption BudgetsMaintaining availability during cluster maintenance and upgrades.
Kubernetes Security Best PracticesSecurity hardening guidelines applicable to Temporal production deployments.
PostgreSQL High AvailabilityDatabase clustering and replication strategies for Temporal persistence requirements.
MySQL Performance TuningOptimization techniques for MySQL-backed Temporal deployments.
temporal-sql-tool UsageDatabase schema management and migration procedures for production upgrades.
Cassandra OperationsAdvanced database management for high-scale Temporal deployments using Cassandra.
Prometheus OperatorKubernetes-native monitoring stack deployment and configuration for Temporal metrics.
Grafana Temporal DashboardsPre-built monitoring dashboards for operational visibility.
Temporal Metrics ReferenceComplete metrics documentation for monitoring production cluster health.
OpenTelemetry IntegrationDistributed tracing setup for complex workflow debugging and performance analysis.
Temporal Samples RepositoryCode examples and patterns for building production-ready workflows across multiple programming languages.
Temporal Benchmarking ToolsLoad testing utilities for validating cluster performance and scaling characteristics.
Temporal CLI DocumentationCommand-line tools for cluster administration, workflow management, and debugging.
Temporal SDK DocumentationLanguage-specific guides for building applications that integrate with Kubernetes-deployed clusters.
Temporal Community ForumActive community discussions including production deployment experiences and troubleshooting advice.
Temporal BlogRegular technical articles covering advanced deployment patterns, performance optimization, and operational insights.
Temporal GitHub IssuesBug reports and feature discussions relevant to production deployments.
Temporal Slack CommunityReal-time community support for deployment questions and operational challenges.
Amazon EKS Best PracticesAWS-specific Kubernetes optimization techniques applicable to Temporal deployments.
Google GKE Production ReadinessGCP-focused production deployment guidelines and managed service integration patterns.
Azure AKS OperationsMicrosoft Azure Kubernetes Service optimization and operational best practices.
Multi-Cloud Kubernetes PatternsCross-cloud deployment strategies and disaster recovery planning.
Temporal Security ModelAuthentication, authorization, and encryption capabilities for enterprise deployments.
Kubernetes Network PoliciesNetwork segmentation and traffic control for secure Temporal service communication.
RBAC ConfigurationRole-based access control setup for Temporal service accounts and operational access.
Certificate ManagementTLS certificate automation for encrypted communication between Temporal services.
ArgoCDGitOps deployment automation for Temporal cluster management and configuration drift prevention.
FluxAlternative GitOps tool for automated Temporal deployment management and synchronization.
Terraform Kubernetes ProviderInfrastructure as Code approaches for reproducible Temporal cluster deployments.
Ansible Kubernetes CollectionConfiguration management automation for complex Temporal deployment scenarios.

Related Tools & Recommendations

integration
Recommended

GitOps Integration Hell: Docker + Kubernetes + ArgoCD + Prometheus

How to Wire Together the Modern DevOps Stack Without Losing Your Sanity

kubernetes
/integration/docker-kubernetes-argocd-prometheus/gitops-workflow-integration
100%
integration
Recommended

Prometheus + Grafana + Jaeger: Stop Debugging Microservices Like It's 2015

When your API shits the bed right before the big demo, this stack tells you exactly why

Prometheus
/integration/prometheus-grafana-jaeger/microservices-observability-integration
71%
integration
Recommended

Kafka + MongoDB + Kubernetes + Prometheus Integration - When Event Streams Break

When your event-driven services die and you're staring at green dashboards while everything burns, you need real observability - not the vendor promises that go

Apache Kafka
/integration/kafka-mongodb-kubernetes-prometheus-event-driven/complete-observability-architecture
67%
integration
Recommended

GitHub Actions + Docker + ECS: Stop SSH-ing Into Servers Like It's 2015

Deploy your app without losing your mind or your weekend

GitHub Actions
/integration/github-actions-docker-aws-ecs/ci-cd-pipeline-automation
39%
alternatives
Recommended

Docker Alternatives That Won't Break Your Budget

Docker got expensive as hell. Here's how to escape without breaking everything.

Docker
/alternatives/docker/budget-friendly-alternatives
39%
compare
Recommended

I Tested 5 Container Security Scanners in CI/CD - Here's What Actually Works

Trivy, Docker Scout, Snyk Container, Grype, and Clair - which one won't make you want to quit DevOps

docker
/compare/docker-security/cicd-integration/docker-security-cicd-integration
39%
tool
Recommended

Grafana - The Monitoring Dashboard That Doesn't Suck

integrates with Grafana

Grafana
/tool/grafana/overview
37%
howto
Recommended

Set Up Microservices Monitoring That Actually Works

Stop flying blind - get real visibility into what's breaking your distributed services

Prometheus
/howto/setup-microservices-observability-prometheus-jaeger-grafana/complete-observability-setup
37%
alternatives
Recommended

Why I Finally Dumped Cassandra After 5 Years of 3AM Hell

depends on MongoDB

MongoDB
/alternatives/mongodb-postgresql-cassandra/cassandra-operational-nightmare
26%
compare
Recommended

MongoDB vs PostgreSQL vs MySQL: Which One Won't Ruin Your Weekend

depends on postgresql

postgresql
/compare/mongodb/postgresql/mysql/performance-benchmarks-2025
26%
troubleshoot
Recommended

Docker Swarm Node Down? Here's How to Fix It

When your production cluster dies at 3am and management is asking questions

Docker Swarm
/troubleshoot/docker-swarm-node-down/node-down-recovery
25%
troubleshoot
Recommended

Docker Swarm Service Discovery Broken? Here's How to Unfuck It

When your containers can't find each other and everything goes to shit

Docker Swarm
/troubleshoot/docker-swarm-production-failures/service-discovery-routing-mesh-failures
25%
tool
Recommended

Docker Swarm - Container Orchestration That Actually Works

Multi-host Docker without the Kubernetes PhD requirement

Docker Swarm
/tool/docker-swarm/overview
25%
review
Recommended

Apache Airflow: Two Years of Production Hell

I've Been Fighting This Thing Since 2023 - Here's What Actually Happens

Apache Airflow
/review/apache-airflow/production-operations-review
23%
tool
Recommended

Apache Airflow - Python Workflow Orchestrator That Doesn't Completely Suck

Python-based workflow orchestrator for when cron jobs aren't cutting it and you need something that won't randomly break at 3am

Apache Airflow
/tool/apache-airflow/overview
23%
integration
Recommended

dbt + Snowflake + Apache Airflow: Production Orchestration That Actually Works

How to stop burning money on failed pipelines and actually get your data stack working together

dbt (Data Build Tool)
/integration/dbt-snowflake-airflow/production-orchestration
23%
tool
Recommended

HashiCorp Nomad - Kubernetes Alternative Without the YAML Hell

competes with HashiCorp Nomad

HashiCorp Nomad
/tool/hashicorp-nomad/overview
23%
tool
Recommended

Amazon ECS - Container orchestration that actually works

alternative to Amazon ECS

Amazon ECS
/tool/aws-ecs/overview
23%
tool
Recommended

Google Cloud Run - Throw a Container at Google, Get Back a URL

Skip the Kubernetes hell and deploy containers that actually work.

Google Cloud Run
/tool/google-cloud-run/overview
23%
tool
Recommended

Spring Boot - Finally, Java That Doesn't Suck

The framework that lets you build REST APIs without XML configuration hell

Spring Boot
/tool/spring-boot/overview
23%

Recommendations combine user behavior, content similarity, research intelligence, and SEO optimization