How many shards should I use before everything breaks?

Here's the dirty truth: **start with 512 shards** unless you absolutely know you need more. I deployed with 4 shards (the default) because the docs made it sound fine. It wasn't fine.After 1000 workflows, History pods were fighting over shard ownership. The exact error was: `Error 1006: shard 2 ownership lost, current owner: temporal-history-abc123, new owner: temporal-history-def456`. That error flooded our logs, workflows stuck in `Running` state forever, and users couldn't create new workflows. We had to rebuild the entire cluster with 512 shards. Two days of downtime explaining to management why our "simple deployment" broke everything.[Monitor shard lock latency](https://temporal.io/blog/scaling-temporal-the-basics) religiously. Above 5ms consistently? Your next cluster rebuild needs more shards. There's no in-place upgrade path for this. Zero.

Why does my History pod keep eating all the memory and then dying?

Because History services are greedy bastards. They cache workflow execution data in memory and they never say no to more RAM. A History pod will happily consume whatever memory limit you give it, then ask for more.The exact error you'll see is: `Signal: killed (9)` followed by `Reason: OOMKilled` in your pod events. Kubernetes killed it because it exceeded the memory limit. Our largest History pod currently uses 12GB and growing. Started with 4GB limits (OOMKilled within 2 hours), then 8GB (lasted a day), now 12GB (stable for weeks).We've seen simple workflows with long histories consume 6GB while complex workflows with short histories use 2GB. [Memory usage is determined by workflow patterns](https://docs.temporal.io/self-hosted-guide/production-checklist) and active workflow count, not just throughput complexity.Start with 8-12GB per History pod minimum. Set memory limits 20% higher than requests or watch them get OOMKilled during memory spikes. Been there, debugged that at 3am with angry users.

How do I stop the database connection errors from ruining my life?

First, **never trust the default connection pool settings**. 10-20 connections per pod sounds reasonable until you have 12 pods and suddenly your database is drowning in connections. You'll get `too many connections` errors and everything stops working.The exact error looks like: `FATAL: remaining connection slots are reserved for non-replication superuser connections` (PostgreSQL) or `ERROR 1040 (HY000): Too many connections` (MySQL). Fix this by setting `maxConns: 5` per pod in your database configuration instead of the default 20.Set up [proper connection pooling](https://docs.temporal.io/self-hosted-guide/production-checklist) and configure retry logic. Add [readiness probes](https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/) that actually test database connectivity - don't just check if the process is running. My probe failed miserably: it checked if the port was open but didn't verify the database was accepting connections.Also, make sure your K8s cluster can actually reach your database. Sounds obvious but I spent 2 hours debugging `dial tcp 10.0.1.100:5432: connect: connection refused` before realizing the security group rule was only allowing port 3306 (MySQL) while we were running PostgreSQL on 5432. Use [external secrets](https://external-secrets.io/) for credentials or you'll have database passwords scattered across YAML files like a security audit nightmare.

Why did the default Helm chart destroy my production cluster?

Because the default chart is a development environment, not a production deployment. It includes Cassandra, Elasticsearch, Prometheus, and Grafana - all configured for development with minimal resources.I deployed the default chart to production once. The bundled Cassandra fell over within hours. Elasticsearch ran out of memory. Prometheus couldn't scrape metrics fast enough. It was a clusterfuck.**Never deploy the default Helm chart to production.** Disable all bundled services, bump shard count from 4 to 512+, configure real resource limits, add pod disruption budgets, set up anti-affinity rules, and connect to actual managed databases. The default chart is a trap.

What metrics actually matter when everything's on fire?

Temporal exposes hundreds of metrics. Most are noise. When your pager goes off at 3am, check these three: 1. **[Shard lock latency](https://temporal.io/blog/scaling-temporal-the-basics)** - Above 10ms means History pods are struggling. Above 20ms means workflows will start failing. 2. **[Schedule-to-start latency](https://temporal.io/blog/scaling-temporal-the-basics)** - How long tasks sit in queues. Above 200ms means workers can't keep up or polling is broken. 3. **Database connection pool utilization** - When this hits 100%, everything stops working and the errors are confusing as hell. Set up Prometheus + Grafana with the included dashboards, but be warned: Temporal's metric cardinality can overwhelm smaller monitoring stacks. We crashed our Prometheus twice before tuning retention policies.

What does "shard ownership lost" actually mean and how do I make it stop?

This error makes experienced engineers want to quit. It means History pods are fighting over who owns which shards, usually because: - History pods are getting [OOMKilled](https://kubernetes.io/docs/tasks/debug/debug-application/debug-running-pod/) due to memory limits that are too low - CPU throttling is making History pods too slow to maintain shard ownership - Network connectivity to the database is flaky - You're restarting pods too aggressively during deployments Fix: Give History pods more memory, ensure stable database connections, and implement proper readiness probes. Also, **don't restart all History pods at once** during deployments - stagger the restarts or you'll trigger shard ownership battles.

How do I upgrade Temporal without destroying everything?

Upgrades are terrifying. Get the order wrong and you'll corrupt workflow state. Here's the sequence that won't ruin your week: 1. **Database schema migration** with [temporal-sql-tool](https://github.com/temporalio/temporal) FIRST 2. Worker services 3. Matching and Frontend services 4. History services LAST (they're the most sensitive) Use [rolling updates](https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#rolling-update-deployment) with pod disruption budgets. Test this exact process in staging with production-like load. Don't assume it'll work fine in prod because it worked with zero traffic.

My workers aren't picking up tasks - what's broken now?

High schedule-to-start latency usually means: - Not enough worker pods (scale horizontally first) - Worker polling configuration is fucked ([tune the polling](https://docs.temporal.io/workers#polling) - 5-10 activity pollers, 10-20 workflow pollers per worker) - Matching service is CPU-bound (scale those pods too) Monitor [poll sync rate](https://temporal.io/blog/scaling-temporal-the-basics). Should be above 99%. Below that means something's broken in the task distribution chain.

How much disk space does Temporal actually need?

Depends entirely on your retention policies and workflow patterns. Database storage grows roughly 1-10GB per million workflow executions, but that's a wild guess until you measure your actual usage.Use [SSD storage classes](https://kubernetes.io/docs/concepts/storage/storage-classes/) for the database or IOPS will be your bottleneck. Set up proper [retention policies](https://docs.temporal.io/clusters#retention-period) or your database will grow forever and your DBA will hate you.

How do I recover from disaster when everything's on fire?

Your database IS Temporal. Everything else is stateless. Focus on database backup and recovery: - Use managed database point-in-time recovery (saved our asses multiple times) - Store your Helm values and K8s manifests in git ([GitOps](https://www.gitops.tech/) approach) - Test recovery procedures regularly - not when disaster strikes When shit hits the fan, restore the database first, then rebuild the K8s cluster from your stored configs. Should take 30 minutes if you've prepared, 6 hours if you're winging it.

Currently viewing the AI version

Switch to human version

Temporal Kubernetes Production Deployment: AI-Optimized Guide

Critical Configuration Requirements

Shard Count Configuration (IMMUTABLE DECISION)

Default Setting: 4 shards (WILL FAIL in production)
Production Minimum: 512 shards
Critical Warning: Cannot be changed after deployment - requires complete cluster rebuild
Failure Scenario: "shard ownership lost" errors at 1000+ workflows with default 4 shards
Rebuild Cost: 2+ days downtime, complete data migration required

Service Resource Requirements

History Service (Memory-Critical)

Minimum Memory: 8GB per pod (12GB recommended)
CPU: 2-4 cores per pod
Scaling Pattern: Memory usage grows unpredictably with workflow patterns
Failure Mode: OOMKilled when limits too low, corrupts workflow state
Critical Error: Signal: killed (9) followed by Reason: OOMKilled

Frontend Service (CPU-Bound)

Memory: 2-4GB per pod
CPU: 1-2 cores per pod
Failure Symptom: context deadline exceeded errors during traffic spikes
Scaling: Horizontal scaling required for traffic spikes

Database Connection Pool

Default Setting: 20 connections per pod (WILL EXHAUST)
Production Setting: 5 connections per pod maximum
Critical Error: FATAL: remaining connection slots are reserved for non-replication superuser connections

Production Deployment Anti-Patterns

Never Use Default Helm Chart in Production

Includes: Bundled Cassandra, Elasticsearch, Prometheus, Grafana
Failure Timeline: 6 hours before system collapse
Resource Exhaustion: Bundled services fail under any real load
Fix: Disable all bundled services, use managed databases

Database Selection Impact

PostgreSQL/MySQL (Recommended)

Use: Managed services (RDS, Cloud SQL, Azure Database)
Storage: SSD-backed only (gp3, Premium SSD, SSD persistent disks)
Connection Pooling: PgBouncer required for PostgreSQL
Schema Requirements: TWO separate databases (core + visibility)

Cassandra (High Complexity)

Operational Overhead: Massive - ring topology, JVM tuning, compaction strategies
Additional Requirement: Elasticsearch cluster for visibility data
Management Cost: Two complex distributed systems vs one PostgreSQL instance
Recommendation: Avoid unless proven scale requirements

Critical Metrics for Production Monitoring

Immediate Alert Thresholds

Shard Lock Latency: >5ms (warning), >10ms (critical)
Schedule-to-Start Latency: >200ms indicates worker capacity issues
Database Connection Pool: >80% utilization triggers connection exhaustion
Poll Sync Rate: <99% indicates task distribution failure

Memory Monitoring

History Service Pattern: Grows unpredictably with workflow history size
Current Production Example: 12GB per pod and increasing
Correlation: Loosely correlates with active workflows (100 simple > 1000 complex workflows possible)

Upgrade Procedures (High Risk)

Required Sequence (EXACT ORDER)

Database schema migration (temporal-sql-tool)
Worker services
Matching and Frontend services
History services (LAST - most sensitive)

Failure Prevention

Rolling Updates: Required with pod disruption budgets
Testing: Exact sequence must be tested in staging with production load
Never: Restart all History pods simultaneously

High Availability Configuration

Pod Distribution

# Anti-affinity - prevent single node failure
affinity:
  podAntiAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
            - key: "app.kubernetes.io/component"
              operator: In
              values: ["history"]
        topologyKey: "kubernetes.io/hostname"

Pod Disruption Budgets

# Maintain 50% availability during maintenance
spec:
  minAvailable: 50%
  selector:
    matchLabels:
      app.kubernetes.io/component: history

Disaster Recovery Requirements

Backup Strategy

Database = Temporal: All workflow state stored in database only
Point-in-Time Recovery: Required capability - saved multiple production incidents
Configuration Storage: Helm values and K8s manifests in Git (GitOps)
Recovery Time: 30 minutes (prepared) vs 6 hours (unprepared)

Certificate Management

TLS Required: Production clusters mandate TLS for inter-service communication
Automation: cert-manager with automatic rotation prevents 3am certificate expiry incidents
Monitoring: Certificate expiry alerts with 30-day warning minimum

Resource Planning Reality

Storage Requirements

Growth Rate: 1-10GB per million workflow executions (highly variable)
IOPS Critical: Database IOPS bottlenecks cause system-wide latency
Storage Class: SSD-backed mandatory for production performance

Scaling Characteristics

History Service: Memory-bound scaling, unpredictable growth patterns
Frontend/Matching: CPU-bound, horizontal scaling required
Worker Services: Task queue dependent, monitor schedule-to-start latency

Common Production Failures

Memory Exhaustion (History Service)

Symptom: Gradual memory growth then sudden OOMKill
Timeline: Can occur hours to days after deployment
Prevention: Start with 8-12GB limits, monitor growth patterns
Emergency Fix: Increase memory limits, restart affected pods

Database Connection Exhaustion

Symptom: All operations fail simultaneously with connection errors
Root Cause: Default 20 connections per pod * pod count exceeds database limits
Fix: Reduce to 5 connections per pod, implement connection pooling
Prevention: Monitor connection pool utilization

Shard Ownership Conflicts

Symptom: "shard ownership lost" error floods
Causes: OOMKilled History pods, CPU throttling, database connectivity issues
Impact: Workflows stuck in Running state indefinitely
Resolution: Increase History pod resources, stable database connections

Deployment Method Comparison

Method	Setup Complexity	Production Readiness	Scaling	Maintenance	Cost Reality
Temporal Cloud	Minimal	Production-ready	Automatic	Zero	$200/month → $2000+/month
Official Helm	Moderate	Requires hardening	Manual	Medium + 3am incidents	Infrastructure + operational overhead
Manual K8s	High	Full control	Highly customizable	High + 60-hour weeks	Infrastructure + extensive engineering time

Critical Decision Points

Shard Count (One-Time Decision)

Impact: Determines maximum cluster performance ceiling
Cannot: Be changed without complete rebuild
Conservative Choice: 512 shards (handles most production workloads)
Aggressive Choice: 4096+ shards (enterprise scale, higher infrastructure cost)

Database Strategy

Managed Services: Higher cost, lower operational risk
Self-Managed: Lower cost, significantly higher operational complexity
Recommendation: Managed services unless proven database expertise available

Monitoring Complexity

Temporal Metrics: Hundreds available, most are noise
Focus: 3-4 critical metrics during incidents
Cardinality Warning: Can overwhelm smaller Prometheus instances
Solution: Metric filtering and retention tuning required

Production Hardening Checklist

Security

TLS inter-service communication configured
Certificate rotation automated (cert-manager)
External secrets management (no passwords in YAML)
Network policies for service isolation
RBAC configured for operational access

Reliability

Pod disruption budgets configured
Anti-affinity rules prevent single-node failures
Resource requests/limits based on production patterns
Database connection pooling configured
Retention policies prevent unbounded growth

Monitoring

Shard lock latency alerts (<5ms warning, <10ms critical)
Schedule-to-start latency monitoring (<200ms)
Database connection pool utilization alerts (>80%)
Memory usage trending for History services
Certificate expiry monitoring (30-day warning)

Operational

Upgrade procedures tested in staging
Backup/restore procedures validated
GitOps configuration management
Incident response playbooks
Capacity planning based on actual usage patterns

Resource References

Essential Documentation

Operational Tools

temporal-sql-tool - Database schema management
Kubernetes Resource Management
Pod Disruption Budgets
cert-manager - TLS automation

Monitoring Stack

Community Support

Useful Links for Further Investigation

Essential Resources for Temporal Kubernetes Production Deployments

Link	Description
Temporal Self-Hosted Guide	Comprehensive deployment guide covering all supported deployment methods and production considerations.
Temporal Helm Charts Repository	Official Kubernetes deployment charts with extensive configuration examples and production-ready templates.
Temporal Production Checklist	Critical configuration requirements and operational best practices for production deployments.
Scaling Temporal: The Basics	Detailed performance optimization guide with load testing methodology and scaling strategies.
Temporal Server Configuration Reference	Complete configuration options documentation for all Temporal services and deployment scenarios.
Kubernetes Documentation	Essential reference for StatefulSets, Services, ConfigMaps, and other resources used in Temporal deployments.
Helm Documentation	Package manager documentation for understanding chart customization and deployment automation.
Kubernetes Resource Management	Critical guide for configuring CPU and memory requests/limits for Temporal services.
Pod Disruption Budgets	Maintaining availability during cluster maintenance and upgrades.
Kubernetes Security Best Practices	Security hardening guidelines applicable to Temporal production deployments.
PostgreSQL High Availability	Database clustering and replication strategies for Temporal persistence requirements.
MySQL Performance Tuning	Optimization techniques for MySQL-backed Temporal deployments.
temporal-sql-tool Usage	Database schema management and migration procedures for production upgrades.
Cassandra Operations	Advanced database management for high-scale Temporal deployments using Cassandra.
Prometheus Operator	Kubernetes-native monitoring stack deployment and configuration for Temporal metrics.
Grafana Temporal Dashboards	Pre-built monitoring dashboards for operational visibility.
Temporal Metrics Reference	Complete metrics documentation for monitoring production cluster health.
OpenTelemetry Integration	Distributed tracing setup for complex workflow debugging and performance analysis.
Temporal Samples Repository	Code examples and patterns for building production-ready workflows across multiple programming languages.
Temporal Benchmarking Tools	Load testing utilities for validating cluster performance and scaling characteristics.
Temporal CLI Documentation	Command-line tools for cluster administration, workflow management, and debugging.
Temporal SDK Documentation	Language-specific guides for building applications that integrate with Kubernetes-deployed clusters.
Temporal Community Forum	Active community discussions including production deployment experiences and troubleshooting advice.
Temporal Blog	Regular technical articles covering advanced deployment patterns, performance optimization, and operational insights.
Temporal GitHub Issues	Bug reports and feature discussions relevant to production deployments.
Temporal Slack Community	Real-time community support for deployment questions and operational challenges.
Amazon EKS Best Practices	AWS-specific Kubernetes optimization techniques applicable to Temporal deployments.
Google GKE Production Readiness	GCP-focused production deployment guidelines and managed service integration patterns.
Azure AKS Operations	Microsoft Azure Kubernetes Service optimization and operational best practices.
Multi-Cloud Kubernetes Patterns	Cross-cloud deployment strategies and disaster recovery planning.
Temporal Security Model	Authentication, authorization, and encryption capabilities for enterprise deployments.
Kubernetes Network Policies	Network segmentation and traffic control for secure Temporal service communication.
RBAC Configuration	Role-based access control setup for Temporal service accounts and operational access.
Certificate Management	TLS certificate automation for encrypted communication between Temporal services.
ArgoCD	GitOps deployment automation for Temporal cluster management and configuration drift prevention.
Flux	Alternative GitOps tool for automated Temporal deployment management and synchronization.
Terraform Kubernetes Provider	Infrastructure as Code approaches for reproducible Temporal cluster deployments.
Ansible Kubernetes Collection	Configuration management automation for complex Temporal deployment scenarios.

Temporal Kubernetes Production Deployment: AI-Optimized Guide

Critical Configuration Requirements

Shard Count Configuration (IMMUTABLE DECISION)

Service Resource Requirements

History Service (Memory-Critical)

Frontend Service (CPU-Bound)

Database Connection Pool

Production Deployment Anti-Patterns

Never Use Default Helm Chart in Production

Database Selection Impact

PostgreSQL/MySQL (Recommended)

Cassandra (High Complexity)

Critical Metrics for Production Monitoring

Immediate Alert Thresholds

Memory Monitoring

Upgrade Procedures (High Risk)

Required Sequence (EXACT ORDER)

Failure Prevention

High Availability Configuration

Pod Distribution

Pod Disruption Budgets

Disaster Recovery Requirements

Backup Strategy

Certificate Management

Resource Planning Reality

Storage Requirements

Scaling Characteristics

Common Production Failures

Memory Exhaustion (History Service)

Database Connection Exhaustion

Shard Ownership Conflicts

Deployment Method Comparison

Critical Decision Points

Shard Count (One-Time Decision)

Database Strategy

Monitoring Complexity

Production Hardening Checklist

Security

Reliability

Monitoring

Operational

Resource References

Essential Documentation

Operational Tools

Monitoring Stack

Community Support

Useful Links for Further Investigation

Essential Resources for Temporal Kubernetes Production Deployments

Related Tools & Recommendations

GitOps Integration Hell: Docker + Kubernetes + ArgoCD + Prometheus

Prometheus + Grafana + Jaeger: Stop Debugging Microservices Like It's 2015

Kafka + MongoDB + Kubernetes + Prometheus Integration - When Event Streams Break

GitHub Actions + Docker + ECS: Stop SSH-ing Into Servers Like It's 2015

Docker Alternatives That Won't Break Your Budget

I Tested 5 Container Security Scanners in CI/CD - Here's What Actually Works

Grafana - The Monitoring Dashboard That Doesn't Suck

Set Up Microservices Monitoring That Actually Works

Why I Finally Dumped Cassandra After 5 Years of 3AM Hell

MongoDB vs PostgreSQL vs MySQL: Which One Won't Ruin Your Weekend

Docker Swarm Node Down? Here's How to Fix It

Docker Swarm Service Discovery Broken? Here's How to Unfuck It

Docker Swarm - Container Orchestration That Actually Works

Apache Airflow: Two Years of Production Hell

Apache Airflow - Python Workflow Orchestrator That Doesn't Completely Suck

dbt + Snowflake + Apache Airflow: Production Orchestration That Actually Works

HashiCorp Nomad - Kubernetes Alternative Without the YAML Hell

Amazon ECS - Container orchestration that actually works

Google Cloud Run - Throw a Container at Google, Get Back a URL

Spring Boot - Finally, Java That Doesn't Suck