Deploying Temporal to Kubernetes Without Losing Your Mind

Why Your First Temporal Deployment Will Fail

Temporal on Kubernetes isn't like deploying a web app. I thought it was. The first time I deployed it, I used the default Helm chart, didn't change anything, and pushed it to prod. Big mistake.

The system seemed fine for about 6 hours. Then at 2am, everything broke. Workflows stopped progressing. The History pods were OOM-killing themselves. Database connections were maxed out. Our on-call engineer (me) spent the next 4 hours figuring out what went wrong.

The Four Services That Will Ruin Your Night

Temporal has four core services and each one has its own special way of breaking in production:

Frontend - The API gateway that looks innocent but will bottleneck you at scale. It's CPU-bound, so when traffic spikes, it just... stops responding. You'll see context deadline exceeded errors everywhere and wonder why your perfectly good workflows are hanging. Scale this first when things get weird. I learned this at 2:30am when our entire workflow system froze because one Frontend pod couldn't handle the load from our batch job processing.

History - This is the one that will eat all your RAM and ask for seconds. History services cache workflow execution data and they're greedy as hell. Each History pod can easily consume 8GB+ of memory in production. The kicker? The shard count is set at deployment and cannot be changed. Ever. Choose wrong and you rebuild your entire cluster.

I learned this the hard way when we deployed with 4 shards (the default) and hit 1000+ workflows. History pods were fighting over shard ownership, causing "shard ownership lost" errors. We had to rebuild everything with 512 shards. Two days of downtime.

Matching - Handles task queues and if you tune it wrong, workflows just... sit there. Forever. Tasks get queued but never picked up. Workers are idle but tasks aren't being delivered. It's infuriating to debug because everything looks healthy until you dig into the queue metrics. I once spent 4 hours chasing a bug where workflows would start but never progress past the first activity - turned out the Matching service couldn't keep up with the poll requests from our 50 worker pods.

Worker - These aren't part of Temporal server but they're what execute your actual workflows. If the ratio of workers to tasks is wrong, you'll either have idle resources burning money or backed-up queues making users angry.

Database Choices That Matter (And Ones That Don't)

Temporal needs a database. Pick from PostgreSQL, MySQL, Cassandra, or SQLite (SQLite is dev only, obviously). Your choice affects everything else.

PostgreSQL/MySQL - Go with managed services like Amazon RDS, Google Cloud SQL, or Azure Database. Trust me on this. Running your own database in K8s sounds cool until it breaks at 3am and you're trying to recover data from persistent volumes while your CEO asks when workflows will work again.

Running PostgreSQL on Kubernetes is possible with operators like Zalando's Postgres Operator or CrunchyData PGO, but the operational overhead is massive. You need backup strategies, connection pooling with PgBouncer, monitoring with pg_stat_statements, WAL archiving, and performance tuning. Plus proper security configurations, upgrade procedures, and monitoring setup with PostgreSQL Exporter. Managed services handle all this bullshit for you.

The temporal-sql-tool handles schema setup. It's straightforward but you need TWO databases - one for core Temporal data and another for visibility (search) data. Yes, two. I know it's annoying.

Cassandra - Only choose this if you hate yourself or actually need the scale. Cassandra in Kubernetes is a nightmare. Sure, there's the Cassandra Operator and K8ssandra for enterprise deployments, but you'll spend more time managing the database than your actual workflows. You need proper ring topology, JVM tuning, compaction strategies, and monitoring with cassandra-exporter.

Plus, Cassandra can't handle visibility data, so you need Elasticsearch with proper cluster setup, index management, and monitoring with Elasticsearch Exporter too. Now you're managing two complex distributed systems instead of one simple PostgreSQL instance. I tried this approach once - spent 3 weeks getting Cassandra stable only to have Elasticsearch nodes randomly dying during peak load. Ended up switching back to RDS Postgres and sleeping better at night.

Resource Planning (AKA Guessing Until It Works)

Here's the dirty truth: nobody knows the exact resources you'll need until you hit production load. You can follow load-measure-scale methodology all you want, but production always surprises you.

Memory - History services are memory hogs. Start with 4GB per History pod, but watch it grow. Our largest History pod consumes 12GB and growing. Memory usage correlates with active workflows and history size, but correlates loosely. We've seen 100 simple workflows use more RAM than 1000 complex ones.

Configure resource requests and limits properly or face the OOMKilled nightmare. Use Vertical Pod Autoscaler to automatically adjust memory limits based on actual usage, but don't trust it blindly - VPA can kill pods during adjustment. Consider Horizontal Pod Autoscaling for Frontend and Matching services, and monitor with Kubernetes Resource Recommender and cAdvisor metrics.

Kubernetes Resource Management

Resource Usage Pattern

CPU - Frontend and Matching are CPU-bound. History uses both CPU and memory. Start with 1-2 CPU cores per pod and scale horizontally when things slow down. Vertical scaling only goes so far.

Storage - Get fast disks. SSD-backed StorageClasses or your database will be the bottleneck. We burned through 3 days debugging "slow" Temporal before realizing our database was on spinning rust.

For AWS, use gp3 volumes with provisioned IOPS. For Azure, go with Premium SSD. Google Cloud's SSD persistent disks are solid too. Don't cheap out on storage - database IOPS bottlenecks will ruin your day.

Now that you understand what each service does and how they'll fail, let's talk about the configuration that actually works. Because the default Helm chart will screw you faster than you can say "production deployment."

The Configuration That Actually Works in Production

The official Temporal Helm charts are a trap. They look official, they look complete, but they'll screw you over in production faster than you can say "shard ownership lost."

Don't Use the Default Helm Chart (Seriously)

The default chart deploys Cassandra, Elasticsearch, Prometheus, and Grafana alongside Temporal. It's a nice development setup. For production? It's a disaster waiting to happen. These bundled services will break under any real load.

I deployed the default Helm chart once in production with temporal version 1.25.1. The bundled Cassandra ran out of memory after 6 hours with just 200 workflows. Elasticsearch threw java.lang.OutOfMemoryError: Java heap space errors. Prometheus couldn't scrape metrics fast enough and started dropping them. It was a clusterfuck that taught me to never trust defaults in production.

Here's the configuration that won't leave you debugging at 3am. First, disable ALL the bundled crap and point to managed services:

Temporal Service Architecture

Temporal Kubernetes Deployment Diagram

cassandra:
  enabled: false
elasticsearch:
  enabled: false
prometheus:
  enabled: false  
grafana:
  enabled: false

server:
  config:
    persistence:
      default:
        driver: "sql"
        sql:
          driver: "postgres"
          host: "your-actual-postgres-host.rds.amazonaws.com"
          port: 5432
          database: "temporal"
          user: "temporal_user"
          # Add the password via secret, not here
      visibility:
        driver: "sql"
        sql:
          driver: "postgres"
          host: "your-actual-postgres-host.rds.amazonaws.com"  
          port: 5432
          database: "temporal_visibility"
          user: "temporal_user"
    numHistoryShards: 512  # THIS CANNOT BE CHANGED LATER

The Shard Count Decision That Will Haunt You - You get exactly one chance to set the number of shards. Choose wrong and you rebuild your cluster from scratch.

Start with 512 shards unless you absolutely know you need more. We started with 4 (the default) and regretted it within a week when we hit this lovely error message: failed to acquire shard ownership, shard 2 is already owned by another host. That error message became my nemesis. 4,096 shards sounds impressive but you'll spend more on infrastructure than most startups raise in Series A.

Resource Limits That Don't Suck

These resource limits look reasonable until your first memory spike teaches you how wrong you were:

server:
  frontend:
    replicaCount: 3
    resources:
      requests:
        cpu: "1"
        memory: "2Gi"
      limits:
        cpu: "2"       # Don't forget CPU limits 
        memory: "4Gi"  # Frontend is usually well-behaved
  history:
    replicaCount: 4
    resources:
      requests:
        cpu: "2"
        memory: "6Gi"  # Start higher than you think
      limits:
        cpu: "4"
        memory: "12Gi" # History pods will eat this and ask for more
  matching:
    resources:
      requests:
        cpu: "1"
        memory: "2Gi"
      limits:
        cpu: "2"
        memory: "4Gi"

Kubernetes Resource Management

Pod Disruption Budgets - These keep your cluster from nuking all History pods during maintenance. Configure PDBs or watch your workflows fail during routine K8s updates:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: temporal-history-pdb
spec:
  minAvailable: 50%  # Keep half running during updates
  selector:
    matchLabels:
      app.kubernetes.io/component: history

Anti-Affinity - Spread pods across nodes or watch everything fail when one node dies:

server:
  history:
    affinity:
      podAntiAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
                - key: "app.kubernetes.io/component"
                  operator: In
                  values: ["history"]
            topologyKey: "kubernetes.io/hostname"  # Different nodes

The Metrics That Actually Matter

Temporal spits out hundreds of metrics. Most are useless noise. Here are the ones that will save your ass:

Monitoring Dashboard Visualization

History shard lock latency - Keep this under 5ms or your workflows will start lagging. Above 10ms and you're in trouble. Monitor this religiously with proper Prometheus configuration.

Schedule-to-start latency - How long tasks wait in queues. Above 200ms means workers aren't keeping up or your poll sync rates are fucked. Check task queue metrics documentation for proper monitoring setup and worker tuning guides for optimization.

Database connection pool exhaustion - The silent killer. Everything looks fine until suddenly all connections are gone and Temporal can't do anything. Configure proper database connection limits and monitor with database-specific metrics. For PostgreSQL, use PgBouncer connection pooling and monitor with postgres_exporter.

TLS Because Security Theater

Production clusters need TLS. It's annoying but required:

server:
  config:
    tls:
      internode:
        server:
          certFile: "/etc/temporal/config/certs/tls.crt"
          keyFile: "/etc/temporal/config/certs/tls.key"
        client:
          serverName: "temporal-frontend"

Use cert-manager with automatic certificate rotation to automate certificate rotation or you'll be manually updating certs at 3am when they expire. Set up proper certificate monitoring with alerts via certificate-expiry-monitor and configure cert-manager webhook for automated renewals. Don't be like me - I hard-coded a cert that expired on a Friday night, took down the entire workflow system, and spent the weekend explaining to stakeholders why "certificate renewal" isn't something we can just restart.

Backups (Because Disaster Will Strike)

Your database IS Temporal. Lose the database, lose everything. Managed database backups are your friend. Point-in-time recovery saved our asses when someone accidentally truncated a table. Follow Temporal database backup best practices and implement automated backup verification with backup monitoring tools.

High Availability Architecture

Also backup your Helm values and K8s manifests. Store them in git with GitOps workflows and use Kubernetes backup solutions like Velero for complete cluster backup and Sealed Secrets for secure secret management. When shit hits the fan, you want to rebuild quickly, not reverse-engineer your configuration from a failing cluster.

Upgrades Are Terrifying

Temporal upgrades require precise sequencing:

Database schema migration with temporal-sql-tool using proper upgrade procedures and schema versioning.
Worker services first
Matching and Frontend services
History services LAST

Get the order wrong and you'll corrupt workflow state. Test this in staging. Seriously. Don't wing it in production.

Speaking of choices that will bite you later, let me break down the different deployment approaches so you can pick your poison wisely.

Temporal Kubernetes Deployment Options Comparison

Deployment Method	Setup Complexity	Production Readiness	Scaling Capabilities	Maintenance Overhead	Cost Considerations
Temporal Cloud	Minimal SDK integration only	Actually works without 3am debugging	Automatic horizontal scaling	None fully managed	~$200/month quickly becomes ~$2000/month at scale
Official Helm Charts	Moderate requires Kubernetes expertise	Requires serious production hardening	Manual scaling that you'll fuck up initially	Medium cluster management + late night outages	Infrastructure costs + your sanity
Manual Kubernetes Deployment	High masochists only	Full control over your own suffering	Highly customizable ways to break things	High you own every failure	Infrastructure costs + 60-hour work weeks
Managed Kubernetes Services	Moderate platform-specific configuration	Good with proper configuration	Platform-integrated scaling	Medium shared with cloud provider	Higher infrastructure costs but reduced complexity
Docker Compose	Low single command deployment	Development only not production ready	Limited vertical scaling only	Low for development environments	Minimal for development, unsuitable for production

Questions I Wish Someone Had Answered Before My First Production Deployment

How many shards should I use before everything breaks?

Here's the dirty truth: start with 512 shards unless you absolutely know you need more. I deployed with 4 shards (the default) because the docs made it sound fine. It wasn't fine.After 1000 workflows, History pods were fighting over shard ownership. The exact error was: Error 1006: shard 2 ownership lost, current owner: temporal-history-abc123, new owner: temporal-history-def456. That error flooded our logs, workflows stuck in Running state forever, and users couldn't create new workflows. We had to rebuild the entire cluster with 512 shards. Two days of downtime explaining to management why our "simple deployment" broke everything.Monitor shard lock latency religiously. Above 5ms consistently? Your next cluster rebuild needs more shards. There's no in-place upgrade path for this. Zero.

Why does my History pod keep eating all the memory and then dying?

Because History services are greedy bastards. They cache workflow execution data in memory and they never say no to more RAM. A History pod will happily consume whatever memory limit you give it, then ask for more.The exact error you'll see is: Signal: killed (9) followed by Reason: OOMKilled in your pod events. Kubernetes killed it because it exceeded the memory limit. Our largest History pod currently uses 12GB and growing. Started with 4GB limits (OOMKilled within 2 hours), then 8GB (lasted a day), now 12GB (stable for weeks).We've seen simple workflows with long histories consume 6GB while complex workflows with short histories use 2GB. Memory usage is determined by workflow patterns and active workflow count, not just throughput complexity.Start with 8-12GB per History pod minimum. Set memory limits 20% higher than requests or watch them get OOMKilled during memory spikes. Been there, debugged that at 3am with angry users.

How do I stop the database connection errors from ruining my life?

First, never trust the default connection pool settings. 10-20 connections per pod sounds reasonable until you have 12 pods and suddenly your database is drowning in connections.

You'll get too many connections errors and everything stops working.The exact error looks like: FATAL: remaining connection slots are reserved for non-replication superuser connections (PostgreSQL) or `ERROR 1040 (HY000):

Too many connections(MySQL). Fix this by settingmaxConns: 5` per pod in your database configuration instead of the default 20.

Set up proper connection pooling and configure retry logic.

Add readiness probes that actually test database connectivity

don't just check if the process is running.

My probe failed miserably: it checked if the port was open but didn't verify the database was accepting connections.

Also, make sure your K8s cluster can actually reach your database. Sounds obvious but I spent 2 hours debugging dial tcp 10.0.1.100:5432: connect: connection refused before realizing the security group rule was only allowing port 3306 (MySQL) while we were running PostgreSQL on 5432.

Use external secrets for credentials or you'll have database passwords scattered across YAML files like a security audit nightmare.

Why did the default Helm chart destroy my production cluster?

Because the default chart is a development environment, not a production deployment. It includes Cassandra, Elasticsearch, Prometheus, and Grafana

all configured for development with minimal resources.I deployed the default chart to production once. The bundled Cassandra fell over within hours. Elasticsearch ran out of memory. Prometheus couldn't scrape metrics fast enough. It was a clusterfuck.Never deploy the default Helm chart to production. Disable all bundled services, bump shard count from 4 to 512+, configure real resource limits, add pod disruption budgets, set up anti-affinity rules, and connect to actual managed databases. The default chart is a trap.

What metrics actually matter when everything's on fire?

Temporal exposes hundreds of metrics. Most are noise. When your pager goes off at 3am, check these three:

Shard lock latency - Above 10ms means History pods are struggling. Above 20ms means workflows will start failing.
Schedule-to-start latency - How long tasks sit in queues. Above 200ms means workers can't keep up or polling is broken.
Database connection pool utilization - When this hits 100%, everything stops working and the errors are confusing as hell.

Set up Prometheus + Grafana with the included dashboards, but be warned: Temporal's metric cardinality can overwhelm smaller monitoring stacks. We crashed our Prometheus twice before tuning retention policies.

What does "shard ownership lost" actually mean and how do I make it stop?

This error makes experienced engineers want to quit. It means History pods are fighting over who owns which shards, usually because:

History pods are getting OOMKilled due to memory limits that are too low
CPU throttling is making History pods too slow to maintain shard ownership
Network connectivity to the database is flaky
You're restarting pods too aggressively during deployments

Fix: Give History pods more memory, ensure stable database connections, and implement proper readiness probes. Also, don't restart all History pods at once during deployments - stagger the restarts or you'll trigger shard ownership battles.

How do I upgrade Temporal without destroying everything?

Upgrades are terrifying. Get the order wrong and you'll corrupt workflow state. Here's the sequence that won't ruin your week:

Database schema migration with temporal-sql-tool FIRST
Worker services
Matching and Frontend services
History services LAST (they're the most sensitive)

Use rolling updates with pod disruption budgets. Test this exact process in staging with production-like load. Don't assume it'll work fine in prod because it worked with zero traffic.

My workers aren't picking up tasks - what's broken now?

High schedule-to-start latency usually means:

Not enough worker pods (scale horizontally first)
Worker polling configuration is fucked (tune the polling - 5-10 activity pollers, 10-20 workflow pollers per worker)
Matching service is CPU-bound (scale those pods too)

Monitor poll sync rate. Should be above 99%. Below that means something's broken in the task distribution chain.

How much disk space does Temporal actually need?

Depends entirely on your retention policies and workflow patterns. Database storage grows roughly 1-10GB per million workflow executions, but that's a wild guess until you measure your actual usage.Use SSD storage classes for the database or IOPS will be your bottleneck. Set up proper retention policies or your database will grow forever and your DBA will hate you.

How do I recover from disaster when everything's on fire?

Your database IS Temporal. Everything else is stateless. Focus on database backup and recovery:

Use managed database point-in-time recovery (saved our asses multiple times)
Store your Helm values and K8s manifests in git (GitOps approach)
Test recovery procedures regularly - not when disaster strikes

When shit hits the fan, restore the database first, then rebuild the K8s cluster from your stored configs. Should take 30 minutes if you've prepared, 6 hours if you're winging it.

Essential Resources for Temporal Kubernetes Production Deployments

21%

Recommendations combine user behavior, content similarity, research intelligence, and SEO optimization

Quick Navigation

The Four Services That Will Ruin Your Night

Database Choices That Matter (And Ones That Don't)

Resource Planning (AKA Guessing Until It Works)

Don't Use the Default Helm Chart (Seriously)

Resource Limits That Don't Suck

The Metrics That Actually Matter

TLS Because Security Theater

Backups (Because Disaster Will Strike)

Upgrades Are Terrifying

How many shards should I use before everything breaks?

Why does my History pod keep eating all the memory and then dying?

How do I stop the database connection errors from ruining my life?

Why did the default Helm chart destroy my production cluster?

What metrics actually matter when everything's on fire?

What does "shard ownership lost" actually mean and how do I make it stop?

How do I upgrade Temporal without destroying everything?

My workers aren't picking up tasks - what's broken now?

How much disk space does Temporal actually need?

How do I recover from disaster when everything's on fire?

Related Tools & Recommendations

Prometheus, Grafana, Alertmanager: Complete Monitoring Stack Setup

Set Up Microservices Monitoring That Actually Works

Helm: Simplify Kubernetes Deployments & Avoid YAML Chaos

LangChain Production Deployment Guide: What Actually Breaks

ArgoCD Production Troubleshooting: Debugging & Fixing Deployments

Fix Docker Daemon Connection Failures

Docker Container Won't Start? Here's How to Actually Fix It

Docker Permission Denied on Windows? Here's How to Fix It

Prometheus - Scrapes Metrics From Your Shit So You Know When It Breaks

Grafana - The Monitoring Dashboard That Doesn't Suck

PostgreSQL vs MySQL vs MongoDB vs Cassandra - Which Database Will Ruin Your Weekend Less?

Ollama Production Troubleshooting: Fix Deployment Nightmares & Performance

Connecting ClickHouse to Kafka: Production Deployment & Pitfalls

GitOps Overview: Principles, Benefits & Implementation Guide

Claude AI: Anthropic's Costly but Effective Production Use

Docker Swarm Node Down? Here's How to Fix It

Container Orchestration Alternatives: Escape Kubernetes Hell

Docker Container Breakout Prevention: Emergency Response Guide

Protocol Buffers: Troubleshooting Performance & Memory Leaks

Git: How to Merge Specific Files from Another Branch