Temporal on Kubernetes isn't like deploying a web app. I thought it was. The first time I deployed it, I used the default Helm chart, didn't change anything, and pushed it to prod. Big mistake.
The system seemed fine for about 6 hours. Then at 2am, everything broke. Workflows stopped progressing. The History pods were OOM-killing themselves. Database connections were maxed out. Our on-call engineer (me) spent the next 4 hours figuring out what went wrong.
The Four Services That Will Ruin Your Night
Temporal has four core services and each one has its own special way of breaking in production:
Frontend - The API gateway that looks innocent but will bottleneck you at scale. It's CPU-bound, so when traffic spikes, it just... stops responding. You'll see context deadline exceeded
errors everywhere and wonder why your perfectly good workflows are hanging. Scale this first when things get weird. I learned this at 2:30am when our entire workflow system froze because one Frontend pod couldn't handle the load from our batch job processing.
History - This is the one that will eat all your RAM and ask for seconds. History services cache workflow execution data and they're greedy as hell. Each History pod can easily consume 8GB+ of memory in production. The kicker? The shard count is set at deployment and cannot be changed. Ever. Choose wrong and you rebuild your entire cluster.
I learned this the hard way when we deployed with 4 shards (the default) and hit 1000+ workflows. History pods were fighting over shard ownership, causing "shard ownership lost" errors. We had to rebuild everything with 512 shards. Two days of downtime.
Matching - Handles task queues and if you tune it wrong, workflows just... sit there. Forever. Tasks get queued but never picked up. Workers are idle but tasks aren't being delivered. It's infuriating to debug because everything looks healthy until you dig into the queue metrics. I once spent 4 hours chasing a bug where workflows would start but never progress past the first activity - turned out the Matching service couldn't keep up with the poll requests from our 50 worker pods.
Worker - These aren't part of Temporal server but they're what execute your actual workflows. If the ratio of workers to tasks is wrong, you'll either have idle resources burning money or backed-up queues making users angry.
Database Choices That Matter (And Ones That Don't)
Temporal needs a database. Pick from PostgreSQL, MySQL, Cassandra, or SQLite (SQLite is dev only, obviously). Your choice affects everything else.
PostgreSQL/MySQL - Go with managed services like Amazon RDS, Google Cloud SQL, or Azure Database. Trust me on this. Running your own database in K8s sounds cool until it breaks at 3am and you're trying to recover data from persistent volumes while your CEO asks when workflows will work again.
Running PostgreSQL on Kubernetes is possible with operators like Zalando's Postgres Operator or CrunchyData PGO, but the operational overhead is massive. You need backup strategies, connection pooling with PgBouncer, monitoring with pg_stat_statements, WAL archiving, and performance tuning. Plus proper security configurations, upgrade procedures, and monitoring setup with PostgreSQL Exporter. Managed services handle all this bullshit for you.
The temporal-sql-tool handles schema setup. It's straightforward but you need TWO databases - one for core Temporal data and another for visibility (search) data. Yes, two. I know it's annoying.
Cassandra - Only choose this if you hate yourself or actually need the scale. Cassandra in Kubernetes is a nightmare. Sure, there's the Cassandra Operator and K8ssandra for enterprise deployments, but you'll spend more time managing the database than your actual workflows. You need proper ring topology, JVM tuning, compaction strategies, and monitoring with cassandra-exporter.
Plus, Cassandra can't handle visibility data, so you need Elasticsearch with proper cluster setup, index management, and monitoring with Elasticsearch Exporter too. Now you're managing two complex distributed systems instead of one simple PostgreSQL instance. I tried this approach once - spent 3 weeks getting Cassandra stable only to have Elasticsearch nodes randomly dying during peak load. Ended up switching back to RDS Postgres and sleeping better at night.
Resource Planning (AKA Guessing Until It Works)
Here's the dirty truth: nobody knows the exact resources you'll need until you hit production load. You can follow load-measure-scale methodology all you want, but production always surprises you.
Memory - History services are memory hogs. Start with 4GB per History pod, but watch it grow. Our largest History pod consumes 12GB and growing. Memory usage correlates with active workflows and history size, but correlates loosely. We've seen 100 simple workflows use more RAM than 1000 complex ones.
Configure resource requests and limits properly or face the OOMKilled nightmare. Use Vertical Pod Autoscaler to automatically adjust memory limits based on actual usage, but don't trust it blindly - VPA can kill pods during adjustment. Consider Horizontal Pod Autoscaling for Frontend and Matching services, and monitor with Kubernetes Resource Recommender and cAdvisor metrics.
CPU - Frontend and Matching are CPU-bound. History uses both CPU and memory. Start with 1-2 CPU cores per pod and scale horizontally when things slow down. Vertical scaling only goes so far.
Storage - Get fast disks. SSD-backed StorageClasses or your database will be the bottleneck. We burned through 3 days debugging "slow" Temporal before realizing our database was on spinning rust.
For AWS, use gp3 volumes with provisioned IOPS. For Azure, go with Premium SSD. Google Cloud's SSD persistent disks are solid too. Don't cheap out on storage - database IOPS bottlenecks will ruin your day.
Now that you understand what each service does and how they'll fail, let's talk about the configuration that actually works. Because the default Helm chart will screw you faster than you can say "production deployment."