Currently viewing the AI version
Switch to human version

Temporal + Kubernetes + Redis: Production Microservices Architecture

Configuration Specifications

Temporal Server Production Settings

  • Version: Use 1.25.0 specifically (avoid 1.26.0 - contains memory leak bug)
  • Memory Requirements: 4Gi requests, 8Gi limits (2Gi causes OOM kills under load)
  • CPU Requirements: 2 requests, 4 limits (spikes during cluster resharding)
  • Critical Issue: History service consumes excessive memory during workflow replays
  • Production Capacity: ~80K concurrent workflows before degradation occurs

Redis Configuration Requirements

  • Version: Redis 7.2 (7.4 breaks pub/sub, 7.3 has memory issues)
  • Memory Allocation: 2Gi minimum (unpredictable growth with pub/sub)
  • Performance Degradation: Sub-1ms until 85% memory utilization, then exponential slowdown
  • Failure Mode: Connection refused errors when OOM occurs
  • Database Separation: Use different Redis databases (0-15) for environment isolation

Kubernetes Resource Specifications

# Production-tested resource limits
resources:
  requests:
    memory: "1Gi"    # Order service baseline
    cpu: "500m"
  limits:
    memory: "4Gi"    # Account for traffic spikes
    cpu: "2"

Critical Failure Modes and Solutions

Temporal Workflow Failures

  • Error Pattern: ACTIVITY_TASK_FAILED with RPC timeout messages
  • Behavior: Workflows appear dead but retry every 30 seconds automatically
  • Recovery Time: Can pause 4+ hours during infrastructure failures and resume normally
  • Critical Setting: Connection pool limits cause pq: sorry, too many clients already

Redis Memory Exhaustion

  • Warning Threshold: 80% memory usage triggers performance degradation
  • Failure Impact: 3am outages common when memory fragmentation occurs
  • Fallback Strategy: Services must rebuild state from databases when Redis unavailable
  • Recovery Pattern: 500ms response times vs normal 50ms during Redis outage

Service Discovery Failures

  • DNS Resolution: Kubernetes DNS updates automatically when pods die
  • Network Patterns: Use service-name:port format, avoid hardcoded IPs
  • Failure Recovery: 200-800ms latency during deployments, 5+ seconds when nodes drain

Performance Baselines and Thresholds

Real-World Production Metrics

  • Daily Workflow Volume: 45-60K executions (peaks at 75K during marketing campaigns)
  • Service-to-Service Latency: 8-25ms optimal, 200-800ms during deployments
  • End-to-End Transaction Time: 150-800ms normal, 3-15 seconds when external APIs fail
  • System Uptime: 99.87% achieved (0.13% downtime primarily Redis memory issues)

Scaling Breakpoints

  • PostgreSQL Connection Limits: Temporal bottleneck at high traffic
  • Redis Operations: Millions ops/sec until memory fragmentation
  • Kubernetes Pod Scaling: Works until node capacity limits
  • Traffic Spike Handling: 10x traffic managed with database connection scaling (20→100 connections)

Resource Requirements and Costs

Infrastructure Costs (Production System ~50K workflows/day)

  • Managed Kubernetes: $800-900/month (3-node cluster with autoscaling spikes)
  • Managed Redis: $200-250/month (ElastiCache Multi-AZ)
  • Database (PostgreSQL): $300-400/month (includes read replica for scaling)
  • Load Balancing: $100-130/month
  • Monitoring/Logging: $120-200/month
  • Total Monthly Cost: $1,600-1,900 + engineering overhead

Learning Curve Investment

  • Team Onboarding: 3 months minimum to stop breaking production
  • Temporal Complexity: Hardest component (workflow vs HTTP handler mindset shift)
  • Implementation Risk: Teams spend 6 months over-architecting before processing first order

Architecture Decision Trade-offs

Technology Comparison Matrix

Pattern Complexity Reliability Performance Operational Overhead Critical Limitations
Temporal+K8s+Redis Medium (3 systems to debug) Excellent (auto-retry until success) High (until Redis OOM) Medium (managed services help) Memory consumption unpredictable
Pure Kubernetes+gRPC Low Good (until pod crashes mid-request) Excellent Low No state coordination
Event-Driven+Kafka High Good ("at-least-once" = duplicate processing) Variable (fast until Kafka issues) High (PhD-level complexity) Event ordering nightmares
API Gateway+Database Low Poor (single point failure) Good (until connection limits) Medium Doesn't scale horizontally

Implementation Patterns That Work

Service Communication Architecture

  1. Workflow Coordination: Long-running business processes via Temporal (order processing, payments, reconciliation)
  2. Real-Time Messaging: Fast inter-service communication via Redis Streams/pub-sub (inventory updates, notifications)
  3. Shared State: Session data, distributed locks, caching via Redis data structures
  4. Service Discovery: Kubernetes DNS for service location (http://payment-service:8080)

Data Flow Pattern

  1. Client request → API Gateway (K8s ingress)
  2. Service validates → starts Temporal workflow
  3. Workflow coordinates → Payment/Inventory/Notification services
  4. Services communicate → Redis for state sharing
  5. K8s manages → health, scaling, routing
  6. Workflow completion → final notifications and audit

Critical Warnings and Gotchas

Production Deployment Issues

  • Database Connection Pools: Set 100+ connections for high traffic (default 20 causes failures)
  • Environment Isolation: Separate Redis databases/K8s namespaces (dev workflows processing live payments = disaster)
  • Memory Monitoring: Redis OOM at 85% usage causes cascade failures
  • Version Pinning: Avoid :latest tags (use specific versions for stability)

Debugging Requirements

  • Correlation IDs: Essential for tracing requests across all three systems
  • Centralized Logging: Required to reconstruct failures at 3am
  • Temporal Web UI: Only reliable way to debug stuck workflows
  • Redis Monitoring: RedisInsight necessary for memory usage tracking

Common Implementation Mistakes

  • Tool Misuse: Don't use Temporal for simple HTTP calls, Redis as primary database, or K8s for business logic
  • Over-Architecture: Build one working end-to-end workflow before adding complexity
  • Schema Changes: Require backward compatibility across distributed system
  • Monitoring Gaps: Need alerts on workflow failures, Redis memory >80%, pod crashes, error rate spikes

Essential Resources for Implementation

Primary Documentation

Community Support

Monitoring and Operations

Educational Content

Success Metrics and Validation

Production Readiness Indicators

  • Workflows survive 4+ hour infrastructure outages without data loss
  • Services handle Redis failures gracefully (fallback to database queries)
  • Zero manual intervention required for service coordination
  • Sub-100ms service calls during normal operations
  • Automatic recovery from pod crashes and node failures

Performance Validation

  • End-to-end transaction processing: 150-800ms
  • Inter-service communication: 8-25ms optimal
  • Redis operations: Sub-1ms for cache hits
  • System availability: 99.87%+ uptime achievable
  • Traffic scaling: 10x capacity with proper database tuning

Useful Links for Further Investigation

Essential Resources for Implementation

LinkDescription
Temporal DocumentationHoly shit, docs that are actually useful and don't assume you have a PhD in distributed systems
Temporal Web UI GuideThis UI saved me from a nervous breakdown when 500 workflows got stuck in limbo last Tuesday
RedisInsight DocumentationThe only way to figure out why Redis is eating 8GB of RAM to cache 50MB of data
Redis Performance OptimizationMemory tuning that might prevent Redis from OOMing at 3am (no guarantees)
Kubernetes Resource ManagementCPU and memory limits that prevent pods from eating all your resources
Microservices Communication PatternsHow services should talk to each other without breaking everything
Circuit Breaker PatternStop cascade failures before they kill your entire system
Prometheus Redis ExporterTrack Redis memory before it eats your entire cluster
Distributed Tracing with JaegerTrace requests across services when everything's broken and you don't know why
Site Reliability EngineeringGoogle's SRE practices when you need to run this shit at scale
Temporal Community SlackWhere you go at 4am to ask "why is my workflow stuck in PENDING for 6 hours"
Redis Discord CommunityReal-time help when Redis is eating all your memory and you don't know why
CNCF Microservices Working GroupIndustry standards and emerging patterns in cloud-native architectures
Temporal Community ForumLong-form discussions about complex workflow patterns and production experiences

Related Tools & Recommendations

tool
Recommended

Docker Swarm - Container Orchestration That Actually Works

Multi-host Docker without the Kubernetes PhD requirement

Docker Swarm
/tool/docker-swarm/overview
100%
alternatives
Recommended

Docker Desktop Alternatives That Don't Suck

Tried every alternative after Docker started charging - here's what actually works

Docker Desktop
/alternatives/docker-desktop/migration-ready-alternatives
71%
tool
Recommended

Docker Security Scanner Performance Optimization - Stop Waiting Forever

integrates with Docker Security Scanners (Category)

Docker Security Scanners (Category)
/tool/docker-security-scanners/performance-optimization
71%
integration
Recommended

Making Pulumi, Kubernetes, Helm, and GitOps Actually Work Together

Stop fighting with YAML hell and infrastructure drift - here's how to manage everything through Git without losing your sanity

Pulumi
/integration/pulumi-kubernetes-helm-gitops/complete-workflow-integration
70%
integration
Recommended

Prometheus + Grafana + Jaeger: Stop Debugging Microservices Like It's 2015

When your API shits the bed right before the big demo, this stack tells you exactly why

Prometheus
/integration/prometheus-grafana-jaeger/microservices-observability-integration
68%
troubleshoot
Recommended

CrashLoopBackOff Exit Code 1: When Your App Works Locally But Kubernetes Hates It

integrates with Kubernetes

Kubernetes
/troubleshoot/kubernetes-crashloopbackoff-exit-code-1/exit-code-1-application-errors
54%
tool
Recommended

Prometheus - Scrapes Metrics From Your Shit So You Know When It Breaks

Free monitoring that actually works (most of the time) and won't die when your network hiccups

Prometheus
/tool/prometheus/overview
37%
integration
Recommended

Kafka + MongoDB + Kubernetes + Prometheus Integration - When Event Streams Break

When your event-driven services die and you're staring at green dashboards while everything burns, you need real observability - not the vendor promises that go

Apache Kafka
/integration/kafka-mongodb-kubernetes-prometheus-event-driven/complete-observability-architecture
37%
compare
Recommended

Redis vs Memcached vs Hazelcast: Production Caching Decision Guide

Three caching solutions that tackle fundamentally different problems. Redis 8.2.1 delivers multi-structure data operations with memory complexity. Memcached 1.6

Redis
/compare/redis/memcached/hazelcast/comprehensive-comparison
36%
tool
Recommended

Spring Boot - Finally, Java That Doesn't Suck

The framework that lets you build REST APIs without XML configuration hell

Spring Boot
/tool/spring-boot/overview
36%
integration
Recommended

Setting Up Prometheus Monitoring That Won't Make You Hate Your Job

How to Connect Prometheus, Grafana, and Alertmanager Without Losing Your Sanity

Prometheus
/integration/prometheus-grafana-alertmanager/complete-monitoring-integration
36%
howto
Recommended

Set Up Microservices Monitoring That Actually Works

Stop flying blind - get real visibility into what's breaking your distributed services

Prometheus
/howto/setup-microservices-observability-prometheus-jaeger-grafana/complete-observability-setup
36%
alternatives
Recommended

GitHub Actions Alternatives for Security & Compliance Teams

integrates with GitHub Actions

GitHub Actions
/alternatives/github-actions/security-compliance-alternatives
36%
alternatives
Recommended

Tired of GitHub Actions Eating Your Budget? Here's Where Teams Are Actually Going

integrates with GitHub Actions

GitHub Actions
/alternatives/github-actions/migration-ready-alternatives
36%
alternatives
Recommended

GitHub Actions is Fine for Open Source Projects, But Try Explaining to an Auditor Why Your CI/CD Platform Was Built for Hobby Projects

integrates with GitHub Actions

GitHub Actions
/alternatives/github-actions/enterprise-governance-alternatives
36%
tool
Recommended

Istio - Service Mesh That'll Make You Question Your Life Choices

The most complex way to connect microservices, but it actually works (eventually)

Istio
/tool/istio/overview
36%
integration
Recommended

Stop Debugging Microservices Networking at 3AM

How Docker, Kubernetes, and Istio Actually Work Together (When They Work)

Docker
/integration/docker-kubernetes-istio/service-mesh-architecture
36%
tool
Recommended

Debugging Istio Production Issues - The 3AM Survival Guide

When traffic disappears and your service mesh is the prime suspect

Istio
/tool/istio/debugging-production-issues
36%
troubleshoot
Recommended

Docker Swarm Node Down? Here's How to Fix It

When your production cluster dies at 3am and management is asking questions

Docker Swarm
/troubleshoot/docker-swarm-node-down/node-down-recovery
35%
troubleshoot
Recommended

Docker Swarm Service Discovery Broken? Here's How to Unfuck It

When your containers can't find each other and everything goes to shit

Docker Swarm
/troubleshoot/docker-swarm-production-failures/service-discovery-routing-mesh-failures
35%

Recommendations combine user behavior, content similarity, research intelligence, and SEO optimization