OpenTofu state is locked and I can't deploy. What the fuck?

Error you'll see: `Error: Error locking state: Error acquiring the state lock: ConditionalCheckFailedException` What happened: DynamoDB state lock got stuck. Someone's deploy failed and the lock didn't release. Fix this shit: ```bash # Find the lock in DynamoDB console, delete the item manually # Or force unlock (dangerous but sometimes necessary) tofu force-unlock 1a2b3c4d-5e6f-7g8h-9i0j-k1l2m3n4o5p6 ``` Prevention: Set up DynamoDB TTL on your lock table. Locks older than 1 hour get auto-deleted. Learned this the hard way after a Friday deploy went sideways and locked our state until Monday.

Atlantis webhook stopped working and deployments are queued forever

Error you'll see: HTTP 500 on webhook delivery, or no webhook delivery at all. What broke: - SSL certificate expired (check with `curl -I https://your-atlantis.com`) - GitHub webhook got deleted somehow - Database is full and Atlantis crashed - Your load balancer health check is failing Emergency fix: Restart Atlantis container and pray. Then debug properly: ```bash # Check if webhooks are being received docker logs atlantis | grep webhook # Check SSL cert openssl s_client -connect your-atlantis.com:443 | grep "Not After" ```

Scalr charges me for failed runs. Is this normal?

Short answer: No. Scalr only charges for successful runs. What actually happened: Your "failed" run probably succeeded but threw warnings. Check the run status in Scalr console. Common gotcha: `terraform plan` shows no changes but still counts as a successful run. Yeah, it's annoying.

Migration from Terraform Cloud broke our providers. Now what?

Error you'll probably see: `provider registry.terraform.io/hashicorp/aws v4.x.x doesn't exist` The problem: Provider version pinning got fucked during migration. Fix it: Update your provider constraints: ```hcl terraform { required_providers { aws = { source = "hashicorp/aws" version = "~> 5.0" # Update this version } } } ``` Run `terraform init -upgrade` and pray to the terraform gods.

GitHub Actions keeps timing out on large terraform plans

Timeout error: `The job running on runner GitHub Actions 47 has exceeded the maximum execution time of 360 minutes.` Reality: Your infrastructure got too big for GitHub Actions default timeouts. Our EKS cluster plan takes 4 hours because we have 200 worker nodes. Solutions that actually work: ```yaml # Increase timeout in workflow jobs: terraform: timeout-minutes: 480 # 8 hours # Or split your monolith into smaller modules (better long-term) ``` Pro tip: Use `terraform plan -out=plan.tfplan` to cache the plan between steps.

CloudFormation error: "UPDATE_ROLLBACK_FAILED" - what does this mean?

The most useless error message in AWS history. What actually happened: - Resource was manually modified outside CloudFormation - IAM permissions changed after resource creation - Resource has dependencies that prevent rollback Debug steps: 1. Go to CloudFormation console → Events tab 2. Find the actual resource error (usually buried 20 lines down) 3. Google the real error message 4. Consider `aws cloudformation continue-update-rollback` as nuclear option

Our terraform state file is 500MB and deployments are slow as hell

Problem: Terraform loads entire state into memory. Big state = slow everything. Our 900MB state file makes `terraform plan` take 8 minutes. Terraform state performance is more consistent than my WiFi but that's not saying much. Why your state is huge: - Too many resources in one state file - State bloat from deleted resources that didn't get cleaned up - Large JSON data in state (common with data sources) Fixes that work: ```bash # Remove unused resources from state terraform state rm aws_instance.deleted_thing # Split state files by environment/service terraform state mv aws_instance.prod terraform-prod.tfstate ``` Nuclear option: Start fresh with new state files. Import existing resources. Plan for a long weekend.

Should I stay with expensive Terraform Cloud or migrate?

Stay if: - You're paying under like $500/month - Your team doesn't know Docker/AWS - HashiCorp Vault integration is critical - You have compliance requirements and no dedicated security team Get the hell out if: - Costs exceed a grand/month and growing - You hit concurrent run limits during incidents (this happened to us 3 times) - You're locked out of basic features due to tier restrictions - Your CFO is asking why infrastructure tooling costs more than your compute Migration reality check: Budget 3-4 weeks for the migration. Plan for bugs. Test everything twice. Have rollback plans. And maybe warn your family you'll be unavailable most evenings.

Currently viewing the AI version

Switch to human version

Terraform Alternatives: Technical Reference Guide

Executive Summary

HashiCorp's August 2023 license change and resource-based pricing model transformed Terraform from free to expensive (12x cost increases reported). Production teams need alternatives that provide state management, approval workflows, audit logs, and deployment reliability without surprise billing.

Critical Context: HashiCorp Pricing Disaster

License Change Impact (August 2023)

Cost Explosion: $200/month → $2,400/month (12x increase) for identical infrastructure
Resource Counting Scam: Internal resources counted separately
- EKS cluster = 40-50 billable resources (VPC, security groups, IAM roles, subnets)
- VPC = 15+ resources (subnets, route tables, gateways, NACLs)
- Modules count each internal resource separately
Billing Model: $0.00014/hour per resource, billed at peak hourly usage
Hidden Costs: Terraform dependency graph creates intermediate resources that count toward billing

Production Failure Scenarios

Concurrent Run Limits: 20-minute deployment queues during production incidents
Artificial Throttling: Teams paying $800/month unable to deploy during outages
Enterprise Feature Paywall: Audit logs, RBAC, unlimited concurrent runs require premium tiers

Alternative Solutions Analysis

Comparison Matrix

Solution	Pricing Model	Real Monthly Cost	Engineering Time Investment	Critical Failure Points
OpenTofu + S3	Storage only (~$20/month)	$20 + 40 hours/month maintenance	HIGH: 3 weekends debugging state locks	State corruption during AWS DynamoDB hiccups, ConditionalCheckFailed errors
Scalr	$0.99/successful run (50 free/month)	$200/month (200 runs) vs $2,400 Terraform Cloud	LOW: Managed platform	Failed runs don't count, large terraform plans take longer
Atlantis	$0 + hosting costs	$150/month redundant AWS setup + 10 hours/month babysitting	HIGH: 2 weeks setup, ongoing maintenance	Webhook failures during incidents, SSL cert expiration, memory leaks, database issues
Digger	$39/user/month + GitHub Actions compute	$195/month (5 users) + $50 Actions compute	MEDIUM: Uses existing CI/CD	GitHub Actions logs poor for debugging, runner timeouts, 2-3 minute cold starts
CloudFormation	AWS compute costs only	~$1/pipeline/month + compute	MEDIUM: YAML complexity	3,000-line templates unmanageable, cryptic error messages, no version pinning

Production Implementation Reality

OpenTofu Migration

State Migration: tofu init -migrate-state works reliably
Compatibility: Existing .tf files work unchanged
Real Costs: S3 backend $8/month, DynamoDB locking $2/month
Critical Failure: State corruption requires 4-hour restoration from backup
Prevention Requirement: DynamoDB TTL setup prevents stuck locks

Atlantis Production Setup

What Breaks in Production:

GitHub webhook failures during high-traffic deployments
Database disk space exhaustion (no log rotation)
SSL certificate expiration breaking webhook delivery
Memory leaks in version 0.19.x causing daily crashes

Setup Reality:

"Simple Docker deployment" = 2 weeks configuration
Requirements: Postgres with backups, reliable webhooks, SSL certificates, monitoring
Maintenance: 10 hours/month operational overhead

Scalr Enterprise Features

Unlimited concurrent runs: Critical for incident response
Policy enforcement: More reliable than Terraform Cloud
Drift detection: Identifies manual infrastructure changes
Cost estimation: Accurate vs HashiCorp's estimates
Transparent pricing: No resource counting, failed runs excluded

Resource Requirements

Migration Time Investment

OpenTofu: 3 weeks full migration + ongoing weekend debugging
Atlantis: 2 weeks initial setup + 1-2 weeks production hardening
Scalr: Minimal migration time, managed platform
CloudFormation: Rewrite required, plan for long weekend

Engineering Expertise Required

State Management: Understanding of Terraform state, backup/restore procedures
Infrastructure: AWS/cloud provider deep knowledge for troubleshooting
CI/CD Integration: Webhook configuration, GitHub Actions optimization
Monitoring: Platform health monitoring, alert configuration

Critical Warnings & Failure Modes

State Lock Debugging (OpenTofu/Atlantis)

Error Pattern: ConditionalCheckFailedException in DynamoDB
Root Cause: Failed deployments don't release state locks
Resolution: Manual DynamoDB lock deletion or tofu force-unlock
Prevention: Configure DynamoDB TTL (1-hour auto-deletion for stuck locks)

GitHub Actions Performance Issues (Digger)

Timeout Errors: Jobs exceed 360-minute default limit
Large Infrastructure Impact: EKS with 200 worker nodes = 4-hour plan time
Workaround: Increase timeout to 480 minutes, use terraform plan -out=plan.tfplan for caching

CloudFormation Error Interpretation

"UPDATE_ROLLBACK_FAILED": Most common useless error message
Actual Causes: Manual resource modification, IAM permission changes, dependency conflicts
Debug Process: CloudFormation Events tab → find buried resource error → Google real error message

Performance Degradation Thresholds

State File Size: 500MB+ causes 8-minute terraform plan execution
Resource Limits: 1000+ resources cause significant UI/API performance degradation
Concurrent Operations: Platform-specific limits cause deployment queuing

Decision Criteria Framework

Stay with Terraform Cloud If:

Monthly costs under $500
Team lacks Docker/AWS expertise
HashiCorp Vault integration critical
Compliance requirements without dedicated security team

Migrate If:

Costs exceed $1000/month and growing
Concurrent run limits hit during incidents (3+ occurrences)
Basic features locked behind tier restrictions
CFO questioning infrastructure tooling costs vs compute costs

Selection Criteria by Team Profile:

Budget-Conscious/High Engineering Capacity: OpenTofu + S3
Predictable Costs/Managed Platform: Scalr
AWS-Only/Cost-Sensitive: CloudFormation
Existing CI/CD Integration: Digger
Self-Hosting Preference: Atlantis (with maintenance budget)

Operational Troubleshooting Guide

Common Production Issues

Stuck State Locks

# Emergency unlock (use carefully)
tofu force-unlock 1a2b3c4d-5e6f-7g8h-9i0j-k1l2m3n4o5p6

# DynamoDB TTL prevention
aws dynamodb put-item --table-name terraform-locks --item '{"LockID":{"S":"lock-id"},"TTL":{"N":"3600"}}'

Atlantis Webhook Failures

# Check SSL certificate expiration
openssl s_client -connect your-atlantis.com:443 | grep "Not After"

# Monitor webhook delivery
docker logs atlantis | grep webhook

Provider Version Conflicts

terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"  # Update version constraints
    }
  }
}

State File Optimization

# Remove unused resources
terraform state rm aws_instance.deleted_thing

# Split large state files
terraform state mv aws_instance.prod terraform-prod.tfstate

Performance Optimization

Large State File Management

Problem: 500MB+ state files cause 8-minute plan times
Causes:

Too many resources in single state
Orphaned deleted resources
Large JSON data from data sources

Solutions:

Split state by environment/service
Regular state cleanup
Import existing resources to fresh state (nuclear option)

Resource Links & Documentation

Migration Tools

OpenTofu Migration Guide: Reliable state migration process
tfmigrate: Complex state migration automation
Checkov: Pre-deployment security validation
Infracost: Cost estimation before deployment

Platform Documentation

Atlantis Production Setup: Production-ready deployment guide
Scalr Documentation: 50 free runs/month, transparent pricing
HCP Terraform Pricing Calculator: Resource counting cost estimation

Community Support

HashiCorp Discuss - Terraform: Active engineering community
OpenTofu GitHub Discussions: Migration and technical support
Gruntwork Infrastructure Blog: State management best practices

Key Operational Intelligence

Hidden Costs Analysis

"Free" OpenTofu: $20 storage + 40 hours/month engineering time = $4000+ actual cost
Terraform Cloud: Resource counting includes invisible dependency graph resources
GitHub Actions: Cold start overhead adds 2-3 minutes per deployment
Self-Hosting: SSL certificate management, database maintenance, monitoring setup

Migration Risk Assessment

State corruption risk: Always backup before migration, test restore procedures
Provider compatibility: Pin versions, test all modules before production migration
Webhook reliability: Plan for manual deployment capabilities during platform outages
Team training: Budget 2-4 weeks for team familiarity with new platform

Success Metrics

Deployment reliability: Concurrent runs during incidents
Cost predictability: Transparent pricing vs resource counting
Engineering productivity: Time spent on platform maintenance vs feature development
Incident response: Deployment capabilities during production outages

Useful Links for Further Investigation

Actually Useful Links (No Bullshit)

Link	Description
OpenTofu Migration Guide	Actually useful migration docs for once. `tofu init -migrate-state` and you're done (usually).
tfmigrate	For complex state migrations. Saved my ass when manual migration broke everything.
Atlantis Production Setup	Skip the "quick start" bullshit. This actually tells you what breaks in production.
HCP Terraform Pricing Calculator	Enter your resource count. Prepare to be horrified. Don't blame me when you see the numbers.
Scalr Pricing	$0.99/run. No surprise fees. Refreshingly honest compared to HashiCorp's resource counting scam.
Scalr Documentation	50 free runs/month. Their docs don't suck, which is rare these days.
Digger	GitHub Actions for infrastructure. Actually clever, unlike most "innovative" DevOps tools.
Checkov	Finds the dumb security shit before it hits production. Saved me from several AWS bill disasters.
Infracost	Shows you how much your terraform changes will cost before you deploy. Wish I'd found this sooner.
HashiCorp Discuss - Terraform	Real engineers solving real problems. Way better than Stack Overflow's duplicate question hell.
OpenTofu GitHub Discussions	Active community where people actually help instead of marking everything as duplicate.
Terraform Internal Architecture	Detailed documentation on the internal architecture of Terraform, explaining its core components and how they interact to manage infrastructure.
Gruntwork Infrastructure Blog	A comprehensive blog post from Gruntwork detailing best practices and strategies for effectively managing Terraform state in various environments.
Spacelift Terraform Guides	A practical guide from Spacelift explaining how to configure and use an S3 backend for Terraform state, including setup and considerations.

Terraform Alternatives: Technical Reference Guide

Executive Summary

Critical Context: HashiCorp Pricing Disaster

License Change Impact (August 2023)

Production Failure Scenarios

Alternative Solutions Analysis

Comparison Matrix

Production Implementation Reality

OpenTofu Migration

Atlantis Production Setup

Scalr Enterprise Features

Resource Requirements

Migration Time Investment

Engineering Expertise Required

Critical Warnings & Failure Modes

State Lock Debugging (OpenTofu/Atlantis)

GitHub Actions Performance Issues (Digger)

CloudFormation Error Interpretation

Performance Degradation Thresholds

Decision Criteria Framework

Stay with Terraform Cloud If:

Migrate If:

Selection Criteria by Team Profile:

Operational Troubleshooting Guide

Common Production Issues

Stuck State Locks

Atlantis Webhook Failures

Provider Version Conflicts

State File Optimization

Performance Optimization

Large State File Management

Resource Links & Documentation

Migration Tools

Platform Documentation

Community Support

Key Operational Intelligence

Hidden Costs Analysis

Migration Risk Assessment

Success Metrics

Useful Links for Further Investigation

Actually Useful Links (No Bullshit)

Related Tools & Recommendations

Making Pulumi, Kubernetes, Helm, and GitOps Actually Work Together

The AI Coding Wars: Windsurf vs Cursor vs GitHub Copilot (2025)

How to Actually Get GitHub Copilot Working in JetBrains IDEs

Pulumi Cloud - Skip the DIY State Management Nightmare

Pulumi Cloud Enterprise Deployment - What Actually Works in Production

Lambda Alternatives That Won't Bankrupt You

AWS API Gateway - Production Security Hardening

CDN Pricing is a Shitshow - Here's What Cloudflare, AWS, and Fastly Actually Cost

Azure AI Foundry Production Reality Check

Microsoft Azure Stack Edge - The $1000/Month Server You'll Never Own

Azure - Microsoft's Cloud Platform (The Good, Bad, and Expensive)

Google Cloud Platform - After 3 Years, I Still Don't Hate It

HashiCorp Vault Pricing: What It Actually Costs When the Dust Settles

HashiCorp Vault - Overly Complicated Secrets Manager

AWS CDK - Finally, Infrastructure That Doesn't Suck

AWS CDK Production Deployment Horror Stories - When CloudFormation Goes Wrong

Terraform vs Pulumi vs AWS CDK vs OpenTofu: Real-World Comparison

CrashLoopBackOff Exit Code 1: When Your App Works Locally But Kubernetes Hates It

Temporal + Kubernetes + Redis: The Only Microservices Stack That Doesn't Hate You

GitHub Actions Alternatives for Security & Compliance Teams