How do I know if my state file is fucked?

Your state is corrupted when: - `terraform plan` dies with "Error: invalid character" JSON bullshit - `terraform state list` returns nothing even though you have stuff running - Resources show as "new" when you know they exist - Commands just hang forever **Quick check:** ```bash # Test if state file is valid JSON terraform state pull | jq . # Count resources in state terraform state list | wc -l # Compare to what's actually running aws ec2 describe-instances --query 'length(Reservations[*].Instances[*])' ``` If the JSON test fails or counts don't match, your state is broken.

Can I recover without backups?

Yes, but you're going to hate your life. You'll rebuild everything with `terraform import`: 1. List all your resources using AWS CLI, Azure CLI, etc. 2. Write Terraform config for each resource 3. Import them one by one: `terraform import resource.name actual-resource-id` 4. Fix config until `terraform plan` shows no changes **Time estimate:** Plan on 1-3 days for anything bigger than a few resources. This is why backups exist.

How do I restore from S3 versioning?

If you have S3 versioning enabled, you have automatic backups: ```bash # List all versions to find the good one aws s3api list-object-versions \ --bucket terraform-state-bucket \ --prefix prod/terraform.tfstate # Download the version from before everything broke aws s3api get-object \ --bucket terraform-state-bucket \ --key prod/terraform.tfstate \ --version-id VERSION_ID_FROM_ABOVE \ terraform.tfstate.backup # Push it back terraform state push terraform.tfstate.backup ``` **Time:** 20 minutes if you don't fuck it up.

My state is locked and won't unlock. Help?

State locks prevent two people from breaking things simultaneously, but sometimes they get stuck: ```bash # See who has the lock aws dynamodb scan --table-name terraform-locks # Force unlock (dangerous!) terraform force-unlock LOCK_ID_FROM_ERROR_MESSAGE # If that fails, delete from DynamoDB directly aws dynamodb delete-item \ --table-name terraform-locks \ --key '{"LockID":{"S":"LOCK_ID"}}' ``` **WARNING:** Only force unlock if you're 100% sure nobody else is running Terraform. Otherwise you'll create the exact problem locks are supposed to prevent.

Only some of my resources are fucked. Can I fix just those?

Yeah, you can surgically remove and re-import the broken resources: ```bash # See what's broken terraform plan # Look for weird errors on specific resources # Remove the broken resource from state terraform state rm aws_instance.broken_one # Re-import it terraform import aws_instance.broken_one i-1234567890abcdef0 # Verify it's fixed terraform plan # Should show no changes for this resource ``` This is way faster than rebuilding everything.

Can I edit the state file directly?

**Don't.** But if you're desperate and have backups: ```bash # BACKUP FIRST terraform state pull > terraform.tfstate.backup # Pull current state terraform state pull > state.json # Edit with vim/nano/whatever (good luck) vim state.json # Validate it's still valid JSON jq . state.json > /dev/null # Push it back if validation passed terraform state push state.json ``` **Seriously, don't do this.** One typo and you'll make everything worse. Import is safer.

terraform apply broke my state file. What now?

**DON'T PANIC.** Your resources probably still exist. 1. **Don't run apply again** - you might create duplicates 2. **Check for local backup**: `ls -la terraform.tfstate*` 3. **If backup exists**: `cp terraform.tfstate.backup terraform.tfstate` 4. **If no backup**: Check AWS console to see what actually got created 5. **Import missing resources**: `terraform import resource.name actual-id` This usually happens when the apply gets interrupted (network dies, laptop sleeps, etc.).

How do I prevent this shit from happening?

**Do these things:** 1. **Use S3 + DynamoDB** for remote state with locking 2. **Enable S3 versioning** - automatic backups 3. **Use CI/CD** - no more `terraform apply` from laptops 4. **Separate state files** for prod/staging/dev 5. **Daily backup scripts** that run independent of Terraform Do this and you'll never spend another weekend reconstructing state files.

Workspaces or separate state files?

**Separate state files.** Workspaces share the same backend and can contaminate each other. ``` terraform/ ├── prod/ # Separate S3 key ├── staging/ # Separate S3 key └── dev/ # Separate S3 key ``` **Why separate is better:** - Can't accidentally destroy prod while working on dev - Different permissions per environment - When staging breaks, prod is unaffected - Easier to debug when things go wrong

How often should I backup state files?

- **Production**: Daily backups + S3 versioning - **Staging**: Daily backups - **Dev**: S3 versioning is probably enough Keep 30 days of backups. More than that is overkill unless you have compliance requirements.

What's the difference between corruption and drift?

**State Corruption:** The state file is broken JSON - Terraform commands fail with parsing errors - Fix: Restore from backup or import everything **State Drift:** State file is fine but reality changed - Someone modified resources outside Terraform - `terraform plan` shows unexpected changes - Fix: `terraform refresh` or import the changes Corruption = file is broken. Drift = file is fine but reality changed.

Currently viewing the AI version

Switch to human version

Terraform State Corruption: Recovery & Prevention Guide

Critical Failure Modes

Primary Causes of State Corruption

Network interruption during apply: Most common cause. WiFi/network failure during state upload to S3 results in partial JSON file
Concurrent executions: Two users running terraform apply simultaneously without proper locking
Disk space exhaustion: Docker volumes at 100% capacity truncate state files to 0 bytes
Manual state editing: Single typo in JSON breaks entire state file
Provider version incompatibility: AWS provider 5.x to 6.x migration broke state files with schema changes

Severity Assessment

Level 1 - JSON Syntax Error: 30-minute fix with backups, catastrophic without
Level 2 - Partial Corruption: 1 day of importing missing resources
Level 3 - Total State Loss: 2-3 days for medium environments (50-500 resources), entire weekend cancelled for large deployments

Immediate Symptoms

terraform plan fails with "Error: invalid character" JSON errors
terraform state list returns empty despite running resources
Resources show as "new" when they already exist
Commands hang indefinitely with lock errors

Recovery Procedures

Option 1: Backup Restoration (20-60 minutes)

Local Backup Recovery

# Verify backup integrity
cat terraform.tfstate.backup | jq . > /dev/null

# Restore if valid
cp terraform.tfstate.backup terraform.tfstate
terraform plan  # Verify functionality

S3 Versioning Recovery

# List all state file versions
aws s3api list-object-versions \
  --bucket terraform-state-bucket \
  --prefix prod/terraform.tfstate

# Download pre-corruption version
aws s3api get-object \
  --bucket terraform-state-bucket \
  --key prod/terraform.tfstate \
  --version-id VERSION_ID \
  terraform.tfstate.backup

# Restore to current state
terraform state push terraform.tfstate.backup

Option 2: Manual Import (1-3 days)

Resource Discovery

# AWS inventory
aws ec2 describe-instances --output table
aws rds describe-db-instances --output table
aws s3api list-buckets

# Azure inventory  
az resource list --output table

# GCP inventory
gcloud compute instances list
gcloud sql instances list

Import Process (Dependency Order)

VPCs and networking - everything depends on these
Security groups and IAM roles
EC2 instances, RDS databases
Load balancers and DNS
Application-specific resources

# Create minimal resource configuration
resource "aws_instance" "web" {
  ami           = "ami-12345678"
  instance_type = "t3.micro"
  # Refine after import
}

# Import actual resource
terraform import aws_instance.web i-1234567890abcdef0

# Verify and fix configuration
terraform plan

Option 3: Emergency Temporary State

For critical production issues requiring immediate deployment:

# Initialize new state
terraform init

# Import only critical resources
terraform import aws_instance.prod_web i-critical-instance-id
terraform import aws_security_group.prod sg-whatever

# Create minimal viable configuration
# Fix comprehensive state later

Prevention Configuration

Remote State Setup (Required)

S3 + DynamoDB Backend

terraform {
  backend "s3" {
    bucket         = "company-terraform-state"
    key            = "production/terraform.tfstate"
    region         = "us-west-2"
    encrypt        = true
    dynamodb_table = "terraform-state-locks"
    versioning     = true  # Critical for recovery
  }
}

Infrastructure Setup

# Enable S3 versioning (saves your career)
aws s3api put-bucket-versioning \
  --bucket terraform-state-bucket \
  --versioning-configuration Status=Enabled

# Create DynamoDB lock table
aws dynamodb create-table \
  --table-name terraform-locks \
  --attribute-definitions AttributeName=LockID,AttributeType=S \
  --key-schema AttributeName=LockID,KeyType=HASH \
  --billing-mode PAY_PER_REQUEST

Automated Backup System

Daily Backup Script

#!/bin/bash
# Run daily via cron: 0 2 * * *

DATE=$(date +%Y%m%d_%H%M%S)
BACKUP_BUCKET="terraform-backups"

for env in prod staging dev; do
  cd /path/to/${env}/terraform
  
  # Pull and validate state
  terraform state pull > "backup-${env}-${DATE}.json"
  
  if jq . "backup-${env}-${DATE}.json" > /dev/null; then
    aws s3 cp "backup-${env}-${DATE}.json" "s3://${BACKUP_BUCKET}/${env}/"
    echo "${env} backup successful"
    rm "backup-${env}-${DATE}.json"
  else
    echo "ERROR: ${env} backup corrupted"
  fi
done

State Health Monitoring

#!/bin/bash
# Monitor state file integrity

CURRENT_SIZE=$(terraform state pull | wc -c)
if [ $CURRENT_SIZE -lt 1000 ]; then
  echo "WARNING: State file suspiciously small (${CURRENT_SIZE} bytes)"
fi

RESOURCE_COUNT=$(terraform state list | wc -l)
if [ $RESOURCE_COUNT -eq 0 ]; then
  echo "ERROR: State file contains no resources"
fi

Team Workflow Configuration

GitOps CI/CD Pipeline

# .github/workflows/terraform.yml
name: Terraform CI/CD
on:
  pull_request:
    paths: ['terraform/**']
  push:
    branches: [main]

jobs:
  terraform:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - uses: hashicorp/setup-terraform@v2
        with:
          terraform_version: 1.5.7
      
      - name: Terraform Init
        run: terraform init
        
      - name: Terraform Plan
        if: github.event_name == 'pull_request'
        run: terraform plan -no-color
        
      - name: Terraform Apply
        if: github.ref == 'refs/heads/main'
        run: terraform apply -auto-approve

Environment Separation

terraform/
├── prod/       # Separate S3 keys
│   ├── main.tf
│   └── backend.tf (key = "prod/terraform.tfstate")
├── staging/
│   ├── main.tf  
│   └── backend.tf (key = "staging/terraform.tfstate")
└── dev/
    ├── main.tf
    └── backend.tf (key = "dev/terraform.tfstate")

Resource Requirements & Time Investment

Recovery Time Estimates

Small environment (<50 resources)
- Backup restore: 20 minutes
- Manual import: 4-6 hours
- Complete rebuild: 8+ hours
Medium environment (50-500 resources)
- Backup restore: 30-60 minutes
- Manual import: 1-2 days
- Complete rebuild: 3-5 days
Large environment (500+ resources)
- Backup restore: 1-2 hours
- Manual import: 3-7 days
- Complete rebuild: 1-2 weeks

Prevention Costs

S3 + DynamoDB backend: $10-50/month depending on state size
Daily backup storage: $5-20/month
CI/CD platform: $0 (GitHub Actions) to $50+/month (commercial)

Human Resource Impact

State corruption incident: 1-3 engineers blocked for days
Prevention setup: 1 engineer for 1-2 days initial configuration
Maintenance: 2-4 hours/month monitoring and updates

Common Failure Scenarios

Lock File Issues

Symptom: "Error acquiring the state lock" preventing all operations
Cause: Process killed during apply, leaving orphaned lock
Resolution:

# Identify lock holder
aws dynamodb scan --table-name terraform-locks

# Force unlock (ensure no active operations first)
terraform force-unlock LOCK_ID

Provider Migration Failures

Symptom: Resources show as completely different types after provider upgrade
Cause: Schema changes between major provider versions
Prevention: Test upgrades in non-production first, maintain provider version constraints

Partial State Corruption

Symptom: Some resources missing from state, others intact
Cause: Interrupted writes, filesystem issues
Resolution: Selective import of missing resources rather than full rebuild

Critical Configuration Settings

Required S3 Bucket Configuration

Versioning: Enabled (provides automatic backups)
Encryption: AES256 or KMS (security compliance)
Public access: Blocked (security requirement)
Lifecycle policy: Retain 30+ versions

Required DynamoDB Table Settings

Partition key: LockID (String type)
Billing: Pay-per-request (cost optimization)
Point-in-time recovery: Enabled (additional safety)

Terraform Configuration Requirements

# Minimum backend configuration
terraform {
  required_version = ">= 1.0"
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"  # Pin major version
    }
  }
  
  backend "s3" {
    encrypt        = true
    dynamodb_table = "terraform-locks"
    # Other settings environment-specific
  }
}

Emergency Procedures

Production State Corruption Response

Immediate: Stop all Terraform operations across team
Assessment: Determine corruption scope (partial vs total)
Communication: Notify stakeholders of deployment freeze
Recovery: Attempt backup restoration first, import as fallback
Verification: Extensive testing before resuming operations
Post-incident: Review prevention measures, update runbooks

Disaster Recovery Checklist

State backups accessible and tested monthly
Recovery procedures documented and practiced
Team trained on emergency procedures
Escalation paths defined for different severity levels
Rollback procedures prepared for critical changes

Operational Intelligence

What Official Documentation Doesn't Tell You

Workspace vs separate files: Workspaces are development features, not environment separation
State locking limitations: DynamoDB locks don't prevent all race conditions
Import complexity: Complex resources require extensive configuration matching
Backup timing: S3 versioning alone insufficient for high-frequency changes

Community Wisdom

Terraformer tool: Generates configs but requires significant cleanup
Import order matters: VPC resources first, applications last
Large state performance: Files >100MB become unwieldy, split recommended
Provider stability: Major version upgrades often break state compatibility

Breaking Points and Limitations

UI performance: Terraform Cloud becomes unusable with 1000+ resources
Apply duration: State files >50MB cause 10+ minute plan times
Import limitations: Some resources impossible to import accurately
Lock timeout: Default 20-minute timeout insufficient for large applies

Useful Links for Further Investigation

Actually Useful Links (Not a Link Farm)

Link	Description
Terraform State Documentation	Comprehensive guide covering state concepts, remote backends, and best practices from Spacelift.
State Recovery Guidelines	AWS guide to state recovery and backup restoration procedures with practical examples.
S3 Backend Configuration	How to set up S3 + DynamoDB properly. Good step-by-step instructions.
Terraform State Management - Gruntwork	Solid best practices from people who actually use this stuff in production.
Terraformer	Generates Terraform config from existing infrastructure. Sometimes works, sometimes doesn't, but better than starting from nothing.
Terraform Debugging Guide	Practical debugging guide including TF_LOG=DEBUG and troubleshooting techniques.
HashiCorp Community Forum	When you have weird edge cases, search here first. Lots of people have been through the same pain.
Stack Overflow - Terraform State	Good for specific error messages and quick fixes.
Spacelift	Commercial platform that handles state management for you. Costs money but prevents you from spending weekends fixing broken state files.
Atlantis	Open-source GitOps for Terraform. Prevents people from running terraform apply from their laptops.

Terraform State Corruption: Recovery & Prevention Guide

Critical Failure Modes

Primary Causes of State Corruption

Severity Assessment

Immediate Symptoms

Recovery Procedures

Option 1: Backup Restoration (20-60 minutes)

Local Backup Recovery

S3 Versioning Recovery

Option 2: Manual Import (1-3 days)

Resource Discovery

Import Process (Dependency Order)

Option 3: Emergency Temporary State

Prevention Configuration

Remote State Setup (Required)

S3 + DynamoDB Backend

Infrastructure Setup

Automated Backup System

Daily Backup Script

State Health Monitoring

Team Workflow Configuration

GitOps CI/CD Pipeline

Environment Separation

Resource Requirements & Time Investment

Recovery Time Estimates

Prevention Costs

Human Resource Impact

Common Failure Scenarios

Lock File Issues

Provider Migration Failures

Partial State Corruption

Critical Configuration Settings

Required S3 Bucket Configuration

Required DynamoDB Table Settings

Terraform Configuration Requirements

Emergency Procedures

Production State Corruption Response

Disaster Recovery Checklist

Operational Intelligence

What Official Documentation Doesn't Tell You

Community Wisdom

Breaking Points and Limitations

Useful Links for Further Investigation

Actually Useful Links (Not a Link Farm)

Related Tools & Recommendations

GitOps Integration Hell: Docker + Kubernetes + ArgoCD + Prometheus

Kafka + MongoDB + Kubernetes + Prometheus Integration - When Event Streams Break

Stop manually configuring servers like it's 2005

GitHub Desktop - Git with Training Wheels That Actually Work

AI Coding Assistants 2025 Pricing Breakdown - What You'll Actually Pay

Pulumi Cloud - Skip the DIY State Management Nightmare

Pulumi Review: Real Production Experience After 2 Years

Pulumi Cloud Enterprise Deployment - What Actually Works in Production

Azure AI Foundry Production Reality Check

Azure - Microsoft's Cloud Platform (The Good, Bad, and Expensive)

Microsoft Azure Stack Edge - The $1000/Month Server You'll Never Own

Google Cloud Platform - After 3 Years, I Still Don't Hate It

HashiCorp Vault - Overly Complicated Secrets Manager

HashiCorp Vault Pricing: What It Actually Costs When the Dust Settles

Terraform vs Pulumi vs AWS CDK vs OpenTofu: Real-World Comparison

AWS CDK Production Deployment Horror Stories - When CloudFormation Goes Wrong

Terraform vs Pulumi vs AWS CDK: Which Infrastructure Tool Will Ruin Your Weekend Less?

RAG on Kubernetes: Why You Probably Don't Need It (But If You Do, Here's How)

GitLab CI/CD - The Platform That Does Everything (Usually)

Red Hat Ansible Automation Platform - Ansible with Enterprise Support That Doesn't Suck