How We Stopped Breaking Production Every Week

Currently viewing the human version

The Multi-Account DevOps Nightmare That Made Us Build This

Why Manual Multi-Account Deployment Is Hell

Deploying infrastructure changes across 15+ AWS accounts manually will destroy your team's productivity and sanity. We learned this the hard way after our fourth production outage caused by deployment inconsistencies between accounts.

Why We Finally Automated Everything

The Jenkins incident: Someone changed an environment variable that started dumping debug logs to prod. Our AWS bill went nuts - think it was like $800 or something because we were logging way too much. Took me forever to figure out why monitoring was broken and costs were crazy.

Version nightmare: Dev team upgraded Terraform, prod was still on the old version. State got messed up during a weekend deploy. Spent most of Sunday manually importing resources while everyone kept asking when the site would be back up.

Audit fun: "Show us your change log for the last 6 months." We had accounts running different versions of everything, some with random hotfixes that nobody bothered committing to Git. Our change tracking was basically Slack messages and hoping someone remembered what they did.

The accidental destroy: One of our devs ran terraform destroy thinking they were in dev. They were in dev, but our backup process was broken so we lost half a day rebuilding stuff.

That's when I decided to spend the next few months building actual automation. Took way longer than I thought but will save you from going through the same pain. AWS Organizations best practices and multi-account strategy guides provide the foundation. GitHub Actions CI/CD patterns and OIDC integration examples show the secure automation approach. Terraform state management and disaster recovery strategies ensure reliability.

The Architecture That Actually Works

GitHub Actions

After way too many months debugging broken deployments, here's what actually works:

Centralized Pipeline with Account-Specific State

The key insight: one pipeline, multiple backends. Each account gets its own Terraform state file, but all deployments flow through the same GitOps pipeline. This prevents state conflicts between accounts while maintaining deployment consistency.

We use Terragrunt because Terraform workspaces are broken for multi-account. Workspaces share state files and make access control impossible. Multi-account deployment patterns and Terragrunt authentication guides provide comprehensive examples for AWS account management. Production-grade multi-environment setups demonstrate real-world patterns that scale. AWS multi-account best practices and Terragrunt Quick Start guides cover the foundational concepts.

## Example Terragrunt configuration that actually works
terraform {
  source = "../../../modules/vpc"
}

remote_state {
  backend = "s3"
  config = {
    bucket = "terraform-state-${get_env("ACCOUNT_ID")}"
    key    = "${path_relative_to_include()}/terraform.tfstate"
    region = "us-east-1"
    
    dynamodb_table = "terraform-locks-${get_env("ACCOUNT_ID")}"
    encrypt        = true
  }
}

inputs = {
  environment = get_env("ENVIRONMENT")
  account_id  = get_env("ACCOUNT_ID")
  vpc_cidr    = "10.${get_env("ACCOUNT_NUMBER")}.0.0/16"
}

Cross-Account Role Assumption That Doesn't Break

Managing IAM permissions across multiple accounts is where most teams give up and go back to manual deployments. The secret is using AWS OIDC integration with GitHub Actions instead of long-lived access keys that get leaked in Slack channels.

Each account has a deployment role that the CI/CD pipeline can assume. The roles are configured with trust policies that only allow specific GitHub repositories and branches to assume them. This prevents developers from accidentally deploying to production from their local machines.

AWS OIDC setup documentation sucks. You'll spend days debugging trust policies. The aws-actions/configure-aws-credentials docs don't mention that GitHub OIDC tokens need exact repository matches in your trust policy - learned that one the hard way.

Environment Promotion That Doesn't Suck

Most "multi-environment" pipelines are actually just the same code deployed with different variable files. This works until you need environment-specific customizations, then everything becomes a mess of conditional logic and feature flags.

Our approach: environment-specific branches with controlled promotion. Development changes merge to develop branch and auto-deploy to dev accounts. Staging changes merge to staging branch after dev validation. Production deployments require explicit promotion from staging branch to main branch with required approvals.

## GitHub Actions workflow that actually handles multi-account deployment
name: Multi-Account Infrastructure Deployment
on:
  push:
    branches: [develop, staging, main]
  pull_request:
    branches: [develop, staging, main]

jobs:
  plan:
    strategy:
      matrix:
        account: 
          - { name: "dev", id: "111111111111", role: "arn:aws:iam::111111111111:role/GitHubDeployment" }
          - { name: "staging", id: "222222222222", role: "arn:aws:iam::222222222222:role/GitHubDeployment" }
          - { name: "prod", id: "333333333333", role: "arn:aws:iam::333333333333:role/GitHubDeployment" }
        exclude:
          # Only deploy to prod from main branch
          - ${{ github.ref != 'refs/heads/main' && matrix.account.name == 'prod' }}
          # Only deploy to staging from staging/main branches  
          - ${{ !contains(fromJson('["staging", "main"]'), github.ref_name) && matrix.account.name == 'staging' }}
    
    runs-on: ubuntu-latest
    environment: ${{ matrix.account.name }}
    
    steps:
    - uses: actions/checkout@v4
    
    - name: Configure AWS Credentials
      uses: aws-actions/configure-aws-credentials@v4
      with:
        role-to-assume: ${{ matrix.account.role }}
        aws-region: us-east-1
        role-session-name: GitHubDeploy-${{ matrix.account.name }}
        
    - name: Terraform Plan
      working-directory: environments/${{ matrix.account.name }}
      run: |
        terragrunt run-all plan

The key insight: different branches deploy to different account combinations automatically. No manual intervention, no "remember to change the account variable," no accidentally deploying dev code to production.

Tool Integration That Doesn't Drive You Insane

Atlantis vs GitHub Actions vs Terraform Cloud

After wasting months testing every GitOps tool, here's what actually works:

Atlantis crashes every time you get busy. Don't bother.

GitHub Actions is what we use. OIDC setup takes 2 days and you'll hate every minute, but once it works it actually works.

HCP Terraform costs too much. $20+/user/month adds up fast and you're stuck with HashiCorp forever.

We went with GitHub Actions because we're already using GitHub and I'm not paying HashiCorp's enterprise tax.

State Management That Doesn't Corrupt

Multi-account Terraform state management is where most teams give up and go back to manual deployments. Here's what actually works:

Separate S3 buckets per account with DynamoDB locking per account. This prevents state corruption when multiple developers deploy simultaneously and isolates state file access by account boundaries.

## Terragrunt automatically generates backend configs like this:
terraform {
  backend "s3" {
    bucket         = "terraform-state-dev-123456789012"
    key            = "vpc/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true
    dynamodb_table = "terraform-locks-dev-123456789012"
  }
}

Cross-region state replication for disaster recovery. S3 cross-region replication ensures state files survive region outages. We learned this lesson after a region outage corrupted our primary state bucket and we spent 8 hours rebuilding state from CloudFormation exports.

Terraform state encryption is mandatory. State files contain sensitive information like database passwords and API keys. Encrypt everything and use S3 bucket policies to restrict access to deployment roles only.

Monitoring That Actually Helps

Standard CloudWatch monitoring is useless for multi-account infrastructure deployments. You need deployment-specific metrics that show what's changing across accounts and regions.

We use GitHub Actions monitoring for pipeline health and custom CloudWatch alarms for deployment failures. The key insight: monitor deployment patterns, not just infrastructure health.

Deployment frequency by account: Track how often each account receives deployments. Accounts that haven't been updated in weeks usually have configuration drift.

Plan vs Apply failures: Terraform plans fail for different reasons than applies. Plan failures indicate code issues, apply failures indicate AWS API problems or permission issues.

State lock duration: Long-running state locks usually indicate stuck deployments or manual interventions. Alert when locks exceed 30 minutes.

What Actually Happens (Reality Check)

Forget the bullshit "2-4 weeks" you see in blog posts. Here's the real deal:

Month 1-2: AWS OIDC setup was a fucking nightmare. Trust policies kept failing with "AssumeRoleWithWebIdentity" errors that told you nothing. GitHub's docs don't mention half the shit that breaks. Spent like 3 weeks debugging random failures and wanted to quit.

Month 3-4, maybe 5: Converting all our existing crap to Terragrunt. Had to import a bunch of resources that someone created manually and never documented. Some imports worked, others failed for no reason I could figure out. VPCs were especially painful - took me like 8 tries to get one working properly. Lost count honestly.

Month 4-6: GitHub Actions kept breaking in new and creative ways. Rate limits, timeouts, random failures that made no sense. Just when I thought everything was stable, something else would break. Those "resource not found" errors were the worst - completely random and impossible to debug.

Still happening: People still try to bypass the automation when they're stressed about deadlines. Always find some way to "just this once" deploy manually when things get crazy and we're trying to push a hotfix.

Took us like 4 months, maybe 5? Hard to remember because it was such a shit show. Plan for longer if you're starting from scratch or don't have someone who's done this before.

How to Actually Build This Without Losing Your Mind

Terraform

GitHub Actions

Get Your Repo Structure Right (Or You'll Regret It Later)

Most enterprise setups are over-engineered messes that need 6 PRs to change one security group rule.

Here's what actually works without making you want to quit.

The Structure That Actually Works

terraform-multi-account/
├── modules/                    # Reusable infrastructure components
│   ├── vpc/
│   ├── eks-cluster/
│   └── rds-postgres/
├── environments/              # Account-specific configurations
│   ├── dev/
│   │   ├── terragrunt.hcl    # Backend and provider config
│   │   ├── vpc/
│   │   └── eks/
│   ├── staging/
│   └── prod/
├── .github/
│   └── workflows/
│       ├── plan.yml          # Run on PRs
│       └── deploy.yml        # Run on merge
└── scripts/
    ├── setup-oidc.sh        # Initial AWS OIDC setup
    └── validate-state.sh    # State health checks

Separate modules from environments or you'll debug circular dependencies until you die. Modules = what to build, environments = where to build it. Simple as that.

The environments/ directory structure mirrors your AWS account organization. Each environment gets its own directory with account-specific Terragrunt configuration. This prevents cross-environment contamination and makes it obvious which changes affect which accounts. Terraform module design patterns and best practices guides show proper organization. Multi-account repository structures and complex AWS setup management provide real examples.

Terragrunt Configuration That Doesn't Suck

Terragrunt solves the multi-account backend configuration problem that makes Terraform workspaces useless.

Each environment gets its own state file in its own S3 bucket with its own DynamoDB lock table.

## environments/dev/terragrunt.hcl
remote_state {
  backend = \"s3\"
  generate = {
    path      = \"backend.tf\"
    if_exists = \"overwrite_terragrunt\"
  }
  config = {
    bucket = \"terraform-state-dev-${get_aws_account_id()}\"
    key    = \"${path_relative_to_include()}/terraform.tfstate\"
    region = \"us-east-1\"
    
    dynamodb_table = \"terraform-locks-dev-${get_aws_account_id()}\"
    encrypt        = true
    
    # S3 bucket versioning and lifecycle
    versioning                = true
    lifecycle_delete_enabled  = false
    lifecycle_expiry_days     = 90
  }
}

## Generate provider configuration
generate \"provider\" {
  path = \"provider.tf\"
  if_exists = \"overwrite_terragrunt\"
  contents = <<EOF
terraform {
  required_version = \">= 1.9\"
  required_providers {
    aws = {
      source  = \"hashicorp/aws\"
      version = \"~> 6.0\"
    }
  }
}

provider \"aws\" {
  region = \"us-east-1\"
  
  # Role assumption for cross-account deployment
  assume_role {
    role_arn = \"arn:aws:iam::${get_aws_account_id()}:role/Terraform

Execution\"
  }
  
  default_tags {
    tags = {
      Environment   = \"dev\"
      ManagedBy    = \"Terraform\"
      Repository   = \"terraform-multi-account\"
      LastUpdated  = timestamp()
    }
  }
}
EOF
}

## Environment-specific inputs
inputs = {
  environment         = \"dev\"
  vpc_cidr           = \"10.10.0.0/16\"
  availability_zones = [\"us-east-1a\", \"us-east-1b\", \"us-east-1c\"]
  
  # EKS configuration
  cluster_version    = \"1.28\"
  node_instance_type = \"t3.medium\"
  min_nodes         = 2
  max_nodes         = 10
  
  # Database configuration  
  db_instance_class = \"db.t3.micro\"
  db_allocated_storage = 20
}

Terragrunt generates all this backend config dynamically so you don't copy-paste the wrong account ID into production.

Learned that lesson the hard way when someone deployed dev config to prod because they copied the wrong file.

AWS OIDC Setup (Prepare for 2 Days of Hell)

AWS OIDC integration with GitHub Actions is the security foundation that makes keyless authentication possible. The documentation is terrible and you'll spend days debugging trust policy syntax, but it's still better than managing long-lived access keys. GitHub's OIDC guide and [AWS IAM documentation](https://docs.aws.amazon.com/IAM/latest/User

Guide/id_roles_providers_create_oidc.html) explain the concepts, while community examples show working configurations.

Creating the OIDC Provider and Roles

Each AWS account needs an OIDC identity provider and deployment role.

Don't try to do this manually through the console

you'll miss critical configuration details and spend hours debugging authentication failures. AWS CLI documentation and CloudFormation templates make this repeatable.

The GitHub marketplace actions handle the complexity.

#!/bin/bash
## scripts/setup-oidc.sh 
- Automate the painful OIDC setup

GITHUB_ORG=\"your-org\"
GITHUB_REPO=\"terraform-multi-account\"
ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
ROLE_NAME=\"GitHubDeploymentRole\"

## Create OIDC Identity Provider
aws iam create-open-id-connect-provider \
  --url \"https://token.actions.githubusercontent.com\" \
  --client-id-list sts.amazonaws.com \
  --thumbprint-list 6938fd4d98bab03faadb97b34396831e3780aea1

## Create trust policy for GitHub Actions
cat > trust-policy.json << EOF
{
  \"Version\": \"2012-10-17\",
  \"Statement\": [
    {
      \"Effect\": \"Allow\",
      \"Principal\": {
        \"Federated\": \"arn:aws:iam::${ACCOUNT_ID}:oidc-provider/token.actions.githubusercontent.com\"
      },
      \"Action\": \"sts:

Assume

RoleWithWebIdentity\",
      \"Condition\": {
        \"StringEquals\": {
          \"token.actions.githubusercontent.com:aud\": \"sts.amazonaws.com\"
        },
        \"StringLike\": {
          \"token.actions.githubusercontent.com:sub\": [
            \"repo:${GITHUB_ORG}/${GITHUB_REPO}:ref:refs/heads/develop\",
            \"repo:${GITHUB_ORG}/${GITHUB_REPO}:ref:refs/heads/staging\", 
            \"repo:${GITHUB_ORG}/${GITHUB_REPO}:ref:refs/heads/main\"
          ]
        }
      }
    }
  ]
}
EOF

## Create the deployment role
aws iam create-role \
  --role-name $ROLE_NAME \
  --assume-role-policy-document file://trust-policy.json

## Attach broad permissions (restrict these in production)
aws iam attach-role-policy \
  --role-name $ROLE_NAME \
  --policy-arn arn:aws:iam::aws:policy/AdministratorAccess

Trust policy sub field must match repo and branch names exactly.

One typo = cryptic "AssumeRoleWithWebIdentity failed" errors. I spent two days debugging this because of an extra space in the repo name. Check it multiple times.

GitHub Actions Workflow Configuration

The GitHub Actions workflow handles multi-account deployment using matrix strategies. Each account gets its own job with environment-specific configuration and approval requirements.

## .github/workflows/deploy.yml
name: Multi-Account Infrastructure Deployment

on:
  push:
    branches: [develop, staging, main]
    paths: ['environments/**', 'modules/**']
  pull_request:
    branches: [develop, staging, main]
    paths: ['environments/**', 'modules/**']

env:

  TERRAFORM_VERSION: \"1.9.5\"
  TERRAGRUNT_VERSION: \"0.63.6\"

jobs:
  plan:
    if: github.event_name == 'pull_request'
    strategy:
      fail-fast: false
      matrix:
        include:

- environment: dev
            account_id: \"111111111111\"
            aws_role: \"arn:aws:iam::111111111111:role/GitHubDeploymentRole\"
            branch_filter: \".*\"  # Deploy to dev from any branch
          
- environment: staging  
            account_id: \"222222222222\"
            aws_role: \"arn:aws:iam::222222222222:role/GitHubDeploymentRole\"
            branch_filter: \"(staging|main)\"
          
- environment: prod
            account_id: \"333333333333\" 
            aws_role: \"arn:aws:iam::333333333333:role/GitHubDeploymentRole\"
            branch_filter: \"main\"
        exclude:
          # Filter environments based on branch
          
- ${{ !contains(matrix.branch_filter, github.head_ref) }}
    
    runs-on: ubuntu-latest
    
    steps:

- name:

 Checkout
      uses: actions/checkout@v4
      
    
- name:

 Setup Terraform
      uses: hashicorp/setup-terraform@v3
      with:
        terraform_version: ${{ env.

TERRAFORM_VERSION }}
        
    
- name: Setup Terragrunt
      run: |
        wget -O terragrunt https://github.com/gruntwork-io/terragrunt/releases/download/v${{ env.

TERRAGRUNT_VERSION }}/terragrunt_linux_amd64
        chmod +x terragrunt
        sudo mv terragrunt /usr/local/bin/
        
    
- name: Configure AWS Credentials
      uses: aws-actions/configure-aws-credentials@v4
      with:
        role-to-assume: ${{ matrix.aws_role }}
        aws-region: us-east-1
        role-session-name:

 GitHubDeploy-${{ matrix.environment }}-${{ github.run_id }}
        
    
- name: Terragrunt Plan
      working-directory: environments/${{ matrix.environment }}
      run: |
        terragrunt run-all plan --terragrunt-non-interactive
        
    
- name:

 Comment PR with Plan
      uses: actions/github-script@v7
      if: github.event_name == 'pull_request'
      with:
        script: |
          const fs = require('fs');
          const plan

Output = fs.readFileSync('environments/${{ matrix.environment }}/plan.out', 'utf8');
          github.rest.issues.createComment({
            issue_number: context.issue.number,
            owner: context.repo.owner,
            repo: context.repo.repo,
            body: `## Terraform Plan 
- ${{ matrix.environment }}
            
            ```
            ${planOutput}
            ```
            `
          });

  deploy:
    if: github.event_name == 'push' && contains(from

Json('["develop", "staging", "main"]'), github.ref_name)
    needs: []  # No dependencies for push events
    strategy:
      fail-fast: false
      matrix:
        include:

- environment: dev
            account_id: \"111111111111\"  
            aws_role: \"arn:aws:iam::111111111111:role/GitHubDeploymentRole\"
            requires_approval: false
          
- environment: staging
            account_id: \"222222222222\"
            aws_role: \"arn:aws:iam::222222222222:role/GitHubDeploymentRole\" 
            requires_approval: false
          
- environment: prod
            account_id: \"333333333333\"
            aws_role: \"arn:aws:iam::333333333333:role/GitHubDeploymentRole\"
            requires_approval: true
        exclude:
          # Branch-based deployment filtering
          
- ${{ github.ref_name == 'develop' && matrix.environment != 'dev' }}
          
- ${{ github.ref_name == 'staging' && !contains(from

Json('["dev", "staging"]'), matrix.environment) }}
          # main branch can deploy to all environments with appropriate approvals
    
    runs-on: ubuntu-latest
    environment: 
      name: ${{ matrix.environment }}
      url: https://console.aws.amazon.com/console/home?region=us-east-1#
      
    steps:

- name:

 Checkout
      uses: actions/checkout@v4
      
    
- name:

 Setup Tools
      run: |
        # Setup Terraform 
- pin the version because new releases break stuff
        wget -O terraform.zip https://releases.hashicorp.com/terraform/${{ env.

TERRAFORM_VERSION }}/terraform_${{ env.TERRAFORM_VERSION }}_linux_amd64.zip
        unzip terraform.zip
        sudo mv terraform /usr/local/bin/
        # Pin the version or new releases break stuff randomly
        
        # Setup Terragrunt  
        wget -O terragrunt https://github.com/gruntwork-io/terragrunt/releases/download/v${{ env.

TERRAGRUNT_VERSION }}/terragrunt_linux_amd64
        chmod +x terragrunt
        sudo mv terragrunt /usr/local/bin/
        
    
- name: Configure AWS Credentials
      uses: aws-actions/configure-aws-credentials@v4
      with:
        role-to-assume: ${{ matrix.aws_role }}
        aws-region: us-east-1
        role-session-name:

 GitHubDeploy-${{ matrix.environment }}-${{ github.run_id }}
        
    
- name: Validate State Health
      working-directory: environments/${{ matrix.environment }}
      run: |
        # Check for state locks before deployment
        terragrunt run-all init --terragrunt-non-interactive
        
        # Validate state integrity
        terragrunt run-all validate --terragrunt-non-interactive
        
    
- name:

 Deploy Infrastructure
      working-directory: environments/${{ matrix.environment }}
      run: |
        terragrunt run-all apply --terragrunt-non-interactive --auto-approve
        
    
- name:

 Post-Deploy Verification
      working-directory: environments/${{ matrix.environment }}
      run: |
        # Verify deployment success
        terragrunt run-all output --terragrunt-non-interactive
        
        # Run basic connectivity tests
        if [ -f \"../scripts/verify-deployment.sh\" ]; then
          ../scripts/verify-deployment.sh ${{ matrix.environment }}
        fi

Environment Promotion (The Part Everyone Screws Up)

The branch strategy determines which environments receive deployments automatically and which require manual promotion.

Most teams get this wrong and end up with either too much automation (prod deployments on every commit) or too little (manual deployments that break consistency).

Branch-Based Environment Mapping

develop branch  → dev account (automatic)
staging branch  → dev + staging accounts (automatic)  
main branch     → dev + staging + prod accounts (with approvals)

Merge entire staging branch to main, not individual commits. Otherwise prod gets different config than what you tested. Found this out when a "small fix" went directly to prod and broke everything because it skipped our staging validation.

GitHub Environment Protection Rules

Configure environment protection in GitHub repository settings to require approvals for production deployments:

## Example environment configuration (GitHub UI)
Environment: prod
Required reviewers:

- DevOps team
  
- Security team
Wait timer: 5 minutes  # Allow time to cancel if needed
Deployment branches: main only

Critical lesson: don't require approvals for dev and staging environments.

Developers need fast feedback loops for testing. Only gate production deployments where the blast radius justifies the overhead.

Monitoring That Doesn't Suck (Unlike CloudWatch Defaults)

AWS CloudWatch Monitoring

Standard infrastructure monitoring doesn't help with deployment issues.

You need deployment-specific metrics that show what's changing and what's breaking. AWS CloudWatch documentation covers the basics, while GitHub Actions monitoring helps track pipeline health. Terraform state monitoring and Datadog integration guides provide additional observability options.

Custom Cloud

Watch Metrics

#!/bin/bash
## scripts/post-deploy-metrics.sh

ENVIRONMENT=$1
REGION=\"us-east-1\"

## Send deployment success metric
aws cloudwatch put-metric-data \
  --namespace \"Infrastructure/Deployments\" \
  --metric-data \
    MetricName=DeploymentSuccess,Value=1,Unit=Count,Dimensions=Environment=$ENVIRONMENT \
    MetricName=DeploymentDuration,Value=$DEPLOY_DURATION,Unit=Seconds,Dimensions=Environment=$ENVIRONMENT

## Check for resource drift
DRIFT_COUNT=$(terragrunt run-all show -json | jq '[.values.root_module.resources[] | select(.mode == \"managed\")] | length')
aws cloudwatch put-metric-data \
  --namespace \"Infrastructure/Drift\" \
  --metric-data MetricName=ResourceCount,Value=$DRIFT_COUNT,Unit=Count,Dimensions=Environment=$ENVIRONMENT

Slack Integration for Real-Time Notifications

GitHub Actions can post deployment status to Slack channels automatically:

- name:

 Notify Slack on Deployment
  if: always()
  uses: 8398a7/action-slack@v3
  with:
    status: ${{ job.status }}
    channel: '#infrastructure'
    webhook_url: ${{ secrets.

SLACK_WEBHOOK }}
    fields: repo,message,commit,author,action,event

Name,ref,workflow
    custom_payload: |
      {
        text: \"Infrastructure deployment ${{ job.status }} for ${{ matrix.environment }}\",
        attachments: [{
          color: '${{ job.status }}' === 'success' ? 'good' : 'danger',
          fields: [{
            title: 'Environment',
            value: '${{ matrix.environment }}',
            short: true
          }, {
            title: 'Account',
            value: '${{ matrix.account_id }}',
            short: true
          }, {
            title: 'Branch',
            value: '${{ github.ref_name }}',
            short: true
          }]
        }]
      }

Disaster Recovery (For When Everything Goes to Shit)

Multi-account Terraform state corruption is a nightmare scenario that will happen to you eventually.

Plan for it before it ruins your life.

State Backup Strategy

#!/bin/bash
## scripts/backup-terraform-state.sh

for env in dev staging prod; do
  BUCKET=\"terraform-state-${env}-$(aws sts get-caller-identity --query Account --output text)\"
  
  # Enable cross-region replication
  aws s3api put-bucket-replication \
    --bucket $BUCKET \
    --replication-configuration file://replication-config.json
    
  # Enable point-in-time recovery for Dynamo

DB
  aws dynamodb put-backup-policy \
    --table-name \"terraform-locks-${env}\" \
    --backup-policy BillingMode=PAY_PER_REQUEST,BackupPolicy=ENABLED
done

State Recovery Procedures

Document the state recovery process before you need it:

## Emergency state recovery procedure
## 1.

 Stop all deployments
## 2. Download state backup from replication bucket  
## 3. Validate state integrity
## 4. Import any missing resources
## 5. Resume deployments

## Example recovery commands
aws s3 cp s3://terraform-state-backup/environments/prod/terraform.tfstate ./
terragrunt init -reconfigure
terragrunt import aws_instance.web i-1234567890abcdef0
terragrunt plan  # Verify no unexpected changes

You'll need this eventually and it'll happen at the worst possible time. Test it quarterly when people are awake and caffeinated, not during 3am production incidents when you're panicking and making terrible decisions.

DevOps Pipeline Tools Reality Check - What Actually Works for Multi-Account Deployment

Platform	Setup Complexity	Multi-Account Support	Reality Check	Monthly Cost	Best For
GitHub Actions	High (2 days OIDC setup)	Excellent with OIDC	Use this. OIDC setup sucks but it works	$0-500+	Everyone not stuck with legacy crap
Atlantis	Medium (1 day)	Good with custom configs	Crashes when you get busy. Don't bother	$150/month infrastructure	No one
HCP Terraform	Low (2 hours)	Excellent native support	Easy but HashiCorp will bleed you dry	$20+/user/month	Teams with unlimited budget
GitLab Ultimate	Medium (4 hours)	Good but not great	Only if you're already stuck with GitLab	$99/user/month	GitLab victims
Jenkins	Very High (1-2 weeks)	Requires custom setup	For masochists only	$200-1000+/month	Sadists and legacy systems

Evolving Your Infrastructure with Terraform by HashiCorp

Found this TechWorld with Nana video that explains the basics without being total garbage. Skip to 18:15 if you just want the CI/CD part - the first 18 minutes is mostly intro stuff.

Nana actually knows what she's talking about, which puts her ahead of 90% of YouTube DevOps content. The Terraform setup she shows works fine for single accounts. For multi-account stuff, you'll still need to deal with all the OIDC nightmare we covered above, but the basic pipeline concepts are solid.

Worth watching if you're new to this - she actually shows the commands instead of just hand-waving through everything.

📺 YouTube

Stuff That Breaks at the Worst Possible Time

Help! GitHub Actions OIDC keeps failing with "AssumeRoleWithWebIdentity" errors

Trust policy doesn't match repo/branch exactly. AWS gives you the most useless error messages on earth. Shit that breaks:

Repo name typo: repo:your-org/terraform-multi-account:ref:refs/heads/main
Missing refs/heads/ prefix: use refs/heads/main not just main
Any typo in sub field = cryptic failures that tell you nothing

Check CloudTrail to see what GitHub is actually sending. This will ruin your day but it's the only way to debug it.

Someone just deployed to prod from their laptop - how do I prevent this?

Take away everyone's prod AWS keys right fucking now. Force all deployments through GitHub Actions. Use AWS OIDC integration instead of long-lived access keys that get pasted in Slack channels.

Configure IAM roles that can only be assumed by GitHub Actions from specific repositories and branches. The trust policy should look like this:

{
  "Version": "2012-10-17", 
  "Statement": [{
    "Effect": "Allow",
    "Principal": {
      "Federated": "arn:aws:iam::ACCOUNT:oidc-provider/token.actions.githubusercontent.com"
    },
    "Action": "sts:AssumeRoleWithWebIdentity",
    "Condition": {
      "StringEquals": {
        "token.actions.githubusercontent.com:aud": "sts.amazonaws.com",
        "token.actions.githubusercontent.com:sub": "repo:org/repo:ref:refs/heads/main"
      }
    }
  }]
}

This prevents role assumption from anywhere except GitHub Actions running on the main branch.

My Terraform state is corrupted across multiple accounts - now what?

You're fucked. This will ruin your weekend. Here's how to fix it:

Prevention (do this now or hate yourself later):

S3 versioning on state buckets
Cross-region replication
DynamoDB point-in-time recovery for locks
State validation in CI/CD

When it inevitably happens:

Stop all deployments immediately and panic
terraform plan to see how fucked you are
Restore from S3 versioning if you set it up (you didn't)
terraform import whatever got lost (this will take hours)
Question your career choices

How do I handle secrets and sensitive data across multiple AWS accounts?

Don't put secrets in Terraform code or state files, ever. Use AWS Secrets Manager or Parameter Store in each account for environment-specific secrets.

For cross-account shared secrets, create them in a dedicated security account and use cross-account IAM roles to access them. GitHub Actions can retrieve secrets during deployment using the same OIDC authentication that accesses AWS resources.

- name: Get Database Password
  run: |
    DB_PASSWORD=$(aws secretsmanager get-secret-value \
      --secret-id prod/database/password \
      --query SecretString --output text)
    echo "::add-mask::$DB_PASSWORD"
    echo "DB_PASSWORD=$DB_PASSWORD" >> $GITHUB_ENV

The ::add-mask:: directive prevents GitHub from logging the secret value in action outputs.

Can I use the same Terraform modules across different accounts with different configurations?

Yes, this is the recommended approach. Store modules in a modules/ directory and use account-specific variable files to customize behavior per environment.

Use Terragrunt to manage variable files and backend configurations per account:

modules/vpc/           # Reusable VPC module
environments/
  dev/terragrunt.hcl   # Points to vpc module with dev-specific inputs
  prod/terragrunt.hcl  # Points to vpc module with prod-specific inputs

This prevents code duplication while allowing environment-specific customizations like instance sizes, CIDR ranges, and security policies.

Why do my deployments take 15+ minutes when deploying to multiple accounts?

You're probably deploying accounts sequentially instead of parallel. GitHub Actions matrix can run accounts simultaneously, but AWS doesn't like too many API calls at once and will rate limit you to hell.

Shit that might help:

GitHub Actions matrix jobs for parallel deployments
Smart dependency management (good luck figuring that out)
Caching Terraform providers between runs (doesn't work half the time)
Smaller modules (easier said than done)
Terraform parallelism settings (but not too high or AWS gets pissy)

Usually the bottleneck is AWS API limits being garbage, not Terraform. Plan for 10-15 minutes minimum no matter what you do. Sometimes it's 5 minutes, sometimes it takes 30 minutes because AWS is having a bad day. No rhyme or reason to it.

How do I roll back a failed deployment across multiple accounts?

Terraform doesn't have rollback because HashiCorp hates us. You need to plan for this shit beforehand or you're fucked when production breaks.

Strategies that sometimes work:

Git-based rollback: Revert the commit and redeploy (takes 15+ minutes during an outage)
Blue-green deployments: Deploy new infrastructure alongside existing (costs double, management hates it)
Immutable infrastructure: Destroy and recreate (great until you have stateful services)
Feature flags: Disable new features without infrastructure changes (if you planned ahead, which you didn't)

For emergency rollbacks: revert to the last known good Git commit and trigger a new deployment. Pray it works because you're probably getting paged every 5 minutes while it runs.

What's the best way to handle Terraform state locking conflicts across multiple developers?

DynamoDB handles the locking automatically. Real problem is developers running Terraform locally while CI/CD is trying to deploy. Recipe for disaster and someone will definitely fuck up production.

Prevent conflicts by:

Requiring all deployments to go through CI/CD (good luck enforcing this)
Using separate state files per environment/module to reduce lock contention
Implementing lock timeouts so stuck locks don't block deployments indefinitely
Creating "developer sandbox" accounts where people can break shit safely

If you get stuck with a persistent lock: figure out what process is holding it, kill that process, then terraform force-unlock. Don't force-unlock active deployments or you'll corrupt state and have the worst day of your life.

How do I test Terraform changes before they hit production?

Use a promotion pipeline: dev → staging → prod with identical infrastructure in each environment.

Testing strategies:

Automated validation: terraform validate, terraform plan, and tools like tfsec
Integration testing: Deploy to dev environment and run automated tests against actual infrastructure
Staging environment: Exact replica of production for final validation
Canary deployments: Roll out changes to subset of production infrastructure first

The key insight: test the same Terraform code that will run in production, not simplified versions. Use identical modules with different variable values per environment.

Why does my multi-account setup work locally but fail in GitHub Actions?

Classic "works on my machine" problem. GitHub Actions runs in a completely different environment with different auth and config. Debugging this will make you question your life choices.

Usual suspects:

AWS profiles that exist on your laptop but not in CI/CD
Environment variables you set locally but forgot to add to GitHub secrets
Different tool versions between your machine and CI/CD runners
Network issues (VPC access, security groups, random AWS bullshit)

Turn on verbose logging and diff everything. Usually something stupidly obvious once you find it, but finding it takes 4 hours and several mental breakdowns.

How do I handle Terraform provider version conflicts across multiple accounts?

Use version constraints in your provider configuration and lock files to ensure consistency:

terraform {
  required_version = ">= 1.9"
  required_providers {
    aws = {
      source  = "hashicorp/aws"  
      version = "~> 6.0"
    }
  }
}

Commit .terraform.lock.hcl files to Git so all developers and CI/CD use identical provider versions. Version conflicts happen when different environments use different provider versions that aren't backward compatible.

What's the disaster recovery plan if our entire CI/CD pipeline is unavailable?

Have break-glass procedures ready before you need them:

Emergency IAM roles with temporary admin access that bypass normal OIDC authentication
Local deployment scripts that can run Terraform from developer machines in emergencies
State file backups stored outside the primary CI/CD system
Documentation for manual deployment procedures that doesn't rely on the broken CI/CD

Test these procedures quarterly when everyone is awake and caffeinated. Don't wait for a real incident to discover your emergency process doesn't work.

How do I convince my team to adopt multi-account DevOps automation instead of manual deployments?

Show them the numbers:

Manual deployment time: 2-4 hours per account × number of accounts × deployment frequency
Automated deployment time: 15 minutes for all accounts, regardless of scale
Error rates: Manual deployments have 10-20% failure rates, automation has <5%
Audit compliance: Git history provides complete change tracking for free

Start with development environments where failure has low impact. Once they see automation working reliably for dev, production adoption becomes obvious. Don't try to automate everything at once - begin with simple, low-risk deployments and gradually add complexity.

Is this overkill for small teams with only 3-5 AWS accounts?

Depends. Break-even is around 5 accounts or when manual deployment disasters start causing production outages weekly.

For small teams, consider:

Simple approach: GitHub Actions with basic multi-account deployment
Skip Terragrunt: Use simple Terraform with environment-specific variable files
Minimal tooling: Focus on OIDC authentication and basic state management
Manual approvals: Use GitHub environment protection for production instead of complex approval workflows

You can always add complexity later as you scale. Starting simple and adding features gradually is better than over-engineering from day one.

How do I train my team on this new workflow without breaking production?

Start with sandbox accounts where people can break shit:

Training accounts for breaking things safely
Runbooks for common scenarios
Video recordings of the process
Pair programming for first deployments
Gradual rollout: dev → staging → prod

Make automation easier than the manual process or people will bypass it every damn time.

Why does `terraform import` keep failing with "resource already exists"?

You're trying to import a resource that Terraform thinks it already manages. Check:

Resource already in state: terraform state list | grep resource_name
Resource address typo: aws_instance.web vs aws_instance.server
Resource exists but state is fucked: terraform state rm then import

Copy-paste the exact resource address from your .tf file.

GitHub Actions randomly failing with "rate limit exceeded"

AWS gets pissy when you hit their APIs too fast. Happens when deploying to multiple accounts at once. GitHub also rate limits git operations on big repos.

Fixes that usually work:

Add sleep 30 between account deployments (dumb but effective)
Reduce Terraform parallelism: terraform apply -parallelism=5
Split big modules into smaller ones
Use different regions if you can
For GitHub limits: use GITHUB_TOKEN with higher limits

Terragrunt takes forever to download providers on every run

Provider caching is broken by default. This works about 80% of the time:

## Create shared plugin cache
mkdir -p ~/.terraform.d/plugin-cache
export TF_PLUGIN_CACHE_DIR=~/.terraform.d/plugin-cache

## Or in GitHub Actions
- name: Cache Terraform providers
  uses: actions/cache@v3
  with:
    path: ~/.terraform.d/plugin-cache
    key: terraform-providers-${{ hashFiles('**/.terraform.lock.hcl') }}

Sometimes it still downloads everything anyway. No idea why.

AWS SSO session expired in middle of deployment

Session timeout during long deployments. Add this to your profile:

## ~/.aws/config
[profile your-profile]
sso_session = your-session
sso_account_id = 123456789012
sso_role_name = YourRole
region = us-east-1

[sso-session your-session]
sso_start_url = https://example.awsapps.com/start
sso_region = us-east-1
sso_registration_scopes = sso:account:access

Run aws sso login --profile your-profile before deployments.

Terraform randomly destroys resources it shouldn't touch

State drift, usually. Common causes:

Different Terraform versions handle things differently
Provider version conflicts (AWS provider updates break things regularly)
Import mistakes (imported something with wrong config)
Manual changes to resources that Terraform doesn't know about

Always check terraform plan before apply. If it wants to destroy production resources, don't run it. Figure out what's wrong first.

Had this happen with RDS once - someone manually changed a parameter group and Terraform wanted to destroy and recreate a 2TB database. Spent like 6 hours figuring out the import command that fixed it. Still have no idea why it worked but it did.

Multi-Account DevOps Pipeline Resources - The Essential Toolkit

Related Tools & Recommendations

integration

Recommended

GitOps Integration Hell: Docker + Kubernetes + ArgoCD + Prometheus

How to Wire Together the Modern DevOps Stack Without Losing Your Sanity

kubernetes

/integration/docker-kubernetes-argocd-prometheus/gitops-workflow-integration

100%

tool

Recommended

Pulumi Cloud - Skip the DIY State Management Nightmare

competes with Pulumi Cloud

Pulumi Cloud

/tool/pulumi-cloud/overview

63%

review

Recommended

Pulumi Review: Real Production Experience After 2 Years

competes with Pulumi

Pulumi

/review/pulumi/production-experience

63%

tool

Recommended

Pulumi Cloud Enterprise Deployment - What Actually Works in Production

When Infrastructure Meets Enterprise Reality

Pulumi Cloud

/tool/pulumi-cloud/enterprise-deployment-strategies

63%

integration

Recommended

Kafka + MongoDB + Kubernetes + Prometheus Integration - When Event Streams Break

When your event-driven services die and you're staring at green dashboards while everything burns, you need real observability - not the vendor promises that go

Apache Kafka

/integration/kafka-mongodb-kubernetes-prometheus-event-driven/complete-observability-architecture

56%

alternatives

Recommended

12 Terraform Alternatives That Actually Solve Your Problems

HashiCorp screwed the community with BSL - here's where to go next

Terraform

/alternatives/terraform/comprehensive-alternatives

48%

review

Recommended

Terraform Performance at Scale Review - When Your Deploys Take Forever

integrates with Terraform

Terraform

/review/terraform/performance-at-scale

48%

tool

Recommended

Terraform - Define Infrastructure in Code Instead of Clicking Through AWS Console for 3 Hours

The tool that lets you describe what you want instead of how to build it (assuming you enjoy YAML's evil twin)

Terraform

/tool/terraform/overview

48%

tool

Recommended

GitLab CI/CD - The Platform That Does Everything (Usually)

CI/CD, security scanning, and project management in one place - when it works, it's great

GitLab CI/CD

/tool/gitlab-ci-cd/overview

45%

compare

Recommended

AI Coding Assistants 2025 Pricing Breakdown - What You'll Actually Pay

GitHub Copilot vs Cursor vs Claude Code vs Tabnine vs Amazon Q Developer: The Real Cost Analysis

GitHub Copilot

/compare/github-copilot/cursor/claude-code/tabnine/amazon-q-developer/ai-coding-assistants-2025-pricing-breakdown

40%

integration

Recommended

I've Been Juggling Copilot, Cursor, and Windsurf for 8 Months

Here's What Actually Works (And What Doesn't)

GitHub Copilot

/integration/github-copilot-cursor-windsurf/workflow-integration-patterns

40%

tool

Recommended

HashiCorp Vault - Overly Complicated Secrets Manager

The tool your security team insists on that's probably overkill for your project

HashiCorp Vault

/tool/hashicorp-vault/overview

35%

pricing

Recommended

HashiCorp Vault Pricing: What It Actually Costs When the Dust Settles

From free to $200K+ annually - and you'll probably pay more than you think

HashiCorp Vault

/pricing/hashicorp-vault/overview

35%

tool

Recommended

AWS Organizations - Stop Losing Your Mind Managing Dozens of AWS Accounts

When you've got 50+ AWS accounts scattered across teams and your monthly bill looks like someone's phone number, Organizations turns that chaos into something y

AWS Organizations

/tool/aws-organizations/overview

33%

tool

Recommended