Terraform Multicloud Architecture Patterns

How I Got Stuck Managing Three Fucking Cloud Providers

Started with AWS because that's what everyone uses. Pretty standard setup - EC2 instances, RDS, the usual stuff. Two years later compliance drops the bomb that EU customer data has to live in Azure Ireland specifically. Something about Microsoft's GDPR coverage being more battle-tested than AWS Frankfurt.

Six months after that we acquire a startup running everything on GCP. Their ML pipeline is built around BigQuery and AutoML. Moving it would take months we don't have. So now I'm maintaining infrastructure on all three platforms, which is about as fun as getting a root canal while someone explains Azure naming conventions to you.

The Four Reasons You Might Actually Need This

Legal/Compliance: EU customer data had to move to Azure Ireland. Not AWS Frankfurt - specifically Azure. Something about Microsoft's GDPR compliance history. The lawyers made up their minds before I got involved. I just read the AWS GDPR docs to understand what they were talking about.

AWS Outages: December 2021, us-east-1 goes down for 6 hours. Our whole platform is dead. I'm in Slack at 2am trying to explain to pissed-off customers why their shit isn't working. CEO rolls out of bed asking why we don't have backup regions. Fair fucking question actually.

Different Strengths: AWS has decent EC2 pricing and the biggest ecosystem. GCP's ML stuff works pretty well. Azure integrates with Active Directory without too much pain.

Acquisitions: Bought a startup already running on GCP. Their ML pipeline is all BigQuery and custom models. Moving it would take forever and probably break things.

Three Ways I've Seen This Done (Two Are Terrible)

The Abstraction Layer Disaster

First thing I tried was making universal modules. Map everything to generic sizes - "small", "medium", "large". Let the module figure out whether that's a t3.medium or Standard_D2s_v3 or whatever GCP calls their instances.

Took forever to build and broke constantly. AWS launches new instance types weekly. Azure changes naming schemes randomly. GCP has custom machine types that don't map to anything standard.

Ended up spending more time fixing the abstraction than just writing separate configs would have taken. Also debugging was impossible - error says "medium instance failed" but which cloud? Which actual instance type? Go fuck yourself, that's which.

The Conditional Logic Nightmare (Also Bad)

My second try was putting all three clouds in the same Terraform config with conditional logic:

resource \"aws_instance\" \"web\" {
  count = var.cloud_provider == \"aws\" ? var.instance_count : 0
  # AWS-specific config
}

resource \"azurerm_virtual_machine\" \"web\" {
  count = var.cloud_provider == \"azure\" ? var.instance_count : 0  
  # Azure-specific config
}

resource \"google_compute_instance\" \"web\" {
  count = var.cloud_provider == \"gcp\" ? var.instance_count : 0
  # GCP-specific config
}

This was cleaner than the abstraction layer but still sucked balls. Plan output was confusing as hell - showing 200+ resources with count = 0. When Azure provider shit the bed (which happens weekly), Terraform would still try to initialize it even when we weren't using it.

Plus debugging was a nightmare. Error messages would reference all three resource types even when only one was actually being used.

What Actually Works: Separate Everything

After wasting a year on the previous approaches, I finally did what I should have done from the start - treated each cloud as completely independent infrastructure that occasionally talks to the others.

Here's our current setup:

AWS: All the production web applications, databases, and general compute stuff
Azure: EU compliance workloads and anything that needs to talk to Active Directory
GCP: ML training jobs and BigQuery analytics (because honestly, GCP's data tools are just better)

Each cloud has its own Terraform root modules, its own state files, its own deployment pipelines. They're linked through VPN connections and shared tagging standards, not through Terraform dependencies.

When AWS has an outage, Azure keeps running. When I need to debug Azure networking (which happens more often than I'd like), I'm not dealing with AWS resources cluttering up the plan output.

For cross-cloud networking, we use basic site-to-site VPNs. I looked at Transit Gateway and Virtual WAN integration but honestly, our traffic between clouds is minimal enough that VPNs work fine and cost way less.

State Management Failures That Made Me Paranoid

Terraform State Management

State files are scary enough with one cloud. With three clouds, they're terrifying.

Worst day was when Azure's API started throwing 429s during a routine plan on a Friday afternoon. Terraform tried to refresh the state, got confused, and marked half our AWS resources for destruction. I caught it before applying but spent the weekend untangling that clusterfuck and restoring from backup. Wife was not happy.

That's when I learned to keep state files completely separate.

One State File Per Cloud (Learned This The Hard Way): Each cloud gets completely separate state management. No shared state files, no cross-references, no cute attempts at unification.

## backend-aws.tf
terraform {
  backend \"s3\" {
    bucket = \"mycompany-terraform-state-aws\"
    key    = \"production/terraform.tfstate\"
    region = \"us-east-1\"
  }
}

## backend-azure.tf  
terraform {
  backend \"azurerm\" {
    resource_group_name  = \"terraform-state\"
    storage_account_name = \"mycompanyterraformstate\"
    container_name      = \"tfstate\"
    key                 = \"production.terraform.tfstate\"
  }
}

Cross-Cloud Data Sources: When one cloud needs information from another, use terraform_remote_state data sources. Check out the Terraform Registry for examples.

data \"terraform_remote_state\" \"aws_network\" {
  backend = \"s3\"
  config = {
    bucket = \"mycompany-terraform-state-aws\"
    key    = \"network/terraform.tfstate\"
    region = \"us-east-1\"
  }
}

## Use AWS VPC ID in GCP network peering
resource \"google_compute_network_peering\" \"aws_gcp\" {
  name         = \"aws-to-gcp\"
  network      = google_compute_network.vpc.id
  peer_network = \"projects/aws-interconnect/global/networks/${data.terraform_remote_state.aws_network.outputs.vpc_id}\"
}

Authentication: Three Different Ways to Hate Your Life

Getting authentication working across all three clouds in CI/CD was easily the most frustrating part of this whole project.

AWS was straightforward - IAM roles with AssumeRole just work. Azure service principals were a pain to set up but work reliably once configured. GCP service account keys kept getting rotated automatically and breaking our builds until I figured out workload identity.

The real nightmare was trying to use the same GitHub Actions workflow for all three clouds. Spent two weeks making it "elegant" before saying fuck it and writing three separate workflows. Sometimes the ugly solution that works beats the pretty solution that doesn't.

Each cloud gets its own authentication setup in your CI/CD. Don't try to abstract this - just accept that you'll have three different ways to do the same thing.

When Our AWS Bill Went Crazy

AWS bill is normally around $8K a month. Get an alert saying we're on track for $42K. I'm thinking the billing API is fucked or something.

Turns out some sync job between GCP and AWS got stuck in a loop. Kept copying the same 2TB dataset over and over for 5 days straight. GCP side was cheap, maybe $300 in compute. But AWS data egress? $11,000. Fucking brutal.

Each cloud bills differently and it's annoying. AWS charges for everything - data out, data between regions, data between AZs. Azure has these weird compute tiers. GCP's per-second billing is nice when you remember to use it.

What helped with costs:

Set up billing alerts on every cloud (learned this after the $42K month, obviously)
Used Infracost to catch expensive shit before deployment
Tag everything consistently so you can actually track what's bleeding money

locals {
  common_tags = {
    Environment = var.environment
    Team        = var.team
    Project     = var.project  
    Owner       = var.owner_email
    Cloud       = \"aws\"  # so we know which bill this shows up on
  }
}

Security and Compliance Challenges

Each cloud has different security primitives, compliance certifications, and audit requirements. What works:

Consistent Security Baselines: Use CIS Benchmarks adapted for each cloud provider. Checkov and Terrascan help automate this.

Network Security: Use cloud-native firewalls (Security Groups, Network Security Groups, Firewall Rules) but with consistent rule patterns. Document the translations between cloud security models.

Identity Federation: Use SAML/OIDC federation to connect all clouds to your central identity provider. Single sign-on across all environments.

What This Actually Costs

Infrastructure bills went up 67%. Not just because we're running more stuff, but data transfer between clouds, redundant load balancers, extra VPN gateways. Instead of $12K a month on AWS, we're at $20K spread across all three.

Everything takes forever. Used to spin up a new service in AWS in an afternoon. Now it's a week because I need to figure out the Azure equivalent, then the GCP version, then make sure they all talk to each other without breaking.

Team doubled. Two people could handle our AWS setup. Now we need four just to keep up with three different clouds shitting the bed in three different ways.

Learning curve blows. I used to actually know AWS pretty well. Now I'm mediocre at three clouds instead of good at one. My team has the same problem - we're all learning Azure and GCP on the fly while trying not to break production.

On-call is hell. AWS going down was bad enough. Now we get paged for Azure API timeouts at 3am, GCP quota limits during load tests, VPN tunnels dropping randomly. Three different ways for things to break while you're trying to sleep. Some Azure errors I still don't understand and Microsoft support is useless.

Worth it for us because compliance lawyers didn't give us a choice. But I wouldn't pick this clusterfuck if I had options.

The Setup Process That Might Not Kill You (No Guarantees)

Alright, you're actually doing this.

Here's how I set it up, including all the shit that broke and made me question my career choices.

Directory Structure That Won't Make You Want to Quit

I went through five different project structures before landing on this one. The key insight: each cloud is its own separate Terraform project.

No shared roots, no clever abstractions.

multicloud-terraform/
├── environments/
│   ├── production/
│   │   ├── aws/
│   │   │   ├── main.tf
│   │   │   ├── backend.tf
│   │   │   └── terraform.tfvars
│   │   ├── azure/
│   │   │   ├── main.tf  
│   │   │   ├── backend.tf
│   │   │   └── terraform.tfvars
│   │   └── gcp/
│   │       ├── main.tf
│   │       ├── backend.tf
│   │       └── terraform.tfvars
│   └── development/
│       └── [same structure]
├── modules/
│   ├── networking/
│   ├── compute/
│   └── storage/
└── shared/
    ├── variables.tf
    └── outputs.tf

Why this structure? Because when Azure's API decides to shit itself and return 500 errors for 6 hours (true story from last fucking month), your AWS and GCP deployments keep working. Each cloud fails independently instead of taking everything down in a beautiful cascade of failure.

Provider Configs: Keep It Simple, Keep It Separate

First mistake: trying to put all three providers in the same main.tf.

Don't do this shit. Each cloud gets its own Terraform workspace with its own provider config. The Terraform documentation explains why this matters for provider initialization, if you like reading docs instead of learning from pain.

AWS Configuration (environments/production/aws/main.tf):

terraform {
  required_version = ">= 1.5"
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }
}

provider "aws" {
  region = var.primary_region
  
  default_tags {
    tags = local.common_tags
  }
}

provider "aws" {
  alias  = "secondary"
  region = var.secondary_region
  
  default_tags {
    tags = local.common_tags
  }
}

locals {
  common_tags = {
    Environment    = var.environment
    Application    = var.application
    ManagedBy     = "terraform"
    CloudProvider = "aws"
    Project       = "multicloud-demo"
  }
}

Azure Configuration (environments/production/azure/main.tf):

terraform {
  required_version = ">= 1.5"
  required_providers {
    azurerm = {
      source  = "hashicorp/azurerm"
      version = "~> 3.0"
    }
  }
}

provider "azurerm" {
  features {}
  
  subscription_id = var.azure_subscription_id
}

locals {
  common_tags = {
    Environment    = var.environment
    Application    = var.application
    ManagedBy     = "terraform"
    CloudProvider = "azure"
    Project       = "multicloud-demo"
  }
}

GCP Configuration (environments/production/gcp/main.tf):

terraform {
  required_version = ">= 1.5"
  required_providers {
    google = {
      source  = "hashicorp/google"
      version = "~> 4.0"
    }
  }
}

provider "google" {
  project = var.gcp_project_id
  region  = var.primary_region
}

locals {
  common_tags = {
    environment     = var.environment
    application     = var.application
    managed-by     = "terraform"
    cloud-provider = "gcp"
    project        = "multicloud-demo"
  }
}

Step 3:

Networking

The Foundation That Makes or Breaks Everything

Multicloud Network

The brutal fucking reality: Each cloud has completely different networking primitives that don't make sense.

Don't try to abstract this away

embrace the differences but maintain consistent logical architecture. Study the AWS VPC documentation, Azure networking concepts, and GCP VPC basics to understand how each works, then cry a little.

AWS Network Module (modules/networking/aws.tf):

resource "aws_vpc" "main" {
  cidr_block           = var.vpc_cidr
  enable_dns_hostnames = true
  enable_dns_support   = true
  
  tags = merge(var.common_tags, {
    Name = "${var.environment}-vpc"
  })
}

resource "aws_subnet" "public" {
  count = length(var.public_subnet_cidrs)
  
  vpc_id                  = aws_vpc.main.id
  cidr_block             = var.public_subnet_cidrs[count.index]
  availability_zone      = data.aws_availability_zones.available.names[count.index]
  map_public_ip_on_launch = true
  
  tags = merge(var.common_tags, {
    Name = "${var.environment}-public-${count.index + 1}"
    Type = "public"
  })
}

resource "aws_subnet" "private" {
  count = length(var.private_subnet_cidrs)
  
  vpc_id            = aws_vpc.main.id
  cidr_block        = var.private_subnet_cidrs[count.index]
  availability_zone = data.aws_availability_zones.available.names[count.index]
  
  tags = merge(var.common_tags, {
    Name = "${var.environment}-private-${count.index + 1}"
    Type = "private"
  })
}

resource "aws_internet_gateway" "main" {
  vpc_id = aws_vpc.main.id
  
  tags = merge(var.common_tags, {
    Name = "${var.environment}-igw"
  })
}

Azure Network Module (modules/networking/azure.tf):

resource "azurerm_virtual_network" "main" {
  name                = "${var.environment}-vnet"
  address_space       = [var.vnet_cidr]
  location           = var.location
  resource_group_name = var.resource_group_name
  
  tags = var.common_tags
}

resource "azurerm_subnet" "public" {
  count = length(var.public_subnet_cidrs)
  
  name                 = "${var.environment}-public-${count.index + 1}"
  resource_group_name  = var.resource_group_name
  virtual_network_name = azurerm_virtual_network.main.name
  address_prefixes     = [var.public_subnet_cidrs[count.index]]
}

resource "azurerm_subnet" "private" {
  count = length(var.private_subnet_cidrs)
  
  name                 = "${var.environment}-private-${count.index + 1}"
  resource_group_name  = var.resource_group_name
  virtual_network_name = azurerm_virtual_network.main.name
  address_prefixes     = [var.private_subnet_cidrs[count.index]]
}

GCP Network Module (modules/networking/gcp.tf):

resource "google_compute_network" "main" {
  name                    = "${var.environment}-vpc"
  auto_create_subnetworks = false
  routing_mode           = "GLOBAL"
}

resource "google_compute_subnetwork" "public" {
  count = length(var.public_subnet_cidrs)
  
  name          = "${var.environment}-public-${count.index + 1}"
  ip_cidr_range = var.public_subnet_cidrs[count.index]
  network       = google_compute_network.main.id
  region        = var.region
}

resource "google_compute_subnetwork" "private" {
  count = length(var.private_subnet_cidrs)
  
  name          = "${var.environment}-private-${count.index + 1}"
  ip_cidr_range = var.private_subnet_cidrs[count.index]
  network       = google_compute_network.main.id
  region        = var.region
  
  private_ip_google_access = true
}

Step 4:

Cross-Cloud Connectivity (Where Dreams Die)

Option 1: VPN Connections (Cheap but slow)
Set up site-to-site VPNs between cloud providers.

AWS has VPN Gateway, Azure has VPN Gateway, GCP has Cloud VPN.

Pros: Cheap (usually under $50/month), encrypted, relatively simple
Cons: Bandwidth limited (usually 1-10 Gbps), latency issues, complex routing

**Option 2:

Cloud Interconnects** (Fast but expensive)
Use dedicated connections like AWS Direct Connect, Azure ExpressRoute, GCP Cloud Interconnect.

Check the pricing calculators before committing.

Pros: High bandwidth (up to 100 Gbps), low latency, predictable performance
Cons: Expensive ($1,000+ monthly), complex setup, requires co-location facilities

**Option 3:

Third-Party SD-WAN** (Aviatrix, Alkira, etc.)
Use a third-party service to manage connectivity between clouds.

Pros: Simplified management, consistent policies, often better than doing it yourself
Cons: Additional cost, vendor dependency, still complex under the hood

We started with VPNs for dev and testing.

For production with heavy cross-cloud traffic, dedicated connections make sense but they're expensive.

Step 5: Compute Resources

The Part That Actually Matters

Here's a practical approach to compute that works across clouds:

Consistent Compute Module Interface:

module "web_servers" {
  source = "./modules/compute"
  
  cloud_provider    = var.cloud_provider
  environment      = var.environment
  application      = var.application
  
  instance_count   = var.web_server_count
  instance_size    = var.web_server_size  # "small", "medium", "large"
  
  vpc_id          = module.networking.vpc_id
  subnet_ids      = module.networking.private_subnet_ids
  security_groups = [module.security.web_security_group_id]
  
  user_data = file("${path.module}/scripts/web-server-setup.sh")
  
  tags = local.common_tags
}

Size Translation Logic:

## modules/compute/locals.tf
locals {
  # AWS instance type mapping
  aws_instance_types = {
    "small"  = "t3.micro"
    "medium" = "t3.small"  
    "large"  = "t3.medium"
    "xlarge" = "t3.large"
  }
  
  # Azure VM size mapping  
  azure_vm_sizes = {
    "small"  = "Standard_B1ls"
    "medium" = "Standard_B1s"
    "large"  = "Standard_B2s"
    "xlarge" = "Standard_B4ms"
  }
  
  # GCP machine type mapping
  gcp_machine_types = {
    "small"  = "e2-micro"
    "medium" = "e2-small"
    "large"  = "e2-medium" 
    "xlarge" = "e2-standard-2"
  }
  
  # Select appropriate instance type based on provider
  instance_type = (
    var.cloud_provider == "aws" ? local.aws_instance_types[var.instance_size] :
    var.cloud_provider == "azure" ? local.azure_vm_sizes[var.instance_size] :
    var.cloud_provider == "gcp" ? local.gcp_machine_types[var.instance_size] :
    "unknown"
  )
}

Step 6:

Secrets and Configuration Management

Never put secrets in Terraform configurations. Use each cloud's native secret management. Read the Terraform security documentation and HashiCorp's best practices guide:

AWS Secrets Example:

resource "aws_secretsmanager_secret" "db_password" {
  name = "${var.environment}-db-password"
  tags = local.common_tags
}

resource "aws_secretsmanager_secret_version" "db_password" {
  secret_id     = aws_secretsmanager_secret.db_password.id
  secret_string = var.db_password
}

## Reference in RDS instance
resource "aws_db_instance" "main" {
  # ... other config
  manage_master_user_password = true
  master_user_secret_kms_key_id = aws_kms_key.main.arn
}

Azure Key Vault Example:

resource "azurerm_key_vault" "main" {
  name                = "${var.environment}-kv"
  location           = var.location
  resource_group_name = var.resource_group_name
  tenant_id          = data.azurerm_client_config.current.tenant_id
  sku_name           = "standard"
  
  tags = var.common_tags
}

resource "azurerm_key_vault_secret" "db_password" {
  name         = "db-password"
  value        = var.db_password
  key_vault_id = azurerm_key_vault.main.id
  
  tags = var.common_tags
}

GCP Secret Manager Example:

resource "google_secret_manager_secret" "db_password" {
  secret_id = "${var.environment}-db-password"
  
  labels = var.common_tags
  
  replication {
    automatic = true
  }
}

resource "google_secret_manager_secret_version" "db_password" {
  secret      = google_secret_manager_secret.db_password.id
  secret_data = var.db_password
}

Step 7:

Monitoring and Observability

Each cloud has native monitoring, but you need unified visibility:

Standardized Alerting:

## CloudWatch (AWS)
resource "aws_cloudwatch_metric_alarm" "high_cpu" {
  alarm_name          = "${var.environment}-high-cpu"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = "2"
  metric_name         = "CPUUtilization"
  namespace           = "AWS/EC2"
  period              = "120"
  statistic           = "Average"
  threshold           = "80"
  alarm_description   = "This metric monitors ec2 cpu utilization"
  alarm_actions       = [aws_sns_topic.alerts.arn]
}

## Azure Monitor
resource "azurerm_monitor_metric_alert" "high_cpu" {
  name                = "${var.environment}-high-cpu"
  resource_group_name = var.resource_group_name
  scopes              = [azurerm_virtual_machine.main.id]
  description         = "High CPU utilization alert"
  
  criteria {
    metric_namespace = "Microsoft.

Compute/virtualMachines"
    metric_name      = "Percentage CPU"
    aggregation      = "Average"
    operator         = "GreaterThan"
    threshold        = 80
  }
  
  action {
    action_group_id = azurerm_monitor_action_group.alerts.id
  }
}

Unified Logging Strategy:

Ship all logs to a central location (ELK, Splunk, or Datadog)
Use consistent log formats across all clouds
Tag everything with cloud provider, environment, application

Step 8:

Deployment Pipeline That Doesn't Suck

## .github/workflows/multicloud-deploy.yml
name: Multicloud Deployment

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

jobs:
  plan:
    runs-on: ubuntu-latest
    strategy:
      matrix:
        cloud: [aws, azure, gcp]
        environment: [development, production]
    
    steps:

- uses: actions/checkout@v4
    
    
- name:

 Setup Terraform
      uses: hashicorp/setup-terraform@v3
      with:
        terraform_version: 1.5.0
        
    
- name:

 Configure AWS Credentials
      if: matrix.cloud == 'aws'
      uses: aws-actions/configure-aws-credentials@v4
      with:
        role-to-assume: ${{ secrets.

AWS_ROLE_ARN }}
        aws-region: us-east-1
        
    
- name:

 Configure Azure Credentials  
      if: matrix.cloud == 'azure'
      uses: azure/login@v1
      with:
        creds: ${{ secrets.

AZURE_CREDENTIALS }}
        
    
- name: Configure GCP Credentials
      if: matrix.cloud == 'gcp'
      uses: google-github-actions/auth@v2
      with:
        credentials_json: ${{ secrets.

GCP_CREDENTIALS }}
        
    
- name: Terraform Init
      working-directory: environments/${{ matrix.environment }}/${{ matrix.cloud }}
      run: terraform init
      
    
- name:

 Terraform Plan
      working-directory: environments/${{ matrix.environment }}/${{ matrix.cloud }}
      run: terraform plan -out=tfplan
      
    
- name:

 Upload Plan
      uses: actions/upload-artifact@v4
      with:
        name: tfplan-${{ matrix.cloud }}-${{ matrix.environment }}
        path: environments/${{ matrix.environment }}/${{ matrix.cloud }}/tfplan

  deploy:
    needs: plan
    runs-on: ubuntu-latest
    if: github.ref == 'refs/heads/main'
    strategy:
      matrix:
        cloud: [aws, azure, gcp]
        environment: [production]
    
    steps:

- uses: actions/checkout@v4
    
    
- name:

 Download Plan
      uses: actions/download-artifact@v4
      with:
        name: tfplan-${{ matrix.cloud }}-${{ matrix.environment }}
        path: environments/${{ matrix.environment }}/${{ matrix.cloud }}/
        
    
- name:

 Terraform Apply
      working-directory: environments/${{ matrix.environment }}/${{ matrix.cloud }}
      run: terraform apply tfplan

What Breaks at 3 AM (And Will Ruin Your Weekend)

Provider versions are fucking terrible. AWS provider updates weekly and breaks random shit. EKS node group behavior changed in 5.17.0 and completely fucked our dev environment. Now I pin everything exactly because Hashi

Corp has no concept of backwards compatibility.

terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "= 5.17.0"  # Never use ~>, always pin exactly or die
    }
  }
}

Azure's API is garbage. Random 429 errors, timeouts, stuff that works in East US but not West US for absolutely no fucking reason. On my M1 Mac the Azure CLI times out constantly. Works fine on Ubuntu though, so maybe it's just Apple hate.

GCP quota limits make no sense. Hit some random VPC limit in us-central1 during load testing. Error message just says "quota exceeded" but doesn't tell you which quota or how to increase it. Had to create a support ticket and wait 3 days to figure out it was the "routes per VPC" limit. Who the fuck tracks that?

Data transfer costs will bankrupt you. Left a backup job running between AWS and GCP for a week, cost $3,400 in egress fees. I have billing alerts set at $500 now instead of $5K because AWS will bleed you dry.

State corruption is inevitable. More clouds, more ways for Terraform state to get fucked up. S3 backend works fine, Azure backend is flaky as hell, GCP backend is okay but slower than molasses.

When AWS went down last month we did fail over to Azure in about 30 minutes. So there's that, I guess.

Multicloud Strategy Comparison: What Works vs What Sounds Good

Strategy	Implementation Complexity	Vendor Lock-in Risk	Cost Impact	Time to Production	Best For	Avoid If
Provider Abstraction Modules	Very High (6-12 months)	Low	+25-40% infrastructure	8-12 months	Large enterprises with dedicated platform teams	Small teams, rapid prototyping
Cloud-Agnostic Resources	High (3-6 months)	Medium	+15-25% infrastructure	4-8 months	Teams with existing Terraform expertise	Complex cloud-native features needed
Federated Infrastructure	Medium (2-4 months)	Medium-High	+10-20% infrastructure	2-6 months	Most production use cases	Need tight cross-cloud integration
Single Cloud + DR	Low (1-2 months)	High	+5-15% infrastructure	1-3 months	Cost-conscious teams, simple failover needs	Regulatory multi-region requirements
Best-of-Breed Services	Very High (6+ months)	Low	+20-35% infrastructure	6-12 months	AI/ML workloads, specialized requirements	Standardized enterprise environments

Multicloud Terraform: The Questions You're Actually Asking (And Brutally Honest Answers)

Should I really do multicloud or is this just resume padding?

Most multicloud projects die when someone adds up the real costs.

Do this if:

Lawyers say you have to (compliance, data residency)
You're buying companies that run on different clouds
Single cloud outages would kill your business
You actually need specific services from each cloud

Don't do this shit if:

You think it looks impressive on your resume
You're worried about vendor lock-in but totally fine with complexity lock-in
Some consultant told your boss it's "strategic" without explaining the engineering cost

How do I handle Terraform state files across multiple clouds?

Keep them completely separate. Don't try to manage AWS, Azure, and GCP resources in the same state file - that's a recipe for disaster.

Use cloud-native backends:

AWS: S3 + DynamoDB locking
Azure: Azure Storage with Blob backend
GCP: GCS backend

When you need cross-cloud data, use terraform_remote_state data sources. Yes, it's more complex, but it prevents one cloud from destroying your entire infrastructure when things go wrong.

My Terraform apply fails randomly on Azure but works fine on AWS. Why?

Azure's API is weird, inconsistent, and generally fucking terrible. Common shit I've run into:

Resources get created out of order even with implicit dependencies because Azure doesn't give a fuck
Azure has picky naming rules and length limits that aren't documented anywhere useful
API throws 429 errors constantly, way more than AWS
Sometimes resources just... don't get created. No error, no explanation. Just nothing.

What actually helped:

Adding explicit depends_on everywhere (annoying as hell but works)
Retry logic in CI/CD - sometimes just running it again magically works
TF_LOG=DEBUG shows you what's actually happening but creates 50MB log files

How much will this actually cost compared to single cloud?

Infrastructure costs went up maybe 20-30%. VPN gateways, data egress between clouds, redundant load balancers, separate monitoring setups.

Engineering costs suck. Everything takes forever. Instead of being good at AWS, everyone's mediocre at three clouds. Need more people on-call, more time to debug weird cross-cloud issues.

We were spending around $2.5M a year on AWS. Now it's $3.2M across all three, plus we needed to hire two more engineers the first year just to keep the lights on. Spoiler: the costs haven't leveled out yet.

Should I use the same Terraform configuration for all clouds?

Don't. Each cloud has different resource types, naming schemes, capabilities. Trying to make them all use the same config leads to:

Dumbed-down architecture that uses the worst parts of each cloud
Conditional logic that's impossible to debug
Features that work great on AWS but terrible on Azure

Better approach: consistent patterns and naming, but separate configs for each cloud.

How do I handle authentication across multiple clouds in CI/CD?

Use each cloud's native authentication - don't try to unify this:

AWS: OIDC federation with GitHub Actions/GitLab CI
Azure: Service Principal with certificate authentication
GCP: Workload Identity Federation

Store credentials as separate secrets in your CI/CD system and configure them per deployment job. Yes, it's more setup work, but it's more secure and reliable.

My team doesn't want to learn three different cloud providers. How do I handle this?

Specialization is your friend. Don't make everyone an expert in everything because that's impossible:

Cloud Platform Team: Builds modules and handles cross-cloud integration
Application Teams: Use the modules, focus on business logic
SRE Team: Monitors and troubleshoots, needs broad knowledge

Alternatively, seriously consider if you actually need multicloud or if you're just solving a problem that doesn't fucking exist.

What's the best way to handle networking between clouds?

Start simple: VPN connections between cloud VPCs work for most use cases:

AWS VPN Gateway ↔ Azure VPN Gateway: ~$100/month
AWS VPN Gateway ↔ GCP Cloud VPN: ~$100/month
Azure VPN Gateway ↔ GCP Cloud VPN: ~$100/month

When you outgrow VPNs: Look at dedicated connections (Direct Connect, ExpressRoute, Cloud Interconnect) but expect $1,000+ monthly per connection.

Third-party solutions (Aviatrix, Alkira) can simplify management but add another vendor and costs.

How do I handle different resource naming conventions across clouds?

Create a naming standard and translate it per cloud:

locals {
  # Base naming
  base_name = "${var.environment}-${var.application}"
  
  # AWS allows hyphens, underscores
  aws_name = "${local.base_name}-${var.resource_type}"
  
  # Azure prefers no special chars for some resources
  azure_name = replace("${local.base_name}${var.resource_type}", "-", "")
  
  # GCP has specific requirements per resource type
  gcp_name = lower(replace("${local.base_name}-${var.resource_type}", "_", "-"))
}

Document these translations and use them consistently. Don't try to make every resource name identical across clouds.

Should I use Terraform Cloud/HCP Terraform for multicloud?

Probably not at first. The pricing gets fucking expensive with lots of resources across multiple clouds. I think it's like $1,000+ a month for 10,000 resources but I haven't checked recently because I don't hate money. State management isn't really simpler either.

We use GitHub Actions with self-hosted runners. Works fine, but setup was a pain in the ass. Atlantis is probably better if you want something fancy and don't mind another thing to maintain.

Might try HCP Terraform later when we have more budget and less on fire.

My Terraform plan shows changes every time I run it, even though nothing changed. How do I fix this?

Common causes in multicloud:

Provider version drift: Lock all provider versions
Clock skew: Azure and GCP resources sometimes have timestamp drift
API responses: Different clouds return data in different formats
Computed values: Some values are only known after apply

Solutions:

Run terraform refresh first
Check for provider version differences between team members
Use lifecycle { ignore_changes = [...] } for constantly changing values

How do I handle secrets management across multiple clouds?

Use native secret management in each cloud:

AWS Secrets Manager
Azure Key Vault
GCP Secret Manager

Don't try to unify this with HashiCorp Vault unless you already have Vault running. Native services integrate better with other cloud services and are easier to secure.

Reference secrets in your Terraform with data sources, never hardcode them.

What monitoring strategy works for multicloud?

Two-tier approach:

Native monitoring for cloud-specific metrics (CloudWatch, Azure Monitor, GCP Monitoring)
Unified dashboards for application metrics (Datadog, New Relic, Grafana)

Ship all logs to a central location (ELK, Splunk, Datadog) with consistent tagging:

tags = {
  Environment   = var.environment
  Application   = var.application
  CloudProvider = "aws"  # or "azure" or "gcp"
}

Should I run identical applications in all clouds or specialize per cloud?

Start with specialization based on each cloud's strengths:

AWS: General compute, enterprise services, broad service catalog
Azure: Microsoft ecosystem integration, hybrid cloud
GCP: ML/AI workloads, analytics, container orchestration

Move to identical deployments only after you've proven the multicloud architecture works. Don't try to do both simultaneously.

How do I test multicloud Terraform configurations?

Testing pyramid approach:

Unit tests: Test individual modules with Terratest
Integration tests: Deploy to dedicated test environments per cloud
End-to-end tests: Test cross-cloud connectivity and failover

Don't try to test everything - focus on critical paths and failure scenarios.

When should I give up on multicloud?

Signs you need to abort this clusterfuck:

Everything takes 3x longer and it's not getting better after 6+ months
Infrastructure costs went up 50%+ but business doesn't see the value
Team spends more time fighting with infrastructure than building actual features
People are getting burned out from the complexity and starting to quit
Can't hire engineers fast enough to handle all the operational overhead
You're spending weekends fixing cross-cloud networking issues

It's perfectly fine to go back to single cloud with disaster recovery in another region. Sometimes the simple solution is way fucking better than the "strategic" one.

Essential Multicloud Terraform Resources

34%

Recommendations combine user behavior, content similarity, research intelligence, and SEO optimization

Quick Navigation

The Four Reasons You Might Actually Need This

Three Ways I've Seen This Done (Two Are Terrible)

The Abstraction Layer Disaster

The Conditional Logic Nightmare (Also Bad)

What Actually Works: Separate Everything

State Management Failures That Made Me Paranoid

Authentication: Three Different Ways to Hate Your Life

When Our AWS Bill Went Crazy

Security and Compliance Challenges

What This Actually Costs

Directory Structure That Won't Make You Want to Quit

Provider Configs: Keep It Simple, Keep It Separate

Step 3:

Step 4:

Step 5: Compute Resources

Step 6:

Step 7:

Step 8:

What Breaks at 3 AM (And Will Ruin Your Weekend)

Should I really do multicloud or is this just resume padding?

How do I handle Terraform state files across multiple clouds?

My Terraform apply fails randomly on Azure but works fine on AWS. Why?

How much will this actually cost compared to single cloud?

Should I use the same Terraform configuration for all clouds?

How do I handle authentication across multiple clouds in CI/CD?

My team doesn't want to learn three different cloud providers. How do I handle this?

What's the best way to handle networking between clouds?

How do I handle different resource naming conventions across clouds?

Should I use Terraform Cloud/HCP Terraform for multicloud?

My Terraform plan shows changes every time I run it, even though nothing changed. How do I fix this?

How do I handle secrets management across multiple clouds?

What monitoring strategy works for multicloud?

Should I run identical applications in all clouds or specialize per cloud?

How do I test multicloud Terraform configurations?

When should I give up on multicloud?

Related Tools & Recommendations

AWS CDK Overview: Modern Infrastructure as Code for AWS

Terraform vs Pulumi vs AWS CDK vs OpenTofu: Real-World Comparison

Microsoft Azure Overview: Cloud Platform Pros, Cons & Costs

Pulumi Overview: IaC with Real Programming Languages & Production Use

Terraform, Pulumi, CloudFormation: IaC Cost Analysis 2025

Terraform, Ansible, Packer: Automate Infrastructure & DevOps

Terraform Overview: Define IaC, Pros, Cons & License Changes

Terraform Alternatives: Performance & Use Case Comparison

AWS vs Azure vs GCP TCO 2025: Cloud Cost Comparison Guide

Google Cloud Platform - After 3 Years, I Still Don't Hate It

Pulumi Cloud for Platform Engineering - Build Self-Service Infrastructure at Scale

Pulumi Cloud Enterprise Deployment - What Actually Works in Production

Google Kubernetes Engine (GKE) - Google's Managed Kubernetes (That Actually Works Most of the Time)

Kubernetes Enterprise Review - Is It Worth The Investment in 2025?

Fix Kubernetes Pod CrashLoopBackOff - Complete Troubleshooting Guide

Databricks vs Snowflake vs BigQuery Pricing: Which Platform Will Bankrupt You Slowest

Terraform Performance: How to Make Slow Terraform Apply Suck Less

AWS AI/ML Cost Optimization: Cut Bills 60-90% | Expert Guide

AWS CDK Production Deployment Horror Stories - When CloudFormation Goes Wrong

AWS vs Azure vs GCP Developer Tools: Real Cost & Pricing Analysis