How I Got Stuck Managing Three Fucking Cloud Providers

Terraform Logo

Started with AWS because that's what everyone uses. Pretty standard setup - EC2 instances, RDS, the usual stuff. Two years later compliance drops the bomb that EU customer data has to live in Azure Ireland specifically. Something about Microsoft's GDPR coverage being more battle-tested than AWS Frankfurt.

Six months after that we acquire a startup running everything on GCP. Their ML pipeline is built around BigQuery and AutoML. Moving it would take months we don't have. So now I'm maintaining infrastructure on all three platforms, which is about as fun as getting a root canal while someone explains Azure naming conventions to you.

The Four Reasons You Might Actually Need This

Legal/Compliance: EU customer data had to move to Azure Ireland. Not AWS Frankfurt - specifically Azure. Something about Microsoft's GDPR compliance history. The lawyers made up their minds before I got involved. I just read the AWS GDPR docs to understand what they were talking about.

AWS Outages: December 2021, us-east-1 goes down for 6 hours. Our whole platform is dead. I'm in Slack at 2am trying to explain to pissed-off customers why their shit isn't working. CEO rolls out of bed asking why we don't have backup regions. Fair fucking question actually.

Different Strengths: AWS has decent EC2 pricing and the biggest ecosystem. GCP's ML stuff works pretty well. Azure integrates with Active Directory without too much pain.

Acquisitions: Bought a startup already running on GCP. Their ML pipeline is all BigQuery and custom models. Moving it would take forever and probably break things.

Three Ways I've Seen This Done (Two Are Terrible)

Terraform Providers AWS Azure GCP

The Abstraction Layer Disaster

First thing I tried was making universal modules. Map everything to generic sizes - "small", "medium", "large". Let the module figure out whether that's a t3.medium or Standard_D2s_v3 or whatever GCP calls their instances.

Took forever to build and broke constantly. AWS launches new instance types weekly. Azure changes naming schemes randomly. GCP has custom machine types that don't map to anything standard.

Ended up spending more time fixing the abstraction than just writing separate configs would have taken. Also debugging was impossible - error says "medium instance failed" but which cloud? Which actual instance type? Go fuck yourself, that's which.

The Conditional Logic Nightmare (Also Bad)

My second try was putting all three clouds in the same Terraform config with conditional logic:

resource \"aws_instance\" \"web\" {
  count = var.cloud_provider == \"aws\" ? var.instance_count : 0
  # AWS-specific config
}

resource \"azurerm_virtual_machine\" \"web\" {
  count = var.cloud_provider == \"azure\" ? var.instance_count : 0  
  # Azure-specific config
}

resource \"google_compute_instance\" \"web\" {
  count = var.cloud_provider == \"gcp\" ? var.instance_count : 0
  # GCP-specific config
}

This was cleaner than the abstraction layer but still sucked balls. Plan output was confusing as hell - showing 200+ resources with count = 0. When Azure provider shit the bed (which happens weekly), Terraform would still try to initialize it even when we weren't using it.

Plus debugging was a nightmare. Error messages would reference all three resource types even when only one was actually being used.

What Actually Works: Separate Everything

After wasting a year on the previous approaches, I finally did what I should have done from the start - treated each cloud as completely independent infrastructure that occasionally talks to the others.

Here's our current setup:

  • AWS: All the production web applications, databases, and general compute stuff
  • Azure: EU compliance workloads and anything that needs to talk to Active Directory
  • GCP: ML training jobs and BigQuery analytics (because honestly, GCP's data tools are just better)

Each cloud has its own Terraform root modules, its own state files, its own deployment pipelines. They're linked through VPN connections and shared tagging standards, not through Terraform dependencies.

When AWS has an outage, Azure keeps running. When I need to debug Azure networking (which happens more often than I'd like), I'm not dealing with AWS resources cluttering up the plan output.

For cross-cloud networking, we use basic site-to-site VPNs. I looked at Transit Gateway and Virtual WAN integration but honestly, our traffic between clouds is minimal enough that VPNs work fine and cost way less.

State Management Failures That Made Me Paranoid

Terraform State Management

State files are scary enough with one cloud. With three clouds, they're terrifying.

Worst day was when Azure's API started throwing 429s during a routine plan on a Friday afternoon. Terraform tried to refresh the state, got confused, and marked half our AWS resources for destruction. I caught it before applying but spent the weekend untangling that clusterfuck and restoring from backup. Wife was not happy.

That's when I learned to keep state files completely separate.

One State File Per Cloud (Learned This The Hard Way): Each cloud gets completely separate state management. No shared state files, no cross-references, no cute attempts at unification.

## backend-aws.tf
terraform {
  backend \"s3\" {
    bucket = \"mycompany-terraform-state-aws\"
    key    = \"production/terraform.tfstate\"
    region = \"us-east-1\"
  }
}

## backend-azure.tf  
terraform {
  backend \"azurerm\" {
    resource_group_name  = \"terraform-state\"
    storage_account_name = \"mycompanyterraformstate\"
    container_name      = \"tfstate\"
    key                 = \"production.terraform.tfstate\"
  }
}

Cross-Cloud Data Sources: When one cloud needs information from another, use terraform_remote_state data sources. Check out the Terraform Registry for examples.

data \"terraform_remote_state\" \"aws_network\" {
  backend = \"s3\"
  config = {
    bucket = \"mycompany-terraform-state-aws\"
    key    = \"network/terraform.tfstate\"
    region = \"us-east-1\"
  }
}

## Use AWS VPC ID in GCP network peering
resource \"google_compute_network_peering\" \"aws_gcp\" {
  name         = \"aws-to-gcp\"
  network      = google_compute_network.vpc.id
  peer_network = \"projects/aws-interconnect/global/networks/${data.terraform_remote_state.aws_network.outputs.vpc_id}\"
}

Authentication: Three Different Ways to Hate Your Life

Getting authentication working across all three clouds in CI/CD was easily the most frustrating part of this whole project.

AWS was straightforward - IAM roles with AssumeRole just work. Azure service principals were a pain to set up but work reliably once configured. GCP service account keys kept getting rotated automatically and breaking our builds until I figured out workload identity.

The real nightmare was trying to use the same GitHub Actions workflow for all three clouds. Spent two weeks making it "elegant" before saying fuck it and writing three separate workflows. Sometimes the ugly solution that works beats the pretty solution that doesn't.

Each cloud gets its own authentication setup in your CI/CD. Don't try to abstract this - just accept that you'll have three different ways to do the same thing.

When Our AWS Bill Went Crazy

AWS Logo Azure Logo GCP Logo

AWS bill is normally around $8K a month. Get an alert saying we're on track for $42K. I'm thinking the billing API is fucked or something.

Turns out some sync job between GCP and AWS got stuck in a loop. Kept copying the same 2TB dataset over and over for 5 days straight. GCP side was cheap, maybe $300 in compute. But AWS data egress? $11,000. Fucking brutal.

Each cloud bills differently and it's annoying. AWS charges for everything - data out, data between regions, data between AZs. Azure has these weird compute tiers. GCP's per-second billing is nice when you remember to use it.

What helped with costs:

  • Set up billing alerts on every cloud (learned this after the $42K month, obviously)
  • Used Infracost to catch expensive shit before deployment
  • Tag everything consistently so you can actually track what's bleeding money
locals {
  common_tags = {
    Environment = var.environment
    Team        = var.team
    Project     = var.project  
    Owner       = var.owner_email
    Cloud       = \"aws\"  # so we know which bill this shows up on
  }
}

Security and Compliance Challenges

Each cloud has different security primitives, compliance certifications, and audit requirements. What works:

Consistent Security Baselines: Use CIS Benchmarks adapted for each cloud provider. Checkov and Terrascan help automate this.

Network Security: Use cloud-native firewalls (Security Groups, Network Security Groups, Firewall Rules) but with consistent rule patterns. Document the translations between cloud security models.

Identity Federation: Use SAML/OIDC federation to connect all clouds to your central identity provider. Single sign-on across all environments.

What This Actually Costs

Infrastructure bills went up 67%. Not just because we're running more stuff, but data transfer between clouds, redundant load balancers, extra VPN gateways. Instead of $12K a month on AWS, we're at $20K spread across all three.

Everything takes forever. Used to spin up a new service in AWS in an afternoon. Now it's a week because I need to figure out the Azure equivalent, then the GCP version, then make sure they all talk to each other without breaking.

Team doubled. Two people could handle our AWS setup. Now we need four just to keep up with three different clouds shitting the bed in three different ways.

Learning curve blows. I used to actually know AWS pretty well. Now I'm mediocre at three clouds instead of good at one. My team has the same problem - we're all learning Azure and GCP on the fly while trying not to break production.

On-call is hell. AWS going down was bad enough. Now we get paged for Azure API timeouts at 3am, GCP quota limits during load tests, VPN tunnels dropping randomly. Three different ways for things to break while you're trying to sleep. Some Azure errors I still don't understand and Microsoft support is useless.

Worth it for us because compliance lawyers didn't give us a choice. But I wouldn't pick this clusterfuck if I had options.

The Setup Process That Might Not Kill You (No Guarantees)

Alright, you're actually doing this.

Here's how I set it up, including all the shit that broke and made me question my career choices.

Directory Structure That Won't Make You Want to Quit

I went through five different project structures before landing on this one. The key insight: each cloud is its own separate Terraform project.

No shared roots, no clever abstractions.

multicloud-terraform/
├── environments/
│   ├── production/
│   │   ├── aws/
│   │   │   ├── main.tf
│   │   │   ├── backend.tf
│   │   │   └── terraform.tfvars
│   │   ├── azure/
│   │   │   ├── main.tf  
│   │   │   ├── backend.tf
│   │   │   └── terraform.tfvars
│   │   └── gcp/
│   │       ├── main.tf
│   │       ├── backend.tf
│   │       └── terraform.tfvars
│   └── development/
│       └── [same structure]
├── modules/
│   ├── networking/
│   ├── compute/
│   └── storage/
└── shared/
    ├── variables.tf
    └── outputs.tf

Why this structure? Because when Azure's API decides to shit itself and return 500 errors for 6 hours (true story from last fucking month), your AWS and GCP deployments keep working. Each cloud fails independently instead of taking everything down in a beautiful cascade of failure.

Provider Configs: Keep It Simple, Keep It Separate

First mistake: trying to put all three providers in the same main.tf.

Don't do this shit. Each cloud gets its own Terraform workspace with its own provider config. The Terraform documentation explains why this matters for provider initialization, if you like reading docs instead of learning from pain.

AWS Configuration (environments/production/aws/main.tf):

terraform {
  required_version = ">= 1.5"
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }
}

provider "aws" {
  region = var.primary_region
  
  default_tags {
    tags = local.common_tags
  }
}

provider "aws" {
  alias  = "secondary"
  region = var.secondary_region
  
  default_tags {
    tags = local.common_tags
  }
}

locals {
  common_tags = {
    Environment    = var.environment
    Application    = var.application
    ManagedBy     = "terraform"
    CloudProvider = "aws"
    Project       = "multicloud-demo"
  }
}

Azure Configuration (environments/production/azure/main.tf):

terraform {
  required_version = ">= 1.5"
  required_providers {
    azurerm = {
      source  = "hashicorp/azurerm"
      version = "~> 3.0"
    }
  }
}

provider "azurerm" {
  features {}
  
  subscription_id = var.azure_subscription_id
}

locals {
  common_tags = {
    Environment    = var.environment
    Application    = var.application
    ManagedBy     = "terraform"
    CloudProvider = "azure"
    Project       = "multicloud-demo"
  }
}

GCP Configuration (environments/production/gcp/main.tf):

terraform {
  required_version = ">= 1.5"
  required_providers {
    google = {
      source  = "hashicorp/google"
      version = "~> 4.0"
    }
  }
}

provider "google" {
  project = var.gcp_project_id
  region  = var.primary_region
}

locals {
  common_tags = {
    environment     = var.environment
    application     = var.application
    managed-by     = "terraform"
    cloud-provider = "gcp"
    project        = "multicloud-demo"
  }
}

Step 3:

Networking

  • The Foundation That Makes or Breaks Everything

Multicloud Network

The brutal fucking reality: Each cloud has completely different networking primitives that don't make sense.

Don't try to abstract this away

AWS Network Module (modules/networking/aws.tf):

resource "aws_vpc" "main" {
  cidr_block           = var.vpc_cidr
  enable_dns_hostnames = true
  enable_dns_support   = true
  
  tags = merge(var.common_tags, {
    Name = "${var.environment}-vpc"
  })
}

resource "aws_subnet" "public" {
  count = length(var.public_subnet_cidrs)
  
  vpc_id                  = aws_vpc.main.id
  cidr_block             = var.public_subnet_cidrs[count.index]
  availability_zone      = data.aws_availability_zones.available.names[count.index]
  map_public_ip_on_launch = true
  
  tags = merge(var.common_tags, {
    Name = "${var.environment}-public-${count.index + 1}"
    Type = "public"
  })
}

resource "aws_subnet" "private" {
  count = length(var.private_subnet_cidrs)
  
  vpc_id            = aws_vpc.main.id
  cidr_block        = var.private_subnet_cidrs[count.index]
  availability_zone = data.aws_availability_zones.available.names[count.index]
  
  tags = merge(var.common_tags, {
    Name = "${var.environment}-private-${count.index + 1}"
    Type = "private"
  })
}

resource "aws_internet_gateway" "main" {
  vpc_id = aws_vpc.main.id
  
  tags = merge(var.common_tags, {
    Name = "${var.environment}-igw"
  })
}

Azure Network Module (modules/networking/azure.tf):

resource "azurerm_virtual_network" "main" {
  name                = "${var.environment}-vnet"
  address_space       = [var.vnet_cidr]
  location           = var.location
  resource_group_name = var.resource_group_name
  
  tags = var.common_tags
}

resource "azurerm_subnet" "public" {
  count = length(var.public_subnet_cidrs)
  
  name                 = "${var.environment}-public-${count.index + 1}"
  resource_group_name  = var.resource_group_name
  virtual_network_name = azurerm_virtual_network.main.name
  address_prefixes     = [var.public_subnet_cidrs[count.index]]
}

resource "azurerm_subnet" "private" {
  count = length(var.private_subnet_cidrs)
  
  name                 = "${var.environment}-private-${count.index + 1}"
  resource_group_name  = var.resource_group_name
  virtual_network_name = azurerm_virtual_network.main.name
  address_prefixes     = [var.private_subnet_cidrs[count.index]]
}

GCP Network Module (modules/networking/gcp.tf):

resource "google_compute_network" "main" {
  name                    = "${var.environment}-vpc"
  auto_create_subnetworks = false
  routing_mode           = "GLOBAL"
}

resource "google_compute_subnetwork" "public" {
  count = length(var.public_subnet_cidrs)
  
  name          = "${var.environment}-public-${count.index + 1}"
  ip_cidr_range = var.public_subnet_cidrs[count.index]
  network       = google_compute_network.main.id
  region        = var.region
}

resource "google_compute_subnetwork" "private" {
  count = length(var.private_subnet_cidrs)
  
  name          = "${var.environment}-private-${count.index + 1}"
  ip_cidr_range = var.private_subnet_cidrs[count.index]
  network       = google_compute_network.main.id
  region        = var.region
  
  private_ip_google_access = true
}

Step 4:

Cross-Cloud Connectivity (Where Dreams Die)

Option 1: VPN Connections (Cheap but slow)
Set up site-to-site VPNs between cloud providers.

AWS has VPN Gateway, Azure has VPN Gateway, GCP has Cloud VPN.

Pros: Cheap (usually under $50/month), encrypted, relatively simple
Cons: Bandwidth limited (usually 1-10 Gbps), latency issues, complex routing

**Option 2:

Cloud Interconnects** (Fast but expensive)
Use dedicated connections like AWS Direct Connect, Azure ExpressRoute, GCP Cloud Interconnect.

Check the pricing calculators before committing.

Pros: High bandwidth (up to 100 Gbps), low latency, predictable performance
Cons: Expensive ($1,000+ monthly), complex setup, requires co-location facilities

**Option 3:

Third-Party SD-WAN** (Aviatrix, Alkira, etc.)
Use a third-party service to manage connectivity between clouds.

Pros: Simplified management, consistent policies, often better than doing it yourself
Cons: Additional cost, vendor dependency, still complex under the hood

We started with VPNs for dev and testing.

For production with heavy cross-cloud traffic, dedicated connections make sense but they're expensive.

Step 5: Compute Resources

  • The Part That Actually Matters

Here's a practical approach to compute that works across clouds:

Consistent Compute Module Interface:

module "web_servers" {
  source = "./modules/compute"
  
  cloud_provider    = var.cloud_provider
  environment      = var.environment
  application      = var.application
  
  instance_count   = var.web_server_count
  instance_size    = var.web_server_size  # "small", "medium", "large"
  
  vpc_id          = module.networking.vpc_id
  subnet_ids      = module.networking.private_subnet_ids
  security_groups = [module.security.web_security_group_id]
  
  user_data = file("${path.module}/scripts/web-server-setup.sh")
  
  tags = local.common_tags
}

Size Translation Logic:

## modules/compute/locals.tf
locals {
  # AWS instance type mapping
  aws_instance_types = {
    "small"  = "t3.micro"
    "medium" = "t3.small"  
    "large"  = "t3.medium"
    "xlarge" = "t3.large"
  }
  
  # Azure VM size mapping  
  azure_vm_sizes = {
    "small"  = "Standard_B1ls"
    "medium" = "Standard_B1s"
    "large"  = "Standard_B2s"
    "xlarge" = "Standard_B4ms"
  }
  
  # GCP machine type mapping
  gcp_machine_types = {
    "small"  = "e2-micro"
    "medium" = "e2-small"
    "large"  = "e2-medium" 
    "xlarge" = "e2-standard-2"
  }
  
  # Select appropriate instance type based on provider
  instance_type = (
    var.cloud_provider == "aws" ? local.aws_instance_types[var.instance_size] :
    var.cloud_provider == "azure" ? local.azure_vm_sizes[var.instance_size] :
    var.cloud_provider == "gcp" ? local.gcp_machine_types[var.instance_size] :
    "unknown"
  )
}

Step 6:

Secrets and Configuration Management

Never put secrets in Terraform configurations. Use each cloud's native secret management. Read the Terraform security documentation and HashiCorp's best practices guide:

AWS Secrets Example:

resource "aws_secretsmanager_secret" "db_password" {
  name = "${var.environment}-db-password"
  tags = local.common_tags
}

resource "aws_secretsmanager_secret_version" "db_password" {
  secret_id     = aws_secretsmanager_secret.db_password.id
  secret_string = var.db_password
}

## Reference in RDS instance
resource "aws_db_instance" "main" {
  # ... other config
  manage_master_user_password = true
  master_user_secret_kms_key_id = aws_kms_key.main.arn
}

Azure Key Vault Example:

resource "azurerm_key_vault" "main" {
  name                = "${var.environment}-kv"
  location           = var.location
  resource_group_name = var.resource_group_name
  tenant_id          = data.azurerm_client_config.current.tenant_id
  sku_name           = "standard"
  
  tags = var.common_tags
}

resource "azurerm_key_vault_secret" "db_password" {
  name         = "db-password"
  value        = var.db_password
  key_vault_id = azurerm_key_vault.main.id
  
  tags = var.common_tags
}

GCP Secret Manager Example:

resource "google_secret_manager_secret" "db_password" {
  secret_id = "${var.environment}-db-password"
  
  labels = var.common_tags
  
  replication {
    automatic = true
  }
}

resource "google_secret_manager_secret_version" "db_password" {
  secret      = google_secret_manager_secret.db_password.id
  secret_data = var.db_password
}

Step 7:

Monitoring and Observability

Each cloud has native monitoring, but you need unified visibility:

Standardized Alerting:

## CloudWatch (AWS)
resource "aws_cloudwatch_metric_alarm" "high_cpu" {
  alarm_name          = "${var.environment}-high-cpu"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = "2"
  metric_name         = "CPUUtilization"
  namespace           = "AWS/EC2"
  period              = "120"
  statistic           = "Average"
  threshold           = "80"
  alarm_description   = "This metric monitors ec2 cpu utilization"
  alarm_actions       = [aws_sns_topic.alerts.arn]
}

## Azure Monitor
resource "azurerm_monitor_metric_alert" "high_cpu" {
  name                = "${var.environment}-high-cpu"
  resource_group_name = var.resource_group_name
  scopes              = [azurerm_virtual_machine.main.id]
  description         = "High CPU utilization alert"
  
  criteria {
    metric_namespace = "Microsoft.

Compute/virtualMachines"
    metric_name      = "Percentage CPU"
    aggregation      = "Average"
    operator         = "GreaterThan"
    threshold        = 80
  }
  
  action {
    action_group_id = azurerm_monitor_action_group.alerts.id
  }
}

Unified Logging Strategy:

  • Ship all logs to a central location (ELK, Splunk, or Datadog)
  • Use consistent log formats across all clouds
  • Tag everything with cloud provider, environment, application

Step 8:

Deployment Pipeline That Doesn't Suck

## .github/workflows/multicloud-deploy.yml
name: Multicloud Deployment

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

jobs:
  plan:
    runs-on: ubuntu-latest
    strategy:
      matrix:
        cloud: [aws, azure, gcp]
        environment: [development, production]
    
    steps:

- uses: actions/checkout@v4
    
    
- name:

 Setup Terraform
      uses: hashicorp/setup-terraform@v3
      with:
        terraform_version: 1.5.0
        
    
- name:

 Configure AWS Credentials
      if: matrix.cloud == 'aws'
      uses: aws-actions/configure-aws-credentials@v4
      with:
        role-to-assume: ${{ secrets.

AWS_ROLE_ARN }}
        aws-region: us-east-1
        
    
- name:

 Configure Azure Credentials  
      if: matrix.cloud == 'azure'
      uses: azure/login@v1
      with:
        creds: ${{ secrets.

AZURE_CREDENTIALS }}
        
    
- name: Configure GCP Credentials
      if: matrix.cloud == 'gcp'
      uses: google-github-actions/auth@v2
      with:
        credentials_json: ${{ secrets.

GCP_CREDENTIALS }}
        
    
- name: Terraform Init
      working-directory: environments/${{ matrix.environment }}/${{ matrix.cloud }}
      run: terraform init
      
    
- name:

 Terraform Plan
      working-directory: environments/${{ matrix.environment }}/${{ matrix.cloud }}
      run: terraform plan -out=tfplan
      
    
- name:

 Upload Plan
      uses: actions/upload-artifact@v4
      with:
        name: tfplan-${{ matrix.cloud }}-${{ matrix.environment }}
        path: environments/${{ matrix.environment }}/${{ matrix.cloud }}/tfplan

  deploy:
    needs: plan
    runs-on: ubuntu-latest
    if: github.ref == 'refs/heads/main'
    strategy:
      matrix:
        cloud: [aws, azure, gcp]
        environment: [production]
    
    steps:

- uses: actions/checkout@v4
    
    
- name:

 Download Plan
      uses: actions/download-artifact@v4
      with:
        name: tfplan-${{ matrix.cloud }}-${{ matrix.environment }}
        path: environments/${{ matrix.environment }}/${{ matrix.cloud }}/
        
    
- name:

 Terraform Apply
      working-directory: environments/${{ matrix.environment }}/${{ matrix.cloud }}
      run: terraform apply tfplan

What Breaks at 3 AM (And Will Ruin Your Weekend)

Provider versions are fucking terrible. AWS provider updates weekly and breaks random shit. EKS node group behavior changed in 5.17.0 and completely fucked our dev environment. Now I pin everything exactly because Hashi

Corp has no concept of backwards compatibility.

terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "= 5.17.0"  # Never use ~>, always pin exactly or die
    }
  }
}

Azure's API is garbage. Random 429 errors, timeouts, stuff that works in East US but not West US for absolutely no fucking reason. On my M1 Mac the Azure CLI times out constantly. Works fine on Ubuntu though, so maybe it's just Apple hate.

GCP quota limits make no sense. Hit some random VPC limit in us-central1 during load testing. Error message just says "quota exceeded" but doesn't tell you which quota or how to increase it. Had to create a support ticket and wait 3 days to figure out it was the "routes per VPC" limit. Who the fuck tracks that?

Data transfer costs will bankrupt you. Left a backup job running between AWS and GCP for a week, cost $3,400 in egress fees. I have billing alerts set at $500 now instead of $5K because AWS will bleed you dry.

State corruption is inevitable. More clouds, more ways for Terraform state to get fucked up. S3 backend works fine, Azure backend is flaky as hell, GCP backend is okay but slower than molasses.

When AWS went down last month we did fail over to Azure in about 30 minutes. So there's that, I guess.

Multicloud Strategy Comparison: What Works vs What Sounds Good

Strategy

Implementation Complexity

Vendor Lock-in Risk

Cost Impact

Time to Production

Best For

Avoid If

Provider Abstraction Modules

Very High (6-12 months)

Low

+25-40% infrastructure

8-12 months

Large enterprises with dedicated platform teams

Small teams, rapid prototyping

Cloud-Agnostic Resources

High (3-6 months)

Medium

+15-25% infrastructure

4-8 months

Teams with existing Terraform expertise

Complex cloud-native features needed

Federated Infrastructure

Medium (2-4 months)

Medium-High

+10-20% infrastructure

2-6 months

Most production use cases

Need tight cross-cloud integration

Single Cloud + DR

Low (1-2 months)

High

+5-15% infrastructure

1-3 months

Cost-conscious teams, simple failover needs

Regulatory multi-region requirements

Best-of-Breed Services

Very High (6+ months)

Low

+20-35% infrastructure

6-12 months

AI/ML workloads, specialized requirements

Standardized enterprise environments

Multicloud Terraform: The Questions You're Actually Asking (And Brutally Honest Answers)

Q

Should I really do multicloud or is this just resume padding?

A

Most multicloud projects die when someone adds up the real costs.

Do this if:

  • Lawyers say you have to (compliance, data residency)
  • You're buying companies that run on different clouds
  • Single cloud outages would kill your business
  • You actually need specific services from each cloud

Don't do this shit if:

  • You think it looks impressive on your resume
  • You're worried about vendor lock-in but totally fine with complexity lock-in
  • Some consultant told your boss it's "strategic" without explaining the engineering cost
Q

How do I handle Terraform state files across multiple clouds?

A

Keep them completely separate. Don't try to manage AWS, Azure, and GCP resources in the same state file - that's a recipe for disaster.

Use cloud-native backends:

  • AWS: S3 + DynamoDB locking
  • Azure: Azure Storage with Blob backend
  • GCP: GCS backend

When you need cross-cloud data, use terraform_remote_state data sources. Yes, it's more complex, but it prevents one cloud from destroying your entire infrastructure when things go wrong.

Q

My Terraform apply fails randomly on Azure but works fine on AWS. Why?

A

Azure's API is weird, inconsistent, and generally fucking terrible. Common shit I've run into:

  • Resources get created out of order even with implicit dependencies because Azure doesn't give a fuck
  • Azure has picky naming rules and length limits that aren't documented anywhere useful
  • API throws 429 errors constantly, way more than AWS
  • Sometimes resources just... don't get created. No error, no explanation. Just nothing.

What actually helped:

  • Adding explicit depends_on everywhere (annoying as hell but works)
  • Retry logic in CI/CD - sometimes just running it again magically works
  • TF_LOG=DEBUG shows you what's actually happening but creates 50MB log files
Q

How much will this actually cost compared to single cloud?

A

Infrastructure costs went up maybe 20-30%. VPN gateways, data egress between clouds, redundant load balancers, separate monitoring setups.

Engineering costs suck. Everything takes forever. Instead of being good at AWS, everyone's mediocre at three clouds. Need more people on-call, more time to debug weird cross-cloud issues.

We were spending around $2.5M a year on AWS. Now it's $3.2M across all three, plus we needed to hire two more engineers the first year just to keep the lights on. Spoiler: the costs haven't leveled out yet.

Q

Should I use the same Terraform configuration for all clouds?

A

Don't. Each cloud has different resource types, naming schemes, capabilities. Trying to make them all use the same config leads to:

  • Dumbed-down architecture that uses the worst parts of each cloud
  • Conditional logic that's impossible to debug
  • Features that work great on AWS but terrible on Azure

Better approach: consistent patterns and naming, but separate configs for each cloud.

Q

How do I handle authentication across multiple clouds in CI/CD?

A

Use each cloud's native authentication - don't try to unify this:

AWS: OIDC federation with GitHub Actions/GitLab CI
Azure: Service Principal with certificate authentication
GCP: Workload Identity Federation

Store credentials as separate secrets in your CI/CD system and configure them per deployment job. Yes, it's more setup work, but it's more secure and reliable.

Q

My team doesn't want to learn three different cloud providers. How do I handle this?

A

Specialization is your friend. Don't make everyone an expert in everything because that's impossible:

  • Cloud Platform Team: Builds modules and handles cross-cloud integration
  • Application Teams: Use the modules, focus on business logic
  • SRE Team: Monitors and troubleshoots, needs broad knowledge

Alternatively, seriously consider if you actually need multicloud or if you're just solving a problem that doesn't fucking exist.

Q

What's the best way to handle networking between clouds?

A

Start simple: VPN connections between cloud VPCs work for most use cases:

  • AWS VPN Gateway ↔ Azure VPN Gateway: ~$100/month
  • AWS VPN Gateway ↔ GCP Cloud VPN: ~$100/month
  • Azure VPN Gateway ↔ GCP Cloud VPN: ~$100/month

When you outgrow VPNs: Look at dedicated connections (Direct Connect, ExpressRoute, Cloud Interconnect) but expect $1,000+ monthly per connection.

Third-party solutions (Aviatrix, Alkira) can simplify management but add another vendor and costs.

Q

How do I handle different resource naming conventions across clouds?

A

Create a naming standard and translate it per cloud:

locals {
  # Base naming
  base_name = "${var.environment}-${var.application}"
  
  # AWS allows hyphens, underscores
  aws_name = "${local.base_name}-${var.resource_type}"
  
  # Azure prefers no special chars for some resources
  azure_name = replace("${local.base_name}${var.resource_type}", "-", "")
  
  # GCP has specific requirements per resource type
  gcp_name = lower(replace("${local.base_name}-${var.resource_type}", "_", "-"))
}

Document these translations and use them consistently. Don't try to make every resource name identical across clouds.

Q

Should I use Terraform Cloud/HCP Terraform for multicloud?

A

Probably not at first. The pricing gets fucking expensive with lots of resources across multiple clouds. I think it's like $1,000+ a month for 10,000 resources but I haven't checked recently because I don't hate money. State management isn't really simpler either.

We use GitHub Actions with self-hosted runners. Works fine, but setup was a pain in the ass. Atlantis is probably better if you want something fancy and don't mind another thing to maintain.

Might try HCP Terraform later when we have more budget and less on fire.

Q

My Terraform plan shows changes every time I run it, even though nothing changed. How do I fix this?

A

Common causes in multicloud:

  • Provider version drift: Lock all provider versions
  • Clock skew: Azure and GCP resources sometimes have timestamp drift
  • API responses: Different clouds return data in different formats
  • Computed values: Some values are only known after apply

Solutions:

  • Run terraform refresh first
  • Check for provider version differences between team members
  • Use lifecycle { ignore_changes = [...] } for constantly changing values
Q

How do I handle secrets management across multiple clouds?

A

Use native secret management in each cloud:

  • AWS Secrets Manager
  • Azure Key Vault
  • GCP Secret Manager

Don't try to unify this with HashiCorp Vault unless you already have Vault running. Native services integrate better with other cloud services and are easier to secure.

Reference secrets in your Terraform with data sources, never hardcode them.

Q

What monitoring strategy works for multicloud?

A

Two-tier approach:

  1. Native monitoring for cloud-specific metrics (CloudWatch, Azure Monitor, GCP Monitoring)
  2. Unified dashboards for application metrics (Datadog, New Relic, Grafana)

Ship all logs to a central location (ELK, Splunk, Datadog) with consistent tagging:

tags = {
  Environment   = var.environment
  Application   = var.application
  CloudProvider = "aws"  # or "azure" or "gcp"
}
Q

Should I run identical applications in all clouds or specialize per cloud?

A

Start with specialization based on each cloud's strengths:

  • AWS: General compute, enterprise services, broad service catalog
  • Azure: Microsoft ecosystem integration, hybrid cloud
  • GCP: ML/AI workloads, analytics, container orchestration

Move to identical deployments only after you've proven the multicloud architecture works. Don't try to do both simultaneously.

Q

How do I test multicloud Terraform configurations?

A

Testing pyramid approach:

  1. Unit tests: Test individual modules with Terratest
  2. Integration tests: Deploy to dedicated test environments per cloud
  3. End-to-end tests: Test cross-cloud connectivity and failover

Don't try to test everything - focus on critical paths and failure scenarios.

Q

When should I give up on multicloud?

A

Signs you need to abort this clusterfuck:

  • Everything takes 3x longer and it's not getting better after 6+ months
  • Infrastructure costs went up 50%+ but business doesn't see the value
  • Team spends more time fighting with infrastructure than building actual features
  • People are getting burned out from the complexity and starting to quit
  • Can't hire engineers fast enough to handle all the operational overhead
  • You're spending weekends fixing cross-cloud networking issues

It's perfectly fine to go back to single cloud with disaster recovery in another region. Sometimes the simple solution is way fucking better than the "strategic" one.

Essential Multicloud Terraform Resources

Related Tools & Recommendations

tool
Similar content

AWS CDK Overview: Modern Infrastructure as Code for AWS

Write AWS Infrastructure in TypeScript Instead of CloudFormation Hell

AWS Cloud Development Kit
/tool/aws-cdk/overview
100%
compare
Similar content

Terraform vs Pulumi vs AWS CDK vs OpenTofu: Real-World Comparison

Compare Terraform, Pulumi, AWS CDK, and OpenTofu for Infrastructure as Code. Learn from production deployments, understand their pros and cons, and choose the b

Terraform
/compare/terraform/pulumi/aws-cdk/iac-platform-comparison
99%
tool
Similar content

Microsoft Azure Overview: Cloud Platform Pros, Cons & Costs

Explore Microsoft Azure's cloud platform, its key services, and real-world usage. Get a candid look at Azure's pros, cons, and costs, plus comparisons to AWS an

Microsoft Azure
/tool/microsoft-azure/overview
87%
tool
Similar content

Pulumi Overview: IaC with Real Programming Languages & Production Use

Discover Pulumi, the Infrastructure as Code tool. Learn how to define cloud infrastructure with real programming languages, compare it to Terraform, and see its

Pulumi
/tool/pulumi/overview
78%
pricing
Similar content

Terraform, Pulumi, CloudFormation: IaC Cost Analysis 2025

What these IaC tools actually cost you in 2025 - and why your AWS bill might double

Terraform
/pricing/terraform-pulumi-cloudformation/infrastructure-as-code-cost-analysis
69%
integration
Similar content

Terraform, Ansible, Packer: Automate Infrastructure & DevOps

Here's how Terraform, Packer, and Ansible work together to automate your entire infrastructure stack without the usual headaches

Terraform
/integration/terraform-ansible-packer/infrastructure-automation-pipeline
66%
tool
Similar content

Terraform Overview: Define IaC, Pros, Cons & License Changes

The tool that lets you describe what you want instead of how to build it (assuming you enjoy YAML's evil twin)

Terraform
/tool/terraform/overview
65%
alternatives
Similar content

Terraform Alternatives: Performance & Use Case Comparison

Stop choosing IaC tools based on hype - pick the one that performs best for your specific workload and team size

Terraform
/alternatives/terraform/performance-focused-alternatives
59%
pricing
Similar content

AWS vs Azure vs GCP TCO 2025: Cloud Cost Comparison Guide

Your $500/month estimate will become $3,000 when reality hits - here's why

Amazon Web Services (AWS)
/pricing/aws-vs-azure-vs-gcp-total-cost-ownership-2025/total-cost-ownership-analysis
45%
tool
Recommended

Google Cloud Platform - After 3 Years, I Still Don't Hate It

I've been running production workloads on GCP since 2022. Here's why I'm still here.

Google Cloud Platform
/tool/google-cloud-platform/overview
44%
tool
Recommended

Pulumi Cloud for Platform Engineering - Build Self-Service Infrastructure at Scale

competes with Pulumi Cloud

Pulumi Cloud
/tool/pulumi-cloud/platform-engineering-guide
41%
tool
Recommended

Pulumi Cloud Enterprise Deployment - What Actually Works in Production

When Infrastructure Meets Enterprise Reality

Pulumi Cloud
/tool/pulumi-cloud/enterprise-deployment-strategies
41%
tool
Recommended

Google Kubernetes Engine (GKE) - Google's Managed Kubernetes (That Actually Works Most of the Time)

Google runs your Kubernetes clusters so you don't wake up to etcd corruption at 3am. Costs way more than DIY but beats losing your weekend to cluster disasters.

Google Kubernetes Engine (GKE)
/tool/google-kubernetes-engine/overview
40%
review
Recommended

Kubernetes Enterprise Review - Is It Worth The Investment in 2025?

integrates with Kubernetes

Kubernetes
/review/kubernetes/enterprise-value-assessment
40%
troubleshoot
Recommended

Fix Kubernetes Pod CrashLoopBackOff - Complete Troubleshooting Guide

integrates with Kubernetes

Kubernetes
/troubleshoot/kubernetes-pod-crashloopbackoff/crashloop-diagnosis-solutions
40%
pricing
Recommended

Databricks vs Snowflake vs BigQuery Pricing: Which Platform Will Bankrupt You Slowest

We burned through about $47k in cloud bills figuring this out so you don't have to

Databricks
/pricing/databricks-snowflake-bigquery-comparison/comprehensive-pricing-breakdown
39%
review
Similar content

Terraform Performance: How to Make Slow Terraform Apply Suck Less

Three years of terraform apply timeout hell taught me what actually works

Terraform
/review/terraform/performance-review
36%
tool
Similar content

AWS AI/ML Cost Optimization: Cut Bills 60-90% | Expert Guide

Stop AWS from bleeding you dry - optimization strategies to cut AI/ML costs 60-90% without breaking production

Amazon Web Services AI/ML Services
/tool/aws-ai-ml-services/cost-optimization-guide
36%
tool
Recommended

AWS CDK Production Deployment Horror Stories - When CloudFormation Goes Wrong

Real War Stories from Engineers Who've Been There

AWS Cloud Development Kit
/tool/aws-cdk/production-horror-stories
35%
pricing
Similar content

AWS vs Azure vs GCP Developer Tools: Real Cost & Pricing Analysis

Cloud pricing is designed to confuse you. Here's what these platforms really cost when your boss sees the bill.

AWS Developer Tools
/pricing/aws-azure-gcp-developer-tools/total-cost-analysis
34%

Recommendations combine user behavior, content similarity, research intelligence, and SEO optimization