Started with AWS because that's what everyone uses. Pretty standard setup - EC2 instances, RDS, the usual stuff. Two years later compliance drops the bomb that EU customer data has to live in Azure Ireland specifically. Something about Microsoft's GDPR coverage being more battle-tested than AWS Frankfurt.
Six months after that we acquire a startup running everything on GCP. Their ML pipeline is built around BigQuery and AutoML. Moving it would take months we don't have. So now I'm maintaining infrastructure on all three platforms, which is about as fun as getting a root canal while someone explains Azure naming conventions to you.
The Four Reasons You Might Actually Need This
Legal/Compliance: EU customer data had to move to Azure Ireland. Not AWS Frankfurt - specifically Azure. Something about Microsoft's GDPR compliance history. The lawyers made up their minds before I got involved. I just read the AWS GDPR docs to understand what they were talking about.
AWS Outages: December 2021, us-east-1 goes down for 6 hours. Our whole platform is dead. I'm in Slack at 2am trying to explain to pissed-off customers why their shit isn't working. CEO rolls out of bed asking why we don't have backup regions. Fair fucking question actually.
Different Strengths: AWS has decent EC2 pricing and the biggest ecosystem. GCP's ML stuff works pretty well. Azure integrates with Active Directory without too much pain.
Acquisitions: Bought a startup already running on GCP. Their ML pipeline is all BigQuery and custom models. Moving it would take forever and probably break things.
Three Ways I've Seen This Done (Two Are Terrible)
The Abstraction Layer Disaster
First thing I tried was making universal modules. Map everything to generic sizes - "small", "medium", "large". Let the module figure out whether that's a t3.medium or Standard_D2s_v3 or whatever GCP calls their instances.
Took forever to build and broke constantly. AWS launches new instance types weekly. Azure changes naming schemes randomly. GCP has custom machine types that don't map to anything standard.
Ended up spending more time fixing the abstraction than just writing separate configs would have taken. Also debugging was impossible - error says "medium instance failed" but which cloud? Which actual instance type? Go fuck yourself, that's which.
The Conditional Logic Nightmare (Also Bad)
My second try was putting all three clouds in the same Terraform config with conditional logic:
resource \"aws_instance\" \"web\" {
count = var.cloud_provider == \"aws\" ? var.instance_count : 0
# AWS-specific config
}
resource \"azurerm_virtual_machine\" \"web\" {
count = var.cloud_provider == \"azure\" ? var.instance_count : 0
# Azure-specific config
}
resource \"google_compute_instance\" \"web\" {
count = var.cloud_provider == \"gcp\" ? var.instance_count : 0
# GCP-specific config
}
This was cleaner than the abstraction layer but still sucked balls. Plan output was confusing as hell - showing 200+ resources with count = 0. When Azure provider shit the bed (which happens weekly), Terraform would still try to initialize it even when we weren't using it.
Plus debugging was a nightmare. Error messages would reference all three resource types even when only one was actually being used.
What Actually Works: Separate Everything
After wasting a year on the previous approaches, I finally did what I should have done from the start - treated each cloud as completely independent infrastructure that occasionally talks to the others.
Here's our current setup:
- AWS: All the production web applications, databases, and general compute stuff
- Azure: EU compliance workloads and anything that needs to talk to Active Directory
- GCP: ML training jobs and BigQuery analytics (because honestly, GCP's data tools are just better)
Each cloud has its own Terraform root modules, its own state files, its own deployment pipelines. They're linked through VPN connections and shared tagging standards, not through Terraform dependencies.
When AWS has an outage, Azure keeps running. When I need to debug Azure networking (which happens more often than I'd like), I'm not dealing with AWS resources cluttering up the plan output.
For cross-cloud networking, we use basic site-to-site VPNs. I looked at Transit Gateway and Virtual WAN integration but honestly, our traffic between clouds is minimal enough that VPNs work fine and cost way less.
State Management Failures That Made Me Paranoid
State files are scary enough with one cloud. With three clouds, they're terrifying.
Worst day was when Azure's API started throwing 429s during a routine plan on a Friday afternoon. Terraform tried to refresh the state, got confused, and marked half our AWS resources for destruction. I caught it before applying but spent the weekend untangling that clusterfuck and restoring from backup. Wife was not happy.
That's when I learned to keep state files completely separate.
One State File Per Cloud (Learned This The Hard Way): Each cloud gets completely separate state management. No shared state files, no cross-references, no cute attempts at unification.
## backend-aws.tf
terraform {
backend \"s3\" {
bucket = \"mycompany-terraform-state-aws\"
key = \"production/terraform.tfstate\"
region = \"us-east-1\"
}
}
## backend-azure.tf
terraform {
backend \"azurerm\" {
resource_group_name = \"terraform-state\"
storage_account_name = \"mycompanyterraformstate\"
container_name = \"tfstate\"
key = \"production.terraform.tfstate\"
}
}
Cross-Cloud Data Sources: When one cloud needs information from another, use terraform_remote_state data sources. Check out the Terraform Registry for examples.
data \"terraform_remote_state\" \"aws_network\" {
backend = \"s3\"
config = {
bucket = \"mycompany-terraform-state-aws\"
key = \"network/terraform.tfstate\"
region = \"us-east-1\"
}
}
## Use AWS VPC ID in GCP network peering
resource \"google_compute_network_peering\" \"aws_gcp\" {
name = \"aws-to-gcp\"
network = google_compute_network.vpc.id
peer_network = \"projects/aws-interconnect/global/networks/${data.terraform_remote_state.aws_network.outputs.vpc_id}\"
}
Authentication: Three Different Ways to Hate Your Life
Getting authentication working across all three clouds in CI/CD was easily the most frustrating part of this whole project.
AWS was straightforward - IAM roles with AssumeRole just work. Azure service principals were a pain to set up but work reliably once configured. GCP service account keys kept getting rotated automatically and breaking our builds until I figured out workload identity.
The real nightmare was trying to use the same GitHub Actions workflow for all three clouds. Spent two weeks making it "elegant" before saying fuck it and writing three separate workflows. Sometimes the ugly solution that works beats the pretty solution that doesn't.
Each cloud gets its own authentication setup in your CI/CD. Don't try to abstract this - just accept that you'll have three different ways to do the same thing.
When Our AWS Bill Went Crazy
AWS bill is normally around $8K a month. Get an alert saying we're on track for $42K. I'm thinking the billing API is fucked or something.
Turns out some sync job between GCP and AWS got stuck in a loop. Kept copying the same 2TB dataset over and over for 5 days straight. GCP side was cheap, maybe $300 in compute. But AWS data egress? $11,000. Fucking brutal.
Each cloud bills differently and it's annoying. AWS charges for everything - data out, data between regions, data between AZs. Azure has these weird compute tiers. GCP's per-second billing is nice when you remember to use it.
What helped with costs:
- Set up billing alerts on every cloud (learned this after the $42K month, obviously)
- Used Infracost to catch expensive shit before deployment
- Tag everything consistently so you can actually track what's bleeding money
locals {
common_tags = {
Environment = var.environment
Team = var.team
Project = var.project
Owner = var.owner_email
Cloud = \"aws\" # so we know which bill this shows up on
}
}
Security and Compliance Challenges
Each cloud has different security primitives, compliance certifications, and audit requirements. What works:
Consistent Security Baselines: Use CIS Benchmarks adapted for each cloud provider. Checkov and Terrascan help automate this.
Network Security: Use cloud-native firewalls (Security Groups, Network Security Groups, Firewall Rules) but with consistent rule patterns. Document the translations between cloud security models.
Identity Federation: Use SAML/OIDC federation to connect all clouds to your central identity provider. Single sign-on across all environments.
What This Actually Costs
Infrastructure bills went up 67%. Not just because we're running more stuff, but data transfer between clouds, redundant load balancers, extra VPN gateways. Instead of $12K a month on AWS, we're at $20K spread across all three.
Everything takes forever. Used to spin up a new service in AWS in an afternoon. Now it's a week because I need to figure out the Azure equivalent, then the GCP version, then make sure they all talk to each other without breaking.
Team doubled. Two people could handle our AWS setup. Now we need four just to keep up with three different clouds shitting the bed in three different ways.
Learning curve blows. I used to actually know AWS pretty well. Now I'm mediocre at three clouds instead of good at one. My team has the same problem - we're all learning Azure and GCP on the fly while trying not to break production.
On-call is hell. AWS going down was bad enough. Now we get paged for Azure API timeouts at 3am, GCP quota limits during load tests, VPN tunnels dropping randomly. Three different ways for things to break while you're trying to sleep. Some Azure errors I still don't understand and Microsoft support is useless.
Worth it for us because compliance lawyers didn't give us a choice. But I wouldn't pick this clusterfuck if I had options.