The 50k Resource Wall of Pain
I've watched Terraform deployments slow to a crawl once you hit around 50k resources. When you get into the hundreds of thousands of resources, you're looking at maybe 2 operations per second even with parallelism maxed out. That's not a performance issue, that's a career-limiting problem.
The root cause? Terraform copies the entire state file for every resource change. With state files hitting hundreds of megabytes, half your time is spent in Go's garbage collector instead of actually building infrastructure. I learned this the hard way dealing with a massive disaster recovery deployment that took most of the day to plan.
Performance Numbers That'll Ruin Your Day
Really Big State Files (Think Databricks Scale):
- Plan Time: Hour and a half, maybe two hours if you're having a good day
- Apply Time: Could be like an hour, maybe 3 if shit goes sideways
- Daily Changes: Thousands of resources getting modified constantly
- Actual Throughput: Maybe 2 ops/sec max, parallelism doesn't help much
These numbers come from an actual enterprise disaster recovery system managing workspace replications. Same shit happens with big multi-tenant platforms, user provisioning systems, or any Unity Catalog setup that got out of hand.
Why Everything Goes to Hell
The Day Terraform Decided to Take Forever
There was this N² complexity issue with how Terraform processed large resource graphs that made big deployments crawl. Like, legitimately all day to plan changes. HashiCorp finally addressed the worst of it in 1.9, but jesus, took them way too fucking long to fix something that fundamental.
OK, enough ranting about HashiCorp. Here's the technical reality of why everything breaks:
Global State Lock: The Single Point of Failure
Terraform uses a global lock for state modifications. Every resource change waits in line like it's the goddamn DMV, then copies the entire state file. This is why cranking parallelism to 100 does jack shit for big deployments.
JSON Waste That Costs You Money
Terraform pretty-prints JSON state files where whitespace takes up 25% of the file size. For states transmitted over network links, you're literally paying AWS transfer costs for indentation. Brilliant engineering choice there.
Terraform 1.13: They Actually Fixed Some Shit
What Got Better (Eventually):
- The N² complexity thing got fixed in 1.9 - took them long enough
- State copying improved in 1.9 too
- Added
TF_STATE_PERSIST_INTERVAL
so it stops checkpointing every 30 seconds like a paranoid robot - Some performance stuff in the latest 1.13.0 - haven't tested it much yet
- New experimental deferred actions in 1.14.0 alpha but it's still alpha so probably breaks
- Parallelism for containers got slightly less terrible
The 1.13.0 performance improvements might help with big deployments, but Terraform's core architecture is still fundamentally broken by design.
When Your Weekend Gets Ruined
< 100 resources: Life is good. Plans take seconds.
100-1k resources: Plans start taking minutes. Still manageable.
1k-50k resources: Welcome to hell. Plans take forever, optimization becomes your full-time job.
50k+ resources: Non-linear performance cliff. You either split states or find a new career.
The jump from 49k to 51k resources isn't gradual - it's like falling off a cliff. Teams report 10x slowdowns crossing the 50k mark, making capacity planning critical unless you enjoy working weekends.