My AWS bill went up 15% last month and now I know why. States are planning to cut power to data centers during peak demand, which means cloud providers are going to pass those costs right back to us.
Texas utilities can disconnect big power users during emergencies - that's how the grid works. There's no special law targeting data centers, just normal load-shedding procedures that every industrial customer deals with. It's the same state that spent the last decade bribing tech companies to relocate with tax breaks and cheap land.
I used to deploy everything to AWS us-east-1 in Virginia because it was fast and reliable. Now I'm looking at multi-region redundancy just in case PJM decides to unplug Amazon during a heat wave.
The Math Is Pretty Simple
A single NVIDIA H100 draws about 700 watts when running at full capacity. Scale that to a decent-sized cluster and you're talking serious power - thousands of GPUs can easily hit tens of megawatts.
OpenAI's GPT-4 training used something like 25,000 H100s for the initial run - probably more, but they don't publish the real numbers. Do the math - that's roughly a gigawatt of power just for one fucking model. For context, most nuclear reactors generate about 1GW.
I ran some numbers on my company's training workloads. We're spending like $180K-220K/month on compute for a model that might never see production - hard to get exact numbers because AWS bills are fucking incomprehensible. That's energy equivalent to 700 average American homes running 24/7.
The real problem is the demand spikes. Grid operators planned for steady industrial loads, not workloads that spike from 5MW to 45MW when someone kicks off a training run.
What This Means for Your Infrastructure
I've already seen two production incidents this summer where AWS instances got terminated during peak usage hours. Amazon calls it "capacity optimization" but it's really just rationing. The first time it happened, I got paged at 2am because our main API was returning 503s - took me an hour of digging through CloudWatch to figure out that half our fucking fleet had just vanished. No warning, no notification, just gone.
The spot instance market is going absolutely batshit too. GPU instances that used to be 70% off are now maybe 30% off, and they get yanked with almost no warning. I had a training job that got interrupted 8 times in one day - kept getting SpotInstanceTerminating
errors in CloudTrail with 30-second warnings. Barely enough time to save a checkpoint before everything dies.
Google Cloud's preemptible instances have similar issues. They advertise 24-hour maximum runtime, but I'm seeing terminations after 2-3 hours when demand peaks.
Microsoft is handling it slightly better with Azure Low Priority VMs, but they're being more transparent about the rationing. Their capacity dashboard actually shows when regions are under power constraints.
The Infrastructure Tax
Data center operators are installing massive diesel generator arrays as backup power. These weren't designed for grid support - they're just to keep individual facilities running when everything else goes dark.
The costs get passed through. I'm already seeing higher instance prices during peak hours, and I expect explicit power surcharges to show up on bills soon. Amazon's not going to eat these costs forever.
Oracle's OCI pricing now includes footnotes about "power availability zones" where GPU instances cost 20% more but have guaranteed uptime during grid stress events.
The weird part is that renewable energy credits are getting expensive too. Companies want to claim their AI training is "carbon neutral" so they're bidding up solar and wind credits. My electricity bill at home went up because of REC market demand.
Practical Impact
I'm redesigning our entire ML pipeline to handle random instance terminations. Previously I assumed spot instances might die, but now I'm planning for entire availability zones going offline. Had to rewrite our training script to handle SIGTERM gracefully - turns out Kubernetes just kills pods with zero fucking warning when nodes get yanked. Spent two weeks debugging why checkpoints were corrupted before I realized the process was getting SIGKILL'd mid-write.
The solution involves checkpointing every 10 minutes instead of every hour, and spreading training across 3+ regions even for medium-sized jobs.
Kubernetes autoscaling is getting weird too. The cluster tries to scale up during peak hours but can't get nodes, so jobs just queue. I'm shifting more workloads to run between midnight and 6am local time when power demand is lower.
The irony is that we're optimizing our code to use less GPU time, which is what we should have been doing all along. My team reduced training time by 40% just by profiling memory usage and fixing inefficient data loading.