The Colossus Project
xAI crammed 200,000 H100 GPUs into Memphis and called it Colossus. For perspective, that's more GPUs than AWS, Google, or Microsoft allocate to any single customer. While normal companies spend 18-36 months building data centers (because permits, environmental studies, and not wanting to fuck up a billion-dollar project), xAI got theirs running in 122 days.
How did they pull this off? They cheated. Instead of building from scratch like proper data center companies, they grabbed a dead Electrolux factory from 2020. All the boring shit like structural engineering, power grid connections, and environmental permits were already sorted. They just had to gut it and stuff it with GPUs.
Memphis: Because They Need Stupid Amounts of Power
150MW continuous draw. That's not a typo - they're sucking down enough juice to power a small city. The Tennessee Valley Authority can actually deliver that much power without the grid shitting itself, which rules out pretty much everywhere else in America.
The power bill? They're not telling, but it's gotta be tens of millions per month just for electricity. Plus Memphis gets hot as balls in summer, so they're burning even more money keeping 200k H100s from melting into expensive paperweights.
Memphis also doesn't give a shit about environmental reviews like California does. Try building this in San Francisco - you'd still be waiting for permits in 2027.
Networking 200k GPUs: A Fucking Nightmare
3.6 Tbps per server. That's more bandwidth than most ISPs have for their entire customer base. xAI's specs aren't joking around - when you need to sync gradients across 200k GPUs, every nanosecond of network latency tanks your training performance. The NVIDIA networking architecture uses custom topology designed specifically for this scale.
Off-the-shelf networking gear would curl up in a corner and die. They had to build custom topology because Cisco's best switches become the bottleneck when you're trying to move model weights between 200k GPUs. High-performance computing networking at this scale requires specialized InfiniBand solutions. One misconfigured switch and your billion-dollar training run sits there doing nothing.
And shit breaks constantly. At this scale, you expect hundreds of GPUs to die every day. Network cards fail, switches overheat, cables get unplugged by accident. The software has to just deal with it and keep training around the failures.
Cooling 150MW: Don't Let the Magic Smoke Escape
Each H100 dumps 700W of heat when it's working hard. Multiply that by 200k and you're generating enough heat to warm a small town - around 140MW worth. That's on top of all the networking gear, storage, and other shit that also gets hot.
Air conditioning? Forget about it. You need liquid cooling that can handle the thermal output of a city. The cooling bill is probably 30% of their total operational costs. And if the cooling fails for even a few minutes, you just turned $50 billion worth of GPUs into very expensive e-waste. The Supermicro liquid cooling systems had to be specifically engineered for this deployment.
Memphis humidity doesn't help either. Sure, it's better than Arizona, but you're still fighting physics when you need to dump that much waste heat into the atmosphere.
What All This Hardware Actually Gets You
Grok 3 used 10x more compute than Grok 2 to train. That's not just throwing money at the problem - when you can dedicate 200k GPUs to a single training run, you can try model architectures that everyone else can only dream about.
The "Think" mode burns compute like crazy - maybe 50x more per query than normal inference. Most companies would go bankrupt running that on rented cloud GPUs. But when you own the hardware, you can afford to be wasteful with compute to get smarter answers.
OpenAI trains on maybe 25k GPUs max. Meta uses 30-50k. Google spreads their TPUs across a bunch of different projects. xAI can throw all 200k at one model and see what happens.
When Everything's Always Broken
At this scale, something dies every few minutes. GPUs overheat, network cards shit the bed, cooling pumps break, power supplies fry themselves. It's not "if" hardware fails, it's "how many things broke while I was getting coffee." The data center design specifications had to account for continuous hardware replacement cycles.
99% uptime means 2,000 GPUs are dead at any given moment. That's not a problem, that's Tuesday. Most data centers freak out when three servers fail. Here, losing a few hundred GPUs during a training run is just background noise.
The software has to be bulletproof. You can't pause a billion-dollar training run because GPU #47,293 decided to take a dirt nap. The training code expects failures and routes around them automatically.
The Million GPU Pipe Dream
They want 1 million GPUs eventually. That's completely insane. 750MW of power - half a nuclear plant's worth. The TVA power grid analysis would need to build new transmission lines just for them.
Cooling 750MW of waste heat? You'd need to divert a river. Memphis would become the GPU capital of America by accident, and the locals would probably hate them for it.
And the failures... if 200k GPUs means 2k are always broken, 1 million means 10k dead GPUs at any moment. You'd need an entire warehouse just for replacement parts and a small army of techs swapping hardware 24/7.
This is either visionary or completely fucking crazy. Probably both.