Got a Slack message from our CFO at 7 AM on a Monday. Our AWS bill went from the usual 5K to 47 grand over the weekend. Took me three hours of digging through CloudTrail logs to figure out what the hell happened.
Someone left a hyperparameter search running on Friday night. The job spawned GPU instances until it hit our service limits, then kept trying. By Sunday morning we had hundreds of p3.16xlarge instances running the same useless grid search. Each one costs about $24/hour and they ran for maybe 60 hours straight.
The worst part? This was supposed to be a "quick experiment" to tune learning rates. Instead of spot instances, it used on-demand. Instead of one region, it somehow spread across three. Our quarterly ML budget got burned in 48 hours testing different values of 0.001 vs 0.0001.
Why Platform Pricing is Designed to Screw You
These platforms make money when you waste money. Their pricing calculators show best-case scenarios with perfect utilization and no mistakes. Reality is messier.
Training Jobs Cost More Than You Think
SageMaker says training costs "$31/hour per instance" in their examples. Sounds reasonable until you realize that's for their smallest GPU instance. Your actual job needs bigger instances and runs them in parallel.
We tried training a computer vision model last month. The job needed 8 GPU instances running together for distributed training. Each ml.p3.16xlarge instance costs around $24/hour. So our "simple training job" was actually burning $192/hour, not the $31 from their pricing page.
The job ran for 18 hours. Quick math: 18 × $192 = $3,456 for one model training run. We ran maybe 15 experiments that month trying different architectures. There goes $50K.
Storage Costs Sneak Up Fast
S3 storage looks cheap at 2 cents per GB. Dataset starts small, maybe a few TB. Then you add more training data, keep model artifacts, store checkpoints from failed runs. Six months later you're storing 80TB and wondering why your S3 bill is $2K per month.
But storage isn't the real killer - it's moving data around. We had training data in us-west-2 but kept spinning up instances in us-east-1 because they were cheaper. Every training job pulled 50GB from the wrong region at 9 cents per GB. Do that three times a day for a month and you've burned $4K on data transfer fees alone.
Databricks and Their Made-Up Currency
Databricks doesn't bill in dollars or hours like normal people. They invented "Databricks Units" - fake money that makes it impossible to predict your bill. Different workloads cost different DBU rates:
- Interactive notebooks: around 55 cents per DBU-hour
- Scheduled jobs: 30 cents per DBU-hour
- SQL queries: 70 cents per DBU-hour
Sounds cheap until you learn that running a notebook burns 4-6 DBUs per hour depending on cluster size. So your "55 cent" notebook actually costs $2.20-$3.30 per hour. Got 10 data scientists running notebooks all day? That's $220-$330 per hour just for people doing exploratory work.
I still don't understand why they couldn't just use normal pricing. The DBU conversion feels like they're trying to hide something.
How Each Platform Screws You
AWS SageMaker - Nickel and Dimed
SageMaker bills by the hour with minimum charges. Start a training job that takes 10 minutes? You pay for the full hour. Their big GPU instances cost $25-30/hour which adds up fast when you're experimenting.
The real trap is SageMaker Autopilot. It's supposed to automate machine learning by trying different algorithms and hyperparameters. Sounds great until it spins up 200+ training jobs to test every possible combination. Each job runs on its own instance for an hour minimum. You wanted automation, you got a $15K bill.
Databricks - DBU Black Hole
Databricks clusters burn DBUs even when nobody's using them. Leave a notebook open over lunch? Still paying. Auto-scaling takes forever to scale down, so you pay for unused capacity.
I left auto-scaling enabled on a cluster once and came back Monday to find it had consumed 2,000 DBUs over the weekend. At 55 cents per DBU, that's $1,100 for doing absolutely nothing. Now I manually terminate everything on Friday.
Azure ML - The Microsoft Tax
Azure ML is basically regular Azure VMs with a 15% markup for the ML service layer. Their GPU instances cost more than AWS equivalents, and everything integrates with Office 365 whether you want it to or not.
They pitch it as "enterprise-ready" but it's just more expensive. The only advantage is if your company is already locked into Microsoft everything.
Google Vertex AI - Cheap Until You Leave
Google has the lowest compute prices, usually 10-15% cheaper than AWS. The catch is getting your data out. Data transfer costs 12 cents per GB compared to AWS's 9 cents. If you ever want to switch providers, moving 100TB costs $12,000 in transfer fees alone.
Their enterprise features also suck. Missing basic stuff like proper RBAC and audit logging. You end up needing additional Google Cloud services to fill the gaps.
The Stuff That Sneaks Up On You
Data Transfer Costs
Moving data between regions costs around $0.09 per GB. Doesn't sound like much until you accidentally configure training to pull from the wrong region. We had 50GB datasets getting pulled from us-west-2 to us-east-1 three times per day. That's $13.50 per day in transfer costs, or about $400/month just moving files around.
One team configured their training pipeline wrong and spent $8K in transfer fees before anyone noticed. The training data was in one region but the compute kept spinning up in another.
Reserved Instance Trap
Reserved instances save 50-60% if you commit to 1-3 years. Great in theory. Problem is ML workloads change faster than your commitments. We reserved a bunch of ml.p3.2xlarge instances for training, then realized we needed ml.g4dn instances for inference.
Now we're stuck paying for reserved capacity we don't use while buying on-demand instances at full price for what we actually need. Brilliant.
Logging Costs Add Up
CloudWatch charges 50 cents per GB for log ingestion. ML models generate tons of logs - every prediction, feature transformation, error message. Our computer vision model logs about 500GB per month in prediction data. That's $250/month just for logs, not including storage or analysis.
What Actually Prevents Disasters
After blowing multiple budgets, here's what works:
Set Spending Alerts That Actually Work
AWS budget alerts are useless by default. They email you after you've already burned money. Set up alerts at 50%, 75%, and 90% of your monthly budget. Better yet, use AWS service limits to cap the number of instances you can spin up.
We learned this the hard way. Now we limit GPU instances to 20 per region and get Slack notifications when anyone requests more.
Use Spot Instances for Training
Spot instances are 70-90% cheaper than on-demand. Perfect for training jobs that can be interrupted. A p3.16xlarge instance costs $24/hour on-demand but only $2.50/hour on spot.
The catch is they can disappear anytime. Enable checkpointing so jobs resume when spot capacity returns. Takes some setup but saves tons of money.
Kill Everything on Weekends
Set up automated shutdowns for Friday evening. Nothing runs Saturday-Sunday except production services. The number of "quick experiments" left running over weekends is insane.
I wrote a Lambda function that terminates all SageMaker instances and Databricks clusters every Friday at 6 PM. It's saved us thousands.
Monitor Who's Burning Money
Databricks has billing tables you can query to see which users are consuming the most DBUs. We run this weekly and send reports to team leads. Amazing how usage drops when people know they're being watched.
For AWS, enable detailed billing reports and tag everything with project codes. At least then you can figure out which team caused the budget explosion.
These platforms make money when you waste resources. They won't help you optimize costs. Set up your own guardrails or watch your budget disappear.