Three months after signing our Datadog contract, I got a call from finance asking why our "infrastructure monitoring" was costing more than our actual infrastructure. Turns out nobody told us that their pricing page is basically fiction once you start using the tool for real work.
The Data Ingestion Scam
Here's how they get you: every tool advertises some "generous" free tier. New Relic gives you 100GB free! Sounds amazing until your Rails app with decent logging hits that in two days. One debug logging session we forgot to turn off generated 300GB in six hours. At $0.40/GB according to industry surveys, that's a $120 mistake for something that should be free.
Datadog is worse. They start you at $0.10 per GB for logs but conveniently don't mention that APM traces, custom metrics, and those pretty database performance graphs all count as separate data streams with their own pricing. Our "simple" Node.js app was eating up data like crazy:
- Our logs were eating like 80 gigs monthly
- APM traces were another 120 gigs or so
- Custom metrics added maybe 45 gigs
- Infrastructure metrics we didn't even know about: 200 fucking gigs
That's $445/month in data costs for ONE APPLICATION. Scale that across 15 services and you're looking at $6,000+ monthly just for the privilege of seeing your data.
The Professional Services Trap
Remember that $15/host/month pricing? That's if you want to monitor ping responses. Actually useful monitoring requires their "professional services" team to set up dashboards that don't suck. Dynatrace won't even talk to you about custom integrations unless you drop $25,000 upfront for their Professional Services.
We spent $40,000 on Datadog professional services to migrate from Nagios. Six months later, half the dashboards broke when they "upgraded" their API. The fix? Another $15,000 consulting engagement to rebuild what we already paid for.
Training Costs (Or: Learning Their Weird Query Language)
Every monitoring tool invented their own query language because apparently SQL wasn't hipster enough. Datadog has their own query language, New Relic has NRQL, Splunk has SPL. Want to write alerts that don't fire every five minutes? Time to send your engineers to $3,000 training courses.
I spent two weeks learning Datadog's query syntax just to write a simple alert for database connection pool exhaustion. The final query looked like:
avg(last_5m):avg:postgresql.connections.active{environment:production} by {host} > 80
That's it. That simple alert cost us $6,000 in training time and consulting to get right because their documentation is garbage.
The Version Upgrade Nightmare
Monitoring tools love to "improve" their pricing models. Datadog switched from host-based to "container monitoring units" in 2019. Suddenly our Kubernetes cluster counted as 200 monitoring units instead of 20 hosts. Overnight cost increase: 300%.
New Relic pulled the same shit when they moved to "New Relic One" pricing. Our renewal quote was 5x higher because they decided every Lambda function counts as a separate "entity."
Infrastructure Overhead Nobody Talks About
Think cloud monitoring is just plug-and-play? Our Prometheus setup requires:
- 3 dedicated servers ($600/month on AWS)
- 2TB of SSD storage ($400/month)
- A full-time engineer maintaining Grafana dashboards ($8,000/month)
- Disaster recovery setup because when monitoring breaks, everything breaks
That "free" Prometheus setup costs us $9,000/month to run properly. Sometimes the commercial solution is actually cheaper, which is terrifying.