The comparison table above gives you the numbers, but numbers don't tell the whole story. Let me walk you through what it's actually like to migrate off OpenTelemetry - the good, bad, and ugly parts that never make it into the sales demos.
Look, I really wanted OpenTelemetry to work. Vendor neutral observability sounds amazing on paper, and I hate being locked into expensive tools as much as anyone. But after two years of fighting with collectors and YAML files that make no fucking sense, I'm done.
What Actually Made Me Switch
Last month our collector crashed three times during business hours. Not because of application load - because of some memory leak that happens when you configure tail sampling wrong. Issue #9590, took them 6 months to acknowledge it was even a real problem. Anyway, I burned 6 hours on a Saturday reading GitHub issues trying to figure out why our "simple" setup was eating 8GB of RAM and dumping core files everywhere.
Then our new junior dev asked me to explain our monitoring setup. I realized I was giving him a 45-minute lecture about processors, exporters, and receivers just so he could add one fucking metric to his service. That's when I knew we'd completely lost the plot.
Our collector config file ended up being like 237 lines just for basic functionality. Half of it was shit I copy-pasted from Stack Overflow and barely understood. Then collector v0.91.0 came out and broke our trace sampling. Spent two days figuring out that the probabilistic_sampler
processor changed its default behavior with zero fucking mention in the release notes. Error logs just showed "failed to process batch" - super helpful.
More recent updates keep breaking our memory management in ways that make no sense - queries that used to take 200ms suddenly take 30+ seconds for no reason I can figure out. I think it's related to memory limiter changes but honestly I'm just guessing at this point because the error messages are useless.
How I've Actually Migrated Teams Off OpenTelemetry
First thing I tried: Just swap the backend
Keep all your OpenTelemetry instrumentation, but send the data somewhere else instead of your collector setup. SigNoz and Uptrace can just eat OTLP data directly, so you don't have to change any application code.
This worked great for one team because their problem was specifically the collector crashing, not the SDKs. Took about a week to set up SigNoz and point all their services at it instead of their local collector.
What usually works: Switch one service at a time
Pick your most annoying service - the one where you're always debugging why traces are missing or whatever. Replace the OpenTelemetry stuff with Datadog's agent or New Relic's agent or whatever you're trying.
Run both in parallel for a few weeks to make sure you're not losing important data. This is boring but it works. Most agents just auto-instrument everything without you having to configure processors and exporters and all that crap.
The nuclear option: Burn it all down
Sometimes OpenTelemetry is so fucked that you just need to start over. I had one team where the collector was using 12GB of RAM and nobody understood why. We ripped out everything and installed Dynatrace OneAgent.
Yeah, it's dramatic, but when your monitoring is actively hurting your production environment, just fix it. Life's too short to debug YAML files on weekends.
The Real Cost of "Free" OpenTelemetry
Everyone focuses on the monthly cost of alternatives, but OpenTelemetry isn't actually free. You're paying in engineer time and weekend debugging sessions.
I tracked our time for a month - we were burning 9.5 hours per week just keeping OpenTelemetry running, probably more if you count the weekend debugging sessions and the random "why is this trace missing?" bullshit. That's one engineer spending almost 2.5 days per month on infrastructure instead of features. That's like 20% of one engineer's time. Even if SigNoz or Datadog costs $1,500/month, that's way cheaper than paying me to babysit collectors every fucking weekend.
SigNoz still needs some maintenance - you have to update it, scale it when you grow, deal with the occasional Docker issue. But it's like 2 hours a month instead of 8 hours a week.
With Datadog, we basically never think about the observability infrastructure. Install the agent, it just works. Costs more but honestly, sleeping through weekends is worth it.
How Long This Actually Takes
Sales demos make it look like you can migrate in an afternoon. Yeah, right.
Week 1: Test it on your laptop. Everything looks great, you're convinced this will be easy.
Week 2-3: First production service. Surprise! Your setup has some weird edge case that doesn't work with the new tool. Spent a week figuring out why half our traces were missing - turns out our internal service mesh was rewriting headers and breaking the trace context. Error message: "trace not found." Real helpful.
Weeks 4-7: Each service you migrate has its own special problems. The one with custom spans breaks differently than the one with high cardinality metrics. You start questioning all your life choices and wondering why you didn't just become a product manager.
Weeks 8-12: Rebuilding dashboards. This is the absolute worst part. Every alert, every graph, every custom query has to be recreated from scratch. You can't just import this shit, and you realize you don't remember what half of your old dashboards were even for.
Weeks 12-16: Getting everyone trained on the new UI and query language. People keep going back to the old system because they actually know how to use it, and you're stuck being the "monitoring guy" who has to fix everyone's broken queries.
Just plan for 4-5 months even if you think it'll be quick. I've literally never seen a monitoring migration finish on time. Not once.
What Actually Breaks During Migration
Your historical data is basically gone - I've never successfully imported all our OpenTelemetry traces into another system. You can export some stuff, but realistically you're losing detailed history. Plan for this.
Every dashboard and alert - This is the part that sucks the most. That custom dashboard you spent hours perfecting? You get to rebuild it from scratch. PromQL queries don't magically become Datadog queries.
Your internal tooling - If you built any scripts or tools that read OpenTelemetry data directly, those are broken now. We had like 5 different internal scripts that assumed Jaeger trace format.
Everyone's muscle memory - Your team knows where to click in Grafana and how to write PromQL queries. New system means everyone's back to googling "how do I filter traces by status code" again.
What Actually Works Instead
Based on migrations I've done or watched other teams do:
If you just want it to work: Datadog - Yeah it's expensive, but the agent installs in one line and just works. Auto-discovers everything, dashboards are decent out of the box. We went from spending 8 hours/week on monitoring to maybe 30 minutes. Worth every penny.
If you're on a budget: SigNoz - Open source, can eat your existing OpenTelemetry data, way less complex than the collector setup. You'll still need to maintain it yourself but it's manageable. Their cloud offering is pretty cheap too.
If you're enterprise and have money: Dynatrace - The AI stuff actually works and automatically figures out what's wrong with your app. Expensive as hell but if you're a big company, the automation is legit.
If you're already using Prometheus: Grafana Cloud - Managed Prometheus + Tempo + Loki. Familiar interface, reasonable pricing, handles the operational stuff for you.
About Vendor Lock-In
Yeah, you're gonna get locked in somehow. OpenTelemetry promises vendor neutrality, but you're still locked into their complexity, their configuration format, their way of doing things.
With alternatives, you get locked into their pricing and data formats instead. But honestly, I'd rather be locked into Datadog's pricing than locked into spending my evenings debugging YAML files.
The question isn't "how do I avoid lock-in?" It's "what kind of lock-in can I actually live with?"
Just Pick Something That Works
This isn't really a technical decision - it's about what kind of pain you want to deal with.
Keep OpenTelemetry if you have someone who actually enjoys configuring collectors and doesn't mind getting paged when they break. Some teams have dedicated platform engineers who live for this stuff.
Switch to something else if you just want to ship features without thinking about your observability stack. Pay Datadog or New Relic or whoever and get on with your life.
Both choices are fine. The wrong choice is pretending that "free" observability doesn't cost you in operational overhead and weekend debugging sessions. Just pick something and stick with it - the perfect solution doesn't exist.