Four components that each have their own special way of failing spectacularly. Been running this nightmare in prod for 2+ years, so here's what you actually need to know.
Elastic APM sits on top of the ELK stack - that's Elasticsearch for storage, Logstash for data processing, and Kibana for dashboards that look pretty until they don't. The APM Server acts as the middleman, collecting traces from your apps via agents and shoving them into Elasticsearch where they'll consume RAM like it's free.
The Four Horsemen of Your Monitoring Apocalypse
APM Server: Handles incoming telemetry data. Crashes when you send it more data than expected (which is always). Memory usage scales linearly with trace volume, which sounds reasonable until your AWS bill jumps from $300 to $1200 overnight because someone decided to trace every database query during Black Friday.
Elasticsearch: Stores everything. Will happily eat 800GB of storage in three days if you don't configure index lifecycle management properly. I think it was like 800GB or something stupid like that when I forgot to set retention policies during a particularly brutal week in March.
Kibana: The pretty UI that shows you colorful graphs of your failures. Service maps look impressive in demos, break consistently in production. The correlation features work great for finding obvious problems, fail miserably when you need to debug something actually complex.
APM Agents: Language-specific libraries that instrument your code. Java agent adds 50-150MB overhead per JVM, Node.js agent sometimes breaks async/await error handling, and the .NET agent requires more XML configuration than anyone should have to write in 2024.
Why Not Just Use Datadog?
Because Datadog costs more than my car payment once you hit any reasonable scale. Elastic APM starts free with the basic license - you can run the whole stack on-premise without paying Elastic a dime. Course, you'll pay in sleepless nights maintaining Elasticsearch clusters, but money's money.
The OpenTelemetry integration is actually solid. No vendor lock-in, standard instrumentation, works with their newer Elastic Distributions of OpenTelemetry (EDOT). Unlike Datadog's proprietary formats, New Relic's custom agents, or AppDynamics' controller architecture, you can export your data elsewhere if you get fed up. No vendor lock-in, standard instrumentation, works with their newer Elastic Distributions of OpenTelemetry (EDOT). Unlike some tools that force proprietary formats, you can export your data elsewhere if you get fed up.
Real Talk: What Actually Works
Distributed tracing works well once you fight through the setup. Service dependency maps are pretty accurate for HTTP calls, less reliable for message queues and async processing. The machine learning anomaly detection catches obvious spikes but misses subtle degradation patterns that actually matter.
Log correlation between APM and Elastic Logs is genuinely useful - when your traces show slowdowns, you can jump directly to error logs from the same request. This feature alone saves hours of context switching between Splunk, Fluentd, or other logging solutions. The Elastic Common Schema (ECS) standardizes field names across logs, metrics, and traces for seamless correlation.
Performance overhead stays reasonable if you tune sampling rates. Default configuration traces everything, which kills performance and fills storage. Set sampling to 10-20% for busy services, 100% for stuff that rarely gets traffic.
Bottom line: Elastic APM works best when you're already using Elasticsearch for logs, search, or security (SIEM). If you're starting fresh, consider whether you want to become an Elasticsearch expert or just pay someone else to handle the infrastructure with Elastic Cloud or alternatives like Amazon OpenSearch.