Elastic APM - Track down why your shit's broken before users start screaming

What is Elastic APM and Why Use It

Four components that each have their own special way of failing spectacularly. Been running this nightmare in prod for 2+ years, so here's what you actually need to know.

ELK Stack Architecture

Elastic APM sits on top of the ELK stack - that's Elasticsearch for storage, Logstash for data processing, and Kibana for dashboards that look pretty until they don't. The APM Server acts as the middleman, collecting traces from your apps via agents and shoving them into Elasticsearch where they'll consume RAM like it's free.

The Four Horsemen of Your Monitoring Apocalypse

APM Server: Handles incoming telemetry data. Crashes when you send it more data than expected (which is always). Memory usage scales linearly with trace volume, which sounds reasonable until your AWS bill jumps from $300 to $1200 overnight because someone decided to trace every database query during Black Friday.

Elasticsearch: Stores everything. Will happily eat 800GB of storage in three days if you don't configure index lifecycle management properly. I think it was like 800GB or something stupid like that when I forgot to set retention policies during a particularly brutal week in March.

Kibana: The pretty UI that shows you colorful graphs of your failures. Service maps look impressive in demos, break consistently in production. The correlation features work great for finding obvious problems, fail miserably when you need to debug something actually complex.

APM Agents: Language-specific libraries that instrument your code. Java agent adds 50-150MB overhead per JVM, Node.js agent sometimes breaks async/await error handling, and the .NET agent requires more XML configuration than anyone should have to write in 2024.

Why Not Just Use Datadog?

Because Datadog costs more than my car payment once you hit any reasonable scale. Elastic APM starts free with the basic license - you can run the whole stack on-premise without paying Elastic a dime. Course, you'll pay in sleepless nights maintaining Elasticsearch clusters, but money's money.

The OpenTelemetry integration is actually solid. No vendor lock-in, standard instrumentation, works with their newer Elastic Distributions of OpenTelemetry (EDOT). Unlike Datadog's proprietary formats, New Relic's custom agents, or AppDynamics' controller architecture, you can export your data elsewhere if you get fed up. No vendor lock-in, standard instrumentation, works with their newer Elastic Distributions of OpenTelemetry (EDOT). Unlike some tools that force proprietary formats, you can export your data elsewhere if you get fed up.

Real Talk: What Actually Works

Distributed tracing works well once you fight through the setup. Service dependency maps are pretty accurate for HTTP calls, less reliable for message queues and async processing. The machine learning anomaly detection catches obvious spikes but misses subtle degradation patterns that actually matter.

Log correlation between APM and Elastic Logs is genuinely useful - when your traces show slowdowns, you can jump directly to error logs from the same request. This feature alone saves hours of context switching between Splunk, Fluentd, or other logging solutions. The Elastic Common Schema (ECS) standardizes field names across logs, metrics, and traces for seamless correlation.

Performance overhead stays reasonable if you tune sampling rates. Default configuration traces everything, which kills performance and fills storage. Set sampling to 10-20% for busy services, 100% for stuff that rarely gets traffic.

Bottom line: Elastic APM works best when you're already using Elasticsearch for logs, search, or security (SIEM). If you're starting fresh, consider whether you want to become an Elasticsearch expert or just pay someone else to handle the infrastructure with Elastic Cloud or alternatives like Amazon OpenSearch.

Elastic APM vs The Competition

Feature	Elastic APM	Datadog APM	New Relic	Dynatrace	AppDynamics
Setup Complexity	High (manage your own Elasticsearch)	Low (just install agents)	Low (SaaS, agents)	Medium (OneAgent magic)	High (enterprise config hell)
Free Tier	Actually free (if self-hosted)	5 hosts, 1 day retention	100GB/month, 1 user	15-day trial only	Trial only
Monthly Cost (10 hosts)	0-200 (infrastructure)	310+ per month	49+ per user	$$$$ (custom quote)	$$$$$ (enterprise pricing)
Language Support	8 agents (decent)	20+ languages	15+ languages	Auto-instruments everything	Java/.NET focused
Trace Sampling	Configurable (head/tail)	Intelligent sampling	Configurable	100% with PurePath	Business transaction focused
Real User Monitoring	Basic (RUM agent)	Advanced synthetics	Advanced browser monitoring	Full user journey tracking	Business transaction focus
Machine Learning	Basic anomaly detection	Robust ML features	AI insights	Davis AI (overhyped)	Limited ML capabilities
On-Premise Option	Yes (fully self-hosted)	No	No	Managed only	Yes (controller on-prem)
Community/Support	Open source community	Enterprise support	24/7 support (paid)	Premium support only	Enterprise support

Deploying Elastic APM Without Losing Your Mind

After burning through two weekends and questioning my career choices, here's what actually matters for getting Elastic APM running in production.

Resource Requirements: The Ugly Truth

Elasticsearch cluster: Start with 3 nodes minimum for production. Each node needs at least 8GB RAM, preferably 16GB. Learned this when our 2-node cluster kept going yellow every time we restarted one for updates. Follow Elastic's production deployment guidelines and consider Kubernetes deployment patterns. Elasticsearch heap sizing should be 50% of available RAM, never more than 31GB per JVM.

APM Server: 2GB RAM minimum, 4GB recommended. Scales with trace volume - if you're ingesting more than 10k transactions/second, you'll need multiple APM servers behind a load balancer. CPU matters less than memory for APM workloads.

Storage planning: Traces are chunky. Budget 5-10GB per million transactions, depending on trace depth. Without proper index lifecycle management, you'll fill disks faster than you think. Set retention to 7-14 days unless you've got infinite storage budget.

Agent Configuration: Don't Trace Everything

Default agent configuration will kill your performance. Here's what actually works:

Sampling rates: Set sample_rate: 0.1 (10%) for high-volume services. Use transaction_sample_rate for fine-grained control. The Java agent config has 200+ options - ignore most, focus on sampling and service naming.

Service names: Be specific. "api" tells you nothing useful at 3am. Use "user-auth-api" or "payment-processor". Future you will thank past you.

Ignore patterns: Exclude health checks, metrics endpoints, and other noise. My /health endpoint was generating 40% of all traces before I blacklisted it.

## APM Server config that won't hate you
apm-server:
  host: \"0.0.0.0:8200\"
  max_connections: 0
  read_timeout: 30s
  write_timeout: 30s
  max_request_size: 1MB
  
output.elasticsearch:
  hosts: [\"es-node-1:9200\", \"es-node-2:9200\", \"es-node-3:9200\"]
  worker: 4
  bulk_max_size: 5120

Deployment Patterns That Don't Suck

Self-hosted: Full control, full responsibility. Use Docker Compose for dev, proper orchestration (Kubernetes, Docker Swarm, Nomad) for prod. Check out Elastic Cloud on Kubernetes (ECK) for managed K8s deployments. Budget 2-3 days for initial setup, 1-2 hours weekly maintenance.

Elastic Cloud: Managed Elasticsearch with APM built-in. Costs more but someone else deals with cluster management. Good if you value sleep over money. Compare with Amazon OpenSearch Service, Elastic's AWS partnership, or Google Cloud Elasticsearch.

Hybrid: Run Elasticsearch in cloud, APM server on-premise. Reduces data egress costs, keeps sensitive trace data internal. Pain in the ass to configure but works well once stable.

Mobile APM: Extra Special Pain

Mobile agents are beta-quality at best. iOS agent completely shits itself on iOS 17.2.1 - crashes on app launch for certain device configurations. Android agent works better but adds 15-20MB to your APK. Consider alternatives like Firebase Performance, Bugsnag, or Instabug for mobile-specific monitoring.

Battery drain is noticeable with default settings. Disable crash reporting if you're using another crash tool (Firebase, Bugsnag) - double instrumentation kills performance.

Free vs Paid Features

Free tier includes: Basic APM, service maps, distributed tracing, Kibana dashboards. Everything you need to get started and probably most of what you'll use daily.

Paid features: Machine learning anomaly detection, advanced alerting, custom dashboards, enterprise auth. Platinum license starts around $125/month per cluster.

The AI features sound impressive in demos, fail quietly in production. Anomaly detection triggers false positives constantly unless you spend weeks tuning it. Compare with Datadog's Watchdog, New Relic's Applied Intelligence, or Dynatrace's Davis AI for AI-driven monitoring approaches. Save your money initially, upgrade later if you find specific paid features you actually need.

What Could Go Wrong (Spoiler: Everything)

Elasticsearch goes red: Usually disk space or memory pressure. Add nodes or reduce retention before you lose data.

APM agents crash apps: Especially Node.js agent with certain async/await patterns. Test thoroughly in staging, have rollback plan ready.

Trace correlation breaks: When service names change or clock drift exceeds trace timeout. Use NTP, keep service names consistent across deployments.

Storage explosion: One badly instrumented service can generate terabytes of traces. Monitor index sizes, set up alerting on rapid growth.

Bottom line: Elastic APM works great once you fight through the operational complexity. Budget time for learning Elasticsearch administration, cluster management, and performance tuning or pay extra for managed hosting. Consider professional services for complex deployments.

Elastic APM FAQ: Questions You'll Ask at 3AM

How much does Elastic APM actually cost?

Self-hosted:

Infrastructure only

figure $200-500/month in AWS/GCP for small-medium deployment. Elastic Cloud: Starts around $95/month for basic setup, scales to $1000+/month with serious data volumes.

Found this out the hard way at 2:37am on a Tuesday when our bill showed $847 in overages.

Can I run this on a single server?

For dev/testing, sure. Production? Hell no. Elasticsearch needs at least 3 nodes for cluster stability. APM server can run anywhere but needs network access to ES cluster. Single points of failure will bite you during the worst possible outage.

How do I stop my disk from filling up with traces?

Index Lifecycle Management (ILM) is your friend. Set retention to 7-14 days max unless you've got unlimited storage budget. Default config keeps everything forever

learned this when we hit 800GB of traces in three days.

Why are my applications suddenly slow after installing agents?

Agent overhead. Java agent adds 50-150MB memory per JVM, Node.js agent can break async/await patterns. Lower sampling rate to 10-20% and exclude noisy endpoints like health checks.

My service map shows nothing but errors. What's wrong?

Clock drift between servers. APM correlates traces using timestamps

if clocks are off by more than trace timeout (default 5 minutes), correlation breaks. Use NTP everywhere. Also check service names
typos break everything.

Why do I get 847 alerts about the same issue?

Default alerting is garbage. Elastic's alerting floods you with duplicate notifications. Configure alert suppression or you'll get 847 Slack notifications in one hour about the same database being slow.

APM shows errors but my application logs are clean. What gives?

APM agents catch exceptions that your app might handle gracefully. Check for caught exceptions being reported as errors. Configure error filtering to exclude expected errors like 404s or validation failures.

Mobile apps crash after adding APM agent. Now what?

iOS agent has known issues with certain device/OS combinations. iOS agent GitHub repo shows active issues with crashes on i

OS 17.2.1. Disable crash reporting if using other tools, reduce sampling rate, test on all target devices.

How many transactions can APM server handle?

Depends on trace complexity and server specs. Figure 5k-10k transactions/second per APM server instance. Beyond that, you'll need load balancing and multiple servers. Monitor queue sizes and processing latency.

When should I use paid features vs free?

Free tier covers 90% of actual needs

basic APM, service maps, distributed tracing. Paid features (ML anomaly detection, advanced alerting) sound great, work poorly without extensive tuning. Start free, upgrade only when you find specific limitations.

My Elasticsearch cluster is constantly yellow/red. Help?

Common causes: Insufficient disk space, memory pressure, or replica configuration issues. Cluster health API tells you what's failing. Usually means add more nodes or reduce retention policies.

Should I use this instead of Datadog/New Relic?

Use Elastic APM if: Already running Elasticsearch, want to control your data, have DevOps expertise, budget constraints. Use commercial tools if: Want plug-and-play, need advanced features, don't want to manage infrastructure.

Can I self-host everything vs using Elastic Cloud?

Self-hosting: Full control, operational burden is yours. Elastic Cloud: More expensive, less operational pain. You fix it yourself, because you're self-hosting, or you pay them to fix it for you.

How does this compare to OpenTelemetry with other backends?

Elastic APM supports OpenTelemetry natively now. No vendor lock-in, standard instrumentation. Can export data to other tools if you change your mind later.

What happens if I outgrow Elastic APM?

Migration path exists to commercial tools. OpenTelemetry instrumentation works with most APM backends. Historical data stays in Elasticsearch unless you export it. Plan migration during low-traffic periods.

Elastic APM Resources: The Good, Bad, and Actually Useful

Related Tools & Recommendations

tool

New Relic Overview: App Monitoring, Setup & Cost Insights

New Relic tells you when your apps are broken, slow, or about to die. Not cheap, but beats getting woken up at 3am with no clue what's wrong.

Quick Navigation

The Four Horsemen of Your Monitoring Apocalypse

Why Not Just Use Datadog?

Real Talk: What Actually Works

Resource Requirements: The Ugly Truth

Agent Configuration: Don't Trace Everything

Deployment Patterns That Don't Suck

Mobile APM: Extra Special Pain

Free vs Paid Features

What Could Go Wrong (Spoiler: Everything)

How much does Elastic APM actually cost?

Can I run this on a single server?

How do I stop my disk from filling up with traces?

Why are my applications suddenly slow after installing agents?

My service map shows nothing but errors. What's wrong?

Why do I get 847 alerts about the same issue?

APM shows errors but my application logs are clean. What gives?

Mobile apps crash after adding APM agent. Now what?

How many transactions can APM server handle?

When should I use paid features vs free?

My Elasticsearch cluster is constantly yellow/red. Help?

Should I use this instead of Datadog/New Relic?

Can I self-host everything vs using Elastic Cloud?

How does this compare to OpenTelemetry with other backends?

What happens if I outgrow Elastic APM?

Related Tools & Recommendations

New Relic Overview: App Monitoring, Setup & Cost Insights

ELK Stack for Microservices - Stop Losing Log Data

Datadog vs New Relic vs Sentry: Real Pricing Breakdown (From Someone Who's Actually Paid These Bills)

Setting Up Prometheus Monitoring That Won't Make You Hate Your Job

Python vs JavaScript vs Go vs Rust - Production Reality Check

Datadog Enterprise Pricing - What It Actually Costs When Your Shit Breaks at 3AM

Google Kubernetes Engine (GKE) - Google's Managed Kubernetes (That Actually Works Most of the Time)

Fix Kubernetes Service Not Accessible - Stop the 503 Hell

Jenkins + Docker + Kubernetes: How to Deploy Without Breaking Production (Usually)

Docker Won't Start on Windows 11? Here's How to Fix That Garbage

Stop Docker from Killing Your Containers at Random (Exit Code 137 Is Not Your Friend)

Docker Desktop's Stupidly Simple Container Escape Just Owned Everyone

Python 3.13 Performance - Stop Buying the Hype

Get Alpaca Market Data Without the Connection Constantly Dying on You

Which JavaScript Runtime Won't Make You Hate Your Life

Install Node.js with NVM on Mac M1/M2/M3 - Because Life's Too Short for Version Hell

Claude API Code Execution Integration - Advanced Tools Guide

Amazon SageMaker - AWS's ML Platform That Actually Works

Musk's xAI Drops Free Coding AI Then Sues Everyone - 2025-09-02

Musk Sues Another Ex-Employee Over Grok "Trade Secrets"