When Your Agents Stop Talking to Datadog (And Why)
The most brutal production failures happen when your monitoring goes dark exactly when you need it most. Your agents were sending data fine for months, then suddenly nothing. Your dashboards show flat lines while your infrastructure is melting down.
Here's what actually goes wrong and how to fix it before you get paged at 3am:
The "Agent Is Running But No Metrics" Nightmare
The Silent Agent Problem: Your monitoring dashboards show flat lines across all metrics - CPU, memory, network, everything at zero. But sudo systemctl status datadog-agent
reports the service is running normally with no error messages. The process is alive but completely braindead.
This one is fucking evil because sudo systemctl status datadog-agent
says everything's fine, but your dashboards are empty. The agent process is alive but braindead.
Check the agent status first: sudo datadog-agent status
will tell you what's actually happening. Look for this output:
Forwarder
: Should say "Running" with recent transactionsCollector
: Should list active checksDogStatsD
: Should show recent metrics received
If the forwarder is failing, you've got network issues. If collector shows no checks, your agent config is fucked.
Common causes:
- Clock skew: Datadog rejects metrics with timestamps >10 minutes off. Run
ntpdate -s time.nist.gov
and restart the agent - API key rotation: Someone rotated keys but forgot to update agents. Check
/etc/datadog-agent/datadog.yaml
for the correctapi_key
- Firewall changes: Your network team blocked egress to
https://app.datadoghq.com
without telling anyone. Usecurl -v https://app.datadoghq.com
to test connectivity
The official troubleshooting guide covers the basics, but here's what they don't tell you:
Memory pressure kills agents: If your host is swapping, the agent will start dropping metrics silently. Monitor agent memory usage with ps aux | grep datadog-agent
- if resident memory is >500MB, something's wrong.
Log rotation breaks everything: When logrotate runs, it can kill agent log handlers. Check /var/log/datadog/agent.log
for sudden stops in logging. You'll see "ERROR: log file descriptor closed unexpectedly" followed by silence. Restart the agent after log rotation, or better yet, configure logrotate to send SIGHUP to the agent process instead of just rotating files.
High Memory Usage: When Your Agent Becomes a Memory Hog
Datadog agents are supposed to be lightweight, but in production they can eat gigabytes of memory and nobody knows why. I've seen agents consume 8GB RAM on a 16GB host, essentially DOSing the applications they're monitoring.
Quick diagnosis: Run sudo datadog-agent status
and look for:
- Forwarder queue size: Should be <1000. Higher means backlog
- Check collection time: Individual checks taking >30s indicate problems
- DogStatsD buffer: High buffer usage means metric flood
The high memory troubleshooting docs are actually useful here, but miss the real production culprits.
APM trace explosion: The biggest memory killer is applications sending massive traces. One shitty microservice generating 10,000-span traces will crash your agent. Check APM resource usage and configure trace sampling immediately.
Custom metrics cardinality bomb: When someone instruments metrics with user_id
tags, you get millions of unique timeseries. Each costs memory. Use datadog-agent check system_core
to see how many metrics you're generating.
Log tailing gone wrong: If you're tailing massive log files, the agent buffers everything in memory. Configure log processing rules to filter at the agent level, not after ingestion.
Fix it before it kills your host:
- Set agent memory limits in systemd:
MemoryMax=2G
in/etc/systemd/system/datadog-agent.service.d/memory.conf
- Configure DogStatsD buffer size to prevent metric floods
- Enable agent resource limits in the config
Kubernetes Agent Clusterfuck
Kubernetes Complexity Layers: The Datadog agent deploys as a DaemonSet (one per node) plus a cluster agent for metadata aggregation, all coordinating through the Kubernetes API server. When any layer fails, the whole monitoring stack goes dark, usually during your worst production incidents.
Kubernetes adds layers of complexity that break in spectacular ways. The cluster agent sounds great until it decides to crash during your biggest production incident.
DaemonSet deployment problems: Despite Datadog's warnings, many teams still use manual DaemonSet deployment because "it's simpler." It's not. It breaks in ways that waste weeks.
Common DaemonSet failures:
- RBAC permissions: Agent pods fail to start with vague "forbidden" errors. Check cluster role bindings
- Node selector conflicts: Agents don't deploy to new nodes because selector rules exclude them
- Resource limits: Kubernetes kills agent pods under memory pressure if limits are too low
Use the fucking Operator: The Datadog Operator exists for a reason. It handles RBAC, node selectors, and resource management automatically.
Cluster Agent crashes: The cluster agent is supposed to aggregate metadata and reduce API load, but it crashes when overwhelmed. Check cluster agent logs for:
too many open files
: Increase file descriptor limitscontext deadline exceeded
: Kubernetes API is slow, increase timeoutsauthentication failed
: Service account tokens expired
Container discovery problems: Agents discover containers automatically, but this breaks with:
- Custom networking: CNI plugins that hide containers from discovery - Calico with strict network policies makes agents return empty container lists
- Namespace restrictions: RBAC that prevents cross-namespace discovery - agents throw "forbidden: pods is forbidden" errors that don't show in agent status
- Container runtime changes: Docker to containerd migrations break agent configs - agent 7.41.0 has a known issue where containerd socket detection fails on Ubuntu 22.04 with non-standard socket paths
Fix by enabling container troubleshooting and checking agent autodiscovery rules.
The Proxy Hell
Corporate networks require proxy servers, and Datadog agents hate proxies with the fury of a thousand suns. The proxy configuration docs make it sound easy. It's not.
SSL interception breaks everything: Your corporate proxy intercepts SSL and presents its own certificates. Datadog agents reject these and fail silently.
Fix with proxy bypass rules or certificate pinning workarounds:
## /etc/datadog-agent/datadog.yaml
skip_ssl_validation: true # Only for desperate situations
proxy:
https: "http://proxy.company.com:8080"
http: "http://proxy.company.com:8080"
Proxy authentication: NTLM authentication makes agents sad. Use basic auth if possible, or configure proxy bypass for Datadog endpoints.
Split tunneling disasters: When VPN routes some traffic through proxy and some direct, agents get confused about which path to use. You'll see intermittent "connection refused" errors in agent logs but everything looks configured correctly. Create explicit routing rules for Datadog IPs, or better yet, avoid split tunnel VPNs entirely if you want reliable monitoring.
Performance That Destroys Production
Monitoring shouldn't make your production systems slower, but badly configured Datadog can destroy performance in subtle ways.
Check interval abuse: Running checks every 10 seconds sounds reasonable until you have 50 checks per host. Database connection checks that run too frequently can exhaust connection pools.
Configure sensible intervals:
instances:
- host: localhost
port: 5432
min_collection_interval: 60 # seconds, not 10
Log tailing performance: Tailing high-volume log files kills disk I/O. Use log sampling aggressively and filter garbage logs at the agent level.
Network saturation: Agents sending metrics to Datadog consume bandwidth. In bandwidth-constrained environments, configure compression and adjust flush intervals to batch network calls.
The key insight: Datadog works great until it doesn't, and when it fails, it fails in production during incidents when you need it most. Build redundancy and monitoring for your monitoring. Keep a simple external check that verifies your agents are actually sending data, not just running processes.
Next, let's look at the dashboard and UI performance problems that make troubleshooting even harder when you're already having a bad day.