My Datadog agent is running but no metrics are showing up. What's the actual problem?

This is the classic "everything looks fine but nothing works" scenario. Run `sudo datadog-agent status` and look for the forwarder section. If it shows "Running" but zero successful transactions, you've got network issues. If it shows errors, check these in order: 1. **Clock skew**: `timedatectl status` - if your system clock is >10 minutes off, Datadog rejects everything 2. **API key rotation**: Check `/etc/datadog-agent/datadog.yaml` for the current API key 3. **Firewall fuckery**: Your network team blocked `https://app.datadoghq.com` - test with `curl -v https://app.datadoghq.com` 4. **Proxy authentication**: If you're behind a corporate proxy, SSL interception breaks agent connections The real kicker: agents can run for weeks in this broken state while you think everything's fine.

Why is my Datadog agent eating 4GB of memory when it should use ~200MB?

Memory explosions happen when agents buffer too much data. Check `datadog-agent status` for these red flags: - **Forwarder queue size >10,000**: Agent is backing up data faster than it can send - **APM trace buffer full**: One application is generating massive traces (10,000+ spans) - **Custom metrics explosion**: Someone tagged metrics with `user_id` and now you have millions of timeseries Quick fixes: Set systemd memory limits (`MemoryMax=2G`), enable [trace sampling](https://docs.datadoghq.com/tracing/guide/ingestion_sampling_use_cases/), and audit your [custom metrics](https://docs.datadoghq.com/account_management/billing/custom_metrics/) for cardinality bombs. I've seen metrics tagged with UUID values generate $50k monthly bills.

My dashboards timeout during incidents when I need them most. How do I fix this?

Dashboard timeouts happen when everyone panic-refreshes during outages, overwhelming Datadog's query engines. Here's how to build incident-ready dashboards: 1. **Reduce time windows**: Use 1-hour windows, not 24-hour during incidents 2. **Limit widgets**: >20 widgets per dashboard = timeout city 3. **Simplify queries**: Complex aggregations that work during normal times fail under load 4. **Create emergency dashboards**: Keep 3-4 simple dashboards with basic metrics for incidents Pro tip: Build external monitoring that checks if your Datadog dashboards are actually loading. When your monitoring monitoring fails, you need alerts about it.

How do I stop my custom metrics from bankrupting my department?

Custom metrics with high cardinality are the #1 budget killer. Each unique tag combination = billable metric. Tag a metric with `user_id` and you'll pay for every user. Here's how to avoid the death spiral: - **Audit existing metrics**: Use [metrics without limits](https://docs.datadoghq.com/metrics/metrics-without-limits/) to see your top cardinality offenders - **Strategic tagging**: Replace `user_id:12345` with `user_tier:premium` - same business insight, 99% cost reduction - **Approval workflows**: Make custom metrics require manager approval after the first 1,000 - **Regular cleanup**: Delete unused metrics - they keep billing until explicitly removed I've seen innocent-looking histograms generate 500,000 billable metrics overnight. Budget 3x whatever you think you'll spend.

Why does my Kubernetes cluster agent keep crashing during deployments?

Cluster agents crash when overwhelmed by Kubernetes API events during large deployments. Common failures: - **Resource limits too low**: Cluster agent needs 500m CPU and 512Mi memory minimum - **RBAC permissions missing**: Agent needs cluster-wide read access to discover resources - **API server overload**: Too many agents querying the same Kubernetes API endpoint Fix by deploying cluster agents in HA mode across availability zones and increasing resource limits. Don't run cluster agents on the same nodes as your application deployments - resource contention kills them.

My logs aren't showing up in Datadog but the agent says it's sending them. What's wrong?

Log ingestion fails silently more often than metrics. Check the agent logs in `/var/log/datadog/agent.log` for these errors: - **Permission denied**: Agent can't read log files (fix with chmod/chown) - **Parsing failures**: Malformed JSON breaks the log pipeline - **Index exclusions**: Your [exclusion filters](https://docs.datadoghq.com/logs/log_configuration/indexes/#exclusion-filters) are dropping logs before indexing - **Processing pipeline errors**: Grok patterns fail on new log formats Use [log collection troubleshooting](https://docs.datadoghq.com/logs/guide/log-collection-troubleshooting-guide/) and check the log processing pipeline for failures. Most "missing logs" are actually filtered out to control costs.

How do I troubleshoot APM when traces are missing spans or showing as incomplete?

Incomplete traces happen when spans get dropped between your application and Datadog. Debug this systematically: 1. **Agent trace buffer**: Check `datadog-agent status` - if trace buffer is full, increase `apm_config.receiver_timeout` 2. **Application sampling**: Your app might be sampling traces before sending to agent 3. **Network drops**: Spans sent via UDP can get lost - use TCP if possible 4. **Trace context propagation**: Microservices aren't passing trace context between calls Check [APM troubleshooting docs](https://docs.datadoghq.com/tracing/troubleshooting/) and enable debug logging to see what spans the agent receives vs sends to Datadog.

Why is my Datadog bill 10x higher than the pricing calculator estimated?

The pricing calculator assumes you're running toy examples. Real production costs include: - **APM span ingestion**: That "small" microservice generates millions of spans at $0.0012 each - **Log explosion**: Debug logging costs $1.27 per million events - one chatty app = $50k annually - **Custom metrics cardinality**: Innocent tags like `container_id` create thousands of billable metrics - **Infrastructure discovery**: Agents find every container, lambda, and managed service you forgot about Budget 3x the calculator estimate and implement [usage controls](https://docs.datadoghq.com/account_management/billing/usage_control_apm/) immediately. Set up billing alerts before your CFO schedules an emergency meeting.

My proxy setup breaks Datadog agents every few weeks. How do I make it reliable?

Corporate proxies are the enemy of reliable monitoring. Common proxy failures: - **SSL certificate rotation**: Proxy certificates expire and agents reject new ones - **Authentication timeouts**: NTLM authentication fails intermittently - **Connection pooling**: Proxy doesn't handle Datadog's keep-alive connections properly Solutions: Configure proxy bypass for Datadog endpoints, use basic auth instead of NTLM, and set `skip_ssl_validation: true` as a last resort. Consider [direct internet egress](https://docs.datadoghq.com/agent/guide/network/) for monitoring if security allows it.

How do I prevent Datadog agent deployments from breaking during critical incidents?

Never deploy monitoring changes during incidents - Murphy's Law guarantees failure. Here's how to avoid making bad situations worse: - **Staging validation**: Test agent updates in non-production first - **Gradual rollouts**: Deploy to 10% of hosts, wait 24 hours, then continue - **Rollback plan**: Keep previous agent versions and config files for quick revert - **Emergency procedures**: Document how to disable problematic agents without losing all monitoring Create monitoring for your monitoring: external checks that verify agents are sending data, not just running processes.

Why do my Datadog synthetic tests keep failing when the application works fine?

Synthetic test failures during working application scenarios usually indicate: - **Test environment drift**: Synthetic tests use different network paths than real users - **Authentication issues**: Test credentials expired or API keys rotated - **Rate limiting**: Application rate-limits synthetic test requests - **Regional differences**: Tests run from different locations than your users Check [synthetic monitoring troubleshooting](https://docs.datadoghq.com/synthetics/troubleshooting/) and compare synthetic test network paths to real user traffic. Most "false positives" reveal actual edge cases in your application.

Currently viewing the AI version

Switch to human version

Datadog Production Troubleshooting: AI-Optimized Reference

Critical Failure Scenarios and Recovery

Agent Silent Failure Mode

Problem: Agent process running but no metrics appearing in Datadog
Detection: sudo datadog-agent status shows zero successful transactions despite "Running" status
Critical Impact: Complete monitoring blindness during incidents
Recovery Time: 15-30 minutes

Root Causes:

Clock skew >10 minutes (Datadog rejects all metrics)
API key rotation without agent updates
Corporate firewall blocking https://app.datadoghq.com
Proxy SSL interception breaking agent connections

Diagnostic Commands:

sudo datadog-agent status  # Check forwarder section
timedatectl status         # Verify system clock
curl -v https://app.datadoghq.com  # Test connectivity

Memory Explosion Scenarios

Threshold: Agent consuming >500MB RAM (normal: ~200MB)
Critical Impact: Host resource exhaustion, application performance degradation
Failure Point: 8GB consumption observed on production systems

Primary Causes:

APM trace buffer overflow from 10,000+ span traces
Custom metrics cardinality explosion (UUID/user_id tags)
Log tailing of high-volume files without filtering
DogStatsD buffer accumulation during network issues

Prevention Configuration:

# systemd memory limits
MemoryMax=2G
MemoryHigh=1.5G

# Agent limits
forwarder_timeout: 20
forwarder_retry_queue_max_size: 100

Cost Explosion Patterns

Custom Metrics Cardinality Bomb

Cost Multiplier: Each unique tag combination = separate billable metric
Example Disaster: 100k users × 5 devices × 10 browsers × 20 regions = 100M metrics
Annual Cost Impact: $5M for single counter with naive tagging
Detection Lag: Only visible in monthly billing (too late)

High-Risk Tags:

user_id, container_id, request_id (UUID patterns)
Geographic data with high precision
Timestamp-based tags
Session identifiers

Cost-Effective Alternatives:

Replace user_id:12345 with user_tier:premium
Replace container_id:abc123 with service:user-api
Replace geographic coordinates with region groupings

APM Span Cost Explosion

Unit Cost: $0.0012 per span
Hidden Multiplier: Microservice calls generate 50+ spans per user action
Real Example: User signup flow generating 200 spans = $60k annually at 1M signups
Critical Threshold: Services generating >1M spans monthly

Span Generation Patterns:

HTTP request span (1)
Database queries (5-20 per request)
Service-to-service calls (3-10 per request)
Cache operations (2-5 per request)
Background job spawning (variable)

Cost Control Configuration:

# Smart sampling strategy
apm_config:
  trace_sample_rate: 0.1  # 10% baseline
  error_sample_rate: 1.0  # 100% errors
  priority_sampling: true

  filter_tags:
    reject:
      - "http.url:/health"
      - "resource.name:GET /ping"

Log Volume Death Spiral

Cost: $1.27 per million events
Growth Pattern: Exponential - starts manageable, explodes with microservice proliferation
Budget Killer: Debug logging in production = $231k annually for 50 services
Detection Difficulty: Gradual growth masks critical threshold crossing

Volume Explosion Sources:

Debug logs left enabled in production
Health check logging (high frequency, low value)
Authentication success logs (high volume, minimal insight)
Microservice mesh communication logging

Cost-Effective Filtering:

# Agent-level filtering
log_config:
  processing_rules:
    - type: exclude_at_match
      name: exclude_health_checks
      pattern: "GET /health|GET /ping"

    - type: sampling
      name: sample_info_logs
      sampling_rate: 0.1  # 10% of INFO logs

Performance Degradation Thresholds

Dashboard Timeout Scenarios

Trigger Conditions: >20 widgets per dashboard during incident traffic
Failure Pattern: Timeouts occur when most needed (during outages)
User Impact: Cannot troubleshoot incidents when monitoring UI fails
Recovery: Requires simplified emergency dashboards

Timeout-Resistant Design:

Limit dashboards to <15 widgets
Use 1-hour time windows during incidents
Pre-build emergency dashboards with basic metrics
Avoid complex aggregations in incident dashboards

Agent Resource Constraints

CPU Impact: Check intervals <60s on busy hosts cause application slowdown
Memory Impact: Unbounded buffering leads to OOM kills
Network Impact: Uncompressed transmission saturates limited bandwidth
I/O Impact: High-frequency log tailing degrades disk performance

Production Optimization:

# Balanced performance configuration
min_collection_interval: 60  # seconds
compression: gzip
batch_max_concurrent_send: 10
log_file_max_size: 10MB

Kubernetes-Specific Failure Modes

Cluster Agent Crash Patterns

Trigger: Large deployments overwhelming Kubernetes API
Resource Requirements: Minimum 500m CPU, 512Mi memory
Failure Symptoms: Container monitoring goes dark during deployments
Recovery Complexity: Requires HA configuration across AZs

Common RBAC Failures:

Missing cluster-wide read permissions
Service account token expiration
Network policy blocking API server access

DaemonSet Deployment Issues

Anti-Pattern: Manual YAML deployment ("it's simpler")
Reality: Breaks in production with complex failure modes
Recommended: Use Datadog Operator for automated management
Failure Points: Node selectors, resource limits, RBAC permissions

Cost Planning and Budget Protection

Budget Explosion Timeline

Immediate (0-7 days): Infrastructure discovery finds all resources
Week 2-4: APM instrumentation spans multiply with microservice adoption
Month 2-3: Custom metrics cardinality grows with feature development
Month 3-6: Log volume explodes as debug logging accumulates
Annual renewal: 3-10x original estimates common

Automated Cost Controls

Emergency Sampling: 50% trace reduction at 80% budget
Log Filtering: Automatic exclusion at 90% budget
Metric Freeze: Stop new custom metrics at 95% budget

Usage Attribution Strategy:

Tag all resources with team/service ownership
Implement chargeback based on actual usage
Create team-level cost dashboards
Require approval for high-cardinality metrics

Growth Modeling Reality

Linear Infrastructure: 20% customer growth = 20% more hosts
Non-Linear Application: 20% customers = 50% more containers (auto-scaling)
Exponential Custom Metrics: One new microservice = 10x spans (service mesh)
Feature Launch Impact: A/B testing = 5x custom metrics

Critical Configuration Templates

Production-Ready Agent Configuration

# /etc/datadog-agent/datadog.yaml
api_key: ${DD_API_KEY}
site: datadoghq.com
hostname_fqdn: true

# Resource limits
forwarder_timeout: 20
forwarder_retry_queue_max_size: 100
log_file_max_size: 10MB
log_file_max_rolls: 5

# Performance optimization
min_collection_interval: 60
compression: gzip
batch_max_concurrent_send: 10

# APM configuration
apm_config:
  enabled: true
  trace_sample_rate: 0.1
  error_sample_rate: 1.0
  max_traces_per_second: 10

# Log configuration
logs_config:
  logs_dd_url: intake.logs.datadoghq.com:10516
  processing_rules:
    - type: sampling
      name: sample_info
      sampling_rate: 0.1

Emergency Diagnostic Commands

# Agent health check
sudo datadog-agent status

# Memory and process inspection
ps aux | grep datadog-agent
sudo systemctl status datadog-agent

# Network connectivity
curl -v https://app.datadoghq.com
telnet intake.logs.datadoghq.com 10516

# Configuration validation
sudo datadog-agent configcheck
sudo datadog-agent check system_core

# Generate support bundle
sudo datadog-agent flare

Severity Matrix for Incident Response

Issue Type	Detection Time	Fix Complexity	Business Impact	Blame Assignment
Agent stops sending metrics	Immediate (flat dashboards)	Low (restart/config)	Critical (blind monitoring)	Platform team
Memory leak	Hours (gradual)	Medium (limits/restart)	High (app degradation)	Could be anyone
Cost explosion	Monthly (billing)	Very High (code changes)	Critical (budget)	Developer tags
Dashboard timeouts	During incidents	High (redesign)	Critical (can't debug)	Dashboard builder
Trace sampling too aggressive	Days (missing data)	Medium (config)	Medium (debugging hard)	Config optimizer

Resource Requirements and Trade-offs

Agent Resource Consumption

Baseline: 200MB RAM, 100m CPU per agent
High Volume: 2GB RAM, 500m CPU with APM + logs
Breaking Point: 8GB RAM observed with trace buffer overflow
Recovery: Requires host restart if swap exhaustion occurs

Network Bandwidth Impact

Baseline: 1-5 Mbps per 100 hosts
High Cardinality: 50+ Mbps with verbose custom metrics
Compression Benefit: 60-80% reduction with gzip enabled
Batching Benefit: 40% reduction in connection overhead

Storage Requirements for Retention

Hot Tier (0-15 days): 60% of log costs, full search capability
Frozen Tier (15+ days): 30% cost reduction, slower search
Archive Tier (90+ days): 70% cost reduction, requires rehydration

Critical Warnings and Breaking Points

What Official Documentation Doesn't Tell You

Clock skew breaks everything silently (>10 minute tolerance)
Log rotation kills agent handlers without restart
Proxy SSL interception causes random failures
Container runtime changes break autodiscovery
Large Kubernetes deployments crash cluster agents
Debug logging can exceed compute costs

Production Breaking Points

UI Performance: >1000 spans makes debugging impossible
Agent Memory: >4GB consumption crashes busy hosts
Dashboard Load: >20 widgets timeout during incidents
Network Saturation: Uncompressed metrics flood bandwidth
API Rate Limits: Too many agents overwhelm Kubernetes API

Hidden Operational Costs

Human Time: Agent troubleshooting = 4-8 hours per incident
Expertise Requirements: Kubernetes + Datadog knowledge = rare skillset
Migration Pain: Major version upgrades break configurations
Support Quality: Community forums more helpful than support tickets
Breaking Changes: API changes require agent redeployment

This operational intelligence enables automated decision-making for Datadog deployment, troubleshooting, and cost optimization in production environments.

Useful Links for Further Investigation

Production-Ready Troubleshooting Resources

Link	Description
Agent Status and Health Checks	The first place to check when agents misbehave. Contains the datadog-agent status command reference and common failure scenarios. Actually useful unlike most vendor docs, but doesn't cover the weird edge cases you'll encounter in production.
High Memory Usage Troubleshooting	When your agents start eating gigabytes of RAM. Covers the most common memory leak causes including APM buffer overflows and log tailing problems. Missing: how to prevent memory issues before they crash your hosts.
Log Collection Troubleshooting Guide	Step-by-step debugging for when logs aren't reaching Datadog. Covers permissions, parsing errors, and pipeline failures. The port 10516 blocked issue happens more often than they admit.
APM Troubleshooting Documentation	Debugging missing traces and incomplete spans. Essential when your distributed tracing shows gaps during production incidents. The span sampling section will save your APM budget.
Container Troubleshooting Guide	Kubernetes and Docker-specific agent problems. Covers DaemonSet deployment issues, RBAC failures, and resource constraints. Use this when your container metrics disappear after platform updates.
APM Resource Usage Analysis	How APM tracing impacts agent performance and memory usage. Critical for understanding why agents crash during high-traffic periods. Contains the resource limit recommendations that actually work.
High Throughput DogStatsD Configuration	Tuning DogStatsD for applications that send massive volumes of metrics. The buffer configuration examples prevent metric drops during traffic spikes. Essential for high-volume production deployments.
Trace Sampling Strategies	Real-world trace sampling configurations that balance visibility with cost control. The priority sampling examples are particularly useful for maintaining trace completeness while reducing volume.
Agent Performance Improvements Blog	Technical deep-dive into agent performance optimizations. Contains actual benchmarks and configuration recommendations from Datadog's engineering team. Worth reading for understanding agent internals.
Custom Metrics Billing Documentation	Understanding what drives custom metrics costs and how cardinality affects billing. The tag optimization examples can reduce costs by 70%+ without losing business insights.
Usage Control and Limits	Setting up automated controls to prevent budget explosions. The emergency sampling configurations activate when costs spike unexpectedly. Set these up before your first surprise bill.
Log Sampling and Filtering	Aggressive log cost control through intelligent sampling and exclusion rules. The regex examples for filtering health checks and debug logs will dramatically reduce ingestion costs.
Metrics Without Limits Guide	Identifying high-cardinality metrics that destroy budgets. The cardinality analysis tools help find metrics tagged with user IDs or UUIDs that create millions of billable timeseries.
Datadog Cluster Agent Setup	Proper cluster agent deployment to avoid the DaemonSet hell. The HA configuration prevents single points of failure during Kubernetes upgrades. Use this instead of manual YAML manifests.
Datadog Operator Advanced Configuration	Enterprise-grade Kubernetes deployments using the operator instead of manual configurations. Handles RBAC, resource limits, and upgrade management automatically. Prevents most Kubernetes-related agent failures.
Kubernetes Configuration Guide	Advanced Kubernetes monitoring configuration including namespace isolation, service discovery, and network policy considerations. Essential for multi-tenant clusters.
Admission Controller Troubleshooting	Debugging the cluster agent's admission controller when it breaks pod instrumentation. The RBAC permission issues are particularly common during security hardening.
Proxy Configuration Guide	Setting up agents behind corporate proxies without everything breaking. The SSL interception workarounds save weeks of debugging. Corporate networks hate this document.
Network Requirements	Complete list of endpoints and ports that agents need to reach Datadog. Your firewall team will hate the number of required connections. Keep this handy for security reviews.
Site Selection and Data Residency	Choosing the right Datadog site (US, EU, Gov) for compliance and performance. Data sovereignty requirements often mandate specific sites for regulated industries.
Datadog Status Page	Check this first when everything seems broken - Datadog has outages too. Includes historical incident data showing their reliability patterns. Bookmark for 3am troubleshooting sessions.
Emergency Troubleshooting Runbook	Sending support flares and gathering diagnostic information during incidents. The agent flare command collects everything support needs for troubleshooting. Use this before opening tickets.
Database Monitoring Troubleshooting	Database-specific monitoring issues including connection problems, permission errors, and query sampling failures. Essential when database metrics suddenly disappear.
Synthetic Monitoring Troubleshooting	Debugging synthetic test failures and private location issues. The network connectivity debugging steps help identify infrastructure problems affecting monitoring.
Agent Configuration Files Reference	Complete agent configuration reference including all the options that aren't documented elsewhere. The resource limit configurations prevent most production issues.
Terraform Datadog Provider	Infrastructure-as-code for Datadog configuration management. Essential for maintaining consistent configurations across environments and enabling disaster recovery of monitoring configs.
Datadog API Documentation	Programmatic management of Datadog resources. The metrics and events API endpoints are particularly useful for custom monitoring and automation integration.
Custom Agent Checks Development	Building custom monitoring for applications and services not covered by standard integrations. The Python check examples provide templates for monitoring proprietary systems.
GitHub Datadog Agent Issues	Real production problems and solutions from other engineers. Search here when you encounter weird errors not covered in official documentation. Often has better solutions than support tickets.
Stack Overflow Datadog Questions	Community solutions for common configuration and troubleshooting problems. The agent deployment questions often have practical workarounds for enterprise environments.
Datadog Community Forums	Less active than Stack Overflow but sometimes has insights from Datadog engineers. Good for architectural questions and best practices discussions.