Currently viewing the human version
Switch to AI version

Discovery Agent Installation Hell: What They Don't Tell You

CPU Utilization Will Destroy Your Legacy Servers

AWS Discovery Agent Performance Impact: CPU usage spikes to 40%+ on legacy servers, memory consumption increases over time, and performance degrades on anything older than 5 years.

Installing AWS Application Discovery Agents sounds simple until you try it on that ancient CentOS 6.9 box running your payment system. The agent documentation says "minimal performance impact" but doesn't mention that "minimal" means 40% CPU usage on anything older than 2015.

Real scenario from production: Installed discovery agents on 12 servers during business hours. Three went offline because they couldn't handle the CPU load. The Windows 2008 R2 domain controller became unresponsive for 20 minutes while the agent attempted to inventory every single registry key.

The fix: Install agents during maintenance windows and test on non-critical systems first. On servers with less than 4GB RAM or older than 5 years, expect performance degradation. Monitor CPU usage for the first 2 hours - if it stays above 30%, kill the agent and try agentless discovery instead.

Memory Leaks That Kill Servers

Discovery agents have a known memory leak that AWS doesn't advertise. After running for 2-3 weeks, the agent process can consume 500MB+ of RAM on busy servers. On systems already running near capacity, this kills performance.

Error you'll see: Application timeouts, database connection failures, general system sluggishness. The agent shows as "healthy" in the console while your server dies.

The workaround: Restart the discovery agent weekly using a cron job:

0 2 * * 0 systemctl restart aws-discovery-daemon

Network Discovery: Missing the Obvious Connections

Network Dependency Mapping: Visualization showing server connections with arrows, but missing 20% of critical dependencies that only appear during monthly batch jobs or system failures.

The network visualization looks impressive until you realize it misses 20% of your critical connections. The agent only captures active network connections - if your backup job runs at 3 AM and you install the agent at 9 AM, that dependency won't show up.

War story: Migrated a web application that worked fine for 3 weeks. Then the monthly reporting job failed because it couldn't connect to an Oracle database that only gets accessed once per month. The dependency wasn't discovered because nobody thought to run all scheduled jobs during the discovery period.

How to actually map dependencies:

  1. Run discovery for at least 14 days to catch weekly/monthly jobs
  2. Manually trigger all scheduled tasks during discovery
  3. Check application logs for outbound connections the agent missed
  4. Document every custom service account - they often indicate hidden dependencies

Agentless Discovery Limitations

AWS Agentless Discovery Connector sounds perfect until you try to use it. It requires VMware vCenter 5.5+ and can only see what VMware knows about - which excludes most of your custom applications and all of your bare metal servers.

Reality check: Agentless discovery finds your servers but tells you nothing useful about what they do. You get basic specs (CPU, RAM, disk) but no process information, no network connections, and no application dependencies. It's basically an expensive version of vmware-toolbox-cmd stat hosttime.

Authentication Nightmares

Setting up the proper IAM roles for Migration Hub feels like navigating a Byzantine bureaucracy. The required policies documentation is outdated and doesn't mention half the permissions you actually need.

Permission error you'll hit: User is not authorized to perform: discovery:GetDiscoverySummary even though you followed the official setup guide. The IAM simulator says everything should work, but the console throws permission errors.

The actual permissions you need (beyond what AWS documents):

  • discovery:*
  • mgh:*
  • AWSApplicationMigrationAgentPolicy
  • AWSApplicationMigrationReplicationServerPolicy
  • Custom policy for CloudWatch logs access

Pro tip: Use AWS CloudTrail to see exactly which API calls are failing, then add those specific permissions. Don't trust the documentation. Also check the AWS IAM Policy Simulator to test permissions before deploying. The AWS Well-Architected Security Pillar has best practices for IAM role design, though it doesn't cover Migration Hub specifics. For complex setups, use AWS Organizations Service Control Policies to prevent accidental permission escalation.

Home Region Confusion

You can only view migration data in your "home region" but AWS makes it unclear how to change this. If you accidentally set the wrong home region during setup, you're stuck with it unless you contact support.

The problem: Set up Migration Hub in us-east-1 but your infrastructure is in us-west-2. All your migration tracking data lives in the wrong region and you can't move it.

The solution: Before installing ANY agents, verify your home region in the Migration Hub console. If it's wrong, you need to contact AWS Support to reset it. This process takes 1-2 business days.

API Rate Limiting During Large Migrations

The Migration Hub APIs have undocumented rate limits that kick in when you're tracking 100+ servers. Your monitoring scripts start failing with HTTP 429 errors, but the AWS documentation doesn't mention any limits.

When this hits: During the data collection phase with 200+ discovery agents running. The console becomes unresponsive and API calls timeout. AWS Support's initial response: "Migration Hub is designed to scale automatically."

Workaround: Implement exponential backoff in your automation scripts and batch API calls. Monitor your CloudWatch metrics - if you see API errors spiking, slow down your requests. Use the AWS SDK retry configuration for automatic backoff. Consider AWS Service Quotas to request limit increases for large migrations. The AWS Support API can help automate quota requests. Monitor with AWS X-Ray to trace API call patterns and identify bottlenecks. Use AWS Config to track configuration changes during large-scale migrations.

Discovery Agent Problems You'll Actually Encounter

Q

Why is the discovery agent maxing out CPU on my server?

A

The agent scans every process, connection, and file handle on the system every 15 minutes. On busy servers or those with hundreds of processes, this creates massive CPU spikes. Legacy servers with single-core CPUs become unresponsive during scans. Quick fix: Edit /opt/aws/discovery/config/agent.properties and change the collection interval from 900 seconds (15 minutes) to 3600 seconds (1 hour). Restart the agent: sudo systemctl restart aws-discovery-daemon.

Q

The agent shows "healthy" but I don't see any data in the console. What's wrong?

A

Usually means the agent can't reach the AWS endpoints due to firewall rules or proxy settings. The agent reports "healthy" because it's running, but it can't upload data. Check connectivity: curl -I https://application-discovery.us-west-2.amazonaws.com from the server. If this fails, you need to configure proxy settings in /opt/aws/discovery/config/agent.properties or open firewall ports 443 and 8888.

Q

Can I install the discovery agent on the same server as my database?

A

Technically yes, but don't. Database servers are already I/O intensive and adding discovery agent scanning makes everything worse. Use agentless discovery if possible, or install the agent during maintenance windows only.

Q

How do I uninstall this thing when it's breaking my server?

A
sudo systemctl stop aws-discovery-daemon
sudo systemctl disable aws-discovery-daemon
sudo /opt/aws/discovery/uninstall
sudo rm -rf /opt/aws/discovery

If the uninstaller fails (it often does), manually kill the processes: sudo pkill -f discovery and delete the directory.

Q

Why does the agent keep restarting every few hours?

A

Memory leak. The agent accumulates memory over time and hits system limits. On servers with limited RAM, the OOM killer terminates the agent process. AWS claims this is "fixed" in newer versions but it still happens. Workaround: Set up a weekly restart cron job or monitor memory usage and restart when it exceeds 500MB.

Migration Tracking Failures and Manual Fixes

When Migration Updates Disappear

Migration Status Tracking Interface: Dashboard showing green progress bars that turn red without warning when migrations fail silently in the background.

You start a migration using AWS Application Migration Service and everything looks good. Then suddenly, Migration Hub stops showing status updates. The migration is still running, but the tracking dashboard shows no data.

Why this happens: The mapping between your migration tool and the discovered servers breaks. AWS's automatic mapping works 60% of the time - the rest requires manual intervention.

The manual fix: Go to Migration Hub → Updates → find your missing migration → click "Edit" in the "Mapped servers" column → manually map it to the correct discovered server. The server names won't match exactly, so you'll need to cross-reference IP addresses or hostnames.

Application Groups That Break Everything

Migration Hub lets you group servers into applications, which sounds useful until the grouping logic fails. Servers get assigned to wrong applications, or the same server appears in multiple groups, confusing the migration tracking.

Real example: Grouped web servers with their database for an e-commerce application. The database server was also shared by a reporting system, so Migration Hub created overlapping applications. When we migrated the database, both applications showed as "partially migrated" even though one was complete.

How to fix broken groups:

  1. Delete all auto-generated application groups
  2. Manually create groups based on actual dependencies, not AWS's guesses
  3. One server per group only - shared services need separate groups
  4. Name groups descriptively: "ecommerce-web-tier" not "Application-1"

Network Diagram Lies

The pretty network visualization shows servers connected with nice arrows, but half the connections are missing or wrong. The agent only captures active connections during the sampling period, missing periodic jobs and backup processes.

What the diagram misses:

  • Scheduled batch jobs that run monthly/quarterly
  • Backup connections that only activate during failures
  • Management interfaces (IPMI, iDRAC) that aren't "application" traffic
  • Database replication traffic on non-standard ports
  • Load balancer health checks

How to get real dependency data: Cross-reference the Migration Hub network diagram with your monitoring tools (Nagios, Zabbix), firewall logs, and application configuration files. Every discrepancy represents a potential migration failure.

Migration Status Mapping Chaos

AWS's automatic status mapping between migration tools and Migration Hub fails constantly. Your AWS Database Migration Service task shows "completed" but Migration Hub still displays "in progress" or vice versa.

Error pattern: DMS finishes full load and starts CDC replication. Migration Hub shows the initial replication as "completed" but doesn't update when CDC starts, leaving you with incorrect status for weeks.

Manual tracking process:

  1. Don't trust Migration Hub status for mission-critical migrations
  2. Monitor the actual migration tools directly (DMS console, MGN console)
  3. Set up CloudWatch alarms for migration failure events
  4. Use the Migration Hub API to manually update status if needed: aws migrationhub notify-migration-task-state

Performance Monitoring That Doesn't Monitor

Migration Hub collects performance data from discovery agents but the data is nearly useless for migration planning. The metrics are averaged over 15-minute intervals, missing peak loads and performance spikes that matter for sizing AWS instances.

What you actually need: Minute-by-minute CPU, memory, and I/O data for at least two weeks including month-end processing cycles. The discovery agent averages hide the fact that your server hits 90% CPU every night during backups.

Better monitoring approach:

  • Keep your existing monitoring tools running during discovery
  • Export performance data from VMware vCenter if available
  • Use AWS Systems Manager to collect more detailed metrics
  • Run stress tests to understand actual resource requirements

Multi-Region Disasters

If your migration spans multiple AWS regions, Migration Hub becomes a nightmare. You can only view data in your home region, but migration tools might be running in different regions.

Scenario: Home region is us-east-1, but you're migrating to us-west-2 for disaster recovery. The Application Migration Service is replicating to us-west-2, but all your tracking data is stuck in us-east-1. The status updates don't cross regions automatically.

Workaround: Use CloudWatch dashboards and custom scripts to aggregate migration status across regions. Migration Hub's single-region limitation makes it useless for multi-region migrations. Build cross-region monitoring with AWS Lambda functions and Amazon EventBridge for status aggregation. Use AWS Systems Manager Parameter Store to share migration state across regions. Consider AWS Control Tower for multi-account, multi-region governance. Implement AWS Cost and Usage Reports to track migration costs across all regions. Use AWS Resource Groups to tag and track migrated resources consistently. Deploy AWS Config Rules to ensure compliance across all migration target regions.

Migration Execution Horror Stories

Q

My migration shows "completed" but the application doesn't work. What happened?

A

Migration Hub tracks the server migration but knows nothing about application functionality. The server migrated successfully, but the application configuration is wrong (database connection strings, license servers, network routing).

Reality check: "Completed" means the files copied successfully, not that your application works. Plan for 2-4 weeks of post-migration troubleshooting for any non-trivial application.

Q

Application Migration Service replicated my server but it won't boot. Now what?

A

Boot failures happen 30% of the time, especially with Windows servers or custom Linux configurations. The replication copied the disk but didn't account for hardware differences, driver issues, or boot sector problems.

Emergency fix process:

  1. Launch the target instance and attach the replicated EBS volume as secondary disk
  2. Boot from a rescue AMI and mount the migrated volume
  3. Fix /etc/fstab (Linux) or registry entries (Windows) for new hardware
  4. Install AWS-compatible drivers before attempting to boot

Prevention: Test boot the migrated instance in a non-production environment first. Every time.

Q

How long should I wait for replication to finish?

A

AWS says "a few hours" but reality is days or weeks depending on data size and network speed. For a 2TB server over a 100Mbps connection, expect 48+ hours for initial replication.

Real timelines from production:

  • 500GB server: 6-12 hours
  • 2TB server with database: 2-3 days
  • 10TB file server: 1-2 weeks
  • Add 50% to any estimate for network hiccups and AWS throttling
Q

The network diagram shows my servers are connected but the application can't reach the database after migration. Why?

A

AWS doesn't migrate network configuration. Your on-premises network routing, VLANs, and firewall rules don't automatically translate to AWS VPC security groups and route tables.

What's missing after migration:

  • Security group rules for inter-server communication
  • Route table entries for subnet routing
  • NACLs that block traffic Migration Hub doesn't know about
  • Custom DNS configurations

Fix it before you migrate: Document every network flow and translate it to AWS networking before starting server migration.

Q

Can I pause a migration that's failing?

A

No. Once Application Migration Service starts the cutover process, you can't pause it. You can fail back to source, but that requires starting over. This is why testing is critical.

When to fail back:

  • Boot failures that you can't fix within your downtime window
  • Database corruption or data inconsistency
  • Application performance issues that make the system unusable
  • Network connectivity problems that prevent users from accessing the application
Q

How do I know if my migration actually worked?

A

Test everything. Migration Hub showing "completed" means nothing for application functionality. Run your full test suite on the migrated systems before declaring success.

Minimum testing checklist:

  • Application starts and responds to requests
  • Database connections work and data is accessible
  • File shares and network drives mount correctly
  • Scheduled jobs execute successfully
  • Monitoring and backup systems connect to migrated servers
  • End-user acceptance testing in production-like conditions

Time estimate: Plan for testing to take as long as the actual migration. A 4-hour server migration needs 4-8 hours of testing.

Resources for When Everything Goes Wrong

Related Tools & Recommendations

tool
Recommended

Azure Migrate - Microsoft's Tool for Moving Your Crap to the Cloud

Microsoft's free migration tool that actually works - helps you discover what you have on-premises, figure out what it'll cost in Azure, and move it without bre

Azure Migrate
/tool/azure-migrate/overview
67%
tool
Recommended

AWS MGN Enterprise Production Deployment - Security & Scale Guide

Rolling out MGN at enterprise scale requires proper security hardening, governance frameworks, and automation strategies. Here's what actually works in producti

AWS Application Migration Service
/tool/aws-application-migration-service/enterprise-production-deployment
66%
tool
Recommended

AWS Application Migration Service (MGN) - Copy Your Servers to AWS

MGN replicates your physical or virtual servers to AWS. It works, but expect some networking headaches and licensing surprises along the way.

AWS Application Migration Service
/tool/aws-application-migration-service/overview
66%
tool
Recommended

AWS Database Migration Service - When You Need to Move Your Database Without Getting Fired

integrates with AWS Database Migration Service

AWS Database Migration Service
/tool/aws-database-migration-service/overview
66%
tool
Popular choice

v0 by Vercel - Code Generator That Sometimes Works

Tool that generates React code from descriptions. Works about 60% of the time.

v0 by Vercel
/tool/v0/overview
60%
howto
Popular choice

How to Run LLMs on Your Own Hardware Without Sending Everything to OpenAI

Stop paying per token and start running models like Llama, Mistral, and CodeLlama locally

Ollama
/howto/setup-local-llm-development-environment/complete-setup-guide
55%
news
Popular choice

Framer Hits $2B Valuation: No-Code Website Builder Raises $100M - August 29, 2025

Amsterdam-based startup takes on Figma with 500K monthly users and $50M ARR

NVIDIA GPUs
/news/2025-08-29/framer-2b-valuation-funding
50%
tool
Recommended

ServiceNow Cloud Observability - Lightstep's Expensive Rebrand

ServiceNow bought Lightstep's solid distributed tracing tech, slapped their logo on it, and jacked up the price. Starts at $275/month - no free tier.

ServiceNow Cloud Observability
/tool/servicenow-cloud-observability/overview
49%
tool
Recommended

ServiceNow App Engine - Build Apps Without Coding Much

ServiceNow's low-code platform for enterprises already trapped in their ecosystem

ServiceNow App Engine
/tool/servicenow-app-engine/overview
49%
howto
Popular choice

Migrate JavaScript to TypeScript Without Losing Your Mind

A battle-tested guide for teams migrating production JavaScript codebases to TypeScript

JavaScript
/howto/migrate-javascript-project-typescript/complete-migration-guide
47%
tool
Popular choice

jQuery - The Library That Won't Die

Explore jQuery's enduring legacy, its impact on web development, and the key changes in jQuery 4.0. Understand its relevance for new projects in 2025.

jQuery
/tool/jquery/overview
45%
tool
Recommended

Terraform CLI: Commands That Actually Matter

The CLI stuff nobody teaches you but you'll need when production breaks

Terraform CLI
/tool/terraform/cli-command-mastery
44%
alternatives
Recommended

12 Terraform Alternatives That Actually Solve Your Problems

HashiCorp screwed the community with BSL - here's where to go next

Terraform
/alternatives/terraform/comprehensive-alternatives
44%
review
Recommended

Terraform Performance at Scale Review - When Your Deploys Take Forever

compatible with Terraform

Terraform
/review/terraform/performance-at-scale
44%
tool
Popular choice

OpenAI Browser Implementation Challenges

Every developer question about actually using this thing in production

OpenAI Browser
/tool/openai-browser/implementation-challenges
42%
review
Popular choice

Cursor Enterprise Security Assessment - What CTOs Actually Need to Know

Real Security Analysis: Code in the Cloud, Risk on Your Network

Cursor
/review/cursor-vs-vscode/enterprise-security-review
40%
tool
Popular choice

Istio - Service Mesh That'll Make You Question Your Life Choices

The most complex way to connect microservices, but it actually works (eventually)

Istio
/tool/istio/overview
40%
pricing
Popular choice

What Enterprise Platform Pricing Actually Looks Like When the Sales Gloves Come Off

Vercel, Netlify, and Cloudflare Pages: The Real Costs Behind the Marketing Bullshit

Vercel
/pricing/vercel-netlify-cloudflare-enterprise-comparison/enterprise-cost-analysis
40%
tool
Popular choice

MariaDB - What MySQL Should Have Been

Discover MariaDB, the powerful open-source alternative to MySQL. Learn why it was created, how to install it, and compare its benefits for your applications.

MariaDB
/tool/mariadb/overview
40%
alternatives
Popular choice

Docker Desktop Got Expensive - Here's What Actually Works

I've been through this migration hell multiple times because spending thousands annually on container tools is fucking insane

Docker Desktop
/alternatives/docker-desktop/migration-ready-alternatives
40%

Recommendations combine user behavior, content similarity, research intelligence, and SEO optimization