AWS Migration Hub Implementation - Fix the Shit That Always Breaks

Currently viewing the human version

Discovery Agent Installation Hell: What They Don't Tell You

CPU Utilization Will Destroy Your Legacy Servers

AWS Discovery Agent Performance Impact: CPU usage spikes to 40%+ on legacy servers, memory consumption increases over time, and performance degrades on anything older than 5 years.

Installing AWS Application Discovery Agents sounds simple until you try it on that ancient CentOS 6.9 box running your payment system. The agent documentation says "minimal performance impact" but doesn't mention that "minimal" means 40% CPU usage on anything older than 2015.

Real scenario from production: Installed discovery agents on 12 servers during business hours. Three went offline because they couldn't handle the CPU load. The Windows 2008 R2 domain controller became unresponsive for 20 minutes while the agent attempted to inventory every single registry key.

The fix: Install agents during maintenance windows and test on non-critical systems first. On servers with less than 4GB RAM or older than 5 years, expect performance degradation. Monitor CPU usage for the first 2 hours - if it stays above 30%, kill the agent and try agentless discovery instead.

Memory Leaks That Kill Servers

Discovery agents have a known memory leak that AWS doesn't advertise. After running for 2-3 weeks, the agent process can consume 500MB+ of RAM on busy servers. On systems already running near capacity, this kills performance.

Error you'll see: Application timeouts, database connection failures, general system sluggishness. The agent shows as "healthy" in the console while your server dies.

The workaround: Restart the discovery agent weekly using a cron job:

0 2 * * 0 systemctl restart aws-discovery-daemon

Network Discovery: Missing the Obvious Connections

Network Dependency Mapping: Visualization showing server connections with arrows, but missing 20% of critical dependencies that only appear during monthly batch jobs or system failures.

The network visualization looks impressive until you realize it misses 20% of your critical connections. The agent only captures active network connections - if your backup job runs at 3 AM and you install the agent at 9 AM, that dependency won't show up.

War story: Migrated a web application that worked fine for 3 weeks. Then the monthly reporting job failed because it couldn't connect to an Oracle database that only gets accessed once per month. The dependency wasn't discovered because nobody thought to run all scheduled jobs during the discovery period.

How to actually map dependencies:

Run discovery for at least 14 days to catch weekly/monthly jobs
Manually trigger all scheduled tasks during discovery
Check application logs for outbound connections the agent missed
Document every custom service account - they often indicate hidden dependencies

Agentless Discovery Limitations

AWS Agentless Discovery Connector sounds perfect until you try to use it. It requires VMware vCenter 5.5+ and can only see what VMware knows about - which excludes most of your custom applications and all of your bare metal servers.

Reality check: Agentless discovery finds your servers but tells you nothing useful about what they do. You get basic specs (CPU, RAM, disk) but no process information, no network connections, and no application dependencies. It's basically an expensive version of vmware-toolbox-cmd stat hosttime.

Authentication Nightmares

Setting up the proper IAM roles for Migration Hub feels like navigating a Byzantine bureaucracy. The required policies documentation is outdated and doesn't mention half the permissions you actually need.

Permission error you'll hit: User is not authorized to perform: discovery:GetDiscoverySummary even though you followed the official setup guide. The IAM simulator says everything should work, but the console throws permission errors.

The actual permissions you need (beyond what AWS documents):

discovery:*
mgh:*
AWSApplicationMigrationAgentPolicy
AWSApplicationMigrationReplicationServerPolicy
Custom policy for CloudWatch logs access

Pro tip: Use AWS CloudTrail to see exactly which API calls are failing, then add those specific permissions. Don't trust the documentation. Also check the AWS IAM Policy Simulator to test permissions before deploying. The AWS Well-Architected Security Pillar has best practices for IAM role design, though it doesn't cover Migration Hub specifics. For complex setups, use AWS Organizations Service Control Policies to prevent accidental permission escalation.

Home Region Confusion

You can only view migration data in your "home region" but AWS makes it unclear how to change this. If you accidentally set the wrong home region during setup, you're stuck with it unless you contact support.

The problem: Set up Migration Hub in us-east-1 but your infrastructure is in us-west-2. All your migration tracking data lives in the wrong region and you can't move it.

The solution: Before installing ANY agents, verify your home region in the Migration Hub console. If it's wrong, you need to contact AWS Support to reset it. This process takes 1-2 business days.

API Rate Limiting During Large Migrations

The Migration Hub APIs have undocumented rate limits that kick in when you're tracking 100+ servers. Your monitoring scripts start failing with HTTP 429 errors, but the AWS documentation doesn't mention any limits.

When this hits: During the data collection phase with 200+ discovery agents running. The console becomes unresponsive and API calls timeout. AWS Support's initial response: "Migration Hub is designed to scale automatically."

Workaround: Implement exponential backoff in your automation scripts and batch API calls. Monitor your CloudWatch metrics - if you see API errors spiking, slow down your requests. Use the AWS SDK retry configuration for automatic backoff. Consider AWS Service Quotas to request limit increases for large migrations. The AWS Support API can help automate quota requests. Monitor with AWS X-Ray to trace API call patterns and identify bottlenecks. Use AWS Config to track configuration changes during large-scale migrations.

Discovery Agent Problems You'll Actually Encounter

Why is the discovery agent maxing out CPU on my server?

The agent scans every process, connection, and file handle on the system every 15 minutes. On busy servers or those with hundreds of processes, this creates massive CPU spikes. Legacy servers with single-core CPUs become unresponsive during scans. Quick fix: Edit /opt/aws/discovery/config/agent.properties and change the collection interval from 900 seconds (15 minutes) to 3600 seconds (1 hour). Restart the agent: sudo systemctl restart aws-discovery-daemon.

The agent shows "healthy" but I don't see any data in the console. What's wrong?

Usually means the agent can't reach the AWS endpoints due to firewall rules or proxy settings. The agent reports "healthy" because it's running, but it can't upload data. Check connectivity: curl -I https://application-discovery.us-west-2.amazonaws.com from the server. If this fails, you need to configure proxy settings in /opt/aws/discovery/config/agent.properties or open firewall ports 443 and 8888.

Can I install the discovery agent on the same server as my database?

Technically yes, but don't. Database servers are already I/O intensive and adding discovery agent scanning makes everything worse. Use agentless discovery if possible, or install the agent during maintenance windows only.

How do I uninstall this thing when it's breaking my server?

sudo systemctl stop aws-discovery-daemon
sudo systemctl disable aws-discovery-daemon
sudo /opt/aws/discovery/uninstall
sudo rm -rf /opt/aws/discovery

If the uninstaller fails (it often does), manually kill the processes: sudo pkill -f discovery and delete the directory.

Why does the agent keep restarting every few hours?

Memory leak. The agent accumulates memory over time and hits system limits. On servers with limited RAM, the OOM killer terminates the agent process. AWS claims this is "fixed" in newer versions but it still happens. Workaround: Set up a weekly restart cron job or monitor memory usage and restart when it exceeds 500MB.

Migration Tracking Failures and Manual Fixes

When Migration Updates Disappear

Migration Status Tracking Interface: Dashboard showing green progress bars that turn red without warning when migrations fail silently in the background.

You start a migration using AWS Application Migration Service and everything looks good. Then suddenly, Migration Hub stops showing status updates. The migration is still running, but the tracking dashboard shows no data.

Why this happens: The mapping between your migration tool and the discovered servers breaks. AWS's automatic mapping works 60% of the time - the rest requires manual intervention.

The manual fix: Go to Migration Hub → Updates → find your missing migration → click "Edit" in the "Mapped servers" column → manually map it to the correct discovered server. The server names won't match exactly, so you'll need to cross-reference IP addresses or hostnames.

Application Groups That Break Everything

Migration Hub lets you group servers into applications, which sounds useful until the grouping logic fails. Servers get assigned to wrong applications, or the same server appears in multiple groups, confusing the migration tracking.

Real example: Grouped web servers with their database for an e-commerce application. The database server was also shared by a reporting system, so Migration Hub created overlapping applications. When we migrated the database, both applications showed as "partially migrated" even though one was complete.

How to fix broken groups:

Delete all auto-generated application groups
Manually create groups based on actual dependencies, not AWS's guesses
One server per group only - shared services need separate groups
Name groups descriptively: "ecommerce-web-tier" not "Application-1"

Network Diagram Lies

The pretty network visualization shows servers connected with nice arrows, but half the connections are missing or wrong. The agent only captures active connections during the sampling period, missing periodic jobs and backup processes.

What the diagram misses:

Scheduled batch jobs that run monthly/quarterly
Backup connections that only activate during failures
Management interfaces (IPMI, iDRAC) that aren't "application" traffic
Database replication traffic on non-standard ports
Load balancer health checks

How to get real dependency data: Cross-reference the Migration Hub network diagram with your monitoring tools (Nagios, Zabbix), firewall logs, and application configuration files. Every discrepancy represents a potential migration failure.

Migration Status Mapping Chaos

AWS's automatic status mapping between migration tools and Migration Hub fails constantly. Your AWS Database Migration Service task shows "completed" but Migration Hub still displays "in progress" or vice versa.

Error pattern: DMS finishes full load and starts CDC replication. Migration Hub shows the initial replication as "completed" but doesn't update when CDC starts, leaving you with incorrect status for weeks.

Manual tracking process:

Don't trust Migration Hub status for mission-critical migrations
Monitor the actual migration tools directly (DMS console, MGN console)
Set up CloudWatch alarms for migration failure events
Use the Migration Hub API to manually update status if needed: aws migrationhub notify-migration-task-state

Performance Monitoring That Doesn't Monitor

Migration Hub collects performance data from discovery agents but the data is nearly useless for migration planning. The metrics are averaged over 15-minute intervals, missing peak loads and performance spikes that matter for sizing AWS instances.

What you actually need: Minute-by-minute CPU, memory, and I/O data for at least two weeks including month-end processing cycles. The discovery agent averages hide the fact that your server hits 90% CPU every night during backups.

Better monitoring approach:

Keep your existing monitoring tools running during discovery
Export performance data from VMware vCenter if available
Use AWS Systems Manager to collect more detailed metrics
Run stress tests to understand actual resource requirements

Multi-Region Disasters

If your migration spans multiple AWS regions, Migration Hub becomes a nightmare. You can only view data in your home region, but migration tools might be running in different regions.

Scenario: Home region is us-east-1, but you're migrating to us-west-2 for disaster recovery. The Application Migration Service is replicating to us-west-2, but all your tracking data is stuck in us-east-1. The status updates don't cross regions automatically.

Workaround: Use CloudWatch dashboards and custom scripts to aggregate migration status across regions. Migration Hub's single-region limitation makes it useless for multi-region migrations. Build cross-region monitoring with AWS Lambda functions and Amazon EventBridge for status aggregation. Use AWS Systems Manager Parameter Store to share migration state across regions. Consider AWS Control Tower for multi-account, multi-region governance. Implement AWS Cost and Usage Reports to track migration costs across all regions. Use AWS Resource Groups to tag and track migrated resources consistently. Deploy AWS Config Rules to ensure compliance across all migration target regions.

Migration Execution Horror Stories

My migration shows "completed" but the application doesn't work. What happened?

Migration Hub tracks the server migration but knows nothing about application functionality. The server migrated successfully, but the application configuration is wrong (database connection strings, license servers, network routing).

Reality check: "Completed" means the files copied successfully, not that your application works. Plan for 2-4 weeks of post-migration troubleshooting for any non-trivial application.

Application Migration Service replicated my server but it won't boot. Now what?

Boot failures happen 30% of the time, especially with Windows servers or custom Linux configurations. The replication copied the disk but didn't account for hardware differences, driver issues, or boot sector problems.

Emergency fix process:

Launch the target instance and attach the replicated EBS volume as secondary disk
Boot from a rescue AMI and mount the migrated volume
Fix /etc/fstab (Linux) or registry entries (Windows) for new hardware
Install AWS-compatible drivers before attempting to boot

Prevention: Test boot the migrated instance in a non-production environment first. Every time.

How long should I wait for replication to finish?

AWS says "a few hours" but reality is days or weeks depending on data size and network speed. For a 2TB server over a 100Mbps connection, expect 48+ hours for initial replication.

Real timelines from production:

500GB server: 6-12 hours
2TB server with database: 2-3 days
10TB file server: 1-2 weeks
Add 50% to any estimate for network hiccups and AWS throttling

The network diagram shows my servers are connected but the application can't reach the database after migration. Why?

AWS doesn't migrate network configuration. Your on-premises network routing, VLANs, and firewall rules don't automatically translate to AWS VPC security groups and route tables.

What's missing after migration:

Security group rules for inter-server communication
Route table entries for subnet routing
NACLs that block traffic Migration Hub doesn't know about
Custom DNS configurations

Fix it before you migrate: Document every network flow and translate it to AWS networking before starting server migration.

Can I pause a migration that's failing?

No. Once Application Migration Service starts the cutover process, you can't pause it. You can fail back to source, but that requires starting over. This is why testing is critical.

When to fail back:

Boot failures that you can't fix within your downtime window
Database corruption or data inconsistency
Application performance issues that make the system unusable
Network connectivity problems that prevent users from accessing the application

How do I know if my migration actually worked?

Test everything. Migration Hub showing "completed" means nothing for application functionality. Run your full test suite on the migrated systems before declaring success.

Minimum testing checklist:

Application starts and responds to requests
Database connections work and data is accessible
File shares and network drives mount correctly
Scheduled jobs execute successfully
Monitoring and backup systems connect to migrated servers
End-user acceptance testing in production-like conditions

Time estimate: Plan for testing to take as long as the actual migration. A 4-hour server migration needs 4-8 hours of testing.

Resources for When Everything Goes Wrong

40%

Recommendations combine user behavior, content similarity, research intelligence, and SEO optimization

Quick Navigation

CPU Utilization Will Destroy Your Legacy Servers

Memory Leaks That Kill Servers

Network Discovery: Missing the Obvious Connections

Agentless Discovery Limitations

Authentication Nightmares

Home Region Confusion

API Rate Limiting During Large Migrations

Why is the discovery agent maxing out CPU on my server?

The agent shows "healthy" but I don't see any data in the console. What's wrong?

Can I install the discovery agent on the same server as my database?

How do I uninstall this thing when it's breaking my server?

Why does the agent keep restarting every few hours?

When Migration Updates Disappear

Application Groups That Break Everything

Network Diagram Lies

Migration Status Mapping Chaos

Performance Monitoring That Doesn't Monitor

Multi-Region Disasters

My migration shows "completed" but the application doesn't work. What happened?

Application Migration Service replicated my server but it won't boot. Now what?

How long should I wait for replication to finish?

The network diagram shows my servers are connected but the application can't reach the database after migration. Why?

Can I pause a migration that's failing?

How do I know if my migration actually worked?

Related Tools & Recommendations

Azure Migrate - Microsoft's Tool for Moving Your Crap to the Cloud

AWS MGN Enterprise Production Deployment - Security & Scale Guide

AWS Application Migration Service (MGN) - Copy Your Servers to AWS

AWS Database Migration Service - When You Need to Move Your Database Without Getting Fired

v0 by Vercel - Code Generator That Sometimes Works

How to Run LLMs on Your Own Hardware Without Sending Everything to OpenAI

Framer Hits $2B Valuation: No-Code Website Builder Raises $100M - August 29, 2025

ServiceNow Cloud Observability - Lightstep's Expensive Rebrand

ServiceNow App Engine - Build Apps Without Coding Much

Migrate JavaScript to TypeScript Without Losing Your Mind

jQuery - The Library That Won't Die

Terraform CLI: Commands That Actually Matter

12 Terraform Alternatives That Actually Solve Your Problems

Terraform Performance at Scale Review - When Your Deploys Take Forever

OpenAI Browser Implementation Challenges

Cursor Enterprise Security Assessment - What CTOs Actually Need to Know

Istio - Service Mesh That'll Make You Question Your Life Choices

What Enterprise Platform Pricing Actually Looks Like When the Sales Gloves Come Off

MariaDB - What MySQL Should Have Been

Docker Desktop Got Expensive - Here's What Actually Works