Amazon ECS - Container Orchestration Without the Kubernetes Bullshit

Currently viewing the human version

What ECS Actually Is (and Why You Might Want It)

AWS ECS Fargate Architecture

Amazon ECS is AWS's attempt to make running Docker containers less of a pain in the ass. Instead of manually provisioning EC2 instances, installing Docker, configuring clustering, and then crying when everything breaks at 3 AM, ECS handles the infrastructure bits while you deal with your actual application.

Here's what Amazon won't tell you: ECS is for teams who want to ship code, not become infrastructure experts. You're already paying AWS for RDS and S3, so why not let them handle container orchestration too? It's Docker management for people with deadlines.

How ECS Actually Works (The Good and Bad)

ECS has three main pieces that you need to understand:

Control Plane: AWS runs the brain that decides where your containers go and monitors if they're still alive. This is actually pretty nice because you don't have to maintain master nodes or deal with etcd corruption. The downside? You're locked into AWS's way of doing things, so good luck if you ever want to migrate.

Data Plane: Where your containers actually run. You've got three options: EC2 instances (you manage the servers), Fargate (AWS manages everything), or ECS Managed Instances (hybrid approach that launched September 2025). Each has its own special way of making your life difficult.

Task Definitions: JSON files that describe your containers. Think Docker Compose but more verbose and with AWS-specific nonsense sprinkled in. You'll spend hours tweaking CPU and memory limits when your container dies with exit code 137. Task definition docs have all the gory details.

Launch Types (Pick Your Poison)

ECS Fargate vs EC2 Comparison

You get three ways to run containers in ECS, each with its own unique way to fuck up your day:

Fargate: AWS handles everything, you just pay through the nose. At around 4 cents per vCPU-hour, it's expensive but eliminates the "why did my EC2 instance randomly die" conversations. Fargate tasks take 1-3 minutes to start, which feels like forever when you're watching ResourcesNotReady errors during a production incident. Also, if you need anything that requires host-level access, you're fucked.

EC2 Launch Type: You manage the EC2 instances, ECS just schedules containers on them. Cheaper if you're smart about Reserved Instances and Spot, but now you're back to babysitting servers. Fun fact: when an EC2 instance dies, all containers on it die too. Hope your app handles that gracefully.

ECS Managed Instances: The new kid on the block (launched September 30, 2025). AWS promises to handle patching and scaling while giving you EC2 flexibility. Sounds great in theory, but it's so new that you'll be the beta tester. Pricing isn't public yet, but expect it to cost more than plain EC2.

The AWS Lock-in (Blessing and Curse)

AWS Services Overview

ECS plays really nice with other AWS services, which is great until you want to leave:

Security: IAM integration means you can lock down containers without learning a new auth system. Each task can have its own IAM role, which is genuinely useful. Just don't give every container Administrator access because you got tired of debugging permissions. GuardDuty will yell at you if something fishy happens, though it's another monthly charge.

Networking: Each Fargate task gets its own ENI, so you can apply security groups directly to containers. This is nice until you hit ENI limits and your deployments fail with ENI provisioning failed errors. I learned this when trying to deploy 200 containers and wondering why only 50 started. Service Connect is AWS's attempt at service mesh without the complexity tax.

Monitoring: CloudWatch integration is decent for basic stuff, but you'll probably want to ship logs somewhere else for serious analysis. Container Insights costs extra but gives you container-level metrics that actually help debug why your API is slow.

When ECS Makes Sense

ECS is perfect if you're already married to AWS and want containers without the Kubernetes learning curve. It's less good if you value portability or need advanced scheduling features. For deployments, just use rolling updates unless you have a specific reason not to. Blue/green is overkill for most use cases.

But knowing the basics isn't enough. Let's talk about what happens when you actually try to run this thing in production.

The Reality of Running ECS in Production

AWS ECS Console Dashboard

Here's where the AWS marketing bullshit meets cold, hard reality. ECS works fine for demos and simple apps, but production has a way of exposing every gotcha.

Task Placement (When It Works and When It Doesn't)

ECS has three placement strategies that sound great in theory:

Spread Placement: Tries to spread your containers across AZs. Works fine until you have an uneven number of containers and one AZ gets overloaded. I learned this the hard way when all my cache containers ended up in us-east-1a during a deployment that went sideways.

Binpack Placement: Crams containers onto fewer instances to save money. Great until one instance dies and takes down half your application. The "intelligent co-location" often means your CPU-intensive and memory-intensive containers end up fighting each other for resources.

Random Placement: Does what it says. Use this when you don't care and just want things to run somewhere.

The custom placement constraints are useful for GPU workloads, but the syntax is annoying and easy to get wrong. Expect to spend time debugging why your ML containers keep landing on CPU-only instances.

Scaling (The Good, Bad, and "Why Is This So Slow?")

ECS scaling has multiple layers that can each fail in exciting ways:

Service Auto Scaling: Watches CloudWatch metrics and adjusts task count. Sounds simple, but CloudWatch metrics lag by 5+ minutes, so you're always scaling after the damage is done. I watched our API response times hit 10 seconds before ECS decided to scale out. Pro tip: set scale-out to be aggressive and scale-in to be conservative, or you'll be refreshing Grafana wondering why everything is slow.

Capacity Provider Scaling: Supposed to add EC2 instances automatically when you need more capacity. In reality, it takes 2-5 minutes to provision new instances, so your containers sit in PENDING state with InsufficientCapacity errors while AWS slowly spins up infrastructure. This is fine for batch jobs, terrible for Black Friday traffic spikes.

Cluster Auto Scaling: The marketing says it "optimizes capacity," but it's really just capacity provider scaling with extra steps.

The official limits say 1,000 services per cluster and 5,000 tasks per service, but there's a catch: if you use service discovery, you're limited to 1,000 tasks per service because of Cloud Map restrictions. Found this out the hard way trying to scale a worker service.

Networking (Where Things Get Weird)

ECS networking is where the magic happens, and by magic I mean "things break in unexpected ways":

Task Networking: Fargate gives each task its own ENI, which is great for security but means you can hit ENI limits on your VPC. Each Fargate task also gets a private IP, so make sure your subnets are big enough. EC2 tasks can use bridge mode (containers share the host network) or awsvpc mode (each task gets an ENI). Bridge mode is simpler but less secure; awsvpc mode is more secure but more complex.

Service Discovery: Cloud Map integration sounds cool but has quirks. DNS propagation can take 30+ seconds, so don't expect instant service discovery. Also, it costs $0.50 per million queries, which adds up if you have chatty services.

Load Balancing: ALB integration is solid once you figure out the target group configuration. Dynamic port mapping works but can be confusing when debugging. NLB is faster but less flexible. Pro tip: ALB health checks have their own timeout settings that can fail your deployments if you're not careful.

Security: Each task can have its own IAM role, which is genuinely useful. ECS Exec lets you shell into running containers, but you need to enable it at the service level and it uses SSM Session Manager. Expect to waste an hour figuring out why aws ecs execute-command returns "Session could not be started" the first time you try it.

Cost Management (Prepare Your Wallet)

AWS Pricing Calculator

ECS costs can sneak up on you if you're not careful:

Fargate Spot: Up to 70% cheaper than regular Fargate, but your tasks can get killed with 2 minutes notice. Great for batch jobs, terrible for user-facing services. The interruption rate varies wildly by region and time.

EC2 Spot: Can save up to 90% on compute costs, but spot interruptions will test how resilient your application actually is. ECS handles the draining gracefully, but your app needs to handle shutdowns properly.

Resource-Based Pricing: Fargate bills per second with a 1-minute minimum, which sounds great until you realize you're paying for the resources you request, not what you use. If you allocate 2GB RAM but only use 500MB, you still pay for 2GB. Size your containers carefully.

Regional Differences: Fargate costs vary dramatically by region. São Paulo costs $0.0696 per vCPU-hour while US East is $0.04048. If you're running global workloads, this adds up fast.

Hidden Costs That Will Bite You

Don't forget about CloudWatch logs ($0.50 per GB ingested), NAT Gateway costs for Fargate internet access, and data transfer charges. I've seen monthly bills jump 30% because someone enabled verbose logging in production.

So now you know how ECS actually behaves in production. The question is: when does it make sense to put up with all this?

When ECS Actually Makes Sense (And When It Doesn't)

Despite all the gotchas and hidden costs, ECS has its place. Here's when it's worth the pain.

Application Modernization (The Good and Ugly)

Docker Deployment Workflow

ECS is decent for containerizing existing apps without a complete rewrite, but let's be honest about what that looks like:

Lift-and-Shift: You can dockerize your legacy Java monolith and throw it on ECS. It'll run, but you're not getting most of the benefits of containers. You've just traded VM problems for container problems. At least deployments become more consistent, and you can scale horizontally easier.

Microservices Migration: ECS works for gradually breaking apart monoliths, but the service discovery has a learning curve. You'll spend time figuring out why services can't find each other, especially during network partitions. The load balancer integration is solid though.

Hybrid Cloud: ECS Anywhere at $0.01025 per hour per instance sounds cheap until you realize you're paying AWS to manage containers running on your own hardware. It works, but you're essentially paying for the control plane complexity you wanted to avoid.

Batch Processing (Where ECS Actually Shines)

ECS is genuinely good for batch workloads and background processing:

Scientific Computing: Running genomics pipelines on ECS with AWS Batch works well because batch jobs can tolerate the 2-5 minute startup time. GPU instance integration is solid, though you'll pay premium prices for those instances. The automatic instance selection saves you from figuring out optimal instance types. Perfect for when you need to process terabytes of data overnight.

Financial Processing: Risk calculations and end-of-day processing are perfect for ECS. You can scale from 0 to 1000+ containers for the nightly batch run, then scale back down. Just make sure your jobs can handle spot interruptions gracefully - learned this when a spot interruption corrupted a 6-hour risk calculation.

Media Processing: Video transcoding works great on spot instances since the jobs are resumable. I've seen 80%+ cost savings using spot instances for media workflows. Just build proper checkpointing into your processing logic.

AI/ML Workloads (Hit or Miss)

ECS for AI/ML has some wins and some major limitations:

Model Inference: Fargate works for smaller models, but the 1-3 minute cold start time is brutal for inference workloads. You'll want to keep instances warm or use EC2 launch type for production inference. GPU instances on Fargate aren't available yet, so GPU inference means managing EC2 instances yourself.

Model Training: ECS can orchestrate distributed training, but honestly, SageMaker is usually better for this. If you're doing training on ECS, the EFS integration for shared model storage works, but expect network bottlenecks with large datasets. For most teams, SageMaker batch transform is less painful.

AI Agents: The security isolation is nice for AI workloads that might run untrusted code, but the startup time can be a problem for real-time agents. Works better for asynchronous AI workflows.

Why Companies Actually Choose ECS

Here's what I've seen in the wild:

Operational Simplicity: Teams choose ECS because they don't want to become Kubernetes experts. Managing etcd, dealing with CNI plugins, and debugging pod networking issues gets old fast. ECS is boring in a good way - it mostly just works.

AWS Lock-In Acceptance: If you're already using RDS, ElastiCache, and Lambda, ECS fits naturally. You're already locked into AWS anyway, so the additional lock-in doesn't matter.

Cost Reality: The "20-50% cost reduction" claims are misleading. You save on not running Kubernetes control plane nodes, but Fargate is expensive. Real savings come from not needing dedicated DevOps engineers who understand Kubernetes deeply.

Industry Patterns (What Actually Happens)

Healthcare: ECS works for HIPAA compliance because AWS handles most of the infrastructure concerns. The audit logging through CloudTrail is comprehensive, but you'll still need to design your applications properly for compliance. AWS covers the infrastructure under their Business Associate Agreement, but your app logic is still your problem.

Financial Services: Regulated environments like ECS because the attack surface is smaller than managing your own Kubernetes cluster. The downside is you're trusting AWS with critical infrastructure, which some compliance teams struggle with. AWS handles most regulatory frameworks, but you still need to audit your application code.

E-commerce: ECS auto-scaling works for traffic spikes, but the 2-5 minute scale-out time means you need to pre-scale for known events like Black Friday. The CloudFront integration is solid though.

When ECS Doesn't Make Sense

Don't use ECS if you need advanced scheduling features, have complex multi-tenancy requirements, or plan to migrate off AWS someday. Kubernetes is more portable and configurable, just more complex to operate.

So how does ECS stack up against the alternatives? Let's break it down.

ECS vs. Container Orchestration Alternatives

Feature	Amazon ECS	Amazon EKS	Google GKE	Azure Container Instances	Docker Swarm
Management Overhead	Low (AWS handles it)	Medium (you manage workers)	Medium (Google handles some)	Very Low (fully managed)	High (you handle everything)
Control Plane Cost	Free	$0.10/hour per cluster	$0.10/hour per cluster	Free	Free (but you manage it)
Learning Curve	Gentle	Steep AF	Steep + GCP quirks	Very gentle	Moderate
Pain Level	Low	High	High	Very Low	Medium
Lock-in Factor	Total AWS lock-in	Portable Kubernetes	Portable Kubernetes	Total Azure lock-in	Highly portable
Debugging Difficulty	Medium	Hard	Hard	Easy	Medium
Auto-Scaling Reality	Works but slow (2-5 min)	Works well	Advanced features	Basic but fast	Barely works
Serverless Options	Fargate (expensive)	Fargate (even more expensive)	Cloud Run (decent)	Native (good)	None
Security Model	IAM per task (nice)	RBAC + IAM (complex)	IAM + RBAC (complex)	Azure AD (simple)	Basic Docker
Networking Gotchas	ENI limits, slow DNS	CNI plugin hell	Works well	VNET complexity	Overlay issues
Storage Pain	EFS is slow, EBS is hard	CSI drivers are finicky	Works smoothly	Azure Files are slow	Volumes are basic
Monitoring Reality	CloudWatch costs add up	Need multiple tools	Integrated but expensive	Azure Monitor is decent	DIY everything
Cost Reality	Fargate is pricey	Control plane + compute	Control plane + compute	Pay per second	Cheapest but hidden costs
Community Support	AWS forums	Huge K8s community	Good but less than EKS	Limited	Dying community
When to Use	AWS shops wanting simple	K8s expertise + portability	GCP shops, ML workloads	Azure shops, simple needs	Legacy Docker migration
When to Avoid	Multi-cloud plans	Simple web apps	AWS-heavy environments	Complex orchestration	New projects

Questions Real Engineers Actually Ask

ECS vs EKS - which one should I pick?

If you're already on AWS and just want containers to work without learning Kubernetes, use ECS. If you need portability or your team knows K8s, use EKS. ECS has no control plane costs but locks you into AWS. EKS costs $0.10/hour per cluster but gives you standard Kubernetes.

Why does my Fargate task take forever to start?

Fargate has a 1-3 minute cold start time because AWS needs to provision the underlying infrastructure. This is just how it works. If you need faster startup, use EC2 launch type with pre-warmed instances, or keep your services scaled to at least 1 task so you have warm containers ready.

How much is this actually going to cost me?

Fargate pricing at $0.04048 per vCPU-hour and $0.004445 per GB-hour adds up fast. A small container (0.5 v

CPU, 1GB RAM) costs about $18/month if you run it 24/7. Don't forget about CloudWatch logs ($0.50/GB), data transfer, and NAT Gateway costs for internet access. I've seen bills double because of logging.

Can I run Windows containers on ECS?

Yes, but only on EC2 instances, not Fargate. Windows containers need Windows Server instances, which cost more due to Microsoft licensing. Also, Windows containers are about as fun as debugging Java

Script in Internet Explorer

they work, but you'll question your life choices.

My task just says "PENDING" forever, what's wrong?

Usually it's one of these: insufficient CPU/memory capacity in your cluster (error: `Insufficient

Capacity), ENI limits in your subnet (error: Cannot

PullContainerError`), or security group issues blocking the ALB health check. I spent 2 hours once debugging this before realizing my security group wasn't allowing traffic on port 80. Check the ECS console events tab

it'll tell you exactly what's wrong instead of making you guess.

Why can't my containers talk to each other?

Service discovery DNS can take 30+ seconds to propagate, so your app might be trying to connect before the DNS record exists. Also check security groups

each Fargate task gets its own ENI, so the security group rules apply at the task level, not the instance level.

How do I handle secrets in ECS?

Use Secrets Manager or Parameter Store and reference them in your task definition. ECS pulls secrets at runtime and injects them as environment variables. Don't put secrets directly in your task definition

they'll show up in the console and logs.

Should I run databases in ECS?

No.

Just use RDS or another managed database service. Running stateful services in containers is a pain in the ass

you'll spend more time managing storage and backups than solving actual problems. Save yourself the headache.

ECS vs plain EC2 - what's the point?

ECS gives you health monitoring, rolling deployments, load balancer integration, and service discovery out of the box. You could build all this yourself on EC2, but why? ECS costs the same as plain EC2 (for EC2 launch type) but handles all the orchestration complexity.

My deployment keeps failing, what now?

Check the service events in the ECS console first

they usually tell you exactly what's wrong.

Common issues: health check failures (check your ALB target group settings

health check path /health returning 404), resource constraints (task definition requesting 4GB but instance only has 2GB free), or networking problems (security groups, subnets). The error messages are actually pretty helpful if you read them. I've debugged deployments that failed because the health check timeout was 5 seconds but the app took 8 seconds to start responding.

What are the actual limits I'll hit?

The official limits say 1,000 services per cluster and 5,000 tasks per service, but there's a catch: service discovery limits you to 1,000 tasks per service because of Cloud Map restrictions. You'll hit ENI limits in your subnets before hitting most other limits.

How do I do blue-green deployments?

ECS supports blue-green through CodeDeploy integration, but honestly, just use rolling deployments unless you have a specific reason not to. They're simpler and work fine for most use cases. Blue-green is overkill for most applications.

Can I use ECS on-premises?

ECS Anywhere lets you run ECS on your own hardware for $0.01025/hour per instance. It works, but you're paying AWS to manage containers on your own servers. If you want on-premises container orchestration, Kubernetes might make more sense.

How do I debug what's happening in my containers?

Use ECS Exec to shell into running containers

it's like SSH but goes through AWS Session Manager.

Enable [Container Insights](https://docs.aws.amazon.com/Amazon

CloudWatch/latest/monitoring/ContainerInsights.html) for detailed metrics, but be prepared for the CloudWatch costs to add up.

What networking mode should I use?

Use awsvpc mode (default for Fargate) where each task gets its own ENI. It's more secure and easier to understand than bridge mode. Host mode is only useful for special cases where you need direct host network access.

Quick Navigation

How ECS Actually Works (The Good and Bad)

Launch Types (Pick Your Poison)

The AWS Lock-in (Blessing and Curse)

When ECS Makes Sense

Task Placement (When It Works and When It Doesn't)

Scaling (The Good, Bad, and "Why Is This So Slow?")

Networking (Where Things Get Weird)

Cost Management (Prepare Your Wallet)

Hidden Costs That Will Bite You

Application Modernization (The Good and Ugly)

Batch Processing (Where ECS Actually Shines)

AI/ML Workloads (Hit or Miss)

Why Companies Actually Choose ECS

Industry Patterns (What Actually Happens)

When ECS Doesn't Make Sense

ECS vs EKS - which one should I pick?

Why does my Fargate task take forever to start?

How much is this actually going to cost me?

Can I run Windows containers on ECS?

My task just says "PENDING" forever, what's wrong?

Why can't my containers talk to each other?

How do I handle secrets in ECS?

Should I run databases in ECS?

ECS vs plain EC2 - what's the point?

My deployment keeps failing, what now?

What are the actual limits I'll hit?

How do I do blue-green deployments?

Can I use ECS on-premises?

How do I debug what's happening in my containers?

What networking mode should I use?

Related Tools & Recommendations

K8s 망해서 Swarm 갔다가 다시 돌아온 개삽질 후기

Amazon ECS - Container orchestration that actually works

AWS Fargate - Run Containers Without the Server Babysitting

Migration vers Kubernetes

Kubernetes 替代方案：轻量级 vs 企业级选择指南

Kubernetes - Le Truc que Google a Lâché dans la Nature

Docker Swarm 프로덕션 배포 - 야근하면서 깨달은 개빡치는 현실

Docker Swarm - Container Orchestration That Actually Works

Google Kubernetes Engine (GKE) - Google's Managed Kubernetes (That Actually Works Most of the Time)

GKE Security That Actually Stops Attacks

Amazon EKS - Managed Kubernetes That Actually Works

Azure Container Instances Production Troubleshooting - Fix the Shit That Always Breaks

Azure Container Instances - Run Containers Without the Kubernetes Complexity Tax

HashiCorp Nomad - Kubernetes Alternative Without the YAML Hell

HashiCorp Nomad - 한국 스타트업을 위한 간단한 Container Orchestration

AWS CodePipeline - Deploy Mobile Apps Without Jenkins Eating Your Laptop

Container Orchestration Pricing: What You'll Actually Pay (Spoiler: More Than You Think)

Stop manually configuring servers like it's 2005

Terraform - Define Infrastructure in Code Instead of Clicking Through AWS Console for 3 Hours

Terraform vs Ansible vs Pulumi - Guía Completa de Herramientas IaC 2025