Why is my bill so fucking high when CPU usage is low?

Data transfer costs will kill you. We had a $800 surprise bill because tasks were constantly pulling images from ECR in different regions. Hidden costs include: - Cross-AZ data transfer: $0.01/GB (adds up fast with microservices) - CloudWatch logs: $0.50/GB ingested (our chatty app cost $200/month just in logs) - Load balancer hours: $16.43/month minimum per ALB - NAT Gateway: $32.40/month plus data processing fees **Fix:** Use ECR in the same region, filter logs in your app (not CloudWatch), and budget 40% more than what AWS tells you it'll cost.

Why do my containers keep failing to start with "CannotPullContainerError"?

90% of the time it's networking. Check this in order: 1. **Security group egress rules** - must allow outbound HTTPS (443) and HTTP (80) 2. **Subnet routing** - private subnets need NAT Gateway or VPC endpoints 3. **ECR permissions** - task execution role needs `ecr:GetDownloadUrlForLayer` and `ecr:BatchGetImage` Copy this and adjust: ```bash aws ecs describe-tasks --cluster my-cluster --tasks arn:aws:ecs:us-east-1:123456789012:task/abc ``` Look for the actual error message in `stoppedReason` field. AWS error messages are cryptic but at least they're consistent.

Task placement failed - what the hell does that mean?

Your subnet ran out of IP addresses. Each Fargate task eats one IP from your subnet. If you're trying to scale 500 tasks in a /24 subnet (251 IPs), you'll hit this error. **Quick fixes:** 1. Create a larger subnet (/20 gives you 4,091 IPs) 2. Spread tasks across multiple subnets 3. Use a VPC with more IP space (don't use the default VPC) **Check available IPs:** ```bash aws ec2 describe-subnets --subnet-ids subnet-12345678 --query 'Subnets[0].AvailableIpAddressCount' ```

How do I debug networking issues between services?

**Step 1:** Check if traffic is reaching the load balancer: ```bash aws elbv2 describe-target-health --target-group-arn arn:aws:elasticloadbalancing:... ``` **Step 2:** If targets are unhealthy, check security group rules. [This Stack Overflow answer](https://stackoverflow.com/questions/41732796/aws-fargate-unable-to-pull-image-cannotpullcontainererror) has the correct rules for most setups. **Step 3:** Enable VPC Flow Logs to see where packets are being dropped: ```json { "version": 2, "account-id": "123456789012", "interface-id": "eni-1235b8ca123456789", "srcaddr": "172.31.16.139", "dstaddr": "172.31.16.21", "srcport": 20641, "dstport": 22, "protocol": 6, "packets": 20, "bytes": 4249, "windowstart": 1418530010, "windowend": 1418530070, "action": "REJECT" } ``` If you see REJECT, your security groups are fucked.

Cold starts are killing my API performance - how do I fix this?

**Image optimization is not optional:** 1. Use Alpine Linux base (3MB vs 100MB+ for Ubuntu) 2. Multi-stage Docker builds to remove build tools 3. [Use ECR with zstd compression](https://aws.amazon.com/blogs/containers/reducing-aws-fargate-startup-times-with-zstd-compressed-container-images/) **Our results:** - Before: 2.1GB Node.js app, 5+ minute cold starts - After: 280MB app, 45-second cold starts **Dockerfile that actually works:** ```dockerfile FROM node:16-alpine AS builder COPY package*.json ./ RUN npm ci --only=production # This took me 3 hours to figure out FROM node:16-alpine COPY --from=builder /node_modules /node_modules # Saves 90% of image size COPY . . EXPOSE 3000 CMD ["node", "server.js"] # Don't use npm start, it's slower ```

Can I use regular EC2 security groups with Fargate?

Yes, but the mental model is different. Security group rules apply to the **task**, not the instance hosting it. Each task gets its own ENI with the security groups you specify. **This catches people:** - Outbound rules matter (tasks can't reach the internet without them) - Rules apply at the task level, not the host level - Source/destination IPs are task IPs, not EC2 instance IPs

How do I handle secrets without hardcoding them?

![ECS Task Definition Structure](https://d2908q01vomqb2.cloudfront.net/fc074d501302eb2b93e2554793fcaf50b3bf7291/2019/02/01/Fargate-1024x512.jpg) **Use AWS Secrets Manager or Parameter Store**, not environment variables in task definitions. Here's the configuration: ```json { "secrets": [ { "name": "DATABASE_PASSWORD", "valueFrom": "arn:aws:secretsmanager:us-east-1:123456789012:secret:prod/api/db-AbCdEf" } ] } ``` **Task execution role needs:** ```json { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "secretsmanager:GetSecretValue" ], "Resource": "arn:aws:secretsmanager:us-east-1:123456789012:secret:*" } ] } ``` Secrets are injected as environment variables at runtime. Your code just reads them normally.

Autoscaling is too slow - tasks take forever to start

**Default scaling policies suck.** Use these settings: ```json { "targetValue": 70.0, "scaleOutCooldown": 30, // Default 300 seconds is way too slow "scaleInCooldown": 300, // Keep this high to avoid flapping "metricType": "CPUUtilization" // Memory scaling is even worse } ``` **Better approach:** Scale on application metrics like queue depth or request latency. CPU utilization lags too much. **Nuclear option:** Pre-warm with minimum task counts. Costs more but handles traffic spikes without the 2-minute delay.

ECS vs EKS on Fargate - which one should I use?

**Use ECS unless you already know Kubernetes.** **ECS pros:** - Learning curve: 2 weeks - AWS-native, better integration - Simpler networking model - No additional cluster costs **EKS pros:** - Learning curve: 2-3 months - Kubernetes portability (theoretical) - Better community ecosystem - More complex networking options **EKS costs an extra $74/month per cluster** just for the control plane. For most companies, ECS is the right choice.

Can I run databases on Fargate?

**Don't.** Just don't. Use RDS, DynamoDB, or ElastiCache instead. **If you absolutely must:** - Use EFS for persistent storage (not ephemeral storage) - Expect terrible I/O performance compared to EC2 - Manual backup strategies (no automated snapshots) - Networking complexity for clustering **Better alternatives:** - Small databases: RDS with t3.micro ($13/month) - NoSQL: DynamoDB on-demand - Caching: ElastiCache Redis

Platform version migrations broke my deployment - how do I prevent this?

**Pin your platform version in production:** ```json { "family": "my-app", "platformVersion": "1.4.0" } ``` **Current platform versions (late 2025):** - Linux: 1.4.0 (stable) - Windows: 1.0.0 (still missing features) - Bottlerocket: 1.4.0 (if you like minimal OS) AWS will eventually force migrations for security updates, but pinning gives you control over when it happens.

How do I monitor costs and set billing alerts?

**Set up billing alerts immediately:** 1. Enable detailed billing in CloudWatch 2. Create alarms for monthly costs 3. Use cost allocation tags to track per-service costs **Tags that matter:** ```json { "Environment": "production", "Service": "api", "Owner": "team-backend" } ``` **Our production setup:** - Alert at $500/month (50% of budget) - Alarm at $800/month (80% of budget) - Emergency cutoff at $1000/month **Pro tip:** Fargate costs can spike 10x overnight if something goes wrong. Monitoring is not optional.

Currently viewing the AI version

Switch to human version

AWS Fargate: AI-Optimized Technical Reference

Executive Summary

AWS Fargate is a serverless container platform that costs 2-3x more than EC2 but eliminates infrastructure management. Critical breaking points include subnet IP exhaustion, 2+ minute cold starts for large images, and platform version migrations that break deployments without warning.

Configuration

Production-Ready Settings

Task Definition (Minimum Viable):

{
  "family": "production-app",
  "platformVersion": "1.4.0",  // Pin to prevent breaking migrations
  "requiresCompatibilities": ["FARGATE"],
  "cpu": "512",               // 0.25 vCPU barely usable for real apps
  "memory": "1024",           // Memory allocations round up (2.1GB costs 4GB)
  "networkMode": "awsvpc"
}

Autoscaling That Survives Traffic Spikes:

{
  "targetValue": 70.0,
  "metricType": "CPUUtilization",
  "scaleOutCooldown": 60,     // Default 300s too slow for production
  "scaleInCooldown": 300,
  "minCapacity": 5,           // Minimum to handle sudden load
  "maxCapacity": 100
}

Security Groups (Essential Egress):

{
  "egress": [
    {"protocol": "tcp", "port": 443, "destination": "0.0.0.0/0"}, // HTTPS
    {"protocol": "tcp", "port": 80, "destination": "0.0.0.0/0"}   // HTTP
  ]
}

Critical Platform Specifications

Component	Specification	Production Reality	Failure Consequences
CPU Range	0.25-16 vCPU	0.25 vCPU unusable for real apps	App timeouts, poor performance
Memory Range	512MB-120GB	Allocations round up (2.1GB = 4GB cost)	2x higher bills than expected
Cold Start	"30-60 seconds"	2+ minutes for images >1GB	API timeouts, user frustration
Ephemeral Storage	Up to 200GB	Deleted when task dies	Data loss, failed deployments
Subnet IPs	1 IP per task	Causes scaling failures	Cannot scale beyond subnet capacity

Common Failure Modes and Solutions

Subnet IP Exhaustion (Most Common Production Issue):

Symptom: "ENI allocation failed" errors during scaling
Cause: Each task consumes one subnet IP address
Solution: Use /20 subnets (4,091 IPs) minimum for production
Prevention: Monitor available IPs: aws ec2 describe-subnets --query 'Subnets[0].AvailableIpAddressCount'

Container Pull Failures:

Root Cause 90% of cases: Security group blocks outbound HTTPS/HTTP
Quick Fix: Verify egress rules allow ports 443 and 80
IAM Permissions Required: ecr:GetDownloadUrlForLayer, ecr:BatchGetImage

Platform Version Breaks:

Trigger: AWS migrates platform versions without warning
Impact: Deployment scripts fail, health checks break
Prevention: Pin platform version in production task definitions

Resource Requirements

Real Cost Analysis

Baseline Comparison (t3.medium equivalent):

EC2: $24/month
Fargate: $58/month (2.4x premium)
Hidden costs add 40% (data transfer, CloudWatch, load balancer)

Cost Optimization Strategies:

Fargate Spot: 70% savings but interrupts every 4-6 hours
Image optimization: 2.1GB → 280MB = 5x faster cold starts
Regional ECR: 2-3x faster pulls than cross-region

Resource Investment Timeline:

Learning curve: 2 weeks (ECS) vs 2-3 months (EKS)
Image optimization: 1-2 days engineering time
Production troubleshooting: Budget 20% more ops time initially

Performance Thresholds

Breaking Points:

Subnet capacity: 251 IPs per /24 subnet = max concurrent tasks
Cold start performance: >1GB images = 2+ minute starts
Memory efficiency: 2.1GB allocation pays for 4GB
Network performance: Throttled compared to dedicated EC2 instances

Scaling Limitations:

Target tracking autoscaling: 2-3 minute lag for CPU-based scaling
Manual intervention required for sudden traffic spikes
Minimum task count essential: 3-5 tasks for production APIs

Critical Warnings

What Official Documentation Doesn't Tell You

Networking Gotchas:

Every task eats a subnet IP (not mentioned in pricing docs)
Security groups apply to tasks, not instances (different mental model)
Private subnets require NAT Gateway or VPC endpoints ($32.40/month minimum)
Cross-AZ data transfer charges apply between tasks

Hidden Cost Traps:

CloudWatch Container Insights: $150/month for medium app
Log ingestion: $200/month for chatty applications
Data transfer: $0.01/GB adds up with microservices
Load balancer minimum: $16.43/month per ALB

Platform Reliability Issues:

Platform version migrations break deployments without warning
ARM64 images: Half of Docker ecosystem won't work
Fargate Spot: Interrupts more frequently than advertised (every 4-6 hours peak)

Breaking Points and Failure Modes

Immediate Deployment Blockers:

Subnet IP exhaustion during traffic spikes
IAM permission errors for ECR access
Security group misconfiguration blocking container pulls
Image size >1GB causing timeout failures

Financial Breaking Points:

Steady 24/7 workloads: EC2 is 2-3x cheaper
GPU workloads: Not supported on Fargate
High-performance computing: Network throttling makes it unusable
Windows containers: Slow, expensive, limited ecosystem

Operational Complexity:

EKS control plane: Additional $74/month per cluster
Custom kernels/system access: Not possible
Database workloads: Terrible I/O performance
Long-running batch jobs (>4 hours): EC2 Spot 70% cheaper

Decision Criteria

When Fargate Makes Sense

Microservices APIs: Independent scaling per service
Batch jobs: Sporadic workloads with unpredictable timing
Development environments: Spin up/tear down testing
Background processing: Using Fargate Spot for 70% savings

When Fargate Will Fail You

GPU workloads: Not supported
High-performance computing: Network limitations
Steady 24/7 workloads: 3x cost premium unjustifiable
Database hosting: Use RDS/DynamoDB instead
Windows containers: Slow and expensive

Implementation Readiness Checklist

Before Production Deployment:

Subnet capacity planning (use /20 minimum)
Image optimization (<500MB target)
Platform version pinning
Cost monitoring and alerts configured
Security group egress rules verified
IAM roles for ECR access configured

Production Monitoring Requirements:

Billing alerts at 50% and 80% of budget
Container Insights or third-party monitoring
VPC Flow Logs for network troubleshooting
ECS Exec enabled for runtime debugging

Troubleshooting Quick Reference

Common Error Messages and Solutions

"CannotPullContainerError":

Check security group egress (ports 443, 80)
Verify subnet routing (NAT Gateway for private subnets)
Confirm ECR IAM permissions

"Task placement failed":

Check subnet available IP count
Create larger subnets or spread across multiple subnets
Monitor for subnet exhaustion patterns

"Service scaling failed":

Verify autoscaling policy settings
Check for platform capacity limits
Consider using Fargate Spot for burst capacity

Performance Optimization Actions

Image Optimization (Critical for Cold Starts):

FROM node:16-alpine AS builder
COPY package*.json ./
RUN npm ci --only=production

FROM node:16-alpine
COPY --from=builder /node_modules /node_modules
COPY . .
CMD ["node", "server.js"]  // Faster than npm start

Network Performance:

Use ECR in same region (2-3x faster pulls)
Enable zstd compression for images
Configure VPC endpoints for ECR access in private subnets

This reference contains the operational intelligence needed for automated decision-making about AWS Fargate implementation, including specific breaking points, real costs, and production-ready configurations.

Useful Links for Further Investigation

Links That Don't Completely Suck

Link	Description
AWS Fargate Overview	Marketing bullshit, but has current pricing and specs. Skip the "benefits" section.
AWS Fargate Developer Guide	Actually decent technical docs. The networking section saved my ass multiple times.
AWS Fargate Pricing	Critical reading - memorize this before you deploy anything. Hidden costs aren't listed here.
AWS Fargate FAQs	Surprisingly honest answers. Read this before asking in forums.
Fargate Platform Versions	Bookmark this - platform migrations will break your shit without warning.
Creating ECS Linux Task for Fargate	Basic tutorial that works. Console-based, but gets you started without CLI hell.
Creating ECS Windows Task for Fargate	Windows containers on Fargate are slow and expensive. You've been warned.
Fargate with AWS CLI	Learn the CLI - the console won't save you in production.
EKS with Fargate Tutorial	If you hate yourself and want to pay $74/month extra for Kubernetes complexity.
Container Insights for Fargate	Costs extra but actually shows you what's happening. Essential for debugging production issues.
Fargate Task Networking	Read this twice - networking is where everything breaks. Security groups work differently than EC2.
AWS Security Best Practices	Boring but necessary. Follow this or get pwned and fired.
Fargate Spot Capacity	70% savings if you can tolerate getting killed every 4 hours. Great for batch jobs, terrible for web apps.
AWS CLI Documentation	Learn this or you'll be clicking buttons in the console forever. JSON everywhere.
AWS CDK for ECS	Infrastructure as code that doesn't make you want to quit. TypeScript support is actually good.
Terraform AWS Provider	If you prefer HCL over TypeScript. State management is a pain but it works.
AWS Copilot CLI	New hotness from AWS. Actually makes deployment easier than the console.
AWS Pricing Calculator	Lies about the actual cost - doesn't include data transfer or CloudWatch. Budget 40% more.
Cost Optimization Guide	Generic advice that misses Fargate-specific gotchas. Use Fargate Spot or cry about the bill.
AWS Billing and Cost Management	Set up billing alerts or wake up to a $2000 surprise bill. Not joking.
AWS Containers Blog	Marketing mixed with real technical content. Skip the fluff, read the technical deep dives.
GitHub - AWS Containers Roadmap	Where to beg for features AWS should have built years ago. Public roadmap with real ETA dates.
Hacker News - AWS Discussions	Salt mine of production horror stories. Better than official forums for real experiences.
AWS Community Forums	Official AWS forums - slower than Stack Overflow but AWS employees actually respond.
Awesome ECS	Nathan Peck knows his shit. Curated list of actually useful ECS resources.
ECS Community Discord	Real-time help when your deployment is on fire during the weekend.
AWS ECS Samples	Basic workshop examples. Good for learning but too simple for production use.
Container Insights Workshop	Hands-on tutorial that actually works. Better than reading docs for 3 hours.
Datadog ECS Integration	Expensive but comprehensive monitoring. Worth it if you have the budget.
New Relic EKS Fargate Integration	Good Kubernetes monitoring for EKS Fargate. Setup is painful but it works.
Sysdig Container Security	Runtime security that actually catches shit. Pricey but beats getting hacked.
AWS Certification	Resume padding that might teach you something. Solutions Architect covers containers.
A Cloud Guru AWS Courses	Better than AWS's own training. Practical examples instead of marketing speak.
AWS Certified Solutions Architect	Useful cert that covers Fargate basics. Worth the time investment.
Container Migration Hub	Migration tools that sometimes work. Your mileage will vary wildly.
Cloud Run vs Fargate Comparison	Google's version is simpler but ties you to GCP. Pick your poison.
Azure Container Instances	Microsoft's take on serverless containers. Fewer features but sometimes cheaper.
Fargate Troubleshooting Guide	Official troubleshooting that misses 90% of real issues. Start here anyway.
Container Insights Troubleshooting	Debugging Container Insights when it stops working. Happens more than you'd think.
ECS Exec Troubleshooting	SSH into running containers for debugging. Game changer when networking is fucked.
Fargate Connection Troubleshooting	AWS Knowledge Center actually has useful info. Who knew?
AWS Service Health Dashboard	Check this when nothing works. AWS outages happen more than they admit.
AWS What's New	New features and price increases. Subscribe to the RSS feed.
AWS Fargate Region Availability	Which regions actually support what you need. Update regularly.
Fargate vs EC2 Cost Analysis	Math that shows Fargate costs 3x more but might be worth it anyway.
Container Insights Cost Optimization	How to reduce monitoring costs before they bankrupt you.
Fargate Spot Best Practices	70% savings if you can tolerate random interruptions. Use wisely.

36%

Recommendations combine user behavior, content similarity, research intelligence, and SEO optimization