How This Shit Actually Works (And Where It Breaks)

Docker Logo

Look, the idea is simple: push to main, magic happens, your app runs in production. Reality is messier.

The Happy Path That Never Happens

You push code. GitHub Actions kicks off a workflow. Docker builds your image. ECR stores it. ECS deploys it. Your users are happy. You sleep through the night.

Here's what actually happens when you first set this up:

Week 1: Your Docker build fails because you forgot to add node_modules to .dockerignore and your image is 2GB. GitHub Actions times out after 6 hours trying to push it.

Week 2: Build works, but your ECS task dies immediately with exit code 1. The logs show "Error: Cannot find module 'express'" because your multi-stage build is too clever and deleted the wrong dependencies.

Week 3: App runs but can't connect to the database. Your task definition has the wrong security group. You spend 4 hours learning that ECS networking is about as intuitive as quantum physics.

Docker: The Part That Should Be Easy But Isn't

Multi-stage builds are great in theory. They reduce image size from 1.5GB to 200MB. They also introduce a dozen new ways to break your dependencies.

## This looks clean but will bite you
FROM node:18-alpine AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production

## Your app works locally but breaks here
FROM node:18-alpine AS production  
COPY --from=builder /app/node_modules ./node_modules
COPY . .
CMD ["npm", "start"]

Pro tip: `npm ci --only=production` doesn't install devDependencies, which breaks builds that need TypeScript or other build tools. You'll discover this at 11pm when your supposedly "production-ready" image crashes because TypeScript isn't installed.

ECS: Where Your Deployment Goes to Die

AWS ECS Architecture

ECS task definitions are XML-level verbose. A simple Node.js app needs 150+ lines of JSON to define CPU (256-4096 units, because apparently AWS engineers hate round numbers), memory (must be specific combinations or ECS throws a tantrum), and networking (good luck).

Real talk: Fargate costs 3x more than EC2 but saves your sanity. You'll pay the premium after spending a weekend debugging why your container can't resolve DNS on a custom EC2 cluster.

GitHub Actions: The Good News

The only part of this stack that doesn't hate you. Actions for AWS are actually well-maintained. OIDC authentication works. The official actions don't randomly break.

But here's what nobody tells you: your first workflow will take 45 minutes to run because Docker layer caching is disabled by default and you're rebuilding everything from scratch every time.

Where The Money Goes

AWS Cost Calculator

GitHub Actions charges $0.008/minute. Sounds cheap until you realize your inefficient Docker builds consume 15 minutes per deployment. That's $3.60 per deploy. Deploy 10 times a day and suddenly you're paying $100+ monthly just for CI.

ECR costs sneak up on you. $0.10/GB/month sounds reasonable until you accumulate 50 old images because you didn't set up lifecycle policies. Your "free" container registry costs $60/month.

Fargate pricing is $0.04048/vCPU/hour. A small app (0.25 vCPU, 512MB RAM) costs $7.20/month if it runs 24/7. Scale to handle real traffic and you're looking at $100+/month just for compute.

The Real Architecture

Here's what actually happens in production:

  1. Developer pushes to main at 5:47pm on Friday (why do we do this to ourselves?)
  2. GitHub Actions starts. Build time: 12 minutes because someone added a 500MB dependency
  3. ECR image scan finds 47 "critical" vulnerabilities in base OS packages you can't control
  4. Deployment succeeds but health checks fail. Task keeps restarting
  5. You debug for 2 hours, discover the container port is 3000 but load balancer expects 80
  6. Fix that, redeploy. Health checks pass but users get 500 errors
  7. Turns out your database connection string is wrong. Environment variable was DATABASE_URL, your code expects DB_URL
  8. By 8:30pm everything works. You promise yourself you'll never deploy on Friday again
  9. You deploy on Friday again next week

The good news? Once it works, it really works. The bad news? Getting there requires sacrificing several weekends to the AWS documentation gods.

The Actual Setup That Works (After You Fix Everything That Doesn't)

Skip the AWS console. Seriously. Use Terraform or you'll be clicking through 47 different screens every time you need to change a CPU limit. Here's what actually works after you've debugged everything twice.

Start With ECR Because That's Easy

Create your ECR repo and enable image scanning. The scanning will find 200 vulnerabilities in your base image, 199 of which you can't fix because they're in Ubuntu packages. You'll learn to ignore them.

## This works, unlike half the AWS CLI examples
aws ecr create-repository --repository-name my-app --image-scanning-configuration scanOnPush=true
aws ecr put-lifecycle-policy --repository-name my-app --lifecycle-policy-text file://lifecycle.json

Set up lifecycle policies immediately or your ECR bill will be $200 next month because you kept every single build image.

ECS Cluster Setup (The Part That Will Frustrate You)

ECS Fargate Architecture

Create a Fargate cluster. Don't use EC2 unless you enjoy troubleshooting networking at midnight. Fargate costs more but your mental health is worth it.

Task definitions are where AWS decided to make developers suffer. A simple Node.js app requires 150 lines of JSON. Here's the minimal version that actually works:

{
  "family": "my-app",
  "networkMode": "awsvpc",
  "requiresCompatibilities": ["FARGATE"],
  "cpu": "256",
  "memory": "512",
  "executionRoleArn": "arn:aws:iam::account:role/ecsTaskExecutionRole",
  "containerDefinitions": [
    {
      "name": "my-app",
      "image": "account.dkr.ecr.region.amazonaws.com/my-app:latest",
      "portMappings": [{"containerPort": 3000}],
      "logConfiguration": {
        "logDriver": "awslogs",
        "options": {
          "awslogs-group": "/ecs/my-app",
          "awslogs-region": "us-east-1",
          "awslogs-stream-prefix": "ecs"
        }
      }
    }
  ]
}

Notice how the memory is in MB but CPU is in "units"? AWS engineers apparently hate consistency.

The GitHub Action That Actually Deploys

GitHub Actions Flow

Forget the marketplace actions that half-work. Here's a workflow that handles the edge cases:

name: Deploy
on:
  push:
    branches: [main]

jobs:
  deploy:
    runs-on: ubuntu-latest
    permissions:
      id-token: write
      contents: read
    
    steps:
    - uses: actions/checkout@v4
    
    - name: Configure AWS
      uses: aws-actions/configure-aws-credentials@v4
      with:
        role-to-assume: arn:aws:iam::${{ secrets.AWS_ACCOUNT_ID }}:role/github-actions-role
        role-session-name: GitHubActions
        aws-region: us-east-1
    
    - name: Login to ECR
      run: |
        aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin ${{ secrets.AWS_ACCOUNT_ID }}.dkr.ecr.us-east-1.amazonaws.com
    
    - name: Build and push
      run: |
        docker build -t my-app:${{ github.sha }} .
        docker tag my-app:${{ github.sha }} ${{ secrets.AWS_ACCOUNT_ID }}.dkr.ecr.us-east-1.amazonaws.com/my-app:${{ github.sha }}
        docker tag my-app:${{ github.sha }} ${{ secrets.AWS_ACCOUNT_ID }}.dkr.ecr.us-east-1.amazonaws.com/my-app:latest
        docker push ${{ secrets.AWS_ACCOUNT_ID }}.dkr.ecr.us-east-1.amazonaws.com/my-app:${{ github.sha }}
        docker push ${{ secrets.AWS_ACCOUNT_ID }}.dkr.ecr.us-east-1.amazonaws.com/my-app:latest
    
    - name: Update ECS service
      run: |
        aws ecs update-service --cluster my-cluster --service my-service --force-new-deployment --region us-east-1

Docker Builds That Don't Suck

Docker Build Process

Your Dockerfile probably looks like this and takes 15 minutes to build:

FROM node:18
WORKDIR /app
COPY . .
RUN npm install
CMD ["npm", "start"]

Here's one that builds in 3 minutes after the first run:

FROM node:18-alpine

WORKDIR /app

## Copy package files first for better caching
COPY package*.json ./
RUN npm ci --only=production && npm cache clean --force

## Copy app source
COPY . .

## Don't run as root
USER node

EXPOSE 3000
CMD ["npm", "start"]

Add a .dockerignore file or your image will be 2GB because you included node_modules, .git, and your entire download folder:

node_modules
.git
*.log
.DS_Store
coverage/
.nyc_output/

OIDC Setup (Do This Once, Correctly)

OIDC eliminates AWS keys in your repo. Set it up in AWS IAM:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Federated": "arn:aws:iam::ACCOUNT:oidc-provider/token.actions.githubusercontent.com"
      },
      "Action": "sts:AssumeRole",
      "Condition": {
        "StringEquals": {
          "token.actions.githubusercontent.com:aud": "sts.amazonaws.com"
        },
        "StringLike": {
          "token.actions.githubusercontent.com:sub": "repo:yourusername/yourrepo:*"
        }
      }
    }
  ]
}

Get this wrong and you'll get cryptic "AssumeRole failed" errors that take 3 hours to debug.

The Gotchas That Will Ruin Your Day

Health checks are not optional. ECS will restart your container every 30 seconds if health checks fail. Add this to your Express app:

app.get('/health', (req, res) => res.status(200).send('OK'));

Environment variables aren't strings in task definitions. This will break:

"environment": [
  {"name": "PORT", "value": 3000}  // Wrong! Must be string
]

This works:

"environment": [
  {"name": "PORT", "value": "3000"}  // String, not number
]

Security groups matter. Your container can't reach the internet if the security group blocks outbound traffic. Learn this the hard way when your app can't connect to external APIs.

Resource limits are enforced. Set your Node.js max memory or it'll use all available RAM and get killed by ECS:

process.env.NODE_OPTIONS = '--max-old-space-size=400';

Deployment Strategies That Work

Rolling deployments are fine for most apps. Don't overcomplicate with blue-green unless you're running a bank.

Set your deployment configuration to:

  • Maximum percent: 200% (allows new tasks before killing old ones)
  • Minimum healthy percent: 100% (ensures zero downtime)
  • Health check grace period: 300 seconds (your app needs time to start)

Monitoring That Matters

CloudWatch Dashboard

CloudWatch Logs are included. Set up log aggregation or you'll be grep-ing through individual log streams like a caveman.

Create CloudWatch alarms for:

  • Task count drops below desired (your app crashed)
  • CPU above 80% (time to scale)
  • Memory above 85% (about to get killed)
  • Error rate above 5% (something's broken)

Skip fancy monitoring until you have the basics working. You'll have plenty of real errors to debug first.

Timeline: How Long This Really Takes

  • Week 1: Get Docker building locally. Fight with dependencies.
  • Week 2: Get GitHub Actions pushing to ECR. Debug OIDC permissions.
  • Week 3: Get ECS task running. Discover security group issues.
  • Week 4: Get health checks passing. Fix environment variables.
  • Week 5: Production deploy works. Realize you forgot about database migrations.
  • Week 6: Add monitoring. Get paged at 2am because of false alarms.

Budget 6 weeks minimum. Anyone who says they did it in a day either had help or is lying.

Deployment Strategies: What Actually Happens vs What AWS Docs Say

Strategy

What AWS Says

Reality

When Your App Breaks

Recommendation

Rolling

"Gradual replacement"

Half your traffic hits new code immediately

Health checks fail, ECS kills everything

Use this unless you're Netflix

Blue-Green

"Zero downtime"

Costs 2x for 10 minutes, then works perfectly

Database migrations break everything

Skip unless you're actually handling money

Canary

"Risk mitigation"

You'll spend more time configuring than deploying

5% of users get broken experience

Only if you have dedicated DevOps team

Recreate

"Simple strategy"

Your site is down for 3 minutes

Users notice, support tickets filed

Never use this in production

Questions Real Developers Actually Ask (And Honest Answers)

Q

Why does my Docker build work locally but fail in GitHub Actions?

A

Your local Docker setup probably has different layer caching, different architecture (ARM vs x86), or you're relying on files that aren't in your git repo. Common culprits:

  • Missing .dockerignore (includes .git and kills the build)
  • Hard-coded paths that work on macOS but break on Linux
  • Dependency on local environment variables
  • Multi-platform build issues (M1 Mac vs Intel runners)

Copy this to fix 90% of cases:

docker build --platform linux/amd64 -t my-app .
Q

My ECS task keeps dying with exit code 1. How do I figure out what's wrong?

A

ECS error messages are useless on purpose. Check CloudWatch logs first:

aws logs describe-log-groups --log-group-name-prefix \"/ecs/\"
aws logs get-log-events --log-group-name \"/ecs/my-app\" --log-stream-name \"ecs/my-app/task-id\"

Common causes:

  • Port mismatch: App listens on 3000, task definition expects 80
  • Missing environment variables: process.env.DATABASE_URL is undefined
  • Wrong working directory: App expects files in /usr/src/app, Dockerfile uses /app
  • Permission issues: Running as root locally, restricted in container
Q

How do I stop GitHub Actions from eating my wallet?

A

Your builds are probably inefficient. First things to fix:

  1. Add Docker layer caching (saves 5-10 minutes per build)
  2. Use npm ci instead of npm install (faster, predictable)
  3. Don't rebuild unchanged dependencies (structure your Dockerfile properly)
  4. Cache node_modules in GitHub Actions

Here's a build that doesn't suck:

- uses: actions/cache@v3
  with:
    path: ~/.npm
    key: ${{ runner.os }}-node-${{ hashFiles('**/package-lock.json') }}
Q

My app works but users can't reach it. What's wrong with the networking?

A

AWS networking is designed to make you suffer. Check these in order:

  1. Security groups: ECS task needs outbound rules for internet access
  2. Load balancer target groups: Health check path must exist (/health)
  3. Subnet routing: Public subnets for ALB, private for ECS tasks
  4. NAT Gateway: Private subnets need internet access for outbound calls

Quick health check endpoint that actually works:

app.get('/health', (req, res) => {
  res.status(200).json({ status: 'ok', timestamp: new Date().toISOString() });
});
Q

Database migrations keep breaking my deployments. What's the right way?

A

Don't run migrations in your app container. Ever. Create a separate task definition for migrations:

aws ecs run-task \
  --cluster my-cluster \
  --task-definition migration-task \
  --wait \
  --query 'tasks[0].lastStatus'

If migration fails, your deployment fails. If migration succeeds, then deploy the app. Takes longer but saves you from the "app is up but database is fucked" scenario.

Q

Why is my AWS bill $400 when I thought ECS was cheap?

A

You probably have:

  • 50 old ECR images at $0.10/GB each (set up lifecycle policies)
  • Load balancer running 24/7 ($16/month whether you use it or not)
  • Fargate compute for oversized containers (1GB RAM when you need 256MB)
  • Cross-AZ data transfer charges (put everything in one AZ for dev)
  • CloudWatch Logs retention set to "never expire" (change to 30 days)

Use AWS Cost Explorer to find what's actually costing money.

Q

How do I deploy to staging and production without manual steps?

A

Set up GitHub Environments with different approval rules:

jobs:
  deploy-staging:
    environment: staging
    if: github.ref == 'refs/heads/main'
    # Auto-deploys on push to main
  
  deploy-production:
    environment: production
    if: github.ref == 'refs/heads/main'
    needs: deploy-staging
    # Requires manual approval

Staging deploys automatically. Production requires someone to click "Approve" in GitHub. Don't skip the approval step unless you want to debug production issues at 2am.

Q

My deployment succeeds but the app returns 500 errors. How do I debug this?

A

ECS doesn't care if your app works, just if the container runs. Check application logs in CloudWatch:

## Get the latest log stream
aws logs describe-log-streams --log-group-name \"/ecs/my-app\" --order-by LastEventTime --descending --max-items 1

## Read the logs  
aws logs get-log-events --log-group-name \"/ecs/my-app\" --log-stream-name \"stream-name\"

Common issues:

  • Wrong environment variables: NODE_ENV=production but config expects NODE_ENV=prod
  • Database connection failures: Wrong security group, wrong connection string
  • Missing dependencies: Production build stripped dev dependencies your app actually needs
  • File system permissions: Can't write to /tmp or read config files
Q

Can I use this setup for a side project or is it overkill?

A

It's probably overkill. For side projects, consider:

  • Render.com: $7/month, connects to GitHub, just works
  • Railway: Similar to Render, good free tier
  • Vercel/Netlify: For static sites and serverless
  • Fly.io: More control, reasonable pricing

Only use ECS if you need:

  • Fine-grained control over container orchestration
  • Integration with other AWS services
  • Complex networking requirements
  • Experience with production-grade deployments

For a basic web app, ECS is like using a freight truck to deliver pizza.

Resources That Actually Help (When Things Go Wrong)

Related Tools & Recommendations

howto
Recommended

How to Reduce Kubernetes Costs in Production - Complete Optimization Guide

integrates with Kubernetes

Kubernetes
/howto/reduce-kubernetes-costs-optimization-strategies/complete-cost-optimization-guide
100%
integration
Recommended

Kafka + MongoDB + Kubernetes + Prometheus Integration - When Event Streams Break

When your event-driven services die and you're staring at green dashboards while everything burns, you need real observability - not the vendor promises that go

Apache Kafka
/integration/kafka-mongodb-kubernetes-prometheus-event-driven/complete-observability-architecture
100%
tool
Recommended

Debug Kubernetes Issues - The 3AM Production Survival Guide

When your pods are crashing, services aren't accessible, and your pager won't stop buzzing - here's how to actually fix it

Kubernetes
/tool/kubernetes/debugging-kubernetes-issues
100%
tool
Recommended

Docker Scout - Find Vulnerabilities Before They Kill Your Production

Docker's built-in security scanner that actually works with stuff you already use

Docker Scout
/tool/docker-scout/overview
91%
troubleshoot
Recommended

Docker Permission Denied on Windows? Here's How to Fix It

Docker on Windows breaks at 3am. Every damn time.

Docker Desktop
/troubleshoot/docker-permission-denied-windows/permission-denied-fixes
91%
troubleshoot
Recommended

Docker Daemon Won't Start on Windows 11? Here's the Fix

Docker Desktop keeps hanging, crashing, or showing "daemon not running" errors

Docker Desktop
/troubleshoot/docker-daemon-not-running-windows-11/windows-11-daemon-startup-issues
91%
tool
Recommended

Podman Desktop - Free Docker Desktop Alternative

competes with Podman Desktop

Podman Desktop
/tool/podman-desktop/overview
71%
tool
Recommended

CircleCI - Fast CI/CD That Actually Works

competes with CircleCI

CircleCI
/tool/circleci/overview
66%
tool
Recommended

Jenkins Production Deployment - From Dev to Bulletproof

competes with Jenkins

Jenkins
/tool/jenkins/production-deployment
65%
integration
Recommended

Jenkins + Docker + Kubernetes: How to Deploy Without Breaking Production (Usually)

The Real Guide to CI/CD That Actually Works

Jenkins
/integration/jenkins-docker-kubernetes/enterprise-ci-cd-pipeline
65%
integration
Recommended

GitHub Actions + Jenkins Security Integration

When Security Wants Scans But Your Pipeline Lives in Jenkins Hell

GitHub Actions
/integration/github-actions-jenkins-security-scanning/devsecops-pipeline-integration
65%
review
Recommended

I Ditched Docker Desktop for Rancher Desktop - Here's What Actually Happened

3 Months Later: The Good, Bad, and Bullshit

Rancher Desktop
/review/rancher-desktop/overview
62%
tool
Recommended

Rancher Desktop - Docker Desktop's Free Replacement That Actually Works

alternative to Rancher Desktop

Rancher Desktop
/tool/rancher-desktop/overview
62%
tool
Recommended

containerd - The Container Runtime That Actually Just Works

The boring container runtime that Kubernetes uses instead of Docker (and you probably don't need to care about it)

containerd
/tool/containerd/overview
58%
troubleshoot
Recommended

Docker Swarm Service Discovery Broken? Here's How to Unfuck It

When your containers can't find each other and everything goes to shit

Docker Swarm
/troubleshoot/docker-swarm-production-failures/service-discovery-routing-mesh-failures
56%
troubleshoot
Recommended

Docker Swarm Node Down? Here's How to Fix It

When your production cluster dies at 3am and management is asking questions

Docker Swarm
/troubleshoot/docker-swarm-node-down/node-down-recovery
56%
tool
Recommended

Docker Swarm - Container Orchestration That Actually Works

Multi-host Docker without the Kubernetes PhD requirement

Docker Swarm
/tool/docker-swarm/overview
56%
troubleshoot
Recommended

Docker Daemon Won't Start on Linux - Fix This Shit Now

Your containers are useless without a running daemon. Here's how to fix the most common startup failures.

Docker Engine
/troubleshoot/docker-daemon-not-running-linux/daemon-startup-failures
47%
news
Recommended

Linux Foundation Takes Control of Solo.io's AI Agent Gateway - August 25, 2025

Open source governance shift aims to prevent vendor lock-in as AI agent infrastructure becomes critical to enterprise deployments

Technology News Aggregation
/news/2025-08-25/linux-foundation-agentgateway
47%
tool
Recommended

GitLab CI/CD - The Platform That Does Everything (Usually)

CI/CD, security scanning, and project management in one place - when it works, it's great

GitLab CI/CD
/tool/gitlab-ci-cd/overview
41%

Recommendations combine user behavior, content similarity, research intelligence, and SEO optimization