GitHub Actions + Docker + ECS: Stop SSH-ing Into Servers Like It's 2015

How This Shit Actually Works (And Where It Breaks)

Look, the idea is simple: push to main, magic happens, your app runs in production. Reality is messier.

The Happy Path That Never Happens

You push code. GitHub Actions kicks off a workflow. Docker builds your image. ECR stores it. ECS deploys it. Your users are happy. You sleep through the night.

Here's what actually happens when you first set this up:

Week 1: Your Docker build fails because you forgot to add node_modules to .dockerignore and your image is 2GB. GitHub Actions times out after 6 hours trying to push it.

Week 2: Build works, but your ECS task dies immediately with exit code 1. The logs show "Error: Cannot find module 'express'" because your multi-stage build is too clever and deleted the wrong dependencies.

Week 3: App runs but can't connect to the database. Your task definition has the wrong security group. You spend 4 hours learning that ECS networking is about as intuitive as quantum physics.

Docker: The Part That Should Be Easy But Isn't

Multi-stage builds are great in theory. They reduce image size from 1.5GB to 200MB. They also introduce a dozen new ways to break your dependencies.

## This looks clean but will bite you
FROM node:18-alpine AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production

## Your app works locally but breaks here
FROM node:18-alpine AS production  
COPY --from=builder /app/node_modules ./node_modules
COPY . .
CMD ["npm", "start"]

Pro tip: `npm ci --only=production` doesn't install devDependencies, which breaks builds that need TypeScript or other build tools. You'll discover this at 11pm when your supposedly "production-ready" image crashes because TypeScript isn't installed.

ECS: Where Your Deployment Goes to Die

AWS ECS Architecture

ECS task definitions are XML-level verbose. A simple Node.js app needs 150+ lines of JSON to define CPU (256-4096 units, because apparently AWS engineers hate round numbers), memory (must be specific combinations or ECS throws a tantrum), and networking (good luck).

Real talk: Fargate costs 3x more than EC2 but saves your sanity. You'll pay the premium after spending a weekend debugging why your container can't resolve DNS on a custom EC2 cluster.

GitHub Actions: The Good News

The only part of this stack that doesn't hate you. Actions for AWS are actually well-maintained. OIDC authentication works. The official actions don't randomly break.

But here's what nobody tells you: your first workflow will take 45 minutes to run because Docker layer caching is disabled by default and you're rebuilding everything from scratch every time.

Where The Money Goes

AWS Cost Calculator

GitHub Actions charges $0.008/minute. Sounds cheap until you realize your inefficient Docker builds consume 15 minutes per deployment. That's $3.60 per deploy. Deploy 10 times a day and suddenly you're paying $100+ monthly just for CI.

ECR costs sneak up on you. $0.10/GB/month sounds reasonable until you accumulate 50 old images because you didn't set up lifecycle policies. Your "free" container registry costs $60/month.

Fargate pricing is $0.04048/vCPU/hour. A small app (0.25 vCPU, 512MB RAM) costs $7.20/month if it runs 24/7. Scale to handle real traffic and you're looking at $100+/month just for compute.

The Real Architecture

Here's what actually happens in production:

Developer pushes to main at 5:47pm on Friday (why do we do this to ourselves?)
GitHub Actions starts. Build time: 12 minutes because someone added a 500MB dependency
ECR image scan finds 47 "critical" vulnerabilities in base OS packages you can't control
Deployment succeeds but health checks fail. Task keeps restarting
You debug for 2 hours, discover the container port is 3000 but load balancer expects 80
Fix that, redeploy. Health checks pass but users get 500 errors
Turns out your database connection string is wrong. Environment variable was DATABASE_URL, your code expects DB_URL
By 8:30pm everything works. You promise yourself you'll never deploy on Friday again
You deploy on Friday again next week

The good news? Once it works, it really works. The bad news? Getting there requires sacrificing several weekends to the AWS documentation gods.

The Actual Setup That Works (After You Fix Everything That Doesn't)

Skip the AWS console. Seriously. Use Terraform or you'll be clicking through 47 different screens every time you need to change a CPU limit. Here's what actually works after you've debugged everything twice.

Start With ECR Because That's Easy

Create your ECR repo and enable image scanning. The scanning will find 200 vulnerabilities in your base image, 199 of which you can't fix because they're in Ubuntu packages. You'll learn to ignore them.

## This works, unlike half the AWS CLI examples
aws ecr create-repository --repository-name my-app --image-scanning-configuration scanOnPush=true
aws ecr put-lifecycle-policy --repository-name my-app --lifecycle-policy-text file://lifecycle.json

Set up lifecycle policies immediately or your ECR bill will be $200 next month because you kept every single build image.

ECS Cluster Setup (The Part That Will Frustrate You)

ECS Fargate Architecture

Create a Fargate cluster. Don't use EC2 unless you enjoy troubleshooting networking at midnight. Fargate costs more but your mental health is worth it.

Task definitions are where AWS decided to make developers suffer. A simple Node.js app requires 150 lines of JSON. Here's the minimal version that actually works:

{
  "family": "my-app",
  "networkMode": "awsvpc",
  "requiresCompatibilities": ["FARGATE"],
  "cpu": "256",
  "memory": "512",
  "executionRoleArn": "arn:aws:iam::account:role/ecsTaskExecutionRole",
  "containerDefinitions": [
    {
      "name": "my-app",
      "image": "account.dkr.ecr.region.amazonaws.com/my-app:latest",
      "portMappings": [{"containerPort": 3000}],
      "logConfiguration": {
        "logDriver": "awslogs",
        "options": {
          "awslogs-group": "/ecs/my-app",
          "awslogs-region": "us-east-1",
          "awslogs-stream-prefix": "ecs"
        }
      }
    }
  ]
}

Notice how the memory is in MB but CPU is in "units"? AWS engineers apparently hate consistency.

The GitHub Action That Actually Deploys

Forget the marketplace actions that half-work. Here's a workflow that handles the edge cases:

name: Deploy
on:
  push:
    branches: [main]

jobs:
  deploy:
    runs-on: ubuntu-latest
    permissions:
      id-token: write
      contents: read
    
    steps:
    - uses: actions/checkout@v4
    
    - name: Configure AWS
      uses: aws-actions/configure-aws-credentials@v4
      with:
        role-to-assume: arn:aws:iam::${{ secrets.AWS_ACCOUNT_ID }}:role/github-actions-role
        role-session-name: GitHubActions
        aws-region: us-east-1
    
    - name: Login to ECR
      run: |
        aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin ${{ secrets.AWS_ACCOUNT_ID }}.dkr.ecr.us-east-1.amazonaws.com
    
    - name: Build and push
      run: |
        docker build -t my-app:${{ github.sha }} .
        docker tag my-app:${{ github.sha }} ${{ secrets.AWS_ACCOUNT_ID }}.dkr.ecr.us-east-1.amazonaws.com/my-app:${{ github.sha }}
        docker tag my-app:${{ github.sha }} ${{ secrets.AWS_ACCOUNT_ID }}.dkr.ecr.us-east-1.amazonaws.com/my-app:latest
        docker push ${{ secrets.AWS_ACCOUNT_ID }}.dkr.ecr.us-east-1.amazonaws.com/my-app:${{ github.sha }}
        docker push ${{ secrets.AWS_ACCOUNT_ID }}.dkr.ecr.us-east-1.amazonaws.com/my-app:latest
    
    - name: Update ECS service
      run: |
        aws ecs update-service --cluster my-cluster --service my-service --force-new-deployment --region us-east-1

Docker Builds That Don't Suck

Your Dockerfile probably looks like this and takes 15 minutes to build:

FROM node:18
WORKDIR /app
COPY . .
RUN npm install
CMD ["npm", "start"]

Here's one that builds in 3 minutes after the first run:

FROM node:18-alpine

WORKDIR /app

## Copy package files first for better caching
COPY package*.json ./
RUN npm ci --only=production && npm cache clean --force

## Copy app source
COPY . .

## Don't run as root
USER node

EXPOSE 3000
CMD ["npm", "start"]

Add a .dockerignore file or your image will be 2GB because you included node_modules, .git, and your entire download folder:

node_modules
.git
*.log
.DS_Store
coverage/
.nyc_output/

OIDC Setup (Do This Once, Correctly)

OIDC eliminates AWS keys in your repo. Set it up in AWS IAM:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Federated": "arn:aws:iam::ACCOUNT:oidc-provider/token.actions.githubusercontent.com"
      },
      "Action": "sts:AssumeRole",
      "Condition": {
        "StringEquals": {
          "token.actions.githubusercontent.com:aud": "sts.amazonaws.com"
        },
        "StringLike": {
          "token.actions.githubusercontent.com:sub": "repo:yourusername/yourrepo:*"
        }
      }
    }
  ]
}

Get this wrong and you'll get cryptic "AssumeRole failed" errors that take 3 hours to debug.

The Gotchas That Will Ruin Your Day

Health checks are not optional. ECS will restart your container every 30 seconds if health checks fail. Add this to your Express app:

app.get('/health', (req, res) => res.status(200).send('OK'));

Environment variables aren't strings in task definitions. This will break:

"environment": [
  {"name": "PORT", "value": 3000}  // Wrong! Must be string
]

This works:

"environment": [
  {"name": "PORT", "value": "3000"}  // String, not number
]

Security groups matter. Your container can't reach the internet if the security group blocks outbound traffic. Learn this the hard way when your app can't connect to external APIs.

Resource limits are enforced. Set your Node.js max memory or it'll use all available RAM and get killed by ECS:

process.env.NODE_OPTIONS = '--max-old-space-size=400';

Deployment Strategies That Work

Rolling deployments are fine for most apps. Don't overcomplicate with blue-green unless you're running a bank.

Set your deployment configuration to:

Maximum percent: 200% (allows new tasks before killing old ones)
Minimum healthy percent: 100% (ensures zero downtime)
Health check grace period: 300 seconds (your app needs time to start)

Monitoring That Matters

CloudWatch Logs are included. Set up log aggregation or you'll be grep-ing through individual log streams like a caveman.

Create CloudWatch alarms for:

Task count drops below desired (your app crashed)
CPU above 80% (time to scale)
Memory above 85% (about to get killed)
Error rate above 5% (something's broken)

Skip fancy monitoring until you have the basics working. You'll have plenty of real errors to debug first.

Timeline: How Long This Really Takes

Week 1: Get Docker building locally. Fight with dependencies.
Week 2: Get GitHub Actions pushing to ECR. Debug OIDC permissions.
Week 3: Get ECS task running. Discover security group issues.
Week 4: Get health checks passing. Fix environment variables.
Week 5: Production deploy works. Realize you forgot about database migrations.
Week 6: Add monitoring. Get paged at 2am because of false alarms.

Budget 6 weeks minimum. Anyone who says they did it in a day either had help or is lying.

Deployment Strategies: What Actually Happens vs What AWS Docs Say

Strategy	What AWS Says	Reality	When Your App Breaks	Recommendation
Rolling	"Gradual replacement"	Half your traffic hits new code immediately	Health checks fail, ECS kills everything	Use this unless you're Netflix
Blue-Green	"Zero downtime"	Costs 2x for 10 minutes, then works perfectly	Database migrations break everything	Skip unless you're actually handling money
Canary	"Risk mitigation"	You'll spend more time configuring than deploying	5% of users get broken experience	Only if you have dedicated DevOps team
Recreate	"Simple strategy"	Your site is down for 3 minutes	Users notice, support tickets filed	Never use this in production

Questions Real Developers Actually Ask (And Honest Answers)

Why does my Docker build work locally but fail in GitHub Actions?

Your local Docker setup probably has different layer caching, different architecture (ARM vs x86), or you're relying on files that aren't in your git repo. Common culprits:

Missing .dockerignore (includes .git and kills the build)
Hard-coded paths that work on macOS but break on Linux
Dependency on local environment variables
Multi-platform build issues (M1 Mac vs Intel runners)

Copy this to fix 90% of cases:

docker build --platform linux/amd64 -t my-app .

My ECS task keeps dying with exit code 1. How do I figure out what's wrong?

ECS error messages are useless on purpose. Check CloudWatch logs first:

aws logs describe-log-groups --log-group-name-prefix \"/ecs/\"
aws logs get-log-events --log-group-name \"/ecs/my-app\" --log-stream-name \"ecs/my-app/task-id\"

Common causes:

Port mismatch: App listens on 3000, task definition expects 80
Missing environment variables: process.env.DATABASE_URL is undefined
Wrong working directory: App expects files in /usr/src/app, Dockerfile uses /app
Permission issues: Running as root locally, restricted in container

How do I stop GitHub Actions from eating my wallet?

Your builds are probably inefficient. First things to fix:

Add Docker layer caching (saves 5-10 minutes per build)
Use npm ci instead of npm install (faster, predictable)
Don't rebuild unchanged dependencies (structure your Dockerfile properly)
Cache node_modules in GitHub Actions

Here's a build that doesn't suck:

- uses: actions/cache@v3
  with:
    path: ~/.npm
    key: ${{ runner.os }}-node-${{ hashFiles('**/package-lock.json') }}

My app works but users can't reach it. What's wrong with the networking?

AWS networking is designed to make you suffer. Check these in order:

Security groups: ECS task needs outbound rules for internet access
Load balancer target groups: Health check path must exist (/health)
Subnet routing: Public subnets for ALB, private for ECS tasks
NAT Gateway: Private subnets need internet access for outbound calls

Quick health check endpoint that actually works:

app.get('/health', (req, res) => {
  res.status(200).json({ status: 'ok', timestamp: new Date().toISOString() });
});

Database migrations keep breaking my deployments. What's the right way?

Don't run migrations in your app container. Ever. Create a separate task definition for migrations:

aws ecs run-task \
  --cluster my-cluster \
  --task-definition migration-task \
  --wait \
  --query 'tasks[0].lastStatus'

If migration fails, your deployment fails. If migration succeeds, then deploy the app. Takes longer but saves you from the "app is up but database is fucked" scenario.

Why is my AWS bill $400 when I thought ECS was cheap?

You probably have:

50 old ECR images at $0.10/GB each (set up lifecycle policies)
Load balancer running 24/7 ($16/month whether you use it or not)
Fargate compute for oversized containers (1GB RAM when you need 256MB)
Cross-AZ data transfer charges (put everything in one AZ for dev)
CloudWatch Logs retention set to "never expire" (change to 30 days)

Use AWS Cost Explorer to find what's actually costing money.

How do I deploy to staging and production without manual steps?

Set up GitHub Environments with different approval rules:

jobs:
  deploy-staging:
    environment: staging
    if: github.ref == 'refs/heads/main'
    # Auto-deploys on push to main
  
  deploy-production:
    environment: production
    if: github.ref == 'refs/heads/main'
    needs: deploy-staging
    # Requires manual approval

Staging deploys automatically. Production requires someone to click "Approve" in GitHub. Don't skip the approval step unless you want to debug production issues at 2am.

My deployment succeeds but the app returns 500 errors. How do I debug this?

ECS doesn't care if your app works, just if the container runs. Check application logs in CloudWatch:

## Get the latest log stream
aws logs describe-log-streams --log-group-name \"/ecs/my-app\" --order-by LastEventTime --descending --max-items 1

## Read the logs  
aws logs get-log-events --log-group-name \"/ecs/my-app\" --log-stream-name \"stream-name\"

Common issues:

Wrong environment variables: NODE_ENV=production but config expects NODE_ENV=prod
Database connection failures: Wrong security group, wrong connection string
Missing dependencies: Production build stripped dev dependencies your app actually needs
File system permissions: Can't write to /tmp or read config files

Can I use this setup for a side project or is it overkill?

It's probably overkill. For side projects, consider:

Render.com: $7/month, connects to GitHub, just works
Railway: Similar to Render, good free tier
Vercel/Netlify: For static sites and serverless
Fly.io: More control, reasonable pricing

Only use ECS if you need:

Fine-grained control over container orchestration
Integration with other AWS services
Complex networking requirements
Experience with production-grade deployments

For a basic web app, ECS is like using a freight truck to deliver pizza.

Quick Navigation

The Happy Path That Never Happens

Docker: The Part That Should Be Easy But Isn't

ECS: Where Your Deployment Goes to Die

GitHub Actions: The Good News

Where The Money Goes

The Real Architecture

Start With ECR Because That's Easy

ECS Cluster Setup (The Part That Will Frustrate You)

The GitHub Action That Actually Deploys

Docker Builds That Don't Suck

OIDC Setup (Do This Once, Correctly)

The Gotchas That Will Ruin Your Day

Deployment Strategies That Work

Monitoring That Matters

Timeline: How Long This Really Takes

Why does my Docker build work locally but fail in GitHub Actions?

My ECS task keeps dying with exit code 1. How do I figure out what's wrong?

How do I stop GitHub Actions from eating my wallet?

My app works but users can't reach it. What's wrong with the networking?

Database migrations keep breaking my deployments. What's the right way?

Why is my AWS bill $400 when I thought ECS was cheap?

How do I deploy to staging and production without manual steps?

My deployment succeeds but the app returns 500 errors. How do I debug this?

Can I use this setup for a side project or is it overkill?

Related Tools & Recommendations

How to Reduce Kubernetes Costs in Production - Complete Optimization Guide

Kafka + MongoDB + Kubernetes + Prometheus Integration - When Event Streams Break

Debug Kubernetes Issues - The 3AM Production Survival Guide

Docker Scout - Find Vulnerabilities Before They Kill Your Production

Docker Permission Denied on Windows? Here's How to Fix It

Docker Daemon Won't Start on Windows 11? Here's the Fix

Podman Desktop - Free Docker Desktop Alternative

CircleCI - Fast CI/CD That Actually Works

Jenkins Production Deployment - From Dev to Bulletproof

Jenkins + Docker + Kubernetes: How to Deploy Without Breaking Production (Usually)

GitHub Actions + Jenkins Security Integration

I Ditched Docker Desktop for Rancher Desktop - Here's What Actually Happened

Rancher Desktop - Docker Desktop's Free Replacement That Actually Works

containerd - The Container Runtime That Actually Just Works

Docker Swarm Service Discovery Broken? Here's How to Unfuck It

Docker Swarm Node Down? Here's How to Fix It

Docker Swarm - Container Orchestration That Actually Works

Docker Daemon Won't Start on Linux - Fix This Shit Now

Linux Foundation Takes Control of Solo.io's AI Agent Gateway - August 25, 2025

GitLab CI/CD - The Platform That Does Everything (Usually)