CI/CD Pipeline Troubleshooting - When Everything Goes to Hell at 3AM

The 3AM Pipeline Debugging Hierarchy (What to Check First)

Debugging Process Flow

When your build is broken and everyone's breathing down your neck, here's the order that actually saves time:

1. Is It "Works on My Machine" Syndrome?

The most infuriating and common cause of pipeline failures. Your code runs perfectly locally but dies in CI with errors that make no fucking sense.

Quick test: Can you reproduce the failure locally? If not, it's an environment difference, and you're in for a long night.

## Copy the exact environment from CI
docker run -it --rm \
  -e NODE_ENV=production \
  -e DATABASE_URL=postgres://user:pass@db:5432/myapp \
  node:18-alpine \
  sh

Common gotchas I've learned the hard way:

Case sensitivity: Your MacBook doesn't care about file case, Linux does. import './Component' works locally but fails in CI when the file is component.tsx
Path separators: Windows uses \, everything else uses /. Hardcoded paths break across platforms
Environment variables: Your .env file works locally but CI doesn't have access to it
Node version drift: You're running Node 18.2.0 locally, CI is using 18.1.0, and that patch version broke something

Last month I spent 4 hours debugging why our tests passed locally but failed in CI with Cannot resolve module '@/components/Button'. Turns out the CI image was missing the TypeScript path mapping configuration. The fix was adding tsconfig.json to our Docker build context.

One line in the Dockerfile. One fucking line. Four hours of my life I'll never get back.

But you know what? Finding that bug felt incredible. Like solving a puzzle that's been taunting you. That moment when the build finally goes green after hours of red failures? Pure dopamine hit. This is why we do this job.

Recent gotcha: Newer Node versions sometimes change how module resolution works. If you're getting ERR_MODULE_NOT_FOUND errors that work fine in older versions, check if your CI is pulling a different Node version than expected:

## Check exactly what Node version CI is using
node --version
npm --version

## If using nvm in CI, pin the version
echo "22.9.0" > .nvmrc
nvm use

2. Check the Obvious Shit First

Has anything changed recently? I know, I know, "nothing changed" - but check anyway:

Package updates in the last 48 hours
Environment variable changes
Infrastructure updates (Node version, base Docker images)
New team members who might have merged something

## Check recent commits that might be cursed
git log --oneline --since="2 days ago"

## See what packages changed
git diff HEAD~5 package.json
git diff HEAD~5 package-lock.json

Are the dependencies actually installing?

npm install hanging for 30+ minutes usually means network issues or registry problems
pip install failing with SSL errors means your base image is too old
bundle install timing out means RubyGems is having a bad day

3. Resource Limits: The Silent Killer

Your pipeline was working fine until your codebase grew or your test suite got bigger. Now builds randomly fail with unhelpful error messages.

Memory exhaustion symptoms:

JavaScript heap out of memory during webpack builds
Killed with no other context (Linux OOM killer)
Tests that pass individually but fail when run together
Docker builds that hang during npm install

## Check if you're hitting memory limits
docker stats --no-stream

## Increase Node.js heap size for builds
export NODE_OPTIONS="--max-old-space-size=4096"

I learned this the hard way when our webpack build started failing after adding React Query to our app. The build would get to 90% compilation and just... die. Killed. No error message, no stack trace, no helpful context.

Six fucking hours. Six hours of my Saturday debugging this while my family went to the beach without me. Tried different Node versions, cleared all caches, rebuilt the Docker image from scratch. Nothing.

Finally ran docker stats and saw memory usage spike to 2GB and flatline. The CI runner was silently OOM-killing the process. Bumped memory to 4GB in our GitHub Actions config and boom - build worked perfectly.

Sometimes I hate this job. But sometimes it's kind of beautiful how simple the solution is once you find it.

4. Network Connectivity Hell

Your CI environment might not have access to the same resources as your development machine.

Common network failures:

Corporate firewalls blocking package registries
DNS resolution issues in containerized environments
Registry authentication failures
Proxy configuration problems

## Test network connectivity from your CI environment
curl -I https://registry.npmjs.org/
nslookup github.com
ping google.com

The registry authentication nightmare: Your personal access token works locally but fails in CI because you're using a different account or the token has different permissions.

5. Timing and Race Conditions

Some failures only happen under specific timing conditions that are more likely in CI environments.

Classic race conditions:

Tests that depend on database seeds not being fully applied
File system operations that assume synchronous completion
Network requests without proper timeout handling
Process startup dependencies that aren't properly awaited

// This works locally but fails in CI
const server = require('./server');
const request = require('supertest');

// Server might not be ready yet
test('should return 200', async () => {
  const response = await request(server).get('/health');
  expect(response.status).toBe(200);
});

// Better: wait for server to be ready
test('should return 200', async () => {
  await server.ready(); // Wait for server startup
  const response = await request(server).get('/health');
  expect(response.status).toBe(200);
});

The Nuclear Debugging Options

When nothing else works and you're getting desperate:

SSH into the CI runner (if possible):

## For self-hosted runners
docker exec -it runner-container bash

## For GitHub Actions with tmate
- name: Debug with tmate
  uses: mxschmitt/action-tmate@v3

Enable verbose logging everywhere:

## Docker build debug
DOCKER_BUILDKIT=0 docker build --progress=plain --no-cache .

## npm debug output
npm install --loglevel=verbose

## GitHub Actions debug
export ACTIONS_STEP_DEBUG=true
export ACTIONS_RUNNER_DEBUG=true

Reproduce the exact CI environment locally:

## Pull the exact same Docker image
docker pull node:18-alpine

## Run with the same environment variables
docker run -it --rm \
  -e CI=true \
  -e NODE_ENV=production \
  -v $(pwd):/app \
  -w /app \
  node:18-alpine \
  /bin/sh

6. New 2025 Debugging Scenarios (The Fresh Hell)

GitHub Actions with Apple Silicon Runners: GitHub now offers M1/M2 runners, but Docker builds targeting x86 fail in weird ways:

## This breaks on M1 runners with Docker buildx
- name: Build for production
  run: docker build --platform linux/amd64 .

## This works
- name: Build for production
  run: | 
    docker buildx create --use
    docker buildx build --platform linux/amd64 --load -t myapp .

Bun/Deno in CI: New JavaScript runtimes cause compatibility hell. I wasted a perfectly good weekend recently debugging why our Next.js app worked locally but failed CI with import.meta is not defined.

Turned out our CI image had both Node and Bun installed, and some fucking dependency was calling Bun instead of Node to run our build script. Two days of my life debugging a build tool I wasn't even trying to use.

Fixed it by explicitly setting engines.node in package.json and removing Bun from the CI image. But honestly? Bun is pretty sweet when it works. Might try it again when it's more stable.

pnpm Workspace Issues: pnpm is becoming popular, but workspace setups break differently than npm:

## This works locally but fails in CI
pnpm install

## CI needs workspace protocol understanding
pnpm install --frozen-lockfile --prefer-offline

Podman Instead of Docker: Some CI providers switched to Podman. Commands look the same but authentication breaks:

## Docker way
docker login -u $USER -p $PASS registry.com

## Podman way (different auth mechanism)
podman login --username $USER --password $PASS registry.com

The key to not losing your mind is being systematic. Most pipeline failures fall into these categories, and checking them in order saves hours of random debugging.

Real Error Messages and What They Actually Mean

Docker build stuck at "RUN npm install" for 45 minutes, what the hell?

This is usually the npm registry being slow or your dependencies being huge. The build isn't frozen - it's just taking forever.

Quick fixes:

## Use npm ci instead of npm install (faster, more reliable)
RUN npm ci --only=production

## Or switch to yarn/pnpm which have better caching
RUN yarn install --frozen-lockfile --production

If that doesn't work: Your package-lock.json is probably fucked. Delete it and node_modules, then run npm install locally to generate a fresh lock file.

I had this exact problem last week - our build went from 3 minutes to 45 minutes overnight. Turns out someone committed a package-lock.json with registry URLs pointing to some internal npm proxy that was slow as hell.

My tests pass locally but fail in CI with "Cannot connect to database"

The most common cause: your local database is running, but CI doesn't have one.

GitHub Actions fix:

services:
  postgres:
    image: postgres:14
    env:
      POSTGRES_PASSWORD: postgres
      POSTGRES_DB: test
    options: >-
      --health-cmd pg_isready
      --health-interval 10s
      --health-timeout 5s
      --health-retries 5

Docker Compose fix:

## Add this to your docker-compose.test.yml
services:
  db:
    image: postgres:14
    environment:
      POSTGRES_DB: testdb
      POSTGRES_PASSWORD: password

  app:
    depends_on:
      - db

The stupid thing to check first: Is your DATABASE_URL actually pointing to the CI database? I've seen people hardcode localhost:5432 which obviously doesn't work in containers.

"JavaScript heap out of memory" during webpack build

Your build process needs more memory than the default Node.js heap limit (usually 2GB).

Immediate fix:

export NODE_OPTIONS="--max-old-space-size=4096"  # 4GB

In package.json:

{
  "scripts": {
    "build": "node --max-old-space-size=4096 ./node_modules/.bin/webpack"
  }
}

For GitHub Actions:

- name: Build
  run: npm run build
  env:
    NODE_OPTIONS: "--max-old-space-size=4096"

Root cause: Usually your bundle got bigger (more dependencies, larger assets) or you're building multiple targets simultaneously.

Kubernetes deployment fails with "CrashLoopBackOff" but logs show nothing useful

The pod is starting, crashing immediately, then Kubernetes keeps restarting it. The logs might be empty because the crash happens before your app has a chance to log anything.

Debug process:

## Get more details about why it's failing
kubectl describe pod <pod-name>

## Check events for clues
kubectl get events --sort-by=.metadata.creationTimestamp

## Look at the previous container logs (before the crash)
kubectl logs <pod-name> --previous

Common causes:

Missing environment variables: App crashes on startup because DATABASE_URL is undefined
Port mismatch: Your app listens on port 3000 but the container expects port 8080
File permissions: Container runs as non-root but files are owned by root
Health check too aggressive: Kubernetes kills the pod before your app finishes starting

Quick fix for port issues:

## Make sure EXPOSE matches what your app actually uses
EXPOSE 3000

## In your Kubernetes deployment
spec:
  containers:
  - ports:
    - containerPort: 3000  # Must match your app

"ImagePullBackOff" - Kubernetes can't pull my Docker image

Kubernetes tried to download your container image but failed. Usually authentication or image name problems.

Debug steps:

## Check if the image exists and you can pull it manually
docker pull your-registry.com/your-image:tag

## Verify your image name in the deployment
kubectl describe pod <pod-name>

Common fixes:

Typo in image name: my-app:latest vs myapp:latest
Wrong registry: Pointing to Docker Hub instead of your private registry
Authentication: Missing or expired registry credentials
Image doesn't exist: You forgot to push it, or the build failed silently

For private registries:

## Create registry credentials
kubectl create secret docker-registry myregistrykey \
  --docker-server=DOCKER_REGISTRY_SERVER \
  --docker-username=DOCKER_USER \
  --docker-password=DOCKER_PASSWORD \
  --docker-email=DOCKER_EMAIL

## Reference in your deployment
spec:
  imagePullSecrets:
  - name: myregistrykey

GitHub Action fails with "The operation was canceled" after exactly 6 hours

GitHub Actions has a 6-hour timeout per job. Your job is taking too long and getting killed.

Why this happens:

Huge test suites that take forever
Building massive Docker images
Installing dependencies on slow networks
Running without proper parallelization

Fixes:

## Increase timeout (max 6 hours)
jobs:
  build:
    timeout-minutes: 360

## Or split into multiple jobs
jobs:
  test-unit:
    runs-on: ubuntu-latest
    steps: [...]

  test-integration:
    runs-on: ubuntu-latest
    steps: [...]

Better solution: Make your build faster instead of increasing timeouts.

npm install fails with "ENOTFOUND registry.npmjs.org"

Network connectivity issues. Your CI environment can't reach the npm registry.

Quick tests:

## Can you reach npm registry?
curl -I https://registry.npmjs.org/

## DNS working?
nslookup registry.npmjs.org

Common causes:

Corporate firewall: Company blocks external package registries
DNS issues: Can't resolve registry.npmjs.org
Proxy problems: Network proxy not configured correctly
Registry down: npm is having issues (check npm status)

Workarounds:

## Use different registry
npm config set registry https://registry.yarnpkg.com

## Or use yarn
yarn install

Docker build fails with "no space left on device"

Your CI runner is out of disk space. Usually caused by old Docker images, build cache, or large build artifacts.

Immediate fix:

## Clean up Docker stuff
docker system prune -a -f
docker builder prune -a -f

## Check disk usage
df -h
du -sh /var/lib/docker

For GitHub Actions:

- name: Free up space
  run: |
    sudo docker system prune -a -f
    sudo rm -rf /usr/share/dotnet
    sudo rm -rf /opt/ghc
    sudo rm -rf "/usr/local/share/boost"

Prevention: Use multi-stage Docker builds to keep final images smaller.

"Permission denied" when trying to write files in Docker

Your container process doesn't have permission to write to the filesystem.

Quick fix:

## Run as root (not recommended for production)
USER root

## Better: fix permissions for your user
RUN chown -R node:node /app
USER node

For GitHub Actions with Docker:

- name: Fix permissions
  run: |
    sudo chown -R $USER:$USER .

Tests randomly fail with "port already in use"

Multiple test processes trying to use the same port, or previous test processes not cleaning up properly.

Fix with random ports:

// Don't hardcode port 3000
const port = process.env.PORT || 3000 + Math.floor(Math.random() * 1000);

// Or use port 0 to get any available port
const server = app.listen(0, () => {
  const port = server.address().port;
  console.log(`Server running on port ${port}`);
});

For databases in tests:

// Use random database names
const dbName = `test_${Date.now()}_${Math.random().toString(36)}`;

My pipeline worked yesterday, today it's completely broken and I didn't change anything

Something changed, even if you didn't change it. Dependencies update, base images update, external services change APIs.

Investigation checklist:

## Check if any dependencies updated
npm outdated
npm audit

## Check Docker base image changes
docker pull node:18-alpine
docker history your-image:latest

## Check for environment changes
env | sort

The usual suspects:

Automatic dependency updates: Dependabot or Renovate updated something that broke
Base image updates: node:18 pulled a newer version with breaking changes
External API changes: Third-party service changed their API and your tests broke
Infrastructure changes: CI provider updated their runners/environment

Pro tip: Pin your dependencies and base images to avoid surprise breakage. Yes, it's more maintenance, but it prevents 3am emergency debugging sessions.

GitHub Actions failing with "Error: Dockerfile parse error" after working fine for months

This sometimes happens when GitHub updates their build environment. The issue is usually related to BuildKit version changes or Docker context problems.

Quick diagnosis:

## Check if your Dockerfile has Windows line endings (CRLF)
file Dockerfile
hexdump -C Dockerfile | head

## Look for ^M characters or 0d 0a byte sequences

Common fixes:

## Force BuildKit version in your workflow
- name: Build Docker image
  run: |
    export DOCKER_BUILDKIT=1
    docker build -t myapp .
  env:
    BUILDKIT_PROGRESS: plain

## Or disable BuildKit entirely if you're desperate
- name: Build Docker image (legacy)
  run: |
    export DOCKER_BUILDKIT=0
    docker build --progress plain -t myapp .

Root cause: Either line ending issues from Windows developers, or BuildKit syntax that's no longer compatible. I spent a Tuesday morning debugging this exact error - turned out someone edited the Dockerfile on Windows and introduced CRLF line endings.

My pnpm workspace build works locally but fails in CI with "could not find workspace root"

pnpm workspaces are pickier about directory structure in CI environments.

Debug steps:

## Check if pnpm can find the workspace root
pnpm --version
cat pnpm-workspace.yaml
ls -la packages/

## See what pnpm thinks the workspace structure is
pnpm list --depth 0

Common fixes:

## Make sure you're running commands from the right directory
- name: Install dependencies
  run: |
    cd $GITHUB_WORKSPACE
    pnpm install --frozen-lockfile

## Or be explicit about workspace root
- name: Install dependencies
  run: pnpm install --frozen-lockfile
  working-directory: .

Bun install randomly fails with "registry request failed" but npm works fine

Bun's package resolution is more aggressive and sometimes fails on network hiccups that npm/yarn handle gracefully.

Immediate workaround:

## Add retry logic for Bun
- name: Install with Bun (with retries)
  run: |
    for i in 1 2 3; do
      bun install && break
      echo "Retry $i failed, trying again..."
      sleep 5
    done

Better fix:

## Configure Bun registry mirrors
echo 'registry = "https://registry.yarnpkg.com"' >> ~/.bunfig.toml

## Or fall back to npm for CI consistency
npm install  # Yes, it's slower but more reliable

My Apple Silicon Mac builds work fine but Linux CI fails with "exec user process caused: exec format error"

You're building ARM64 images on your Mac but trying to run them on x86 CI runners.

Fix the build:

## Use buildx for multi-platform builds
FROM --platform=$BUILDPLATFORM node:18-alpine AS builder
## ... build steps

FROM node:18-alpine
COPY --from=builder /app/dist ./

Or force x86 builds:

- name: Build for CI (x86)
  run: docker build --platform linux/amd64 -t myapp .

I hit this when we got new MacBook Pros and suddenly our perfectly working pipeline started failing with cryptic exec format errors. The ARM64 images built fine but couldn't run on GitHub's x86 runners.

AI code completion tools are breaking my CI builds by suggesting wrong code

GitHub Copilot, Cursor, and other AI tools sometimes suggest code that looks right but has subtle bugs that only show up in CI environments.

Real example I hit recently: AI suggested this Docker health check:

HEALTHCHECK --interval=30s CMD curl -f \$APP_URL || exit 1

Looks fine, but fails in CI because the base image doesn't have curl. The correct version:

## AI suggestions don't always account for minimal base images
FROM node:18-alpine
RUN apk add --no-cache curl  # Add this!
HEALTHCHECK --interval=30s CMD curl -f \$APP_URL || exit 1

## Or use wget which is usually available
HEALTHCHECK --interval=30s CMD wget --quiet --tries=1 --spider \$APP_URL || exit 1

Pro tip: Always test AI-suggested code in your actual CI environment, not just locally. AI models were trained on a lot of examples that work on full Linux distributions but fail in minimal container images.

Docker and Kubernetes: When Container Dreams Become Deployment Nightmares

Docker Container Platform

Container orchestration was supposed to make deployments easier. Instead, we traded one set of problems for a completely different set of problems that require specialized knowledge to debug. Here's how to fix the most common disasters.

Docker Build Performance Hell

Your Docker build takes 20 minutes and you're losing your mind:

The problem is usually layer caching and build context size. Docker has to send your entire project directory to the build daemon, then rebuild layers that could be cached.

## This Dockerfile will make you suffer
FROM node:18
COPY . /app
WORKDIR /app
RUN npm install
RUN npm run build

## This one won't ruin your day
FROM node:18-alpine as dependencies
WORKDIR /app
## Copy only package files first for better caching
COPY package*.json ./
RUN npm ci --only=production

FROM node:18-alpine as build
WORKDIR /app
COPY package*.json ./
RUN npm ci
COPY . .
RUN npm run build

FROM node:18-alpine as runtime
WORKDIR /app
COPY --from=dependencies /app/node_modules ./node_modules
COPY --from=build /app/dist ./dist
COPY package*.json ./
EXPOSE 3000
CMD ["npm", "start"]

The .dockerignore file you forgot to create:

node_modules
npm-debug.log
.git
.DS_Store
*.md
.env
coverage/
.nyc_output

I learned this the hard way when our build went from 2 minutes to 15 minutes after someone added a logs/ directory with 2GB of log files. Docker was copying the entire thing to the build context every single time. Adding .dockerignore fixed it instantly.

Multi-Platform Build Failures

Building for ARM64 (Apple Silicon) when your CI runs AMD64 causes weird failures:

## This breaks on ARM Macs
FROM node:18

## This works everywhere
FROM --platform=$BUILDPLATFORM node:18

## Or be explicit about what you want
FROM --platform=linux/amd64 node:18

If you need multi-platform builds:

## Use buildx for multi-platform images
FROM --platform=$BUILDPLATFORM node:18 as builder
## ... build steps ...

FROM node:18-alpine
COPY --from=builder /app/dist ./dist

Kubernetes Deployment Debugging: The Systematic Approach

Kubernetes Troubleshooting Flow

When your deployment fails, Kubernetes gives you cryptic errors spread across multiple resources. Here's how to actually figure out what's wrong:

Step 1: Start with the deployment status

## See if pods are even being created
kubectl get deployments
kubectl describe deployment your-app

## Check replica sets (manages pods for deployments)
kubectl get replicasets
kubectl describe replicaset your-app-xxx

Step 2: Check pod status

## See what state your pods are in
kubectl get pods -l app=your-app

## Get detailed info about failed pods
kubectl describe pod your-app-xxx

## Check events (this is where the real info usually is)
kubectl get events --sort-by=.metadata.creationTimestamp

Step 3: Dive into logs

## Current container logs
kubectl logs your-app-xxx

## Previous container logs (if it crashed and restarted)
kubectl logs your-app-xxx --previous

## Follow logs in real-time
kubectl logs -f deployment/your-app

The Most Common Kubernetes Fuckups

Resource limits that are too restrictive:

Your app needs 512MB to start but you set limits to 256MB. Kubernetes kills it immediately.

## This will kill your Node.js app
resources:
  limits:
    memory: "128Mi"  # Too small for most Node apps
    cpu: "100m"      # Too small for build processes

## This actually works
resources:
  requests:
    memory: "256Mi"
    cpu: "100m"
  limits:
    memory: "512Mi"   # Room for memory spikes
    cpu: "500m"       # Room for CPU spikes

Liveness probes that are too aggressive:

Your app takes 30 seconds to start but the probe checks every 10 seconds with no grace period.

## This kills apps that take time to start
livenessProbe:
  httpGet:
    path: /health
    port: 3000
  initialDelaySeconds: 5    # Too soon!
  periodSeconds: 10
  timeoutSeconds: 5
  failureThreshold: 2       # Only 2 failures = death

## This actually works
livenessProbe:
  httpGet:
    path: /health
    port: 3000
  initialDelaySeconds: 60   # Give it time to start
  periodSeconds: 30
  timeoutSeconds: 10
  failureThreshold: 3       # More forgiving

Readiness probes that never succeed:

Your health check endpoint returns 503 during startup, so Kubernetes never routes traffic to your pod.

// Bad health check - always returns ready
app.get('/health', (req, res) => {
  res.status(200).json({ status: 'ok' });
});

// Better health check - actually checks if app is ready
let isReady = false;

// Set ready after database connection, etc.
connectToDatabase().then(() => {
  isReady = true;
});

app.get('/health', (req, res) => {
  if (isReady) {
    res.status(200).json({ status: 'ready' });
  } else {
    res.status(503).json({ status: 'starting' });
  }
});

Networking Issues That Will Drive You Insane

Service can't reach other services:

DNS resolution in Kubernetes is weird. Services can reach each other by name within the same namespace, but cross-namespace requires FQDN.

## This works within the same namespace
DATABASE_URL: postgres://user:pass@postgres:5432/db

## This works across namespaces
DATABASE_URL: postgres://user:pass@postgres.database.svc.cluster.local:5432/db

Ingress routing that just doesn't work:

Your ingress looks correct but requests get 404 or route to the wrong service.

## Make sure your service selector matches pod labels
apiVersion: v1
kind: Service
spec:
  selector:
    app: your-app          # Must match pod labels
  ports:
  - port: 80
    targetPort: 3000       # Must match container port

---
## And your ingress path must match service
apiVersion: networking.k8s.io/v1
kind: Ingress
spec:
  rules:
  - http:
      paths:
      - path: /api
        pathType: Prefix
        backend:
          service:
            name: your-app  # Must match service name
            port:
              number: 80   # Must match service port

ConfigMap and Secret Mount Issues

Your environment variables just aren't there:

## Wrong way - typos in key names kill you
env:
- name: DATABASE_URL
  valueFrom:
    secretKeyRef:
      name: app-secrets
      key: database-url    # Typo! Should be database_url

## Right way - check the actual secret keys first

## Debug ConfigMaps and Secrets
kubectl get secret app-secrets -o yaml
kubectl describe configmap app-config

File mounts that don't show up:

Your config file should be mounted at /etc/config/app.yml but the directory is empty.

## Check mount path vs container path
volumeMounts:
- name: config
  mountPath: /etc/config          # Directory, not file
  subPath: app.yml               # Specific file from configmap

volumes:
- name: config
  configMap:
    name: app-config
    items:
    - key: app.yml               # Key in configmap
      path: app.yml              # File name in mount

The Nuclear Debug Options for Kubernetes

When you're truly desperate:

## Create a debug pod in the same network namespace
kubectl run debug --image=busybox -it --rm --restart=Never -- /bin/sh

## Or attach to running pod
kubectl exec -it your-app-xxx -- /bin/bash

## Port forward to access services locally
kubectl port-forward service/your-app 8080:80

## Check what's actually running in the container
kubectl exec your-app-xxx -- ps aux
kubectl exec your-app-xxx -- netstat -tlnp
kubectl exec your-app-xxx -- env

Copy files from failing pods:

## Get config files to see what's actually mounted
kubectl cp your-app-xxx:/etc/config ./debug-config

## Get logs from filesystem if kubectl logs doesn't work
kubectl cp your-app-xxx:/var/log ./debug-logs

Docker Registry Authentication Nightmares

ImagePullBackOff with private registries:

The image exists, you can pull it locally, but Kubernetes can't pull it.

## Create registry secret
kubectl create secret docker-registry myregistrykey \
  --docker-server=my-registry.com \
  --docker-username=myuser \
  --docker-password=mypassword

## Check if the secret is actually correct
kubectl get secret myregistrykey -o yaml | base64 -d

For AWS ECR:

## ECR tokens expire every 12 hours
aws ecr get-login-password --region us-west-2 | \
  docker login --username AWS --password-stdin 123456789012.dkr.ecr.us-west-2.amazonaws.com

## Set up automatic token refresh in your pipeline

Container Startup Failures

Exit code 125: Docker daemon error
Usually a problem with the Docker image or runtime configuration.

Exit code 126: Container command not executable
The command you specified in CMD or ENTRYPOINT doesn't have execute permissions.

## Fix permissions in Dockerfile
COPY script.sh /app/
RUN chmod +x /app/script.sh
CMD ["/app/script.sh"]

Exit code 127: Container command not found
The command doesn't exist in the container.

## Make sure the command exists
RUN which node  # Check if node is available
CMD ["node", "app.js"]

The key insight: containers fail fast and with limited context. Enable verbose logging, check all your assumptions, and remember that what works in your terminal might not work in a minimal container environment.

CI/CD Error Messages: Quick Reference (Copy-Paste Fixes)

Error Message	Platform	Actual Problem	Copy-Paste Fix	Time to Fix
"JavaScript heap out of memory"	Any Node.js build	Webpack/build needs more RAM	`export NODE_OPTIONS="--max-old-space-size=4096"`	30 seconds
"npm install" hangs for 30+ minutes	Docker builds	Registry slow or package-lock fucked	`RUN npm ci --only=production` in Dockerfile	2 minutes
"Cannot connect to database"	Test pipelines	No database service in CI	Add postgres service to your workflow YAML	5 minutes
"CrashLoopBackOff"	Kubernetes	App crashes on startup	`kubectl logs <pod> --previous` then fix the actual error	10-60 minutes
"ImagePullBackOff"	Kubernetes	Can't download container image	Check image name spelling and registry auth	5-15 minutes
"The operation was canceled"	GitHub Actions	Hit 6-hour timeout	Split job or add `timeout-minutes: 360`	2 minutes
"ENOTFOUND registry.npmjs.org"	npm builds	Network/DNS issues	`npm config set registry https://registry.yarnpkg.com`	1 minute
"Permission denied"	Docker containers	File ownership issues	`RUN chown -R node:node /app` in Dockerfile	1 minute
"no space left on device"	CI runners	Disk full from old builds	`docker system prune -a -f` before build	2 minutes
"port already in use"	Tests	Previous test didn't clean up	Use random ports: `const port = 3000 + Math.random() * 1000`	5 minutes
"Error: EPERM operation not permitted"	Windows CI	File lock/permission issue	Add retry logic or use different npm registry	10 minutes
Exit code 137	Docker/K8s	Process killed (usually OOM)	Increase memory limits in deployment YAML	2 minutes
Exit code 125	Docker	Image build failed	Check Dockerfile syntax and base image	5-30 minutes
"dial tcp: lookup on 127.0.0.11"	Docker Compose	DNS resolution failed	Use service names, not localhost	1 minute
"failed to solve with frontend dockerfile.v0"	Docker buildx	Multi-platform build issue	Add `--platform=linux/amd64` to docker build	1 minute
"Module not found"	Node.js builds	Missing dependency or path issue	Check import paths and package.json	5-20 minutes
"Dockerfile parse error"	Docker builds (2025)	Line ending or BuildKit issues	`export DOCKER_BUILDKIT=0` or fix CRLF	2-10 minutes
"could not find workspace root"	pnpm workspaces	Wrong working directory in CI	`cd $GITHUB_WORKSPACE && pnpm install`	5 minutes
"registry request failed"	Bun installs	Network issues with Bun's resolver	Use npm/yarn in CI or add retry logic	2 minutes
"exec user process caused: exec format error"	Apple Silicon builds	ARM64 image on x86 CI	`--platform linux/amd64` in docker build	1 minute
"cannot use import statement"	Node.js (ESM)	Module type mismatch	Add `"type": "module"` to package.json	2 minutes

Links I Keep Open When Shit Breaks at 3AM

Related Tools & Recommendations

tool

GitLab CI/CD Overview: Features, Setup, & Real-World Use

CI/CD, security scanning, and project management in one place - when it works, it's great

GitLab CI/CD

/tool/gitlab-ci-cd/overview

100%

integration

Jenkins Docker Kubernetes CI/CD: Deploy Without Breaking Production

The Real Guide to CI/CD That Actually Works

Jenkins

/integration/jenkins-docker-kubernetes/enterprise-ci-cd-pipeline

88%

alternatives

GitHub Actions Alternatives: Reduce Costs & Simplify Migration

Explore top GitHub Actions alternatives to reduce CI/CD costs and streamline your development pipeline. Learn why teams are migrating and what to expect during

GitHub Actions

/alternatives/github-actions/migration-ready-alternatives

77%

pricing

Recommended