The 3AM Pipeline Debugging Hierarchy (What to Check First)

Debugging Process Flow

When your build is broken and everyone's breathing down your neck, here's the order that actually saves time:

1. Is It "Works on My Machine" Syndrome?

The most infuriating and common cause of pipeline failures. Your code runs perfectly locally but dies in CI with errors that make no fucking sense.

Quick test: Can you reproduce the failure locally? If not, it's an environment difference, and you're in for a long night.

## Copy the exact environment from CI
docker run -it --rm \
  -e NODE_ENV=production \
  -e DATABASE_URL=postgres://user:pass@db:5432/myapp \
  node:18-alpine \
  sh

Common gotchas I've learned the hard way:

  • Case sensitivity: Your MacBook doesn't care about file case, Linux does. import './Component' works locally but fails in CI when the file is component.tsx
  • Path separators: Windows uses \, everything else uses /. Hardcoded paths break across platforms
  • Environment variables: Your .env file works locally but CI doesn't have access to it
  • Node version drift: You're running Node 18.2.0 locally, CI is using 18.1.0, and that patch version broke something

Last month I spent 4 hours debugging why our tests passed locally but failed in CI with Cannot resolve module '@/components/Button'. Turns out the CI image was missing the TypeScript path mapping configuration. The fix was adding tsconfig.json to our Docker build context.

One line in the Dockerfile. One fucking line. Four hours of my life I'll never get back.

But you know what? Finding that bug felt incredible. Like solving a puzzle that's been taunting you. That moment when the build finally goes green after hours of red failures? Pure dopamine hit. This is why we do this job.

Recent gotcha: Newer Node versions sometimes change how module resolution works. If you're getting ERR_MODULE_NOT_FOUND errors that work fine in older versions, check if your CI is pulling a different Node version than expected:

## Check exactly what Node version CI is using
node --version
npm --version

## If using nvm in CI, pin the version
echo "22.9.0" > .nvmrc
nvm use

2. Check the Obvious Shit First

Has anything changed recently? I know, I know, "nothing changed" - but check anyway:

  • Package updates in the last 48 hours
  • Environment variable changes
  • Infrastructure updates (Node version, base Docker images)
  • New team members who might have merged something
## Check recent commits that might be cursed
git log --oneline --since="2 days ago"

## See what packages changed
git diff HEAD~5 package.json
git diff HEAD~5 package-lock.json

Are the dependencies actually installing?

  • npm install hanging for 30+ minutes usually means network issues or registry problems
  • pip install failing with SSL errors means your base image is too old
  • bundle install timing out means RubyGems is having a bad day

3. Resource Limits: The Silent Killer

Your pipeline was working fine until your codebase grew or your test suite got bigger. Now builds randomly fail with unhelpful error messages.

Memory exhaustion symptoms:

  • JavaScript heap out of memory during webpack builds
  • Killed with no other context (Linux OOM killer)
  • Tests that pass individually but fail when run together
  • Docker builds that hang during npm install
## Check if you're hitting memory limits
docker stats --no-stream

## Increase Node.js heap size for builds
export NODE_OPTIONS="--max-old-space-size=4096"

I learned this the hard way when our webpack build started failing after adding React Query to our app. The build would get to 90% compilation and just... die. Killed. No error message, no stack trace, no helpful context.

Six fucking hours. Six hours of my Saturday debugging this while my family went to the beach without me. Tried different Node versions, cleared all caches, rebuilt the Docker image from scratch. Nothing.

Finally ran docker stats and saw memory usage spike to 2GB and flatline. The CI runner was silently OOM-killing the process. Bumped memory to 4GB in our GitHub Actions config and boom - build worked perfectly.

Sometimes I hate this job. But sometimes it's kind of beautiful how simple the solution is once you find it.

4. Network Connectivity Hell

Your CI environment might not have access to the same resources as your development machine.

Common network failures:

  • Corporate firewalls blocking package registries
  • DNS resolution issues in containerized environments
  • Registry authentication failures
  • Proxy configuration problems
## Test network connectivity from your CI environment
curl -I https://registry.npmjs.org/
nslookup github.com
ping google.com

The registry authentication nightmare: Your personal access token works locally but fails in CI because you're using a different account or the token has different permissions.

5. Timing and Race Conditions

Some failures only happen under specific timing conditions that are more likely in CI environments.

Classic race conditions:

  • Tests that depend on database seeds not being fully applied
  • File system operations that assume synchronous completion
  • Network requests without proper timeout handling
  • Process startup dependencies that aren't properly awaited
// This works locally but fails in CI
const server = require('./server');
const request = require('supertest');

// Server might not be ready yet
test('should return 200', async () => {
  const response = await request(server).get('/health');
  expect(response.status).toBe(200);
});

// Better: wait for server to be ready
test('should return 200', async () => {
  await server.ready(); // Wait for server startup
  const response = await request(server).get('/health');
  expect(response.status).toBe(200);
});

The Nuclear Debugging Options

When nothing else works and you're getting desperate:

SSH into the CI runner (if possible):

## For self-hosted runners
docker exec -it runner-container bash

## For GitHub Actions with tmate
- name: Debug with tmate
  uses: mxschmitt/action-tmate@v3

Enable verbose logging everywhere:

## Docker build debug
DOCKER_BUILDKIT=0 docker build --progress=plain --no-cache .

## npm debug output
npm install --loglevel=verbose

## GitHub Actions debug
export ACTIONS_STEP_DEBUG=true
export ACTIONS_RUNNER_DEBUG=true

Reproduce the exact CI environment locally:

## Pull the exact same Docker image
docker pull node:18-alpine

## Run with the same environment variables
docker run -it --rm \
  -e CI=true \
  -e NODE_ENV=production \
  -v $(pwd):/app \
  -w /app \
  node:18-alpine \
  /bin/sh

6. New 2025 Debugging Scenarios (The Fresh Hell)

GitHub Actions with Apple Silicon Runners: GitHub now offers M1/M2 runners, but Docker builds targeting x86 fail in weird ways:

## This breaks on M1 runners with Docker buildx
- name: Build for production
  run: docker build --platform linux/amd64 .

## This works
- name: Build for production
  run: | 
    docker buildx create --use
    docker buildx build --platform linux/amd64 --load -t myapp .

Bun/Deno in CI: New JavaScript runtimes cause compatibility hell. I wasted a perfectly good weekend recently debugging why our Next.js app worked locally but failed CI with import.meta is not defined.

Turned out our CI image had both Node and Bun installed, and some fucking dependency was calling Bun instead of Node to run our build script. Two days of my life debugging a build tool I wasn't even trying to use.

Fixed it by explicitly setting engines.node in package.json and removing Bun from the CI image. But honestly? Bun is pretty sweet when it works. Might try it again when it's more stable.

pnpm Workspace Issues: pnpm is becoming popular, but workspace setups break differently than npm:

## This works locally but fails in CI
pnpm install

## CI needs workspace protocol understanding
pnpm install --frozen-lockfile --prefer-offline

Podman Instead of Docker: Some CI providers switched to Podman. Commands look the same but authentication breaks:

## Docker way
docker login -u $USER -p $PASS registry.com

## Podman way (different auth mechanism)
podman login --username $USER --password $PASS registry.com

The key to not losing your mind is being systematic. Most pipeline failures fall into these categories, and checking them in order saves hours of random debugging.

Real Error Messages and What They Actually Mean

Q

Docker build stuck at "RUN npm install" for 45 minutes, what the hell?

A

This is usually the npm registry being slow or your dependencies being huge. The build isn't frozen - it's just taking forever.

Quick fixes:

## Use npm ci instead of npm install (faster, more reliable)
RUN npm ci --only=production

## Or switch to yarn/pnpm which have better caching
RUN yarn install --frozen-lockfile --production

If that doesn't work: Your package-lock.json is probably fucked. Delete it and node_modules, then run npm install locally to generate a fresh lock file.

I had this exact problem last week - our build went from 3 minutes to 45 minutes overnight. Turns out someone committed a package-lock.json with registry URLs pointing to some internal npm proxy that was slow as hell.

Q

My tests pass locally but fail in CI with "Cannot connect to database"

A

The most common cause: your local database is running, but CI doesn't have one.

GitHub Actions fix:

services:
  postgres:
    image: postgres:14
    env:
      POSTGRES_PASSWORD: postgres
      POSTGRES_DB: test
    options: >-
      --health-cmd pg_isready
      --health-interval 10s
      --health-timeout 5s
      --health-retries 5

Docker Compose fix:

## Add this to your docker-compose.test.yml
services:
  db:
    image: postgres:14
    environment:
      POSTGRES_DB: testdb
      POSTGRES_PASSWORD: password

  app:
    depends_on:
      - db

The stupid thing to check first: Is your DATABASE_URL actually pointing to the CI database? I've seen people hardcode localhost:5432 which obviously doesn't work in containers.

Q

"JavaScript heap out of memory" during webpack build

A

Your build process needs more memory than the default Node.js heap limit (usually 2GB).

Immediate fix:

export NODE_OPTIONS="--max-old-space-size=4096"  # 4GB

In package.json:

{
  "scripts": {
    "build": "node --max-old-space-size=4096 ./node_modules/.bin/webpack"
  }
}

For GitHub Actions:

- name: Build
  run: npm run build
  env:
    NODE_OPTIONS: "--max-old-space-size=4096"

Root cause: Usually your bundle got bigger (more dependencies, larger assets) or you're building multiple targets simultaneously.

Q

Kubernetes deployment fails with "CrashLoopBackOff" but logs show nothing useful

A

The pod is starting, crashing immediately, then Kubernetes keeps restarting it. The logs might be empty because the crash happens before your app has a chance to log anything.

Debug process:

## Get more details about why it's failing
kubectl describe pod <pod-name>

## Check events for clues
kubectl get events --sort-by=.metadata.creationTimestamp

## Look at the previous container logs (before the crash)
kubectl logs <pod-name> --previous

Common causes:

  1. Missing environment variables: App crashes on startup because DATABASE_URL is undefined
  2. Port mismatch: Your app listens on port 3000 but the container expects port 8080
  3. File permissions: Container runs as non-root but files are owned by root
  4. Health check too aggressive: Kubernetes kills the pod before your app finishes starting

Quick fix for port issues:

## Make sure EXPOSE matches what your app actually uses
EXPOSE 3000
## In your Kubernetes deployment
spec:
  containers:
  - ports:
    - containerPort: 3000  # Must match your app
Q

"ImagePullBackOff" - Kubernetes can't pull my Docker image

A

Kubernetes tried to download your container image but failed. Usually authentication or image name problems.

Debug steps:

## Check if the image exists and you can pull it manually
docker pull your-registry.com/your-image:tag

## Verify your image name in the deployment
kubectl describe pod <pod-name>

Common fixes:

  1. Typo in image name: my-app:latest vs myapp:latest
  2. Wrong registry: Pointing to Docker Hub instead of your private registry
  3. Authentication: Missing or expired registry credentials
  4. Image doesn't exist: You forgot to push it, or the build failed silently

For private registries:

## Create registry credentials
kubectl create secret docker-registry myregistrykey \
  --docker-server=DOCKER_REGISTRY_SERVER \
  --docker-username=DOCKER_USER \
  --docker-password=DOCKER_PASSWORD \
  --docker-email=DOCKER_EMAIL

## Reference in your deployment
spec:
  imagePullSecrets:
  - name: myregistrykey
Q

GitHub Action fails with "The operation was canceled" after exactly 6 hours

A

GitHub Actions has a 6-hour timeout per job. Your job is taking too long and getting killed.

Why this happens:

  • Huge test suites that take forever
  • Building massive Docker images
  • Installing dependencies on slow networks
  • Running without proper parallelization

Fixes:

## Increase timeout (max 6 hours)
jobs:
  build:
    timeout-minutes: 360

## Or split into multiple jobs
jobs:
  test-unit:
    runs-on: ubuntu-latest
    steps: [...]

  test-integration:
    runs-on: ubuntu-latest
    steps: [...]

Better solution: Make your build faster instead of increasing timeouts.

Q

npm install fails with "ENOTFOUND registry.npmjs.org"

A

Network connectivity issues. Your CI environment can't reach the npm registry.

Quick tests:

## Can you reach npm registry?
curl -I https://registry.npmjs.org/

## DNS working?
nslookup registry.npmjs.org

Common causes:

  1. Corporate firewall: Company blocks external package registries
  2. DNS issues: Can't resolve registry.npmjs.org
  3. Proxy problems: Network proxy not configured correctly
  4. Registry down: npm is having issues (check npm status)

Workarounds:

## Use different registry
npm config set registry https://registry.yarnpkg.com

## Or use yarn
yarn install
Q

Docker build fails with "no space left on device"

A

Your CI runner is out of disk space. Usually caused by old Docker images, build cache, or large build artifacts.

Immediate fix:

## Clean up Docker stuff
docker system prune -a -f
docker builder prune -a -f

## Check disk usage
df -h
du -sh /var/lib/docker

For GitHub Actions:

- name: Free up space
  run: |
    sudo docker system prune -a -f
    sudo rm -rf /usr/share/dotnet
    sudo rm -rf /opt/ghc
    sudo rm -rf "/usr/local/share/boost"

Prevention: Use multi-stage Docker builds to keep final images smaller.

Q

"Permission denied" when trying to write files in Docker

A

Your container process doesn't have permission to write to the filesystem.

Quick fix:

## Run as root (not recommended for production)
USER root

## Better: fix permissions for your user
RUN chown -R node:node /app
USER node

For GitHub Actions with Docker:

- name: Fix permissions
  run: |
    sudo chown -R $USER:$USER .
Q

Tests randomly fail with "port already in use"

A

Multiple test processes trying to use the same port, or previous test processes not cleaning up properly.

Fix with random ports:

// Don't hardcode port 3000
const port = process.env.PORT || 3000 + Math.floor(Math.random() * 1000);

// Or use port 0 to get any available port
const server = app.listen(0, () => {
  const port = server.address().port;
  console.log(`Server running on port ${port}`);
});

For databases in tests:

// Use random database names
const dbName = `test_${Date.now()}_${Math.random().toString(36)}`;
Q

My pipeline worked yesterday, today it's completely broken and I didn't change anything

A

Something changed, even if you didn't change it. Dependencies update, base images update, external services change APIs.

Investigation checklist:

## Check if any dependencies updated
npm outdated
npm audit

## Check Docker base image changes
docker pull node:18-alpine
docker history your-image:latest

## Check for environment changes
env | sort

The usual suspects:

  1. Automatic dependency updates: Dependabot or Renovate updated something that broke
  2. Base image updates: node:18 pulled a newer version with breaking changes
  3. External API changes: Third-party service changed their API and your tests broke
  4. Infrastructure changes: CI provider updated their runners/environment

Pro tip: Pin your dependencies and base images to avoid surprise breakage. Yes, it's more maintenance, but it prevents 3am emergency debugging sessions.

Q

GitHub Actions failing with "Error: Dockerfile parse error" after working fine for months

A

This sometimes happens when GitHub updates their build environment. The issue is usually related to BuildKit version changes or Docker context problems.

Quick diagnosis:

## Check if your Dockerfile has Windows line endings (CRLF)
file Dockerfile
hexdump -C Dockerfile | head

## Look for ^M characters or 0d 0a byte sequences

Common fixes:

## Force BuildKit version in your workflow
- name: Build Docker image
  run: |
    export DOCKER_BUILDKIT=1
    docker build -t myapp .
  env:
    BUILDKIT_PROGRESS: plain

## Or disable BuildKit entirely if you're desperate
- name: Build Docker image (legacy)
  run: |
    export DOCKER_BUILDKIT=0
    docker build --progress plain -t myapp .

Root cause: Either line ending issues from Windows developers, or BuildKit syntax that's no longer compatible. I spent a Tuesday morning debugging this exact error - turned out someone edited the Dockerfile on Windows and introduced CRLF line endings.

Q

My pnpm workspace build works locally but fails in CI with "could not find workspace root"

A

pnpm workspaces are pickier about directory structure in CI environments.

Debug steps:

## Check if pnpm can find the workspace root
pnpm --version
cat pnpm-workspace.yaml
ls -la packages/

## See what pnpm thinks the workspace structure is
pnpm list --depth 0

Common fixes:

## Make sure you're running commands from the right directory
- name: Install dependencies
  run: |
    cd $GITHUB_WORKSPACE
    pnpm install --frozen-lockfile

## Or be explicit about workspace root
- name: Install dependencies
  run: pnpm install --frozen-lockfile
  working-directory: .
Q

Bun install randomly fails with "registry request failed" but npm works fine

A

Bun's package resolution is more aggressive and sometimes fails on network hiccups that npm/yarn handle gracefully.

Immediate workaround:

## Add retry logic for Bun
- name: Install with Bun (with retries)
  run: |
    for i in 1 2 3; do
      bun install && break
      echo "Retry $i failed, trying again..."
      sleep 5
    done

Better fix:

## Configure Bun registry mirrors
echo 'registry = "https://registry.yarnpkg.com"' >> ~/.bunfig.toml

## Or fall back to npm for CI consistency
npm install  # Yes, it's slower but more reliable
Q

My Apple Silicon Mac builds work fine but Linux CI fails with "exec user process caused: exec format error"

A

You're building ARM64 images on your Mac but trying to run them on x86 CI runners.

Fix the build:

## Use buildx for multi-platform builds
FROM --platform=$BUILDPLATFORM node:18-alpine AS builder
## ... build steps

FROM node:18-alpine
COPY --from=builder /app/dist ./

Or force x86 builds:

- name: Build for CI (x86)
  run: docker build --platform linux/amd64 -t myapp .

I hit this when we got new MacBook Pros and suddenly our perfectly working pipeline started failing with cryptic exec format errors. The ARM64 images built fine but couldn't run on GitHub's x86 runners.

Q

AI code completion tools are breaking my CI builds by suggesting wrong code

A

GitHub Copilot, Cursor, and other AI tools sometimes suggest code that looks right but has subtle bugs that only show up in CI environments.

Real example I hit recently: AI suggested this Docker health check:

HEALTHCHECK --interval=30s CMD curl -f \$APP_URL || exit 1

Looks fine, but fails in CI because the base image doesn't have curl. The correct version:

## AI suggestions don't always account for minimal base images
FROM node:18-alpine
RUN apk add --no-cache curl  # Add this!
HEALTHCHECK --interval=30s CMD curl -f \$APP_URL || exit 1

## Or use wget which is usually available
HEALTHCHECK --interval=30s CMD wget --quiet --tries=1 --spider \$APP_URL || exit 1

Pro tip: Always test AI-suggested code in your actual CI environment, not just locally. AI models were trained on a lot of examples that work on full Linux distributions but fail in minimal container images.

Docker and Kubernetes: When Container Dreams Become Deployment Nightmares

Docker Container Platform

Container orchestration was supposed to make deployments easier. Instead, we traded one set of problems for a completely different set of problems that require specialized knowledge to debug. Here's how to fix the most common disasters.

Docker Build Performance Hell

Your Docker build takes 20 minutes and you're losing your mind:

The problem is usually layer caching and build context size. Docker has to send your entire project directory to the build daemon, then rebuild layers that could be cached.

## This Dockerfile will make you suffer
FROM node:18
COPY . /app
WORKDIR /app
RUN npm install
RUN npm run build

## This one won't ruin your day
FROM node:18-alpine as dependencies
WORKDIR /app
## Copy only package files first for better caching
COPY package*.json ./
RUN npm ci --only=production

FROM node:18-alpine as build
WORKDIR /app
COPY package*.json ./
RUN npm ci
COPY . .
RUN npm run build

FROM node:18-alpine as runtime
WORKDIR /app
COPY --from=dependencies /app/node_modules ./node_modules
COPY --from=build /app/dist ./dist
COPY package*.json ./
EXPOSE 3000
CMD ["npm", "start"]

The .dockerignore file you forgot to create:

node_modules
npm-debug.log
.git
.DS_Store
*.md
.env
coverage/
.nyc_output

I learned this the hard way when our build went from 2 minutes to 15 minutes after someone added a logs/ directory with 2GB of log files. Docker was copying the entire thing to the build context every single time. Adding .dockerignore fixed it instantly.

Multi-Platform Build Failures

Building for ARM64 (Apple Silicon) when your CI runs AMD64 causes weird failures:

## This breaks on ARM Macs
FROM node:18

## This works everywhere
FROM --platform=$BUILDPLATFORM node:18

## Or be explicit about what you want
FROM --platform=linux/amd64 node:18

If you need multi-platform builds:

## Use buildx for multi-platform images
FROM --platform=$BUILDPLATFORM node:18 as builder
## ... build steps ...

FROM node:18-alpine
COPY --from=builder /app/dist ./dist

Kubernetes Deployment Debugging: The Systematic Approach

Kubernetes Troubleshooting Flow

When your deployment fails, Kubernetes gives you cryptic errors spread across multiple resources. Here's how to actually figure out what's wrong:

Step 1: Start with the deployment status

## See if pods are even being created
kubectl get deployments
kubectl describe deployment your-app

## Check replica sets (manages pods for deployments)
kubectl get replicasets
kubectl describe replicaset your-app-xxx

Step 2: Check pod status

## See what state your pods are in
kubectl get pods -l app=your-app

## Get detailed info about failed pods
kubectl describe pod your-app-xxx

## Check events (this is where the real info usually is)
kubectl get events --sort-by=.metadata.creationTimestamp

Step 3: Dive into logs

## Current container logs
kubectl logs your-app-xxx

## Previous container logs (if it crashed and restarted)
kubectl logs your-app-xxx --previous

## Follow logs in real-time
kubectl logs -f deployment/your-app

The Most Common Kubernetes Fuckups

Resource limits that are too restrictive:

Your app needs 512MB to start but you set limits to 256MB. Kubernetes kills it immediately.

## This will kill your Node.js app
resources:
  limits:
    memory: "128Mi"  # Too small for most Node apps
    cpu: "100m"      # Too small for build processes

## This actually works
resources:
  requests:
    memory: "256Mi"
    cpu: "100m"
  limits:
    memory: "512Mi"   # Room for memory spikes
    cpu: "500m"       # Room for CPU spikes

Liveness probes that are too aggressive:

Your app takes 30 seconds to start but the probe checks every 10 seconds with no grace period.

## This kills apps that take time to start
livenessProbe:
  httpGet:
    path: /health
    port: 3000
  initialDelaySeconds: 5    # Too soon!
  periodSeconds: 10
  timeoutSeconds: 5
  failureThreshold: 2       # Only 2 failures = death

## This actually works
livenessProbe:
  httpGet:
    path: /health
    port: 3000
  initialDelaySeconds: 60   # Give it time to start
  periodSeconds: 30
  timeoutSeconds: 10
  failureThreshold: 3       # More forgiving

Readiness probes that never succeed:

Your health check endpoint returns 503 during startup, so Kubernetes never routes traffic to your pod.

// Bad health check - always returns ready
app.get('/health', (req, res) => {
  res.status(200).json({ status: 'ok' });
});

// Better health check - actually checks if app is ready
let isReady = false;

// Set ready after database connection, etc.
connectToDatabase().then(() => {
  isReady = true;
});

app.get('/health', (req, res) => {
  if (isReady) {
    res.status(200).json({ status: 'ready' });
  } else {
    res.status(503).json({ status: 'starting' });
  }
});

Networking Issues That Will Drive You Insane

Service can't reach other services:

DNS resolution in Kubernetes is weird. Services can reach each other by name within the same namespace, but cross-namespace requires FQDN.

## This works within the same namespace
DATABASE_URL: postgres://user:pass@postgres:5432/db

## This works across namespaces
DATABASE_URL: postgres://user:pass@postgres.database.svc.cluster.local:5432/db

Ingress routing that just doesn't work:

Your ingress looks correct but requests get 404 or route to the wrong service.

## Make sure your service selector matches pod labels
apiVersion: v1
kind: Service
spec:
  selector:
    app: your-app          # Must match pod labels
  ports:
  - port: 80
    targetPort: 3000       # Must match container port

---
## And your ingress path must match service
apiVersion: networking.k8s.io/v1
kind: Ingress
spec:
  rules:
  - http:
      paths:
      - path: /api
        pathType: Prefix
        backend:
          service:
            name: your-app  # Must match service name
            port:
              number: 80   # Must match service port

ConfigMap and Secret Mount Issues

Your environment variables just aren't there:

## Wrong way - typos in key names kill you
env:
- name: DATABASE_URL
  valueFrom:
    secretKeyRef:
      name: app-secrets
      key: database-url    # Typo! Should be database_url

## Right way - check the actual secret keys first
## Debug ConfigMaps and Secrets
kubectl get secret app-secrets -o yaml
kubectl describe configmap app-config

File mounts that don't show up:

Your config file should be mounted at /etc/config/app.yml but the directory is empty.

## Check mount path vs container path
volumeMounts:
- name: config
  mountPath: /etc/config          # Directory, not file
  subPath: app.yml               # Specific file from configmap

volumes:
- name: config
  configMap:
    name: app-config
    items:
    - key: app.yml               # Key in configmap
      path: app.yml              # File name in mount

The Nuclear Debug Options for Kubernetes

When you're truly desperate:

## Create a debug pod in the same network namespace
kubectl run debug --image=busybox -it --rm --restart=Never -- /bin/sh

## Or attach to running pod
kubectl exec -it your-app-xxx -- /bin/bash

## Port forward to access services locally
kubectl port-forward service/your-app 8080:80

## Check what's actually running in the container
kubectl exec your-app-xxx -- ps aux
kubectl exec your-app-xxx -- netstat -tlnp
kubectl exec your-app-xxx -- env

Copy files from failing pods:

## Get config files to see what's actually mounted
kubectl cp your-app-xxx:/etc/config ./debug-config

## Get logs from filesystem if kubectl logs doesn't work
kubectl cp your-app-xxx:/var/log ./debug-logs

Docker Registry Authentication Nightmares

ImagePullBackOff with private registries:

The image exists, you can pull it locally, but Kubernetes can't pull it.

## Create registry secret
kubectl create secret docker-registry myregistrykey \
  --docker-server=my-registry.com \
  --docker-username=myuser \
  --docker-password=mypassword

## Check if the secret is actually correct
kubectl get secret myregistrykey -o yaml | base64 -d

For AWS ECR:

## ECR tokens expire every 12 hours
aws ecr get-login-password --region us-west-2 | \
  docker login --username AWS --password-stdin 123456789012.dkr.ecr.us-west-2.amazonaws.com

## Set up automatic token refresh in your pipeline

Container Startup Failures

Exit code 125: Docker daemon error
Usually a problem with the Docker image or runtime configuration.

Exit code 126: Container command not executable
The command you specified in CMD or ENTRYPOINT doesn't have execute permissions.

## Fix permissions in Dockerfile
COPY script.sh /app/
RUN chmod +x /app/script.sh
CMD ["/app/script.sh"]

Exit code 127: Container command not found
The command doesn't exist in the container.

## Make sure the command exists
RUN which node  # Check if node is available
CMD ["node", "app.js"]

The key insight: containers fail fast and with limited context. Enable verbose logging, check all your assumptions, and remember that what works in your terminal might not work in a minimal container environment.

CI/CD Error Messages: Quick Reference (Copy-Paste Fixes)

Error Message

Platform

Actual Problem

Copy-Paste Fix

Time to Fix

"JavaScript heap out of memory"

Any Node.js build

Webpack/build needs more RAM

export NODE_OPTIONS="--max-old-space-size=4096"

30 seconds

"npm install" hangs for 30+ minutes

Docker builds

Registry slow or package-lock fucked

RUN npm ci --only=production in Dockerfile

2 minutes

"Cannot connect to database"

Test pipelines

No database service in CI

Add postgres service to your workflow YAML

5 minutes

"CrashLoopBackOff"

Kubernetes

App crashes on startup

kubectl logs <pod> --previous then fix the actual error

10-60 minutes

"ImagePullBackOff"

Kubernetes

Can't download container image

Check image name spelling and registry auth

5-15 minutes

"The operation was canceled"

GitHub Actions

Hit 6-hour timeout

Split job or add timeout-minutes: 360

2 minutes

"ENOTFOUND registry.npmjs.org"

npm builds

Network/DNS issues

npm config set registry https://registry.yarnpkg.com

1 minute

"Permission denied"

Docker containers

File ownership issues

RUN chown -R node:node /app in Dockerfile

1 minute

"no space left on device"

CI runners

Disk full from old builds

docker system prune -a -f before build

2 minutes

"port already in use"

Tests

Previous test didn't clean up

Use random ports: const port = 3000 + Math.random() * 1000

5 minutes

"Error: EPERM operation not permitted"

Windows CI

File lock/permission issue

Add retry logic or use different npm registry

10 minutes

Exit code 137

Docker/K8s

Process killed (usually OOM)

Increase memory limits in deployment YAML

2 minutes

Exit code 125

Docker

Image build failed

Check Dockerfile syntax and base image

5-30 minutes

"dial tcp: lookup on 127.0.0.11"

Docker Compose

DNS resolution failed

Use service names, not localhost

1 minute

"failed to solve with frontend dockerfile.v0"

Docker buildx

Multi-platform build issue

Add --platform=linux/amd64 to docker build

1 minute

"Module not found"

Node.js builds

Missing dependency or path issue

Check import paths and package.json

5-20 minutes

"Dockerfile parse error"

Docker builds (2025)

Line ending or BuildKit issues

export DOCKER_BUILDKIT=0 or fix CRLF

2-10 minutes

"could not find workspace root"

pnpm workspaces

Wrong working directory in CI

cd $GITHUB_WORKSPACE && pnpm install

5 minutes

"registry request failed"

Bun installs

Network issues with Bun's resolver

Use npm/yarn in CI or add retry logic

2 minutes

"exec user process caused: exec format error"

Apple Silicon builds

ARM64 image on x86 CI

--platform linux/amd64 in docker build

1 minute

"cannot use import statement"

Node.js (ESM)

Module type mismatch

Add "type": "module" to package.json

2 minutes

Related Tools & Recommendations

tool
Similar content

GitLab CI/CD Overview: Features, Setup, & Real-World Use

CI/CD, security scanning, and project management in one place - when it works, it's great

GitLab CI/CD
/tool/gitlab-ci-cd/overview
100%
integration
Similar content

Jenkins Docker Kubernetes CI/CD: Deploy Without Breaking Production

The Real Guide to CI/CD That Actually Works

Jenkins
/integration/jenkins-docker-kubernetes/enterprise-ci-cd-pipeline
88%
alternatives
Similar content

GitHub Actions Alternatives: Reduce Costs & Simplify Migration

Explore top GitHub Actions alternatives to reduce CI/CD costs and streamline your development pipeline. Learn why teams are migrating and what to expect during

GitHub Actions
/alternatives/github-actions/migration-ready-alternatives
77%
pricing
Recommended

Enterprise Git Hosting: What GitHub, GitLab and Bitbucket Actually Cost

When your boss ruins everything by asking for "enterprise features"

GitHub Enterprise
/pricing/github-enterprise-bitbucket-gitlab/enterprise-deployment-cost-analysis
57%
tool
Recommended

CircleCI - Fast CI/CD That Actually Works

competes with CircleCI

CircleCI
/tool/circleci/overview
56%
tool
Recommended

Jenkins - The CI/CD Server That Won't Die

competes with Jenkins

Jenkins
/tool/jenkins/overview
56%
integration
Recommended

GitHub Actions + Jenkins Security Integration

When Security Wants Scans But Your Pipeline Lives in Jenkins Hell

GitHub Actions
/integration/github-actions-jenkins-security-scanning/devsecops-pipeline-integration
56%
integration
Recommended

OpenTelemetry + Jaeger + Grafana on Kubernetes - The Stack That Actually Works

Stop flying blind in production microservices

OpenTelemetry
/integration/opentelemetry-jaeger-grafana-kubernetes/complete-observability-stack
55%
integration
Recommended

Temporal + Kubernetes + Redis: The Only Microservices Stack That Doesn't Hate You

Stop debugging distributed transactions at 3am like some kind of digital masochist

Temporal
/integration/temporal-kubernetes-redis-microservices/microservices-communication-architecture
53%
alternatives
Recommended

GitHub Actions Alternatives That Don't Suck

competes with GitHub Actions

GitHub Actions
/alternatives/github-actions/use-case-driven-selection
49%
alternatives
Recommended

GitHub Actions is Fine for Open Source Projects, But Try Explaining to an Auditor Why Your CI/CD Platform Was Built for Hobby Projects

competes with GitHub Actions

GitHub Actions
/alternatives/github-actions/enterprise-governance-alternatives
49%
troubleshoot
Recommended

Docker Desktop Won't Install? Welcome to Hell

When the "simple" installer turns your weekend into a debugging nightmare

Docker Desktop
/troubleshoot/docker-cve-2025-9074/installation-startup-failures
47%
howto
Recommended

Complete Guide to Setting Up Microservices with Docker and Kubernetes (2025)

Split Your Monolith Into Services That Will Break in New and Exciting Ways

Docker
/howto/setup-microservices-docker-kubernetes/complete-setup-guide
47%
troubleshoot
Recommended

Fix Docker Daemon Connection Failures

When Docker decides to fuck you over at 2 AM

Docker Engine
/troubleshoot/docker-error-during-connect-daemon-not-running/daemon-connection-failures
47%
troubleshoot
Recommended

Fix Kubernetes ImagePullBackOff Error - The Complete Battle-Tested Guide

From "Pod stuck in ImagePullBackOff" to "Problem solved in 90 seconds"

Kubernetes
/troubleshoot/kubernetes-imagepullbackoff/comprehensive-troubleshooting-guide
46%
pricing
Recommended

GitHub Enterprise vs GitLab Ultimate - Total Cost Analysis 2025

The 2025 pricing reality that changed everything - complete breakdown and real costs

GitHub Enterprise
/pricing/github-enterprise-vs-gitlab-cost-comparison/total-cost-analysis
46%
review
Recommended

GitHub Copilot vs Cursor: Which One Pisses You Off Less?

I've been coding with both for 3 months. Here's which one actually helps vs just getting in the way.

GitHub Copilot
/review/github-copilot-vs-cursor/comprehensive-evaluation
44%
pricing
Recommended

GitHub Copilot Enterprise Pricing - What It Actually Costs

GitHub's pricing page says $39/month. What they don't tell you is you're actually paying $60.

GitHub Copilot Enterprise
/pricing/github-copilot-enterprise-vs-competitors/enterprise-cost-calculator
44%
tool
Recommended

GitHub - Where Developers Actually Keep Their Code

Microsoft's $7.5 billion code bucket that somehow doesn't completely suck

GitHub
/tool/github/overview
44%
tool
Recommended

Travis CI - The CI Service That Used to Be Great (Before GitHub Actions)

Travis CI was the CI service that saved us from Jenkins hell in 2011, but GitHub Actions basically killed it

Travis CI
/tool/travis-ci/overview
40%

Recommendations combine user behavior, content similarity, research intelligence, and SEO optimization