Currently viewing the AI version
Switch to human version

CI/CD Pipeline Troubleshooting: AI-Optimized Knowledge Base

Critical Debugging Hierarchy (Time-Saving Order)

1. Environment Differences (Most Common Cause)

Problem: Code works locally but fails in CI with nonsensical errors
Symptoms: Import/module resolution failures, missing dependencies, path issues
Time Investment: 15 minutes to 4 hours if not systematic

Immediate Actions:

# Reproduce CI environment exactly
docker run -it --rm \
  -e NODE_ENV=production \
  -e DATABASE_URL=postgres://user:pass@db:5432/myapp \
  node:18-alpine \
  sh

Common Environment Gotchas:

  • Case sensitivity: macOS ignores file case, Linux doesn't
  • Path separators: Windows \ vs Unix /
  • Missing environment variables in CI
  • Node version drift between local and CI
  • TypeScript path mapping missing in Docker builds

Real-World Impact: Single missing tsconfig.json in Docker context = 4 hours debugging time

2. Resource Limits (Silent Killer)

Memory Exhaustion Indicators:

  • JavaScript heap out of memory during builds
  • Killed with no context (Linux OOM killer)
  • Tests pass individually, fail together
  • Docker builds hang at npm install

Production Thresholds:

  • Node.js default heap: 2GB (insufficient for modern webpack builds)
  • Required for medium React apps: 4GB (--max-old-space-size=4096)
  • CI runner memory usage spikes to 2GB+ = immediate failure

Configuration:

export NODE_OPTIONS="--max-old-space-size=4096"

3. Network Connectivity Issues

Failure Patterns:

  • Registry timeouts during dependency installation
  • DNS resolution failures in containers
  • Corporate firewall blocking package registries
  • Authentication failures with private registries

Diagnostic Commands:

curl -I https://registry.npmjs.org/
nslookup registry.npmjs.org
ping google.com

4. Timing and Race Conditions

High-Risk Scenarios:

  • Database seeds not fully applied before tests
  • Server startup timing in integration tests
  • Process dependencies without proper awaiting
  • File system operations assuming synchronous completion

Prevention Pattern:

// Wrong: assumes immediate server readiness
test('should return 200', async () => {
  const response = await request(server).get('/health');
  expect(response.status).toBe(200);
});

// Correct: explicit readiness check
test('should return 200', async () => {
  await server.ready();
  const response = await request(server).get('/health');
  expect(response.status).toBe(200);
});

Docker Build Optimization

Performance Critical Factors

Build Time Killers:

  • Large build context (copying unnecessary files)
  • Poor layer caching strategy
  • Installing dependencies on every build
  • Single-stage builds without optimization

Multi-Stage Build Pattern (Reduces build time from 20 minutes to 3 minutes):

FROM node:18-alpine as dependencies
WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production

FROM node:18-alpine as build
WORKDIR /app
COPY package*.json ./
RUN npm ci
COPY . .
RUN npm run build

FROM node:18-alpine as runtime
WORKDIR /app
COPY --from=dependencies /app/node_modules ./node_modules
COPY --from=build /app/dist ./dist
EXPOSE 3000
CMD ["npm", "start"]

Essential .dockerignore:

node_modules
.git
.DS_Store
*.md
.env
coverage/
logs/

Multi-Platform Issues (Apple Silicon Era)

Problem: ARM64 builds on Mac fail on x86 CI
Symptoms: exec user process caused: exec format error
Solution:

FROM --platform=$BUILDPLATFORM node:18
# or explicit platform targeting
FROM --platform=linux/amd64 node:18

Kubernetes Debugging Methodology

Systematic Diagnosis Process

  1. Deployment Status: kubectl get deployments && kubectl describe deployment
  2. Pod Status: kubectl get pods -l app=your-app && kubectl describe pod
  3. Events: kubectl get events --sort-by=.metadata.creationTimestamp
  4. Logs: kubectl logs pod-name --previous

Critical Configuration Failures

Resource Limits (Production Values):

resources:
  requests:
    memory: "256Mi"
    cpu: "100m"
  limits:
    memory: "512Mi"    # Room for memory spikes
    cpu: "500m"        # Room for CPU spikes

Probe Configuration (Prevents Premature Kills):

livenessProbe:
  httpGet:
    path: /health
    port: 3000
  initialDelaySeconds: 60    # Critical: enough startup time
  periodSeconds: 30
  timeoutSeconds: 10
  failureThreshold: 3        # More forgiving than default

Health Check Implementation:

let isReady = false;

connectToDatabase().then(() => {
  isReady = true;
});

app.get('/health', (req, res) => {
  if (isReady) {
    res.status(200).json({ status: 'ready' });
  } else {
    res.status(503).json({ status: 'starting' });
  }
});

Network Resolution Issues

Cross-Namespace Service Communication:

# Same namespace: service-name
DATABASE_URL: postgres://user:pass@postgres:5432/db

# Cross-namespace: service.namespace.svc.cluster.local
DATABASE_URL: postgres://user:pass@postgres.database.svc.cluster.local:5432/db

Error Reference Matrix

Error Platform Root Cause Time to Fix Production Impact
JavaScript heap out of memory Node.js builds Insufficient heap size 30 seconds Build failure
npm install hangs 30+ minutes Docker Registry issues/large deps 2 minutes Pipeline timeout
Cannot connect to database Test pipelines Missing DB service 5 minutes Test failure
CrashLoopBackOff Kubernetes App startup failure 10-60 minutes Service down
ImagePullBackOff Kubernetes Registry auth/image missing 5-15 minutes Deployment blocked
Exit code 137 Containers OOM kill 2 minutes Service restart
ENOTFOUND registry.npmjs.org npm Network/DNS failure 1 minute Build blocked
port already in use Tests Process cleanup failure 5 minutes Test flakiness
exec format error Multi-platform Architecture mismatch 1 minute Runtime failure

Resource Requirements and Trade-offs

Build Resource Allocation

  • Minimum for Node.js builds: 2GB RAM, 2 CPU cores
  • Recommended for production: 4GB RAM, 4 CPU cores
  • Enterprise/large codebases: 8GB RAM, 8 CPU cores

Time Investment Patterns

  • "Works locally" debugging: 15 minutes - 4 hours
  • Docker build optimization: 2-8 hours initial, saves 15+ minutes per build
  • Kubernetes deployment issues: 30 minutes - 3 hours
  • Network/registry problems: 5 minutes - 2 hours

Technology Support Quality

  • Docker: Excellent documentation, large community
  • Kubernetes: Steep learning curve, complex debugging
  • GitHub Actions: Good docs, limited debugging tools
  • npm/Node.js: Mature ecosystem, occasional registry issues
  • Bun: Fast but unstable, limited CI support
  • pnpm: Good performance, workspace complications

Prevention Strategies

Dependency Management

{
  "engines": {
    "node": "18.19.0",
    "npm": "10.2.3"
  }
}

Container Image Pinning

# Avoid: latest tags that break unexpectedly
FROM node:18

# Use: specific versions for stability
FROM node:18.19.0-alpine

Test Isolation

// Random ports prevent conflicts
const port = process.env.PORT || 3000 + Math.floor(Math.random() * 1000);

// Random database names
const dbName = `test_${Date.now()}_${Math.random().toString(36)}`;

Nuclear Debugging Options

GitHub Actions Deep Debug

env:
  ACTIONS_STEP_DEBUG: true
  ACTIONS_RUNNER_DEBUG: true

Docker Verbose Logging

DOCKER_BUILDKIT=0 docker build --progress=plain --no-cache .
export BUILDKIT_PROGRESS=plain

Kubernetes Emergency Access

# Debug pod in same network
kubectl run debug --image=busybox -it --rm --restart=Never -- /bin/sh

# Emergency port forward
kubectl port-forward service/your-app 8080:80

# Container filesystem access
kubectl exec -it pod-name -- /bin/bash

2025 Technology Updates

New Failure Modes

  • Apple Silicon CI runners: Docker buildx complications
  • Bun/Deno in CI: Compatibility issues with Node-based tooling
  • pnpm workspaces: Different resolution behavior than npm
  • Podman adoption: Authentication differences from Docker

Updated Best Practices

  • Pin base images to prevent surprise updates
  • Use BuildKit for multi-platform builds
  • Implement proper health checks for container orchestration
  • Add retry logic for network-dependent operations

Critical Success Factors

  1. Systematic debugging approach prevents 4+ hour debugging sessions
  2. Environment reproduction eliminates "works on my machine" syndrome
  3. Resource monitoring prevents silent OOM failures
  4. Proper health checks prevent premature Kubernetes kills
  5. Dependency pinning prevents unexpected breakage

Bottom Line: 90% of CI/CD issues fall into these categories. Following this hierarchy saves 2-6 hours per incident compared to random debugging approaches.

Useful Links for Further Investigation

Links I Keep Open When Shit Breaks at 3AM

LinkDescription
GitHub Actions Debug ModeFirst thing I enable when workflows fail mysteriously. Add ACTIONS_STEP_DEBUG=true to secrets and see what's actually happening. Saved me probably 100 hours of guessing.
Docker Exit Codes Cheat SheetWhen Docker says "Exited (137)" and you have no fucking clue what that means. 137 = out of memory. 125 = docker command failed. Bookmark this, you'll need it.
kubectl Cheat SheetCopy-paste commands when Kubernetes is being Kubernetes. kubectl describe pod is my most-used command. kubectl logs -f saved my weekend more times than I can count.
Stack Overflow: "continuous-integration" TagWhere I find solutions to problems that official docs pretend don't exist. Sort by votes, not date. The 2019 answers about Docker networking issues are still more helpful than current docs.

Related Tools & Recommendations

integration
Similar content

Stop Fighting Your CI/CD Tools - Make Them Work Together

When Jenkins, GitHub Actions, and GitLab CI All Live in Your Company

GitHub Actions
/integration/github-actions-jenkins-gitlab-ci/hybrid-multi-platform-orchestration
100%
integration
Recommended

GitOps Integration Hell: Docker + Kubernetes + ArgoCD + Prometheus

How to Wire Together the Modern DevOps Stack Without Losing Your Sanity

docker
/integration/docker-kubernetes-argocd-prometheus/gitops-workflow-integration
64%
integration
Recommended

GitHub Actions + Jenkins Security Integration

When Security Wants Scans But Your Pipeline Lives in Jenkins Hell

GitHub Actions
/integration/github-actions-jenkins-security-scanning/devsecops-pipeline-integration
63%
integration
Similar content

GitHub Actions + Docker + ECS: Stop SSH-ing Into Servers Like It's 2015

Deploy your app without losing your mind or your weekend

GitHub Actions
/integration/github-actions-docker-aws-ecs/ci-cd-pipeline-automation
47%
troubleshoot
Similar content

Fix Kubernetes OOMKilled Pods - Production Memory Crisis Management

When your pods die with exit code 137 at 3AM and production is burning - here's the field guide that actually works

Kubernetes
/troubleshoot/kubernetes-oom-killed-pod/oomkilled-production-crisis-management
37%
troubleshoot
Similar content

Fix Kubernetes ImagePullBackOff Error - The Complete Battle-Tested Guide

From "Pod stuck in ImagePullBackOff" to "Problem solved in 90 seconds"

Kubernetes
/troubleshoot/kubernetes-imagepullbackoff/comprehensive-troubleshooting-guide
37%
tool
Recommended

CircleCI - Fast CI/CD That Actually Works

competes with CircleCI

CircleCI
/tool/circleci/overview
35%
tool
Recommended

Jenkins - The CI/CD Server That Won't Die

competes with Jenkins

Jenkins
/tool/jenkins/overview
34%
alternatives
Recommended

GitHub Actions is Fine for Open Source Projects, But Try Explaining to an Auditor Why Your CI/CD Platform Was Built for Hobby Projects

competes with GitHub Actions

GitHub Actions
/alternatives/github-actions/enterprise-governance-alternatives
31%
howto
Recommended

Stop Docker from Killing Your Containers at Random (Exit Code 137 Is Not Your Friend)

Three weeks into a project and Docker Desktop suddenly decides your container needs 16GB of RAM to run a basic Node.js app

Docker Desktop
/howto/setup-docker-development-environment/complete-development-setup
29%
troubleshoot
Recommended

CVE-2025-9074 Docker Desktop Emergency Patch - Critical Container Escape Fixed

Critical vulnerability allowing container breakouts patched in Docker Desktop 4.44.3

Docker Desktop
/troubleshoot/docker-cve-2025-9074/emergency-response-patching
29%
pricing
Recommended

Enterprise Git Hosting: What GitHub, GitLab and Bitbucket Actually Cost

When your boss ruins everything by asking for "enterprise features"

GitHub Enterprise
/pricing/github-enterprise-bitbucket-gitlab/enterprise-deployment-cost-analysis
28%
news
Recommended

DeepSeek V3.1 Launch Hints at China's "Next Generation" AI Chips

Chinese AI startup's model upgrade suggests breakthrough in domestic semiconductor capabilities

GitHub Copilot
/news/2025-08-22/github-ai-enhancements
26%
integration
Similar content

Stop Deploying Vulnerable Code - GitHub Actions, SonarQube, and Snyk Integration

Wire together three tools to catch security fuckups before they hit production

GitHub Actions
/integration/github-actions-sonarqube-snyk/complete-security-pipeline-guide
24%
tool
Recommended

GitLab CI/CD - The Platform That Does Everything (Usually)

CI/CD, security scanning, and project management in one place - when it works, it's great

GitLab CI/CD
/tool/gitlab-ci-cd/overview
24%
tool
Similar content

Fix Azure DevOps Pipeline Performance - Stop Waiting 45 Minutes for Builds

Optimize Azure DevOps pipelines. Discover why your builds are slow (e.g., 45 minutes) and implement strategies to fix performance, reduce wait times, and boost

Azure DevOps Services
/tool/azure-devops-services/pipeline-optimization
21%
review
Recommended

GitHub Copilot Value Assessment - What It Actually Costs (spoiler: way more than $19/month)

integrates with GitHub Copilot

GitHub Copilot
/review/github-copilot/value-assessment-review
21%
compare
Recommended

Cursor vs GitHub Copilot vs Codeium vs Tabnine vs Amazon Q - Which One Won't Screw You Over

After two years using these daily, here's what actually matters for choosing an AI coding tool

Cursor
/compare/cursor/github-copilot/codeium/tabnine/amazon-q-developer/windsurf/market-consolidation-upheaval
21%
tool
Recommended

GitLab Container Registry

GitLab's container registry that doesn't make you juggle five different sets of credentials like every other registry solution

GitLab Container Registry
/tool/gitlab-container-registry/overview
21%
tool
Recommended

Terraform Enterprise - HashiCorp's $37K-$300K Self-Hosted Monster

Self-hosted Terraform that doesn't phone home to HashiCorp and won't bankrupt you with per-resource billing

Terraform Enterprise
/tool/terraform-enterprise/overview
19%

Recommendations combine user behavior, content similarity, research intelligence, and SEO optimization