CI/CD Pipeline Troubleshooting: AI-Optimized Knowledge Base
Critical Debugging Hierarchy (Time-Saving Order)
1. Environment Differences (Most Common Cause)
Problem: Code works locally but fails in CI with nonsensical errors
Symptoms: Import/module resolution failures, missing dependencies, path issues
Time Investment: 15 minutes to 4 hours if not systematic
Immediate Actions:
# Reproduce CI environment exactly
docker run -it --rm \
-e NODE_ENV=production \
-e DATABASE_URL=postgres://user:pass@db:5432/myapp \
node:18-alpine \
sh
Common Environment Gotchas:
- Case sensitivity: macOS ignores file case, Linux doesn't
- Path separators: Windows
\
vs Unix/
- Missing environment variables in CI
- Node version drift between local and CI
- TypeScript path mapping missing in Docker builds
Real-World Impact: Single missing tsconfig.json in Docker context = 4 hours debugging time
2. Resource Limits (Silent Killer)
Memory Exhaustion Indicators:
JavaScript heap out of memory
during buildsKilled
with no context (Linux OOM killer)- Tests pass individually, fail together
- Docker builds hang at
npm install
Production Thresholds:
- Node.js default heap: 2GB (insufficient for modern webpack builds)
- Required for medium React apps: 4GB (
--max-old-space-size=4096
) - CI runner memory usage spikes to 2GB+ = immediate failure
Configuration:
export NODE_OPTIONS="--max-old-space-size=4096"
3. Network Connectivity Issues
Failure Patterns:
- Registry timeouts during dependency installation
- DNS resolution failures in containers
- Corporate firewall blocking package registries
- Authentication failures with private registries
Diagnostic Commands:
curl -I https://registry.npmjs.org/
nslookup registry.npmjs.org
ping google.com
4. Timing and Race Conditions
High-Risk Scenarios:
- Database seeds not fully applied before tests
- Server startup timing in integration tests
- Process dependencies without proper awaiting
- File system operations assuming synchronous completion
Prevention Pattern:
// Wrong: assumes immediate server readiness
test('should return 200', async () => {
const response = await request(server).get('/health');
expect(response.status).toBe(200);
});
// Correct: explicit readiness check
test('should return 200', async () => {
await server.ready();
const response = await request(server).get('/health');
expect(response.status).toBe(200);
});
Docker Build Optimization
Performance Critical Factors
Build Time Killers:
- Large build context (copying unnecessary files)
- Poor layer caching strategy
- Installing dependencies on every build
- Single-stage builds without optimization
Multi-Stage Build Pattern (Reduces build time from 20 minutes to 3 minutes):
FROM node:18-alpine as dependencies
WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production
FROM node:18-alpine as build
WORKDIR /app
COPY package*.json ./
RUN npm ci
COPY . .
RUN npm run build
FROM node:18-alpine as runtime
WORKDIR /app
COPY --from=dependencies /app/node_modules ./node_modules
COPY --from=build /app/dist ./dist
EXPOSE 3000
CMD ["npm", "start"]
Essential .dockerignore:
node_modules
.git
.DS_Store
*.md
.env
coverage/
logs/
Multi-Platform Issues (Apple Silicon Era)
Problem: ARM64 builds on Mac fail on x86 CI
Symptoms: exec user process caused: exec format error
Solution:
FROM --platform=$BUILDPLATFORM node:18
# or explicit platform targeting
FROM --platform=linux/amd64 node:18
Kubernetes Debugging Methodology
Systematic Diagnosis Process
- Deployment Status:
kubectl get deployments && kubectl describe deployment
- Pod Status:
kubectl get pods -l app=your-app && kubectl describe pod
- Events:
kubectl get events --sort-by=.metadata.creationTimestamp
- Logs:
kubectl logs pod-name --previous
Critical Configuration Failures
Resource Limits (Production Values):
resources:
requests:
memory: "256Mi"
cpu: "100m"
limits:
memory: "512Mi" # Room for memory spikes
cpu: "500m" # Room for CPU spikes
Probe Configuration (Prevents Premature Kills):
livenessProbe:
httpGet:
path: /health
port: 3000
initialDelaySeconds: 60 # Critical: enough startup time
periodSeconds: 30
timeoutSeconds: 10
failureThreshold: 3 # More forgiving than default
Health Check Implementation:
let isReady = false;
connectToDatabase().then(() => {
isReady = true;
});
app.get('/health', (req, res) => {
if (isReady) {
res.status(200).json({ status: 'ready' });
} else {
res.status(503).json({ status: 'starting' });
}
});
Network Resolution Issues
Cross-Namespace Service Communication:
# Same namespace: service-name
DATABASE_URL: postgres://user:pass@postgres:5432/db
# Cross-namespace: service.namespace.svc.cluster.local
DATABASE_URL: postgres://user:pass@postgres.database.svc.cluster.local:5432/db
Error Reference Matrix
Error | Platform | Root Cause | Time to Fix | Production Impact |
---|---|---|---|---|
JavaScript heap out of memory |
Node.js builds | Insufficient heap size | 30 seconds | Build failure |
npm install hangs 30+ minutes |
Docker | Registry issues/large deps | 2 minutes | Pipeline timeout |
Cannot connect to database |
Test pipelines | Missing DB service | 5 minutes | Test failure |
CrashLoopBackOff |
Kubernetes | App startup failure | 10-60 minutes | Service down |
ImagePullBackOff |
Kubernetes | Registry auth/image missing | 5-15 minutes | Deployment blocked |
Exit code 137 | Containers | OOM kill | 2 minutes | Service restart |
ENOTFOUND registry.npmjs.org |
npm | Network/DNS failure | 1 minute | Build blocked |
port already in use |
Tests | Process cleanup failure | 5 minutes | Test flakiness |
exec format error |
Multi-platform | Architecture mismatch | 1 minute | Runtime failure |
Resource Requirements and Trade-offs
Build Resource Allocation
- Minimum for Node.js builds: 2GB RAM, 2 CPU cores
- Recommended for production: 4GB RAM, 4 CPU cores
- Enterprise/large codebases: 8GB RAM, 8 CPU cores
Time Investment Patterns
- "Works locally" debugging: 15 minutes - 4 hours
- Docker build optimization: 2-8 hours initial, saves 15+ minutes per build
- Kubernetes deployment issues: 30 minutes - 3 hours
- Network/registry problems: 5 minutes - 2 hours
Technology Support Quality
- Docker: Excellent documentation, large community
- Kubernetes: Steep learning curve, complex debugging
- GitHub Actions: Good docs, limited debugging tools
- npm/Node.js: Mature ecosystem, occasional registry issues
- Bun: Fast but unstable, limited CI support
- pnpm: Good performance, workspace complications
Prevention Strategies
Dependency Management
{
"engines": {
"node": "18.19.0",
"npm": "10.2.3"
}
}
Container Image Pinning
# Avoid: latest tags that break unexpectedly
FROM node:18
# Use: specific versions for stability
FROM node:18.19.0-alpine
Test Isolation
// Random ports prevent conflicts
const port = process.env.PORT || 3000 + Math.floor(Math.random() * 1000);
// Random database names
const dbName = `test_${Date.now()}_${Math.random().toString(36)}`;
Nuclear Debugging Options
GitHub Actions Deep Debug
env:
ACTIONS_STEP_DEBUG: true
ACTIONS_RUNNER_DEBUG: true
Docker Verbose Logging
DOCKER_BUILDKIT=0 docker build --progress=plain --no-cache .
export BUILDKIT_PROGRESS=plain
Kubernetes Emergency Access
# Debug pod in same network
kubectl run debug --image=busybox -it --rm --restart=Never -- /bin/sh
# Emergency port forward
kubectl port-forward service/your-app 8080:80
# Container filesystem access
kubectl exec -it pod-name -- /bin/bash
2025 Technology Updates
New Failure Modes
- Apple Silicon CI runners: Docker buildx complications
- Bun/Deno in CI: Compatibility issues with Node-based tooling
- pnpm workspaces: Different resolution behavior than npm
- Podman adoption: Authentication differences from Docker
Updated Best Practices
- Pin base images to prevent surprise updates
- Use BuildKit for multi-platform builds
- Implement proper health checks for container orchestration
- Add retry logic for network-dependent operations
Critical Success Factors
- Systematic debugging approach prevents 4+ hour debugging sessions
- Environment reproduction eliminates "works on my machine" syndrome
- Resource monitoring prevents silent OOM failures
- Proper health checks prevent premature Kubernetes kills
- Dependency pinning prevents unexpected breakage
Bottom Line: 90% of CI/CD issues fall into these categories. Following this hierarchy saves 2-6 hours per incident compared to random debugging approaches.
Useful Links for Further Investigation
Links I Keep Open When Shit Breaks at 3AM
Link | Description |
---|---|
GitHub Actions Debug Mode | First thing I enable when workflows fail mysteriously. Add ACTIONS_STEP_DEBUG=true to secrets and see what's actually happening. Saved me probably 100 hours of guessing. |
Docker Exit Codes Cheat Sheet | When Docker says "Exited (137)" and you have no fucking clue what that means. 137 = out of memory. 125 = docker command failed. Bookmark this, you'll need it. |
kubectl Cheat Sheet | Copy-paste commands when Kubernetes is being Kubernetes. kubectl describe pod is my most-used command. kubectl logs -f saved my weekend more times than I can count. |
Stack Overflow: "continuous-integration" Tag | Where I find solutions to problems that official docs pretend don't exist. Sort by votes, not date. The 2019 answers about Docker networking issues are still more helpful than current docs. |
Related Tools & Recommendations
Stop Fighting Your CI/CD Tools - Make Them Work Together
When Jenkins, GitHub Actions, and GitLab CI All Live in Your Company
GitOps Integration Hell: Docker + Kubernetes + ArgoCD + Prometheus
How to Wire Together the Modern DevOps Stack Without Losing Your Sanity
GitHub Actions + Jenkins Security Integration
When Security Wants Scans But Your Pipeline Lives in Jenkins Hell
GitHub Actions + Docker + ECS: Stop SSH-ing Into Servers Like It's 2015
Deploy your app without losing your mind or your weekend
Fix Kubernetes OOMKilled Pods - Production Memory Crisis Management
When your pods die with exit code 137 at 3AM and production is burning - here's the field guide that actually works
Fix Kubernetes ImagePullBackOff Error - The Complete Battle-Tested Guide
From "Pod stuck in ImagePullBackOff" to "Problem solved in 90 seconds"
CircleCI - Fast CI/CD That Actually Works
competes with CircleCI
Jenkins - The CI/CD Server That Won't Die
competes with Jenkins
GitHub Actions is Fine for Open Source Projects, But Try Explaining to an Auditor Why Your CI/CD Platform Was Built for Hobby Projects
competes with GitHub Actions
Stop Docker from Killing Your Containers at Random (Exit Code 137 Is Not Your Friend)
Three weeks into a project and Docker Desktop suddenly decides your container needs 16GB of RAM to run a basic Node.js app
CVE-2025-9074 Docker Desktop Emergency Patch - Critical Container Escape Fixed
Critical vulnerability allowing container breakouts patched in Docker Desktop 4.44.3
Enterprise Git Hosting: What GitHub, GitLab and Bitbucket Actually Cost
When your boss ruins everything by asking for "enterprise features"
DeepSeek V3.1 Launch Hints at China's "Next Generation" AI Chips
Chinese AI startup's model upgrade suggests breakthrough in domestic semiconductor capabilities
Stop Deploying Vulnerable Code - GitHub Actions, SonarQube, and Snyk Integration
Wire together three tools to catch security fuckups before they hit production
GitLab CI/CD - The Platform That Does Everything (Usually)
CI/CD, security scanning, and project management in one place - when it works, it's great
Fix Azure DevOps Pipeline Performance - Stop Waiting 45 Minutes for Builds
Optimize Azure DevOps pipelines. Discover why your builds are slow (e.g., 45 minutes) and implement strategies to fix performance, reduce wait times, and boost
GitHub Copilot Value Assessment - What It Actually Costs (spoiler: way more than $19/month)
integrates with GitHub Copilot
Cursor vs GitHub Copilot vs Codeium vs Tabnine vs Amazon Q - Which One Won't Screw You Over
After two years using these daily, here's what actually matters for choosing an AI coding tool
GitLab Container Registry
GitLab's container registry that doesn't make you juggle five different sets of credentials like every other registry solution
Terraform Enterprise - HashiCorp's $37K-$300K Self-Hosted Monster
Self-hosted Terraform that doesn't phone home to HashiCorp and won't bankrupt you with per-resource billing
Recommendations combine user behavior, content similarity, research intelligence, and SEO optimization