The Reality: Exit Code 1 Means Your App Said "Fuck This"

Kubernetes Pod Error States

Exit code 1 is the computer equivalent of rage-quitting. Your app starts, something pisses it off immediately, and it just gives up. Unlike the more helpful exit codes (137 means out of memory, 143 means it was shut down gracefully), exit code 1 just means "something went wrong and I'm done."

I've debugged this nightmare more times than I can count. Your container works perfectly on Docker Desktop, passes all your tests, deploys without errors - then immediately dies with exit code 1 and goes into that maddening CrashLoopBackOff cycle.

The worst part? The error message is usually useless: "Error: Error occurred" or just nothing at all.

The Three Things That Actually Cause Exit Code 1

After years of debugging this crap, 95% of exit code 1 crashes come down to three things:

1. Missing environment variable - Your app expects DATABASE_URL but Kubernetes doesn't have it
2. Can't connect to database/service - Postgres isn't ready yet, or the service name is wrong
3. File permissions are fucked - Your app can't read config files or write logs

Everything else is just variations of these three problems.

War Stories: When Exit Code 1 Ruined My Day

The Missing Environment Variable That Took Down Production

Last month, I deployed a Node.js API to production. Worked perfectly in staging. Five minutes after deployment, CrashLoopBackOff with exit code 1.

The logs showed: Error: JWT_SECRET environment variable is required

Turns out the staging ConfigMap had JWT_SECRET but production had JWT_TOKEN. Same fucking value, different key name. App couldn't start without it.

Took me 2 hours to figure this out because I kept assuming the ConfigMap was identical between environments.

The Database That "Was Ready" But Actually Wasn't

Init container checked that Postgres was accepting connections. Main app container started immediately after and died with "connection refused".

Postgres was accepting connections but wasn't actually ready to handle queries yet. Init container passed, main app failed. Spent 3 hours debugging this before adding a 10-second delay after the init container.

Learned the hard way: pg_isready doesn't mean "ready for your application queries."

The File Permissions Nightmare

Deployed with readOnlyRootFilesystem: true for security. App tried to create a temp file and crashed with "Permission denied". The filesystem was read-only but the app needed to write logs.

Had to mount an emptyDir volume at /tmp to give the app somewhere to write. Should have been obvious but it wasn't. Cost me a whole afternoon.

The Commands That Actually Work

When your app is stuck in CrashLoopBackOff with exit code 1, start with these:

## Get the actual error message (if there is one)
kubectl logs <pod-name> --previous

## Check if environment variables are missing  
kubectl exec <pod-name> -- env | grep -E "(DATABASE|API|SECRET)"

## Test connectivity to services your app needs
kubectl exec <pod-name> -- nc -zv database-service 5432

## Check file permissions if your app writes files
kubectl exec <pod-name> -- ls -la /app/
kubectl exec <pod-name> -- id

90% of the time, one of these commands shows you exactly what's wrong. The other 10% of the time, you're fucked and need to add debug logging to your app.

Version-Specific Gotchas That Will Bite You

Kubernetes Version Compatibility Issues

Node.js 18.x changed environment variable behavior - If you're using dotenv, it now throws errors for missing variables that it used to ignore. Your app might work locally with Node 16 but crash with exit code 1 on Node 18.

Kubernetes 1.25+ deprecated Docker runtime - If you're still using Docker as the container runtime, some volume mount permissions might be different now with containerd.

Python 3.11 import changes - Some packages that worked in Python 3.9 now fail to import in 3.11, causing immediate exit code 1 crashes.

Don't ask me how I know all of this.

The Debug Process Flow That Actually Works

Troubleshooting Process Flow

  1. Look at the logs first - kubectl logs <pod-name> --previous
  2. Check environment variables - kubectl exec <pod-name> -- env
  3. Test connectivity - kubectl exec <pod-name> -- nc -zv service-name port
  4. Check file permissions - kubectl exec <pod-name> -- ls -la /app/
  5. Nuclear option - Override the command to keep container alive for debugging

This process catches 95% of exit code 1 issues in under 10 minutes.

Now that you understand what causes exit code 1 crashes and how to diagnose them, let's move on to the actual fixes that work in production.

Essential debugging links:

The 3 Fixes That Actually Work

Kubernetes Debugging Workflow

Skip the theory. Here are the 3 fixes that solve 95% of exit code 1 crashes, ranked by how often they work.

Fix #1: Missing Environment Variables (70% of cases)

The 2-minute emergency fix:

## Find the missing variable from logs
kubectl logs <pod-name> --previous | grep -i "required\|missing\|undefined"

## Patch it directly into the deployment
kubectl patch deployment <deployment-name> -p='{"spec":{"template":{"spec":{"containers":[{"name":"<container-name>","env":[{"name":"DATABASE_URL","value":"postgres://user:pass@db:5432/myapp"}]}]}}}}'

## Force restart 
kubectl rollout restart deployment <deployment-name>

Why this happens: Your staging environment has JWT_SECRET but production has JWT_TOKEN. Same value, different key name. Or someone forgot to create the Secret in the new namespace.

The permanent fix:
Add environment variable validation to your app startup:

// Add this to the top of your main file
const requiredEnvVars = ['DATABASE_URL', 'JWT_SECRET', 'API_KEY'];
const missing = requiredEnvVars.filter(env => !process.env[env]);
if (missing.length > 0) {
  console.error(`Missing required environment variables: ${missing.join(', ')}`);
  console.error('Available env vars:', Object.keys(process.env).filter(k => !k.includes('PASSWORD')).sort());
  process.exit(1);
}

This way you get useful error messages instead of "Error: Error occurred".

Fix #2: Database/Service Connection Failed (20% of cases)

The nuclear option that works:

## Delete the pod and let Kubernetes recreate it
kubectl delete pod <pod-name>

## If that doesn't work, restart the entire deployment
kubectl rollout restart deployment <deployment-name>

## Still broken? Check if the service actually exists
kubectl get svc | grep database
kubectl get endpoints database-service

Why this happens: Your app tries to connect to postgres-service:5432 but the actual service is named postgres. Or Postgres is accepting connections but not ready for queries yet.

The permanent fix that doesn't suck:

// Add retry logic to database connections
const connectWithRetry = async (retries = 10) => {
  for (let i = 0; i < retries; i++) {
    try {
      await database.connect();
      console.log(`Connected to database on attempt ${i + 1}`);
      return;
    } catch (error) {
      console.log(`Database connection attempt ${i + 1}/${retries} failed: ${error.message}`);
      if (i === retries - 1) {
        console.error('Max retries reached, exiting');
        process.exit(1);
      }
      await new Promise(resolve => setTimeout(resolve, 2000 * (i + 1))); // exponential backoff
    }
  }
};

Fix #3: File Permissions Are Fucked (5% of cases)

The "why didn't I think of this" fix:

## Check what user your app is running as
kubectl exec <pod-name> -- id

## Check file permissions
kubectl exec <pod-name> -- ls -la /app/

## Check if filesystem is read-only
kubectl exec <pod-name> -- touch /tmp/test && echo "Can write" || echo "Cannot write"

Why this happens: You deployed with readOnlyRootFilesystem: true but your app needs to write logs or temp files.

The fix:

## Add writable volumes for apps that need to write files
spec:
  template:
    spec:
      securityContext:
        runAsUser: 1000
        runAsGroup: 1000
        fsGroup: 1000
      containers:
      - name: app
        securityContext:
          allowPrivilegeEscalation: false
          readOnlyRootFilesystem: true
        volumeMounts:
        - name: tmp
          mountPath: /tmp
        - name: logs
          mountPath: /app/logs
      volumes:
      - name: tmp
        emptyDir: {}
      - name: logs
        emptyDir: {}

When All Else Fails: The Debug Container

Debug Container Architecture

If none of the above fixes work, you're probably dealing with something weird. Override the container command to keep it alive so you can debug:

## Keep the container alive
kubectl patch deployment <deployment-name> -p='{"spec":{"template":{"spec":{"containers":[{"name":"<container-name>","command":["sleep","3600"]}]}}}}'

## Then exec into it and run your app manually
kubectl exec -it <pod-name> -- /bin/sh
cd /app
node server.js  # or whatever your startup command is

This lets you see exactly what's failing without the restart loop masking the problem.

Reality Check: Time Estimates

  • Fix #1 (env vars): 2-5 minutes if you know what you're doing
  • Fix #2 (database): 5-15 minutes, depending on how fucked your networking is
  • Fix #3 (permissions): 10-30 minutes, because file permissions are always confusing
  • Debug container: 30 minutes to 2 hours, depending on how deep the rabbit hole goes

These fixes handle the most common scenarios, but every production environment has its unique quirks. Let's tackle some of the specific questions and edge cases you'll run into.

Essential troubleshooting links:

Exit Code 1 Troubleshooting FAQ

Q

How do I know if this fucking CrashLoopBackOff is exit code 1?

A

Check the pod events and container status:bashkubectl describe pod <pod-name> | grep -A5 -B5 "Exit Code"kubectl get events --field-selector involvedObject.name=<pod-name> | grep -i exitLook for "Exit Code: 1" in the pod description. Unlike exit code 137 (OOMKilled) or 143 (SIGTERM), exit code 1 means your application code explicitly called exit(1) or crashed with an unhandled error.Key difference: Exit code 1 = application problem. Exit code 137 = resource problem. Exit code 143 = shutdown signal. Exit code 125 = Docker problem.

Q

My application works locally but gives exit code 1 in Kubernetes. What's different?

A

The three most common environmental differences:

  1. Environment variables:bash# Compare local vs Kubernetes environmentdocker run --rm <your-image> env | sort > local-env.txtkubectl exec <pod-name> -- env | sort > k8s-env.txtdiff local-env.txt k8s-env.txt2. File paths and volumes:bash# Check if files exist where your app expects themkubectl exec <pod-name> -- ls -la /app/config/kubectl exec <pod-name> -- ls -la /data/kubectl exec <pod-name> -- pwd # Check working directory3. Network connectivity:bash# Test database/API connections from inside the podkubectl exec <pod-name> -- nc -zv database-service 5432kubectl exec <pod-name> -- nslookup api-serviceReality check: Most "works locally" issues are missing environment variables. Docker Compose sets defaults that Kubernetes doesn't inherit.
Q

My logs show "Error: Cannot find module 'xyz'" but the dependency is in package.json. Why?

A

This is a container build issue, not a runtime issue.

The dependency wasn't properly installed in the container image:Diagnosis:bash# Check if node_modules exists and has the missing modulekubectl exec <pod-name> -- ls -la /app/node_modules/kubectl exec <pod-name> -- ls -la /app/node_modules/xyz/# Check if npm install ran during buildkubectl exec <pod-name> -- ls -la /app/package*.jsonCommon causes:

  • Dockerfile doesn't run npm install or npm ci
  • .dockerignore excludes node_modules (which is usually correct)
  • Multi-stage build doesn't copy node_modules to final stage
  • Wrong working directory during npm installFix in Dockerfile:```dockerfileFROM node:18-alpine

WORKDIR /appCOPY package*.json ./RUN npm ci --only=production # This must run BEFORE copying app codeCOPY . .CMD ["node", "server.js"]```

Q

My application exits with code 1 saying "Permission denied" when accessing files. How do I fix this?

A

This is a user permissions and security context issue:Diagnosis:bash# Check what user the container runs askubectl exec <pod-name> -- idkubectl exec <pod-name> -- whoami# Check file ownership and permissionskubectl exec <pod-name> -- ls -la /app/kubectl exec <pod-name> -- ls -la /data/kubectl exec <pod-name> -- stat /app/config.jsonFix with proper security context:```yamlapiVersion: apps/v1kind:

Deploymentspec: template: spec: securityContext: runAsUser: 1000 runAsGroup: 1000 fsGroup: 1000 # This sets group ownership of mounted volumes containers:

  • name: app image: myapp:latest securityContext: allowPrivilegeEscalation: false readOnlyRootFilesystem: true volumeMounts:

  • name: data mountPath: /data

  • name: tmp mountPath: /tmp # Writable temp directory volumes:

  • name: data persistentVolumeClaim: claimName: app-data

  • name: tmp emptyDir: {}**Fix in Dockerfile:**dockerfileFROM node:18-alpine# Create app userRUN addgroup -g 1000 -S appgroup && adduser -S appuser -u 1000 -G appgroupWORKDIR /appCOPY --chown=appuser:appgroup . .USER appuser

CMD ["node", "server.js"]```

Q

My init container succeeds but the main container still gets exit code 1 with database connection errors. Why?

A

Race condition between init container completion and main container startup.

The database was available when init container checked, but became unavailable by the time the main container started.Better init container pattern:```yamlinitContainers:

  • name: wait-for-dependencies image: busybox command:

  • sh

  • -c

  • | echo "Waiting for Postgre

SQL..." until nc -z postgres-service 5432; do sleep 2; done echo "Waiting for Redis..." until nc -z redis-service 6379; do sleep 2; done # Wait a bit longer to ensure services are fully ready echo "Services responding, waiting 10 more seconds for stability..." sleep 10 echo "All dependencies ready!"**Application-level retry logic (recommended):**javascript// Add retry logic to your main applicationasync function connectWithRetry(maxAttempts = 10) { for (let attempt = 1; attempt <= maxAttempts; attempt++) { try { await database.connect(); console.log(Database connected on attempt ${attempt}); return; } catch (error) { console.log(Connection attempt ${attempt} failed: ${error.message}); if (attempt === max

Attempts) { console.error("Max connection attempts reached"); process.exit(1); } await new Promise(resolve => setTimeout(resolve, 2000 * attempt)); } }}```

Q

How do I debug exit code 1 when there are no logs or very minimal logs?

A

Silent failures usually indicate the application is crashing before logging initialization:

  1. Test the container image directly:bash# Run the exact same image with a shellkubectl run debug-pod --image=<failing-image> --rm -it --restart=Never -- sh# Inside the container, try running your application manuallycd /appnode server.js # or python app.py, java -jar app.jar, etc.2. Override the container command to keep it alive:bashkubectl patch deployment <deployment-name> -p='{"spec":{"template":{"spec":{"containers":[{"name":"<container-name>","command":["sleep","3600"]}]}}}}'3. Add debug logging to your application startup:javascript// Add this at the very beginning of your main fileconsole.log("Application starting...");console.log("Node version:", process.version);console.log("Working directory:", process.cwd());console.log("Command line arguments:", process.argv);console.log("Environment variables:", Object.keys(process.env).filter(k => !k.includes('PASSWORD')));try { // Your application initialization here console.log("Loading dependencies..."); const express = require('express'); console.log("Express loaded successfully"); console.log("Loading configuration..."); // Configuration loading code } catch (error) { console.error("Error during application startup:", error); process.exit(1);}
Q

My application gets exit code 1 during health check failures. Is this normal?

A

No, this is usually a configuration problem.

Health check failures should not cause exit code 1. Kubernetes kills the container with SIGTERM (exit code 143) when liveness probes fail.If you're seeing exit code 1 during health check issues:

  1. Your health endpoint is crashing:```javascript// Bad health endpoint that can cause exit code 1app.get('/health', (req, res) => { // This throws an unhandled exception if database is down const result = database.query('SELECT 1'); res.json({status: 'ok'});});// Good health endpointapp.get('/health', (req, res) => { try { // Check application components const dbStatus = database.isConnected() ? 'connected' : 'disconnected'; const redisStatus = redis.ping() ? 'connected' : 'disconnected'; if (dbStatus === 'connected' && redisStatus === 'connected') { res.status(200).json({status: 'healthy', database: db

Status, redis: redisStatus}); } else { res.status(503).json({status: 'unhealthy', database: db

Status, redis: redisStatus}); } } catch (error) { console.error('Health check error:', error); res.status(503).json({status: 'error', message: error.message}); }});2. **Check probe configuration:**bashkubectl describe pod | grep -A10 "Liveness|Readiness"```Make sure:

  • initialDelaySeconds gives your app enough time to start
  • timeoutSeconds is reasonable (usually 5-10 seconds)
  • periodSeconds isn't too frequent (usually 10-30 seconds)
Q

Can I prevent exit code 1 by catching all exceptions in my application?

A

Yes, but be careful not to hide real problems.

Use global exception handlers as a last resort, not a primary error handling strategy:```javascript// Node.js global exception handlingprocess.on('uncaught

Exception', (error) => { console.error('Uncaught Exception:', error); // Log to external monitoring system logger.error('Uncaught exception', { error: error.message, stack: error.stack }); // Graceful shutdown process.exit(1); // Still exit with code 1, but after proper logging});process.on('unhandled

Rejection', (reason, promise) => { console.error('Unhandled Rejection at:', promise, 'reason:', reason); logger.error('Unhandled promise rejection', { reason, promise }); process.exit(1);});``````python# Python global exception handlingimport sysimport loggingdef handle_exception(exc_type, exc_value, exc_traceback): if issubclass(exc_type, KeyboardInterrupt): sys.excepthook(exc_type, exc_value, exc_traceback) return logging.error("Uncaught exception", exc_info=(exc_type, exc_value, exc_traceback)) # Exit with code 1 after proper logging sys.exit(1)sys.excepthook = handle_exception```Better approach:

Fix the root causes instead of catching everything:

  • Validate inputs at application boundaries
  • Handle expected errors gracefully
  • Use circuit breakers for external dependencies
  • Implement proper retry logic
  • Validate configuration on startup
Q

How do I handle exit code 1 in batch jobs or one-time tasks?

A

For Kubernetes Jobs, exit code 1 has different implications than for long-running services:```yamlapiVersion: batch/v1kind:

Jobspec: backoffLimit: 3 # Retry up to 3 times activeDeadlineSeconds: 300 # Job timeout template: spec: restartPolicy:

Never # Don't restart containers within pod containers:

  • name: batch-job image: my-batch-job:latest command:

  • python

  • -c

  • | import sys import traceback try: # Your batch job logic here result = process_data() print(f"Job completed successfully: {result}") sys.exit(0) # Success except ValidationError as e: print(f"Data validation failed: {e}") sys.exit(1) # Retry this job except CriticalError as e: print(f"Critical error, do not retry: {e}") sys.exit(2) # Don't retry this job except Exception as e: print(f"Unexpected error: {e}") traceback.print_exc() sys.exit(1) # Retry this job```Exit code strategy for batch jobs:

  • exit(0):

Success, job complete

  • exit(1): Temporary failure, retry the job
  • exit(2-255): Permanent failure, don't retry
Q

My microservice gets exit code 1 when other services are temporarily unavailable. How do I make it more resilient?

A

Implement circuit breaker patterns and graceful degradation:```javascript// Circuit breaker pattern for resilient microservicesconst CircuitBreaker = require('opossum');const options = { timeout: 5000, // 5 second timeout errorThresholdPercentage: 50, // Open circuit at 50% failure rate resetTimeout: 30000 // Try again after 30 seconds};const breaker = new Circuit

Breaker(callExternalService, options);breaker.fallback(() => { // Return cached data or default response instead of crashing return { status: 'degraded', data: get

CachedData() };});breaker.on('open', () => console.log('Circuit breaker opened'));breaker.on('halfOpen', () => console.log('Circuit breaker half-open'));async function callExternalService(data) { const response = await fetch(${API_URL}/process, { method: 'POST', body:

JSON.stringify(data), timeout: 5000 }); if (!response.ok) { throw new Error(API error: ${response.status}); } return response.json();} // Use the circuit breaker instead of direct callsapp.post('/api/data', async (req, res) => { try { const result = await breaker.fire(req.body); res.json(result); } catch (error) { // Even if external service fails, don't crash the app console.error('Service degraded:', error.message); res.status(503).json({ status: 'service_unavailable', message: 'External service temporarily unavailable', fallback_data: getDefaultData() }); }});**Startup dependency management:**yaml# Use init containers for critical dependencies onlyinitContainers:

  • name: wait-for-database # Database is critical

  • wait for it image: postgres:15-alpine command: ['sh', '-c', 'until pg_isready -h postgres; do sleep 2; done']containers:

  • name: app image: myapp:latest env:

  • name:

EXTERNAL_API_URL value: "http://optional-service"

  • name:

GRACEFUL_DEGRADATION value: "true"``````javascript// Application handles optional services gracefullyasync function start

Application() { // Critical dependencies

  • fail fast if unavailable await connectToDatabase(); // Optional dependencies
  • continue without them try { await connectToOptionalService(); console.log('Optional service connected'); } catch (error) { console.warn('Optional service unavailable, continuing with degraded functionality:', error.message); // Don't exit(1) for optional services! } startWebServer();}```The key principle: Exit code 1 should only occur for truly unrecoverable errors. Temporary service unavailability, network hiccups, and optional dependency failures should be handled gracefully without crashing the application.Now that we've covered how to fix exit code 1 when it happens, let's focus on preventing these crashes in the first place. Because honestly, debugging at 3am is no fun for anyone.

How to Stop Exit Code 1 From Ruining Your Life

Kubernetes Production Patterns

Let's be honest: debugging CrashLoopBackOff at 3am sucks. Here's how to prevent exit code 1 crashes before they happen.

The 5-Minute Startup Check

Add this to every app you deploy to Kubernetes. It catches 90% of environment issues before they become production disasters:

// startup-validation.js - Add this to your main file
const REQUIRED_ENV_VARS = [
  'DATABASE_URL',
  'JWT_SECRET', 
  'API_KEY'
];

const OPTIONAL_ENV_VARS = [
  'REDIS_URL',
  'EXTERNAL_API_URL'
];

function validateEnvironment() {
  console.log('🔍 Validating environment...');
  
  const missing = REQUIRED_ENV_VARS.filter(env => !process.env[env]);
  if (missing.length > 0) {
    console.error(`❌ Missing required environment variables: ${missing.join(', ')}`);
    console.error('Available variables:', Object.keys(process.env).filter(k => !k.includes('PASSWORD')).sort().join(', '));
    process.exit(1);
  }
  
  const optional = OPTIONAL_ENV_VARS.filter(env => !process.env[env]);
  if (optional.length > 0) {
    console.warn(`⚠️  Optional environment variables missing: ${optional.join(', ')}`);
    console.warn('App will run with reduced functionality');
  }
  
  console.log('✅ Environment validation passed');
}

validateEnvironment();

Why this works: Instead of a useless "Error: Error occurred" message, you get exactly which environment variable is missing. Saves hours of debugging.

Test Your Database Connection Before Starting

Don't let your app crash because the database isn't ready yet:

// Add this before starting your web server
async function waitForDatabase(maxAttempts = 10) {
  for (let attempt = 1; attempt <= maxAttempts; attempt++) {
    try {
      await database.raw('SELECT 1');  // Simple connectivity test
      console.log(`✅ Database connected on attempt ${attempt}`);
      return;
    } catch (error) {
      console.log(`🔄 Database connection attempt ${attempt}/${maxAttempts} failed: ${error.message}`);
      
      if (attempt === maxAttempts) {
        console.error('❌ Max database connection attempts reached');
        console.error('Database configuration:', {
          host: process.env.DATABASE_HOST,
          port: process.env.DATABASE_PORT,
          database: process.env.DATABASE_NAME,
          // Don't log passwords!
        });
        process.exit(1);
      }
      
      // Exponential backoff: 2s, 4s, 8s, etc.
      await new Promise(resolve => setTimeout(resolve, 2000 * attempt));
    }
  }
}

await waitForDatabase();

Dockerfile Best Practices That Actually Matter

Docker Container Security

The container that won't crash:

FROM node:18-alpine

## Create a non-root user (prevents permission issues)
RUN addgroup -g 1001 -S app && adduser -S app -u 1001 -G app

WORKDIR /app

## Install dependencies first (better caching)
COPY package*.json ./
RUN npm ci --only=production --ignore-scripts

## Copy app code
COPY --chown=app:app . .

## Validate the app can start (catch issues during build)
RUN timeout 10s npm start --validate-only || (echo "App failed startup validation" && exit 1)

USER app

## Use exec form to properly handle signals
CMD ["node", "server.js"]

Why these details matter:

  • Non-root user prevents file permission crashes
  • Startup validation catches missing dependencies during build
  • Exec form CMD ensures your app receives shutdown signals properly

Kubernetes Deployment That Won't Fail

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app
spec:
  replicas: 2
  selector:
    matchLabels:
      app: my-app
  template:
    metadata:
      labels:
        app: my-app
    spec:
      # Wait for dependencies (database, cache, etc.)
      initContainers:
      - name: wait-for-database
        image: postgres:15-alpine
        command:
        - sh
        - -c
        - |
          echo "Waiting for database..."
          until pg_isready -h $(DATABASE_HOST) -p $(DATABASE_PORT) -U $(DATABASE_USER); do
            echo "Database not ready, waiting..."
            sleep 2
          done
          echo "Database is ready!"
        env:
        - name: DATABASE_HOST
          value: "postgres-service"
        - name: DATABASE_PORT
          value: "5432"
        - name: DATABASE_USER
          value: "myapp"
          
      containers:
      - name: app
        image: my-app:latest
        
        # Give your app enough resources to start
        resources:
          requests:
            memory: "256Mi"
            cpu: "250m"
          limits:
            memory: "512Mi"  # Don't be stingy with memory
            cpu: "500m"
        
        # Security without breaking functionality
        securityContext:
          runAsUser: 1001
          runAsGroup: 1001
          allowPrivilegeEscalation: false
          readOnlyRootFilesystem: true
          
        # Mount writable directories for logs and temp files
        volumeMounts:
        - name: tmp
          mountPath: /tmp
        - name: logs
          mountPath: /app/logs
          
        # Health checks that don't crash your app
        livenessProbe:
          httpGet:
            path: /health
            port: 3000
          initialDelaySeconds: 30  # Give app time to start
          periodSeconds: 10
          timeoutSeconds: 5
          failureThreshold: 3
          
        readinessProbe:
          httpGet:
            path: /ready
            port: 3000
          initialDelaySeconds: 10
          periodSeconds: 5
          
      volumes:
      - name: tmp
        emptyDir: {}
      - name: logs
        emptyDir: {}

The Health Check That Won't Kill Your App

Health Check Pattern

// health-check.js
app.get('/health', (req, res) => {
  try {
    // Check critical dependencies without crashing
    const checks = {
      database: database.isConnected(),
      memory: process.memoryUsage().heapUsed < 500 * 1024 * 1024, // Under 500MB
      uptime: process.uptime() > 5 // At least 5 seconds uptime
    };
    
    const healthy = Object.values(checks).every(check => check);
    
    if (healthy) {
      res.status(200).json({ status: 'healthy', checks });
    } else {
      res.status(503).json({ status: 'unhealthy', checks });
    }
  } catch (error) {
    // Don't let health checks crash your app!
    console.error('Health check error:', error);
    res.status(503).json({ 
      status: 'error', 
      message: error.message,
      timestamp: new Date().toISOString()
    });
  }
});

Reality Check: What Actually Prevents Problems

Prevention vs Reaction

Things that work:

  • Environment variable validation on startup (catches 70% of issues)
  • Database connection retry logic (catches 20% of issues)
  • Proper resource limits (catches 5% of issues)
  • Non-root user with proper volumes (catches 4% of issues)

Things that don't work:

  • Complex init containers that check 20 different things
  • Health checks that crash when dependencies are down
  • Overly restrictive security contexts without writable volumes
  • "Comprehensive" validation that takes 30 seconds to run

The best prevention is boring, simple validation that catches the common problems.

With these prevention strategies in place, you'll spend less time debugging and more time shipping features. But when things do go wrong (and they will), having the right resources at hand makes all the difference.

Essential links for prevention:

Exit Code 1 Troubleshooting Resources and Tools

Related Tools & Recommendations

tool
Similar content

Helm Troubleshooting Guide: Fix Deployments & Debug Errors

The commands, tools, and nuclear options for when your Helm deployment is fucked and you need to debug template errors at 3am.

Helm
/tool/helm/troubleshooting-guide
100%
tool
Similar content

Helm: Simplify Kubernetes Deployments & Avoid YAML Chaos

Package manager for Kubernetes that saves you from copy-pasting deployment configs like a savage. Helm charts beat maintaining separate YAML files for every dam

Helm
/tool/helm/overview
90%
integration
Similar content

Jenkins Docker Kubernetes CI/CD: Deploy Without Breaking Production

The Real Guide to CI/CD That Actually Works

Jenkins
/integration/jenkins-docker-kubernetes/enterprise-ci-cd-pipeline
81%
tool
Similar content

GKE Overview: Google Kubernetes Engine & Managed Clusters

Google runs your Kubernetes clusters so you don't wake up to etcd corruption at 3am. Costs way more than DIY but beats losing your weekend to cluster disasters.

Google Kubernetes Engine (GKE)
/tool/google-kubernetes-engine/overview
72%
troubleshoot
Similar content

Fix Kubernetes Service Not Accessible: Stop 503 Errors

Your pods show "Running" but users get connection refused? Welcome to Kubernetes networking hell.

Kubernetes
/troubleshoot/kubernetes-service-not-accessible/service-connectivity-troubleshooting
66%
troubleshoot
Similar content

Fix Kubernetes Pod CrashLoopBackOff - Complete Troubleshooting Guide

Master Kubernetes CrashLoopBackOff. This complete guide explains what it means, diagnoses common causes, provides proven solutions, and offers advanced preventi

Kubernetes
/troubleshoot/kubernetes-pod-crashloopbackoff/crashloop-diagnosis-solutions
63%
tool
Similar content

etcd Overview: The Core Database Powering Kubernetes Clusters

etcd stores all the important cluster state. When it breaks, your weekend is fucked.

etcd
/tool/etcd/overview
60%
troubleshoot
Similar content

Kubernetes CrashLoopBackOff: Debug & Fix Pod Restart Issues

Your pod is fucked and everyone knows it - time to fix this shit

Kubernetes
/troubleshoot/kubernetes-pod-crashloopbackoff/crashloopbackoff-debugging
53%
tool
Similar content

containerd - The Container Runtime That Actually Just Works

The boring container runtime that Kubernetes uses instead of Docker (and you probably don't need to care about it)

containerd
/tool/containerd/overview
52%
integration
Recommended

Setting Up Prometheus Monitoring That Won't Make You Hate Your Job

How to Connect Prometheus, Grafana, and Alertmanager Without Losing Your Sanity

Prometheus
/integration/prometheus-grafana-alertmanager/complete-monitoring-integration
49%
troubleshoot
Similar content

Debug Kubernetes AI GPU Failures: Pods Stuck Pending & OOM

Debugging workflows for when Kubernetes decides your AI workload doesn't deserve those GPUs. Based on 3am production incidents where everything was on fire.

Kubernetes
/troubleshoot/kubernetes-ai-workload-deployment-issues/ai-workload-gpu-resource-failures
46%
troubleshoot
Similar content

Fix Kubernetes ImagePullBackOff Error: Complete Troubleshooting Guide

From "Pod stuck in ImagePullBackOff" to "Problem solved in 90 seconds"

Kubernetes
/troubleshoot/kubernetes-imagepullbackoff/comprehensive-troubleshooting-guide
46%
troubleshoot
Similar content

Kubernetes Crisis Management: Fix Your Down Cluster Fast

How to fix Kubernetes disasters when everything's on fire and your phone won't stop ringing.

Kubernetes
/troubleshoot/kubernetes-production-crisis-management/production-crisis-management
46%
alternatives
Recommended

Terraform Alternatives That Don't Suck to Migrate To

Stop paying HashiCorp's ransom and actually keep your infrastructure working

Terraform
/alternatives/terraform/migration-friendly-alternatives
46%
pricing
Recommended

Infrastructure as Code Pricing Reality Check: Terraform vs Pulumi vs CloudFormation

What these IaC tools actually cost you in 2025 - and why your AWS bill might double

Terraform
/pricing/terraform-pulumi-cloudformation/infrastructure-as-code-cost-analysis
46%
tool
Recommended

Terraform - Define Infrastructure in Code Instead of Clicking Through AWS Console for 3 Hours

The tool that lets you describe what you want instead of how to build it (assuming you enjoy YAML's evil twin)

Terraform
/tool/terraform/overview
46%
howto
Similar content

FastAPI Kubernetes Deployment: Production Reality Check

What happens when your single Docker container can't handle real traffic and you need actual uptime

FastAPI
/howto/fastapi-kubernetes-deployment/production-kubernetes-deployment
43%
tool
Similar content

ArgoCD Production Troubleshooting: Debugging & Fixing Deployments

The real-world guide to debugging ArgoCD when your deployments are on fire and your pager won't stop buzzing

Argo CD
/tool/argocd/production-troubleshooting
40%
troubleshoot
Similar content

Fix Snyk Authentication Registry Errors: Deployment Nightmares Solved

When Snyk can't connect to your registry and everything goes to hell

Snyk
/troubleshoot/snyk-container-scan-errors/authentication-registry-errors
39%
tool
Similar content

Fix gRPC Production Errors - The 3AM Debugging Guide

Fix critical gRPC production errors: 'connection refused', 'DEADLINE_EXCEEDED', and slow calls. This guide provides debugging strategies and monitoring solution

gRPC
/tool/grpc/production-troubleshooting
36%

Recommendations combine user behavior, content similarity, research intelligence, and SEO optimization