Node.js Production Troubleshooting - Debug the Shit That Breaks at 3AM

The Production Horror Stories (And How to Fix Them)

Memory Leaks: The Silent Killers

Your app starts at 200MB RAM. Six hours later it's at 1.8GB and climbing. The V8 heap limit is ~2GB on 64-bit systems - hit it and your app dies with FATAL ERROR: Reached heap limit. Use heap profiling tools and Chrome DevTools to track down the leaks before they kill your production server.

Most Common Culprits:

Memory Leak Detection Tools

Global variables that never get cleared:

// WRONG - creates a memory leak
let userCache = new Map();
app.get('/users/:id', (req, res) => {
    userCache.set(req.params.id, userData); // Never cleaned up
});

// RIGHT - use TTL cache with cleanup
const NodeCache = require('node-cache');
const userCache = new NodeCache({ stdTTL: 600 }); // 10-minute expiry

Event listeners that pile up:

// WRONG - adds listener on every request
app.get('/data', (req, res) => {
    req.on('close', handleClose); // Memory leak
});

// RIGHT - remove listeners
app.get('/data', (req, res) => {
    req.on('close', handleClose);
    res.on('finish', () => {
        req.removeListener('close', handleClose);
    });
});

Debugging Memory Leaks - Tools That Actually Work:

**Clinic.js Doctor** - Free and catches most leaks:

npm install -g @nodejs/clinic
clinic doctor -- node app.js
## Let it run for 10+ minutes under load
## Kill with Ctrl+C and check the generated report

**0x Profiler** - Shows exactly what's eating memory:

npm install -g 0x
0x app.js
## Generate load, then kill process
## Opens flame graph showing memory hotspots

Production Memory Monitoring:

// Real production memory monitoring
const memoryUsage = () => {
    const usage = process.memoryUsage();
    console.log({
        rss: Math.round(usage.rss / 1024 / 1024) + 'MB',
        heapUsed: Math.round(usage.heapUsed / 1024 / 1024) + 'MB',
        heapTotal: Math.round(usage.heapTotal / 1024 / 1024) + 'MB',
        external: Math.round(usage.external / 1024 / 1024) + 'MB'
    });
    
    // Kill process if heap usage > 1.5GB (before hitting 2GB limit)
    if (usage.heapUsed > 1.5 * 1024 * 1024 * 1024) {
        console.error('Memory usage too high, restarting...');
        process.exit(1);
    }
};

setInterval(memoryUsage, 30000); // Check every 30 seconds

Event Loop Blocking - When Everything Stops

The event loop is single-threaded. Block it and your entire API becomes unresponsive. I've seen 2-second API responses turn into 30-second timeouts because someone processed a CSV file synchronously.

Event Loop Lag Detection:

const { performance } = require('perf_hooks');

let previousNow = performance.now();
setInterval(() => {
    const now = performance.now();
    const lag = now - previousNow - 1000; // Expected 1000ms interval
    
    if (lag > 100) {
        console.warn(`Event loop lag: ${lag.toFixed(2)}ms`);
        
        // Log stack trace to find the blocking code
        console.trace('Event loop blocked here');
    }
    previousNow = now;
}, 1000);

Common Event Loop Blockers:

Synchronous file operations - Never use these in production:

// WRONG - blocks the event loop completely
const fs = require('fs');
const data = fs.readFileSync('./large-file.json'); // BLOCKS EVERYTHING

// RIGHT - async file operations
const fs = require('fs').promises;
const data = await fs.readFile('./large-file.json'); // Non-blocking

JSON.parse() on large payloads:

// WRONG - blocks on large JSON
app.post('/upload', (req, res) => {
    const data = JSON.parse(req.body); // Can block for seconds
});

// RIGHT - stream processing or worker threads
const { Worker } = require('worker_threads');

app.post('/upload', (req, res) => {
    const worker = new Worker(`
        const { parentPort } = require('worker_threads');
        parentPort.on('message', (data) => {
            try {
                const parsed = JSON.parse(data);
                parentPort.postMessage({ success: true, data: parsed });
            } catch (error) {
                parentPort.postMessage({ success: false, error: error.message });
            }
        });
    `, { eval: true });
    
    worker.postMessage(req.body);
    worker.on('message', (result) => {
        res.json(result);
        worker.terminate();
    });
});

Node.js Event Loop Blocking

Database Connection Hell

Database connections are where most production Node.js apps die. Connection pools run out, queries hang forever, and suddenly your API returns 500 errors.

Connection Pool Debugging:

// Most apps get this wrong
const mysql = require('mysql2');

const pool = mysql.createPool({
    host: 'localhost',
    user: 'app',
    password: 'secret',
    database: 'production',
    connectionLimit: 10, // Too low for production load
    acquireTimeout: 60000, // Default timeout often too high
    timeout: 60000,
    reconnect: true
});

// RIGHT - production-ready pool with monitoring
const pool = mysql.createPool({
    host: process.env.DB_HOST,
    user: process.env.DB_USER,
    password: process.env.DB_PASSWORD,
    database: process.env.DB_NAME,
    connectionLimit: 50, // Higher limit for production
    acquireTimeout: 10000, // Fail fast on connection issues
    timeout: 30000, // Reasonable query timeout
    reconnect: true,
    multipleStatements: false // Security
});

// Monitor pool health
setInterval(() => {
    console.log('DB Pool Stats:', {
        totalConnections: pool.config.connectionLimit,
        activeConnections: pool._allConnections.length,
        freeConnections: pool._freeConnections.length,
        queuedConnections: pool._connectionQueue.length
    });
    
    // Alert if pool is running low
    const freeConnections = pool._freeConnections.length;
    const totalConnections = pool.config.connectionLimit;
    
    if (freeConnections / totalConnections < 0.2) {
        console.error('Database connection pool running low!');
    }
}, 30000);

Query Timeout Hell:

// Production query with proper timeout handling
const executeQuery = (query, params) => {
    return new Promise((resolve, reject) => {
        const timeout = setTimeout(() => {
            reject(new Error('Query timeout'));
        }, 15000); // 15 second timeout
        
        pool.execute(query, params, (error, results) => {
            clearTimeout(timeout);
            
            if (error) {
                console.error('Query failed:', {
                    query: query.substring(0, 100) + '...',
                    error: error.message,
                    code: error.code,
                    errno: error.errno
                });
                reject(error);
            } else {
                resolve(results);
            }
        });
    });
};

// Usage with error handling
app.get('/users/:id', async (req, res) => {
    try {
        const results = await executeQuery(
            'SELECT * FROM users WHERE id = ?',
            [req.params.id]
        );
        
        if (results.length === 0) {
            return res.status(404).json({ error: 'User not found' });
        }
        
        res.json(results[0]);
    } catch (error) {
        console.error('Database error:', error);
        
        if (error.message === 'Query timeout') {
            res.status(504).json({ error: 'Database timeout' });
        } else {
            res.status(500).json({ error: 'Database error' });
        }
    }
});

Process Crashes and Recovery

Your Node.js process will crash. The question is whether you'll recover gracefully or leave users staring at error pages.

Graceful Shutdown Handling:

// Production-ready graceful shutdown
const gracefulShutdown = (signal) => {
    console.log(`Received ${signal}, starting graceful shutdown...`);
    
    // Stop accepting new connections
    server.close((err) => {
        if (err) {
            console.error('Error during server close:', err);
            process.exit(1);
        }
        
        console.log('HTTP server closed');
        
        // Close database connections
        if (pool) {
            pool.end(() => {
                console.log('Database pool closed');
                process.exit(0);
            });
        } else {
            process.exit(0);
        }
    });
    
    // Force exit after 30 seconds
    setTimeout(() => {
        console.error('Forced exit after timeout');
        process.exit(1);
    }, 30000);
};

// Handle shutdown signals
process.on('SIGTERM', () => gracefulShutdown('SIGTERM'));
process.on('SIGINT', () => gracefulShutdown('SIGINT'));

// Handle uncaught exceptions
process.on('uncaughtException', (error) => {
    console.error('Uncaught Exception:', error);
    
    // Log to external service (Sentry, LogRocket, etc.)
    if (typeof logError === 'function') {
        logError(error);
    }
    
    // Graceful shutdown after logging
    setTimeout(() => {
        process.exit(1);
    }, 1000);
});

// Handle unhandled promise rejections
process.on('unhandledRejection', (reason, promise) => {
    console.error('Unhandled Promise Rejection at:', promise, 'reason:', reason);
    
    // Log the error but don't exit immediately
    if (typeof logError === 'function') {
        logError(reason);
    }
});

The key to production Node.js troubleshooting is preparation. Set up monitoring before things break, because when your app crashes at 3AM, you need data immediately - not time to install debugging tools.

Production Troubleshooting FAQ: The Questions Asked at 3AM

My Node.js app crashes with "heap out of memory" - what do I do RIGHT NOW?

Your app hit the V8 heap limit (~2GB). Immediate fix: Restart with more memory:

## Emergency restart with 4GB heap
node --max-old-space-size=4096 app.js

## Or 8GB if you have the RAM
node --max-old-space-size=8192 app.js

But this is a band-aid. You have a memory leak. Use Clinic.js or 0x profiler to find what's eating memory. Pin EVERYTHING. I mean everything - exact versions in package-lock.json.

My API responses went from 200ms to 10+ seconds overnight - WTF happened?

The event loop is blocked. Something is doing synchronous work that's killing performance.

Quick diagnosis:

## Install event loop lag monitoring
npm install --save @nodejs/clinic

## Run with doctor
clinic doctor -- node app.js
## Load test your app
## Kill with Ctrl+C and check the flamegraph

Common causes:

Someone added fs.readFileSync() in a request handler
JSON.parse() on large payloads (>10MB)
Unoptimized loops processing arrays
RegExp on user input (ReDoS attacks)
Synchronous crypto operations

Database connections are timing out but the DB server is fine - what gives?

Your connection pool is exhausted. Node.js apps create tons of concurrent connections and most pools default to 10 connections max.

Check pool health:

// Log pool stats every 30 seconds
setInterval(() => {
    console.log('Pool:', {
        active: pool._allConnections.length,
        free: pool._freeConnections.length,
        queued: pool._connectionQueue.length
    });
}, 30000);

Fix: Increase connection limit to 50-100 for production. And always set query timeouts - I've seen queries hang for hours.

My app works fine locally but crashes in Docker - what's different?

Memory limits. Your local machine has 16GB RAM, your container has 512MB. Docker kills processes that exceed memory limits.

Debug memory in Docker:

## Add memory monitoring to your container
FROM node:20-alpine
## ... your setup
CMD ["node", "--max-old-space-size=400", "app.js"]

Check Docker memory limits:

docker stats your-container-name
## Shows actual memory usage vs limits

Also check if you're using --max-old-space-size - set it to 80% of container memory.

How do I debug which dependency is causing the memory leak?

Use clinic doctor to generate a flame graph, then look for suspicious patterns:

Steps that actually work:

Install clinic: npm install -g @nodejs/clinic
Run: clinic doctor -- node app.js
Generate real load (not toy requests)
Let it run for 10+ minutes
Kill with Ctrl+C
Open the HTML report

Look for functions consuming disproportionate memory. Usually it's:

Caching libraries (Redis client, memory caches)
Database connection libraries
WebSocket libraries
File upload handlers

My Node.js process keeps getting killed in production with no error logs - why?

The OS is killing your process due to OOM (Out of Memory). This happens before Node.js can log anything.

Check system logs:

## Linux
dmesg | grep -i "killed process"
journalctl -u your-service-name | grep -i oom

## Shows which process got killed and why

Prevention:

// Monitor memory and restart before OOM kill
const memoryMonitor = () => {
    const usage = process.memoryUsage();
    const heapUsedMB = Math.round(usage.heapUsed / 1024 / 1024);
    
    // Restart at 80% of container memory limit
    if (heapUsedMB > 1600) { // 80% of 2GB
        console.error(`Memory usage too high: ${heapUsedMB}MB`);
        process.exit(1); // PM2 or Docker will restart
    }
};

setInterval(memoryMonitor, 30000);

How do I debug performance issues without taking the app offline?

Use 0x profiler in production. It has minimal performance impact and shows real bottlenecks:

## Install globally
npm install -g 0x

## Profile production app (low overhead)
0x --open -o /tmp/profile app.js

## Generates flamegraph after you kill the process
## Shows exactly what functions are slow

For memory issues, use --collect-memory-profile:

0x --collect-memory-profile app.js

My app randomly stops responding for 30+ seconds then recovers - what causes this?

Garbage Collection pauses. When your heap gets large (>1GB), V8's garbage collector can pause the entire application for seconds.

Check GC stats:

node --trace-gc --trace-gc-verbose app.js
## Shows GC pause times - anything >100ms is problematic

Solutions:

Reduce memory usage (fix memory leaks)
Use streaming for large data processing
Implement request timeouts (clients give up waiting)
Consider cluster mode to distribute load

How do I know if my Node.js app is CPU-bound or I/O-bound?

Simple test:

const { performance } = require('perf_hooks');

// Check event loop utilization
setInterval(() => {
    const usage = process.cpuUsage();
    const eluUsage = performance.eventLoopUtilization();
    
    console.log({
        cpu: {
            user: usage.user / 1000, // Convert to ms
            system: usage.system / 1000
        },
        eventLoop: {
            active: (eluUsage.active / eluUsage.idle * 100).toFixed(2) + '%'
        }
    });
}, 5000);

CPU-bound signs:

Event loop utilization > 80%
High user CPU time
Slow response times under load

I/O-bound signs:

Low CPU usage but slow responses
Database/API timeouts
High system CPU time

My logs show "ECONNRESET" and "EPIPE" errors - what are these?

ECONNRESET: Client disconnected before the server finished responding. Usually not your fault - users close browsers, mobile connections drop, load balancers timeout.

EPIPE: You tried to write to a closed connection. Usually follows ECONNRESET.

Handle gracefully:

app.get('/slow-endpoint', (req, res) => {
    // Check if connection is still alive
    req.on('close', () => {
        console.log('Client disconnected, stopping work...');
        // Stop any ongoing processing
    });
    
    // Your slow processing here
    doSlowWork()
        .then(result => {
            if (!res.headersSent) {
                res.json(result);
            }
        })
        .catch(error => {
            if (!res.headersSent) {
                res.status(500).json({ error: error.message });
            }
        });
});

Don't log these as errors - they're normal in production. Log them as warnings or ignore entirely.

Monitoring and Alerting: Catch Problems Before Users Do

Production Monitoring That Actually Works

Most Node.js monitoring is garbage - generic dashboards that tell you your app is slow after users already hate you. You need specific metrics that catch problems before they become disasters.

Production Node.js Monitoring Dashboard

Critical Metrics to Monitor

Memory Metrics:

// Custom monitoring that actually helps
const monitorHealth = () => {
    const mem = process.memoryUsage();
    const cpu = process.cpuUsage();
    const eventLoopLag = process.hrtime.bigint();
    
    // Track memory growth rate
    const heapUsedMB = Math.round(mem.heapUsed / 1024 / 1024);
    const rssUsedMB = Math.round(mem.rss / 1024 / 1024);
    
    // Alert thresholds
    const alerts = [];
    
    if (heapUsedMB > 1600) { // 80% of 2GB heap limit
        alerts.push({ type: 'MEMORY_HIGH', heap: heapUsedMB });
    }
    
    if (rssUsedMB > 1800) { // Near container limits
        alerts.push({ type: 'RSS_HIGH', rss: rssUsedMB });
    }
    
    // Track memory growth rate over time
    const now = Date.now();
    const timeDelta = now - (global.lastMemCheck || now);
    const memDelta = heapUsedMB - (global.lastHeapUsed || heapUsedMB);
    
    if (timeDelta > 0) {
        const growthRate = memDelta / (timeDelta / 1000); // MB per second
        
        if (growthRate > 5) { // Growing >5MB/second
            alerts.push({ 
                type: 'MEMORY_LEAK', 
                growthRate: growthRate.toFixed(2) 
            });
        }
    }
    
    global.lastMemCheck = now;
    global.lastHeapUsed = heapUsedMB;
    
    return {
        memory: {
            heap: heapUsedMB,
            rss: rssUsedMB,
            external: Math.round(mem.external / 1024 / 1024)
        },
        cpu: {
            user: Math.round(cpu.user / 1000), // Convert to ms
            system: Math.round(cpu.system / 1000)
        },
        alerts
    };
};

// Check health every 30 seconds
setInterval(() => {
    const health = monitorHealth();
    
    if (health.alerts.length > 0) {
        console.error('HEALTH ALERTS:', health.alerts);
        
        // Send to your alerting system (Slack, PagerDuty, etc.)
        health.alerts.forEach(alert => sendAlert(alert));
    }
    
    // Log metrics for dashboards
    console.log('Health:', JSON.stringify(health, null, 2));
}, 30000);

Database Connection Monitoring

Database issues kill more Node.js apps than bad code. Monitor pool health aggressively:

// Monitor database pool health
const monitorDbPool = (pool) => {
    const stats = {
        total: pool.config.connectionLimit,
        active: pool._allConnections ? pool._allConnections.length : 0,
        free: pool._freeConnections ? pool._freeConnections.length : 0,
        queued: pool._connectionQueue ? pool._connectionQueue.length : 0
    };
    
    // Calculate utilization percentage
    stats.utilization = ((stats.active / stats.total) * 100).toFixed(1);
    
    // Alert if pool is stressed
    const alerts = [];
    
    if (stats.utilization > 80) {
        alerts.push({
            type: 'DB_POOL_HIGH',
            utilization: stats.utilization,
            queued: stats.queued
        });
    }
    
    if (stats.queued > 10) {
        alerts.push({
            type: 'DB_QUEUE_BACKLOG',
            queued: stats.queued
        });
    }
    
    return { stats, alerts };
};

// Monitor every minute
setInterval(() => {
    const dbHealth = monitorDbPool(pool);
    
    if (dbHealth.alerts.length > 0) {
        console.error('DB ALERTS:', dbHealth.alerts);
        dbHealth.alerts.forEach(alert => sendAlert(alert));
    }
    
    console.log('DB Pool:', dbHealth.stats);
}, 60000);

Event Loop Monitoring

The event loop is your app's heartbeat. When it stops, everything dies:

// Event loop lag monitoring
const monitorEventLoop = () => {
    let previousHrtime = process.hrtime.bigint();
    
    setInterval(() => {
        const currentHrtime = process.hrtime.bigint();
        const delta = Number(currentHrtime - previousHrtime);
        const lag = Math.max(0, delta - 1000000000); // Expected 1 second
        const lagMs = lag / 1000000; // Convert to milliseconds
        
        previousHrtime = currentHrtime;
        
        if (lagMs > 100) { // Event loop lag > 100ms
            console.warn(`Event loop lag: ${lagMs.toFixed(2)}ms`);
            
            // Critical lag alerts
            if (lagMs > 1000) {
                sendAlert({
                    type: 'EVENT_LOOP_BLOCKED',
                    lagMs: lagMs.toFixed(2)
                });
            }
        }
        
        // Track event loop utilization
        const elu = performance.eventLoopUtilization();
        const utilization = (elu.active / (elu.active + elu.idle)) * 100;
        
        if (utilization > 90) {
            sendAlert({
                type: 'EVENT_LOOP_SATURATED',
                utilization: utilization.toFixed(1)
            });
        }
        
    }, 1000);
};

monitorEventLoop();

Performance Debugging in Production

Debugging performance issues in production without killing your app requires the right tools and techniques.

Using 0x Profiler Safely

0x is the only profiler I trust in production. It has low overhead and shows real bottlenecks:

## Install globally
npm install -g 0x

## Profile with minimal impact (flame graphs)
0x --open -o ./profiles app.js

## Profile memory allocation patterns
0x --collect-memory-profile app.js

## Profile for specific duration then auto-stop
timeout 300 0x app.js  # Profile for 5 minutes then stop

The flame graphs it generates show exactly which functions consume CPU time. Look for:

Wide bars (functions that take lots of CPU time)
Deep stacks (excessive function call depth)
Red sections (hot paths that need optimization)

Request-Level Debugging

Track slow requests to identify bottlenecks:

// Request performance tracking
app.use((req, res, next) => {
    const startTime = process.hrtime.bigint();
    const startMemory = process.memoryUsage().heapUsed;
    
    // Track when request finishes
    res.on('finish', () => {
        const endTime = process.hrtime.bigint();
        const endMemory = process.memoryUsage().heapUsed;
        
        const duration = Number(endTime - startTime) / 1000000; // Convert to ms
        const memoryDelta = endMemory - startMemory;
        
        const logData = {
            method: req.method,
            url: req.url,
            statusCode: res.statusCode,
            duration: Math.round(duration),
            memory: Math.round(memoryDelta / 1024), // KB
            userAgent: req.get('User-Agent'),
            ip: req.ip
        };
        
        // Log slow requests
        if (duration > 1000) { // >1 second
            console.warn('SLOW REQUEST:', logData);
        }
        
        // Log memory-heavy requests
        if (memoryDelta > 50 * 1024 * 1024) { // >50MB
            console.warn('MEMORY-HEAVY REQUEST:', logData);
        }
        
        // Normal request logging
        console.log('REQUEST:', logData);
    });
    
    next();
});

Error Handling and Recovery Patterns

Production Node.js apps crash. The key is recovering gracefully and learning from failures.

Graceful Degradation

When external services fail, degrade gracefully instead of crashing:

// Circuit breaker pattern for external services
class CircuitBreaker {
    constructor(service, threshold = 5, resetTime = 60000) {
        this.service = service;
        this.failureThreshold = threshold;
        this.resetTime = resetTime;
        this.failureCount = 0;
        this.lastFailureTime = null;
        this.state = 'CLOSED'; // CLOSED, OPEN, HALF_OPEN
    }
    
    async call(request) {
        if (this.state === 'OPEN') {
            const timeSinceLastFailure = Date.now() - this.lastFailureTime;
            
            if (timeSinceLastFailure >= this.resetTime) {
                this.state = 'HALF_OPEN';
                this.failureCount = 0;
            } else {
                throw new Error('Circuit breaker is OPEN');
            }
        }
        
        try {
            const result = await this.service(request);
            
            // Success - reset failure count
            if (this.state === 'HALF_OPEN') {
                this.state = 'CLOSED';
            }
            this.failureCount = 0;
            
            return result;
        } catch (error) {
            this.failureCount++;
            this.lastFailureTime = Date.now();
            
            if (this.failureCount >= this.failureThreshold) {
                this.state = 'OPEN';
                console.error(`Circuit breaker opened after ${this.failureCount} failures`);
            }
            
            throw error;
        }
    }
}

// Usage with external API
const apiCircuitBreaker = new CircuitBreaker(
    async (request) => {
        const response = await fetch(request.url, {
            timeout: 5000, // 5 second timeout
            signal: AbortSignal.timeout(5000)
        });
        
        if (!response.ok) {
            throw new Error(`API error: ${response.status}`);
        }
        
        return response.json();
    },
    3, // Open after 3 failures
    30000 // Reset after 30 seconds
);

// Use with fallback
app.get('/api/user-data/:id', async (req, res) => {
    try {
        const userData = await apiCircuitBreaker.call({
            url: `${process.env.API_URL}/users/${req.params.id}`
        });
        
        res.json(userData);
    } catch (error) {
        console.warn('External API failed, using cached data:', error.message);
        
        // Fallback to cached data
        const cachedData = await getCachedUserData(req.params.id);
        
        if (cachedData) {
            res.json({
                ...cachedData,
                _cached: true,
                _warning: 'Using cached data due to service unavailability'
            });
        } else {
            res.status(503).json({
                error: 'User data temporarily unavailable',
                retryAfter: 30
            });
        }
    }
});

Centralized Error Reporting

Never debug production issues without centralized error reporting:

// Production error reporting
const reportError = (error, context = {}) => {
    const errorData = {
        message: error.message,
        stack: error.stack,
        timestamp: new Date().toISOString(),
        nodeVersion: process.version,
        pid: process.pid,
        memory: process.memoryUsage(),
        uptime: process.uptime(),
        ...context
    };
    
    // Log locally
    console.error('ERROR REPORTED:', JSON.stringify(errorData, null, 2));
    
    // Send to external service (Sentry, Bugsnag, etc.)
    try {
        // Replace with your error reporting service
        sendToErrorService(errorData);
    } catch (reportingError) {
        console.error('Failed to report error:', reportingError);
    }
};

// Global error handlers
process.on('uncaughtException', (error) => {
    reportError(error, {
        type: 'uncaughtException',
        fatal: true
    });
    
    // Graceful shutdown
    process.exit(1);
});

process.on('unhandledRejection', (reason, promise) => {
    reportError(reason, {
        type: 'unhandledRejection',
        promise: promise.toString(),
        fatal: false
    });
});

// Request error handling
app.use((error, req, res, next) => {
    reportError(error, {
        type: 'requestError',
        method: req.method,
        url: req.url,
        userAgent: req.get('User-Agent'),
        ip: req.ip,
        body: req.body
    });
    
    // Don't expose internal errors to users
    if (process.env.NODE_ENV === 'production') {
        res.status(500).json({
            error: 'Internal server error',
            requestId: req.id // Include request ID for support
        });
    } else {
        res.status(500).json({
            error: error.message,
            stack: error.stack
        });
    }
});

Production Node.js troubleshooting is 80% preparation and 20% panic. Set up monitoring, error reporting, and health checks before you need them. When your app crashes at 3AM, you'll thank yourself for the preparation.

Debugging Tools Comparison: What Works in Production vs What Doesn't

Tool	Good For	Sucks At	Production Safe?	Learning Curve
Clinic.js	Memory leaks, event loop analysis	Real-time debugging	✅ Yes	Easy
0x Profiler	CPU bottlenecks, flame graphs	Memory analysis	✅ Yes (low overhead)	Moderate
Node.js --inspect	Step debugging, heap snapshots	Production use	❌ No (blocks app)	Hard
PM2 Monit	Process monitoring, auto-restart	Deep debugging	✅ Yes	Easy
Chrome DevTools	Local development	Production debugging	❌ No	Moderate
New Relic	APM monitoring, alerts	Detailed profiling	✅ Yes ($$)	Easy

Quick Navigation

Memory Leaks: The Silent Killers

Event Loop Blocking - When Everything Stops

Database Connection Hell

Process Crashes and Recovery

My Node.js app crashes with "heap out of memory" - what do I do RIGHT NOW?

My API responses went from 200ms to 10+ seconds overnight - WTF happened?

Database connections are timing out but the DB server is fine - what gives?

My app works fine locally but crashes in Docker - what's different?

How do I debug which dependency is causing the memory leak?

My Node.js process keeps getting killed in production with no error logs - why?

How do I debug performance issues without taking the app offline?

My app randomly stops responding for 30+ seconds then recovers - what causes this?

How do I know if my Node.js app is CPU-bound or I/O-bound?

My logs show "ECONNRESET" and "EPIPE" errors - what are these?

Production Monitoring That Actually Works

Critical Metrics to Monitor

Database Connection Monitoring

Event Loop Monitoring

Performance Debugging in Production

Using 0x Profiler Safely

Request-Level Debugging

Error Handling and Recovery Patterns

Graceful Degradation

Centralized Error Reporting

Related Tools & Recommendations

Node.js Memory Leaks & Debugging: Stop App Crashes

Node.js Production Deployment - How to Not Get Paged at 3AM

React Production Debugging: Fix App Crashes & White Screens

TaxBit Enterprise Production Troubleshooting: Debug & Fix Issues

Docker: Package Code, Run Anywhere - Fix 'Works on My Machine'

Fix MongoDB "Topology Was Destroyed" Connection Pool Errors

Node.js Microservices: Avoid Pitfalls & Build Robust Systems

MongoDB Express Mongoose Production: Deployment & Troubleshooting

Debugging Broken Truffle Projects: Emergency Fix Guide

Fix TaxAct Errors: Login, WebView2, E-file & State Rejection Guide

Express.js Production Guide: Optimize Performance & Prevent Crashes

Express.js - The Web Framework Nobody Wants to Replace

Node.js Security Hardening Guide: Protect Your Apps

Node.js Docker Containerization: Setup, Optimization & Production Guide

Node.js Performance Optimization: Boost App Speed & Scale

PostgreSQL: Why It Excels & Production Troubleshooting Guide

Install Node.js & NVM on Mac M1/M2/M3: A Complete Guide

Webpack: The Build Tool You'll Love to Hate & Still Use in 2025

Arbitrum Production Debugging: Fix Gas & WASM Errors in Live Dapps

LM Studio Performance: Fix Crashes & Speed Up Local AI