Why is KrakenD returning 502 errors for backends that are clearly running?

Check the obvious stuff first: - Is the backend URL spelled correctly in your config? Missing `http://` prefix is the most common mistake - Can KrakenD actually reach the backend? `kubectl exec -it krakend-pod -- curl backend-url` - Are you using the right port? Service port vs container port confusion kills 30% of deployments **Real fix**: Add health checks to your backend services and monitor connectivity from KrakenD pods. [Circuit breakers](https://www.krakend.io/docs/backends/circuit-breaker/) will help isolate failing services but they won't fix basic networking issues.

KrakenD keeps crashing with exit code 137 - what's killing it?

**It's almost always memory limits.** Exit code 137 means Kubernetes killed your pod for using too much RAM. KrakenD can consume massive amounts of memory during traffic spikes if you don't set proper resource limits. ```yaml resources: limits: memory: "1Gi" # Set this based on your actual usage requests: memory: "256Mi" ``` **Debugging memory issues**: Check your [concurrent requests](https://www.krakend.io/docs/endpoints/concurrent-requests/) settings. Default is unlimited, which will eat all available memory.

Config changes aren't taking effect - what am I doing wrong?

**Most likely culprits:** 1. **Config not mounted properly** - check your Kubernetes ConfigMap and volume mounts 2. **KrakenD didn't restart** - config changes require a restart unless you have [hot reload](https://www.krakend.io/docs/developer/hot-reload/) enabled 3. **Invalid JSON** - use `krakend check --config krakend.json` to validate 4. **Wrong config file path** - KrakenD defaults to looking for `/etc/krakend/krakend.json` **Quick validation**: `kubectl exec -it krakend-pod -- cat /etc/krakend/krakend.json` to see what config KrakenD is actually using.

JWT validation is randomly failing - tokens work sometimes but not others

**Classic symptoms of clock drift or JWK caching issues.** JWTs have expiration times that are sensitive to clock synchronization between services. **Immediate fixes:** - Check NTP synchronization on all nodes - Increase JWK cache TTL in your config to reduce key fetching issues - Verify your auth service's JWK endpoint is always reachable - Consider setting `cookie_key` if you're using [JWT signing](https://www.krakend.io/docs/authorization/jwt-signing/) with cookies

Rate limiting is blocking legitimate traffic - how do I tune it?

**KrakenD's rate limiting is per-instance, not global.** If you have 3 KrakenD pods, each gets the full rate limit allocation. This catches everyone off guard. ```json { "qos/ratelimit/token-bucket": { "max_rate": 100, "capacity": 200, "every": "1s" } } ``` **Start high and tune down** based on actual metrics. Use your [monitoring dashboard](https://www.krakend.io/docs/telemetry/grafana/) to see what's being rate limited before adjusting limits.

KrakenD won't start and logs show "bind: address already in use"

**Port conflict.** Another process is using port 8080 (KrakenD's default). This happens in Docker environments when you have multiple containers trying to use the same port. **Quick fixes:** - Change KrakenD's port in config: `"port": 8081` - Check what's using the port: `lsof -i :8080` or `netstat -tulpn | grep 8080` - In Kubernetes, check for port conflicts in your service definitions

Backend services are slow and KrakenD is timing out - how do I fix timeouts?

**Timeout hell is common with microservices.** KrakenD has multiple timeout settings that can conflict with each other. **Timeout hierarchy** (from most specific to least): 1. Backend timeout: `"timeout": "30s"` in backend config 2. Endpoint timeout: `"timeout": "45s"` in endpoint config 3. Global timeout: `"timeout": "60s"` in root config **Rule of thumb**: Backend timeout < Endpoint timeout < Global timeout. Give yourself buffer time for aggregation and processing.

Why can't I see detailed error messages from my backends?

**KrakenD sanitizes backend errors by default.** You're probably seeing generic 500 errors instead of the actual backend error messages. **Enable detailed errors:** ```json { "backend": [{ "url_pattern": "/api/service", "host": ["http://backend:8080"], "extra_config": { "backend/http": { "return_error_details": "backend_alias" } } }] } ``` **Security warning**: Don't enable this in production unless you're sure your backend errors don't leak sensitive information.

KrakenD is consuming all available memory during traffic spikes - how do I fix it?

This is usually concurrent request limits combined with slow backends. KrakenD queues requests and spawns goroutines for each concurrent request. Slow backends mean requests pile up and consume memory.Immediate fixes:```json{"endpoints": [{"concurrent_calls": 10, // Limit concurrent backend calls"extra_config": {"qos/ratelimit/token-bucket": {"max_rate": 100,"capacity": 200}}}]}```Set explicit concurrent call limits per endpoint. Default unlimited concurrency will exhaust memory during traffic spikes.Resource limits that actually work:```yamlresources: limits: memory: "2Gi" cpu: "1000m" requests: memory: "512Mi" cpu: "200m"```

Configuration changes cause intermittent 404s - what's the deployment issue?

Rolling updates with configuration changes break routing temporarily. KrakenD loads configuration at startup, so config changes require pod restarts during rolling deployments.Zero-downtime config updates:1. Use [flexible configuration](https://www.krakend.io/docs/configuration/flexible-config/) with environment variables for values that change frequently2. Deploy configuration changes as separate step from image updates3. Consider [hot reload](https://www.krakend.io/docs/developer/hot-reload/) for Enterprise editionConfigMap update strategy:```yamlspec: template: metadata: annotations: configHash: "{{ .Values.configHash }}" # Force pod restart on config change```

Circuit breakers are opening unnecessarily during normal load - how do I tune them?

Default circuit breaker settings are too aggressive for most real-world scenarios. They're designed for protecting against cascading failures, not normal traffic variance.```json{"backend": [{"extra_config": {"qos/circuit-breaker": {"interval": 60,"max_errors": 10,"name": "backend-circuit-breaker","timeout": 10}}}]}```Tuning guidelines:- `max_errors`: Start with 20-30 errors per interval, not the default 5- `interval`: 60 seconds gives you enough data to make decisions- `timeout`: How long to wait before trying again - start with 10 secondsMonitor circuit breaker state in your dashboards and adjust based on actual failure patterns.

JWT tokens are being rejected with "signature verification failed" errors

Usually a key rotation or algorithm mismatch issue. Your auth service rotated keys but KrakenD is still caching the old public key.Debug JWT validation:```bash# Check what KrakenD is seeingkubectl logs -f krakend-pod | grep "jwt"# Manually validate a failing tokenecho "your-jwt-token" | cut -d. -f2 | base64 -d | jq .```Common fixes:- Reduce JWK cache TTL: `"cache_ttl": "5m"` instead of default 15 minutes- Verify algorithm matches: `"alg": "RS256"` vs what your auth service uses- Check JWK endpoint accessibility from KrakenD pods

Backend services are healthy but KrakenD shows connection failures

DNS resolution problems in Kubernetes. This is especially common in multi-namespace deployments where service discovery gets confused.Debugging network connectivity:```bash# From KrakenD podkubectl exec -it krakend-pod -- nslookup backend-servicekubectl exec -it krakend-pod -- curl -v your-backend-service:8080/health# Replace 'your-backend-service' with your actual service name and add http://kubectl exec -it krakend-pod -- netstat -rn # Check routing table```Service naming gotchas:- Use full service DNS names: `backend-service.namespace.svc.cluster.local`- Check service selector labels match backend pod labels- Verify service ports match backend container ports

Rate limiting isn't working as expected - legitimate traffic gets blocked

Per-instance vs cluster-wide rate limiting confusion. KrakenD applies rate limits per instance, so 3 replicas = 3x the configured limit.Rate limiting that makes sense:```json{"extra_config": {"qos/ratelimit/token-bucket": {"max_rate": 100, // Per instance limit"capacity": 200, // Burst capacity"every": "1s"}}}```Calculate actual limits: (max_rate × number_of_replicas) = cluster-wide limitConsider [cluster rate limiting](https://www.krakend.io/docs/throttling/cluster/) if you need true global rate limits across replicas.

KrakenD performance degrades over time - what's causing the memory leak?

Usually goroutine leaks from abandoned requests or connection pooling issues. Long-running KrakenD instances accumulate connections and goroutines over time.Memory leak debugging:```bash# Check goroutine count over timekubectl exec -it krakend-pod -- curl localhost:8080/__stats# Monitor connection poolskubectl exec -it krakend-pod -- netstat -an | grep ESTABLISHED | wc -l```Common causes:- Backend services not properly closing connections- Infinite timeout configurations allowing requests to hang forever- Missing context cancellation in custom pluginsMitigation strategies:- Set reasonable timeouts at all levels- Monitor goroutine count and restart pods when it grows too large- Use connection pooling limits in your HTTP client configuration

Currently viewing the AI version

Switch to human version

KrakenD Production Troubleshooting Guide

Critical Failure Modes and Solutions

Memory Leaks and OOM Kills

Symptoms: Random container restarts, exit code 137, Kubernetes pod kills
Root Cause: Unlimited goroutine spawning during traffic spikes without proper resource limits
Severity: CRITICAL - Can cause complete service outage
Frequency: Common during traffic spikes

Production-Ready Fix:

resources:
  limits:
    memory: "1Gi"
    cpu: "1000m"
  requests:
    memory: "512Mi"
    cpu: "100m"

Critical Configuration:

Set explicit memory limits in Kubernetes deployment
Configure concurrent request limits per endpoint
Monitor goroutine count - restart pods when count grows excessively

Operational Intelligence: Teams lose entire weekends without proper resource limits. Memory issues escalate in 60 seconds during peak traffic.

Configuration Validation Failures

Symptoms: Endpoints return 404s, config changes don't take effect, mysterious routing issues
Root Cause: Subtle JSON syntax errors that don't break startup but break routing
Severity: HIGH - Causes service availability issues
Time Investment: Can waste hours debugging 30-second fixes

Validation Strategy:

krakend check --config krakend.json

Common Gotchas:

Missing http:// in backend URLs
Trailing slashes in endpoint paths (/api/users/ vs /api/users)
Wrong url_pattern vs endpoint matching

Operational Intelligence: KrakenD error messages for config issues are inadequate. Always validate before deployment.

Backend Service Discovery Failures

Symptoms: 502 errors, "connection refused", connectivity failures despite running services
Root Cause: DNS resolution issues in Kubernetes environments
Severity: HIGH - Breaks service communication

Debugging Commands:

kubectl exec -it krakend-pod -- nslookup backend-service
kubectl exec -it krakend-pod -- curl backend-service:8080/health

Kubernetes-Specific Solutions:

Use full service names: backend-service.namespace.svc.cluster.local:8080
Verify service selector labels match backend pods
Check port number conflicts (service port vs container port)

JWT Validation Failures

Symptoms: 401 errors for valid tokens, sporadic authentication failures
Root Cause: Clock drift, JWK endpoint unreachability, algorithm mismatches
Severity: HIGH - Breaks user authentication

Temporary Fix (for debugging only):

{
  "auth/validator": {
    "alg": "RS256",
    "jwk_url": "https://your-auth.com/.well-known/jwks.json",
    "cache_ttl": "15m",
    "disable_jwk_security": true
  }
}

Warning: disable_jwk_security is for debugging only - never leave in production

Rate Limiting Issues

Symptoms: Legitimate traffic getting 429 errors during normal load
Root Cause: Per-instance rate limiting with aggressive defaults
Operational Intelligence: Rate limits are per-instance, not global - 3 replicas = 3x configured limit

Working Configuration:

{
  "qos/ratelimit/token-bucket": {
    "max_rate": 1000,
    "capacity": 1000,
    "every": "1s"
  }
}

Production Monitoring Requirements

Essential Metrics (Life-Saving)

Response time percentiles per endpoint: P95/P99 latencies by endpoint
Circuit breaker state changes: 60-second window before incident escalation
Backend connection failures: Separate connection refused, timeout, DNS errors
Request queue depth: Growing queues indicate impending problems
Memory usage with goroutine count: Detect memory leaks early

Alert Thresholds (Tested in Production)

Error rate > 1% for 2+ minutes (not just spikes)
P95 response time > 2x baseline for 5 minutes
Circuit breaker open: Immediate alert (any duration)
Memory usage > 80% for 10 minutes
Backend connection failures > 10/minute: Per backend

Logging Configuration

{
  "extra_config": {
    "telemetry/logging": {
      "level": "INFO",
      "prefix": "[KRAKEND]",
      "syslog": false,
      "stdout": true,
      "format": "json"
    }
  }
}

Critical Log Patterns:

connection refused: Backend connectivity
context deadline exceeded: Timeout problems
jwt validation failed: Authentication issues
circuit breaker is open: Backend failures

Performance Troubleshooting Workflow

Debugging Sequence (80% Success Rate)

Check backend health first: 80% of KrakenD issues are backend problems
Review traffic patterns: Correlate performance degradation with traffic spikes
Examine resource utilization: Scale horizontally before config tuning
Validate recent config changes: Most production issues from recent deployments
Analyze request flows: Use distributed tracing for expensive operations

Capacity Planning Reality

Memory Scaling Formula: 100MB base + (concurrent_requests × 1MB) per endpoint
Example: 10 endpoints × 100 concurrent requests = 1GB+ memory usage under load

CPU vs Network: CPU efficient until network I/O limits. Need more replicas for bandwidth before CPU cores.

Emergency Response (Quick Fixes)

Exit Code 137 (OOM Kill)

Immediate Action: Set memory limits in Kubernetes deployment
Root Cause: Unlimited concurrent requests exhausting memory

502 Errors with Running Backends

Check List:

Backend URL spelling (missing http:// prefix most common)
Network connectivity from KrakenD pods
Port conflicts (service vs container ports)

Config Changes Not Taking Effect

Debug Steps:

Verify ConfigMap mounting: kubectl exec -it krakend-pod -- cat /etc/krakend/krakend.json
Validate JSON syntax: krakend check --config krakend.json
Confirm pod restart after config changes

JWT Random Failures

Immediate Fixes:

Check NTP synchronization across nodes
Reduce JWK cache TTL to 5 minutes
Verify auth service JWK endpoint accessibility

Rate Limiting Blocking Legitimate Traffic

Solution: Start with generous limits, tune down based on metrics
Remember: Limits are per-instance - calculate total capacity across replicas

Advanced Production Problems

Memory Consumption During Traffic Spikes

Cause: Concurrent request limits + slow backends = request pileup
Solution: Set explicit concurrent call limits per endpoint

{
  "endpoints": [{
    "concurrent_calls": 10,
    "extra_config": {
      "qos/ratelimit/token-bucket": {
        "max_rate": 100,
        "capacity": 200
      }
    }
  }]
}

Zero-Downtime Config Updates

Strategy:

Use flexible configuration with environment variables
Deploy config changes separately from image updates
Force pod restart on config changes with annotations

Circuit Breaker Tuning

Production Settings:

{
  "qos/circuit-breaker": {
    "interval": 60,
    "max_errors": 20,
    "timeout": 10
  }
}

Guidelines: Start with 20-30 errors per interval, not default 5

Network Connectivity Debug

kubectl exec -it krakend-pod -- nslookup backend-service
kubectl exec -it krakend-pod -- curl -v backend-service:8080/health
kubectl exec -it krakend-pod -- netstat -rn

Resource Requirements

Time Investments

Basic setup: 2-4 hours for production-ready configuration
Monitoring setup: 4-8 hours for comprehensive observability
Troubleshooting major issues: 2-6 hours without proper monitoring, 30 minutes with good observability

Expertise Requirements

Kubernetes networking knowledge: Essential for service discovery issues
JWT/OAuth understanding: Required for authentication troubleshooting
Observability tools: Prometheus, Grafana, distributed tracing experience

Breaking Points

1000+ concurrent requests per endpoint: UI becomes unusable for debugging
Traffic spikes without resource limits: Guaranteed OOM kills
Default circuit breaker settings: Too aggressive for real-world traffic variance
Missing health checks: Kubernetes restarts healthy pods

Community Resources

Response Time Expectations

GitHub Issues: 1-3 days for community response
Slack Community: Real-time during business hours
Stack Overflow: 6-24 hours for common problems

Quality Indicators

Documentation: Good for basics, limited for production edge cases
Community Support: Active, experienced users share real-world solutions
Enterprise Support: Available for mission-critical deployments

Critical Warnings

What Documentation Doesn't Tell You

Default settings will fail in production traffic
Error messages are often misleading or generic
Resource requirements scale non-linearly with traffic
Clock synchronization is critical for JWT validation
Per-instance rate limiting catches everyone off guard

Configuration Gotchas

Missing http:// prefix breaks backend connectivity
Trailing slashes create different endpoints
Circuit breaker defaults are too aggressive
JWT validation requires perfect timing synchronization
Health checks only test process status, not functionality

Useful Links for Further Investigation

Link Group

Link	Description
KrakenD Check Command	Validate your configuration before deploying. Should be part of your CI/CD pipeline to catch config errors before they hit production.
KrakenD Debug Endpoint	Built-in endpoint that shows internal request/response flow. Essential for debugging routing issues and backend connectivity problems.
Grafana Dashboard for KrakenD	Pre-configured dashboard with all the metrics you need to debug performance issues. Import this before you have problems, not during them.
KrakenD Health Check Configuration	Configure proper health checks for Kubernetes liveness and readiness probes. Basic health checks miss most production issues.
OpenTelemetry Integration	Complete observability setup with traces, metrics, and logs. Required for debugging complex request flows through multiple services.
Prometheus Metrics Export	Expose KrakenD metrics for Prometheus scraping. Essential for alerting and monitoring production deployments.
Circuit Breaker Configuration	Protect your backends from cascading failures. Proper circuit breaker configuration prevents minor issues from becoming major outages.
Distributed Tracing with Jaeger	Track request flows across services to identify bottlenecks and failures. Critical for debugging performance issues in microservices environments.
Kubernetes Deployment Guide	Production-ready Kubernetes configurations with proper resource limits, health checks, and scaling parameters.
Flexible Configuration System	Template-based configuration management for multiple environments. Reduces configuration errors and simplifies deployments.
Docker Best Practices	Container optimization and security hardening for production KrakenD deployments. Includes resource optimization and security configurations.
Configuration Audit Tool	Automated security and performance analysis of your KrakenD configuration. Run this regularly to catch misconfigurations before they cause issues.
Load Balancing Strategies	Configure backend load balancing for optimal performance and failover. Essential for high-availability production deployments.
Connection Pooling and Timeouts	HTTP client optimization for backend connections. Poor connection pooling configuration causes most performance issues.
Rate Limiting Best Practices	Production-tested rate limiting configurations that protect your APIs without blocking legitimate traffic.
KrakenD GitHub Issues	Search existing issues before creating new ones. Most production problems have been encountered and solved by others.
KrakenD Community Forum	Active community support for troubleshooting and configuration questions. Response times are usually better than most paid support.
KrakenD Slack Community	Real-time chat support with KrakenD users and developers. Fastest way to get help during production incidents.
Stack Overflow KrakenD Tag	Searchable Q&A for common KrakenD problems. Good for finding solutions to specific error messages and configuration issues.

KrakenD Production Troubleshooting Guide

Critical Failure Modes and Solutions

Memory Leaks and OOM Kills

Configuration Validation Failures

Backend Service Discovery Failures

JWT Validation Failures

Rate Limiting Issues

Production Monitoring Requirements

Essential Metrics (Life-Saving)

Alert Thresholds (Tested in Production)

Logging Configuration

Performance Troubleshooting Workflow

Debugging Sequence (80% Success Rate)

Capacity Planning Reality

Emergency Response (Quick Fixes)

Exit Code 137 (OOM Kill)

502 Errors with Running Backends

Config Changes Not Taking Effect

JWT Random Failures

Rate Limiting Blocking Legitimate Traffic

Advanced Production Problems

Memory Consumption During Traffic Spikes

Zero-Downtime Config Updates

Circuit Breaker Tuning

Network Connectivity Debug

Resource Requirements

Time Investments

Expertise Requirements

Breaking Points

Community Resources

Response Time Expectations

Quality Indicators

Critical Warnings

What Documentation Doesn't Tell You

Configuration Gotchas

Useful Links for Further Investigation

Link Group

Related Tools & Recommendations

Python 3.13 Production Deployment - What Actually Breaks

Python 3.13 Finally Lets You Ditch the GIL - Here's How to Install It

Python Performance Disasters - What Actually Works When Everything's On Fire

OpenAI Gets Sued After GPT-5 Convinced Kid to Kill Himself

AWS RDS - Amazon's Managed Database Service

AWS Organizations - Stop Losing Your Mind Managing Dozens of AWS Accounts

Coinbase vs Poloniex: The Brutal Truth About Trading Crypto

Coinbase Developer Platform - Build Crypto Apps Without the Headaches

MetaMask vs Coinbase Wallet vs Trust Wallet vs Ledger Live - Which Won't Screw You Over?

PyTorch ↔ TensorFlow Model Conversion: The Real Story

Binance Chain JavaScript SDK - Legacy Tool for Legacy Chain

Binance API - Build Trading Bots That Actually Work

Binance Pro Mode - The Trading Interface That Unlocks Everything Binance Hides From Beginners

Stripe WooCommerce Integration - Doesn't Completely Suck (Unlike PayPal)

WordPress - Runs 43% of the Web Because It Just Works

PHP Performance Optimization - Stop Blaming the Language

phpMyAdmin - The MySQL Tool That Won't Die

PHP - The Language That Actually Runs the Internet

Which JavaScript Runtime Won't Make You Hate Your Life

Build Trading Bots That Actually Work - IB API Integration That Won't Ruin Your Weekend