KrakenD Production Troubleshooting - Fix the 3AM Problems

The Most Common Ways KrakenD Breaks (And How to Fix Them)

KrakenD Production Troubleshooting

After dealing with production KrakenD deployments for years, these are the problems that wake you up at 3am. Most of them have simple fixes once you know what to look for.

Memory Leaks and OOM Kills

Symptom: KrakenD containers randomly restart, kubectl logs shows exit code 137, Kubernetes keeps killing your pods.

KrakenD Memory Usage Monitoring

This usually hits when you're handling concurrent requests without proper resource limits. KrakenD spawns goroutines like crazy during traffic spikes, and if you don't configure memory limits correctly, Kubernetes will murder your pods. Check the Kubernetes deployment guide and Docker deployment best practices for proper resource configuration.

The fix that actually works:

resources:
  limits:
    memory: "1Gi"
    cpu: "1000m"
  requests:
    memory: "512Mi"
    cpu: "100m"

Set explicit memory limits in your Kubernetes deployment. Don't trust the defaults. I've seen teams lose entire weekends because they skipped this basic step. Also check your circuit breaker configuration - aggressive timeouts can cause request pileups. For monitoring, set up Prometheus metrics and Grafana dashboards to catch these issues early.

Configuration Validation Hell

Symptom: KrakenD starts but endpoints return 404s, config changes don't take effect, mysterious routing issues.

The number one cause: your JSON config has subtle syntax errors that don't break startup but break routing. KrakenD's error messages for config issues are... not great. You'll spend hours debugging what should be a 30-second fix. Use the configuration check command and configuration audit tool to catch these before deployment.

Check this first:

## Always validate before deploying
krakend check --config krakend.json

Common gotchas:

Missing http:// in backend URLs (this one cost me 2 hours last month)
Trailing slashes in endpoint paths - /api/users/ vs /api/users are different endpoints
Wrong url_pattern vs endpoint matching - read the docs because this trips everyone up. Also check backend configuration and parameter forwarding guide.

Backend Service Discovery Failures

Symptom: 502 errors, "connection refused", KrakenD can't reach your services even though they're running.

In Kubernetes, this is usually DNS resolution. KrakenD tries to connect to backend-service:8080 but can't resolve the hostname because of namespace issues or service naming problems. Check the service discovery documentation and Kubernetes networking guide.

Debug the networking:

## From inside your KrakenD pod
kubectl exec -it krakend-pod -- nslookup backend-service
kubectl exec -it krakend-pod -- curl your-backend-service:8080/health
## Replace 'your-backend-service' with your actual service name and add http://

Kubernetes-specific fixes:

Use full service names: backend-service.namespace.svc.cluster.local:8080
Check your service selector labels match your backend pods
Verify port numbers - common mistake is exposing 80 but backend listens on 8080

JWT Validation Breaking Authentication

Symptom: 401 errors for valid tokens, authentication works sporadically, users getting logged out randomly.

KrakenD's JWT validation is picky about token format and timing. Clock drift between services can cause valid tokens to be rejected as expired.

Common authentication failures:

JWK endpoint unreachable - KrakenD can't fetch public keys
Token algorithm mismatch - your auth service uses RS256, config says HS256
Clock synchronization issues - NTP drift causes timing validation failures

The nuclear option that usually works:

{
  "auth/validator": {
    "alg": "RS256",
    "jwk_url": "https://your-auth.com/.well-known/jwks.json",
    "cache_ttl": "15m",
    "disable_jwk_security": true
  }
}

disable_jwk_security is a temporary hack while you fix the real issue. Don't leave it in production.

Rate Limiting Gone Wrong

Symptom: Legitimate traffic getting 429 errors, rate limits triggering during normal load, users complaining about blocked requests.

KrakenD's rate limiting configuration is confusing and the defaults are aggressive. Token bucket settings don't behave like most people expect.

Rate limiting that doesn't suck:

{
  "qos/ratelimit/token-bucket": {
    "max_rate": 1000,
    "capacity": 1000,
    "every": "1s"
  }
}

Start with generous limits and tune down based on actual traffic patterns. Monitor your rate limiting metrics in Grafana to see what's actually being blocked.

Emergency Debugging Questions (Quick Fixes)

Why is KrakenD returning 502 errors for backends that are clearly running?

Check the obvious stuff first:

Is the backend URL spelled correctly in your config? Missing http:// prefix is the most common mistake
Can KrakenD actually reach the backend? kubectl exec -it krakend-pod -- curl backend-url
Are you using the right port? Service port vs container port confusion kills 30% of deployments

Real fix: Add health checks to your backend services and monitor connectivity from KrakenD pods. Circuit breakers will help isolate failing services but they won't fix basic networking issues.

KrakenD keeps crashing with exit code 137 - what's killing it?

It's almost always memory limits. Exit code 137 means Kubernetes killed your pod for using too much RAM. KrakenD can consume massive amounts of memory during traffic spikes if you don't set proper resource limits.

resources:
  limits:
    memory: "1Gi"    # Set this based on your actual usage
  requests:
    memory: "256Mi"

Debugging memory issues: Check your concurrent requests settings. Default is unlimited, which will eat all available memory.

Config changes aren't taking effect - what am I doing wrong?

Most likely culprits:

Config not mounted properly - check your Kubernetes ConfigMap and volume mounts
KrakenD didn't restart - config changes require a restart unless you have hot reload enabled
Invalid JSON - use krakend check --config krakend.json to validate
Wrong config file path - KrakenD defaults to looking for /etc/krakend/krakend.json

Quick validation: kubectl exec -it krakend-pod -- cat /etc/krakend/krakend.json to see what config KrakenD is actually using.

JWT validation is randomly failing - tokens work sometimes but not others

Classic symptoms of clock drift or JWK caching issues. JWTs have expiration times that are sensitive to clock synchronization between services.

Immediate fixes:

Check NTP synchronization on all nodes
Increase JWK cache TTL in your config to reduce key fetching issues
Verify your auth service's JWK endpoint is always reachable
Consider setting cookie_key if you're using JWT signing with cookies

Rate limiting is blocking legitimate traffic - how do I tune it?

KrakenD's rate limiting is per-instance, not global. If you have 3 KrakenD pods, each gets the full rate limit allocation. This catches everyone off guard.

{
  "qos/ratelimit/token-bucket": {
    "max_rate": 100,
    "capacity": 200,
    "every": "1s"
  }
}

Start high and tune down based on actual metrics. Use your monitoring dashboard to see what's being rate limited before adjusting limits.

KrakenD won't start and logs show "bind: address already in use"

Port conflict. Another process is using port 8080 (KrakenD's default). This happens in Docker environments when you have multiple containers trying to use the same port.

Quick fixes:

Change KrakenD's port in config: "port": 8081
Check what's using the port: lsof -i :8080 or netstat -tulpn | grep 8080
In Kubernetes, check for port conflicts in your service definitions

Backend services are slow and KrakenD is timing out - how do I fix timeouts?

Timeout hell is common with microservices. KrakenD has multiple timeout settings that can conflict with each other.

Timeout hierarchy (from most specific to least):

Backend timeout: "timeout": "30s" in backend config
Endpoint timeout: "timeout": "45s" in endpoint config
Global timeout: "timeout": "60s" in root config

Rule of thumb: Backend timeout < Endpoint timeout < Global timeout. Give yourself buffer time for aggregation and processing.

Why can't I see detailed error messages from my backends?

KrakenD sanitizes backend errors by default. You're probably seeing generic 500 errors instead of the actual backend error messages.

Enable detailed errors:

{
  "backend": [{
    "url_pattern": "/api/service",
    "host": ["http://backend:8080"],
    "extra_config": {
      "backend/http": {
        "return_error_details": "backend_alias"
      }
    }
  }]
}

Security warning: Don't enable this in production unless you're sure your backend errors don't leak sensitive information.

Production Monitoring That Actually Helps

When KrakenD breaks in production, you need monitoring that tells you what's wrong instead of just that something is wrong.

Most teams set up the basic metrics but miss the ones that matter during incidents. Check the telemetry overview and OpenTelemetry implementation guide for comprehensive monitoring setup.### Essential Metrics for Production KrakenD**Memory and CPU are basic

these are the ones that save your ass:**Response time percentiles by endpoint:

P95 and P99 latencies tell you which endpoints are struggling before users complain. Monitor these per endpoint, not just globally. Set up Prometheus monitoring and Grafana dashboards for visualization.Circuit breaker state changes:

When circuit breakers start opening, you have maybe 60 seconds before the incident escalates. Set alerts on circuit breaker state transitions, not just error rates.Backend connection failures: Track connection refused, timeout, and DNS resolution errors separately.

Each indicates a different type of problem requiring different fixes.JWT validation failures: Separate auth failures from other 401s.

Clock drift and JWK endpoint issues show up here first.Request queue depth: KrakenD queues requests during backend slowdowns.

Queue depth growing means you're about to have a bad time.### Logging Configuration for TroubleshootingStandard KrakenD logs are useless for production debugging. You need structured logging with the right log levels. Check the logging documentation and Graylog integration guide for structured logging setup.json{"extra_config": {"telemetry/logging": {"level": "INFO","prefix": "[KRAKEND]","syslog": false,"stdout": true,"format": "json"}}}Critical log patterns to alert on:

connection refused
Backend connectivity issues
context deadline exceeded
Timeout problems
jwt validation failed
Authentication problems
circuit breaker is open
Backend failures### Health Check StrategyKrakenD's health endpoint only tells you if the process is running, not if it's working correctly.Better health checks test actual functionality:json{"endpoints": [{"endpoint": "/__health_detailed","method": "GET","backend": [{"url_pattern": "/health","host": ["http://critical-backend:8080"]}]}]}Create a health endpoint that actually tests backend connectivity.

Use this for Kubernetes liveness probes instead of the default health endpoint.### Alert Thresholds That WorkToo many false positives train people to ignore alerts. These thresholds are based on real production experience:

Error rate > 1% for more than 2 minutes (not just a spike)
P95 response time > 2x baseline for 5 minutes
Circuit breaker open for any duration (immediate alert)
Memory usage > 80% for 10 minutes (gives you time to scale before OOM kills)
Backend connection failures > 10/minute for any backend### Performance Troubleshooting Workflow

When KrakenD performance goes to shit, follow this debugging sequence:**1.

Check backend health first**: 80% of KrakenD performance issues are actually backend issues.

Look at backend response times and error rates before diving into gateway metrics.2. Review traffic patterns: Sudden traffic spikes break things in predictable ways.

Check if the performance degradation correlates with traffic increases.3. Examine resource utilization: Memory and CPU spikes indicate resource constraints.

Scale horizontally before trying to tune configuration.4. Validate configuration changes: Recent config deployments cause most production issues.

Compare current config with the last known good configuration.5. Analyze request flows: Use distributed tracing to understand where requests are spending time.

Request aggregation and data manipulation can be expensive. Also check Zipkin integration and AWS X-Ray setup for tracing.### Capacity Planning Reality CheckKrakenD scales differently than other services. Most teams underestimate memory requirements and overestimate CPU needs.

Check the server dimensioning guide and clustering documentation for scaling best practices.Memory scaling:

Kraken

D uses roughly 100MB base + (concurrent_requests × 1MB) per endpoint. If you have 10 endpoints configured for 100 concurrent requests each, expect 1GB+ memory usage under load.CPU scaling: KrakenD is CPU efficient until you hit network I/O limits.

You'll usually need more replicas for network bandwidth before you need more CPU cores.Network bandwidth: API gateways push a lot of data.

Monitor network utilization

many "performance" issues are actually bandwidth limitations. For detailed performance analysis, check the benchmarking documentation and performance testing guides.

Advanced Production Problems (The Hard Stuff)

KrakenD is consuming all available memory during traffic spikes - how do I fix it?

This is usually concurrent request limits combined with slow backends. KrakenD queues requests and spawns goroutines for each concurrent request. Slow backends mean requests pile up and consume memory.Immediate fixes:json{"endpoints": [{"concurrent_calls": 10, // Limit concurrent backend calls"extra_config": {"qos/ratelimit/token-bucket": {"max_rate": 100,"capacity": 200}}}]}Set explicit concurrent call limits per endpoint. Default unlimited concurrency will exhaust memory during traffic spikes.Resource limits that actually work:yamlresources: limits: memory: "2Gi" cpu: "1000m" requests: memory: "512Mi" cpu: "200m"

Configuration changes cause intermittent 404s - what's the deployment issue?

Rolling updates with configuration changes break routing temporarily.

Kraken

D loads configuration at startup, so config changes require pod restarts during rolling deployments.Zero-downtime config updates: 1.

Use flexible configuration with environment variables for values that change frequently 2.

Deploy configuration changes as separate step from image updates 3. Consider hot reload for Enterprise editionConfigMap update strategy:yamlspec: template: metadata: annotations: configHash: "{{ .Values.configHash }}" # Force pod restart on config change

Circuit breakers are opening unnecessarily during normal load - how do I tune them?

Default circuit breaker settings are too aggressive for most real-world scenarios.

They're designed for protecting against cascading failures, not normal traffic variance.json{"backend": [{"extra_config": {"qos/circuit-breaker": {"interval": 60,"max_errors": 10,"name": "backend-circuit-breaker","timeout": 10}}}]}Tuning guidelines:

max_errors:

Start with 20-30 errors per interval, not the default 5

interval: 60 seconds gives you enough data to make decisions
timeout: How long to wait before trying again
start with 10 secondsMonitor circuit breaker state in your dashboards and adjust based on actual failure patterns.

JWT tokens are being rejected with "signature verification failed" errors

Usually a key rotation or algorithm mismatch issue.

Your auth service rotated keys but KrakenD is still caching the old public key.Debug JWT validation:bash# Check what KrakenD is seeingkubectl logs -f krakend-pod | grep "jwt"# Manually validate a failing tokenecho "your-jwt-token" | cut -d. -f2 | base64 -d | jq .Common fixes:

Reduce JWK cache TTL: "cache_ttl": "5m" instead of default 15 minutes
Verify algorithm matches: "alg": "RS256" vs what your auth service uses
Check JWK endpoint accessibility from KrakenD pods

Backend services are healthy but KrakenD shows connection failures

DNS resolution problems in Kubernetes.

This is especially common in multi-namespace deployments where service discovery gets confused.Debugging network connectivity:bash# From KrakenD podkubectl exec -it krakend-pod -- nslookup backend-servicekubectl exec -it krakend-pod -- curl -v your-backend-service:8080/health# Replace 'your-backend-service' with your actual service name and add http://kubectl exec -it krakend-pod -- netstat -rn # Check routing tableService naming gotchas:

Use full service DNS names: backend-service.namespace.svc.cluster.local
Check service selector labels match backend pod labels
Verify service ports match backend container ports

Rate limiting isn't working as expected - legitimate traffic gets blocked

Per-instance vs cluster-wide rate limiting confusion. KrakenD applies rate limits per instance, so 3 replicas = 3x the configured limit.Rate limiting that makes sense:json{"extra_config": {"qos/ratelimit/token-bucket": {"max_rate": 100, // Per instance limit"capacity": 200, // Burst capacity"every": "1s"}}}Calculate actual limits: (max_rate × number_of_replicas) = cluster-wide limitConsider cluster rate limiting if you need true global rate limits across replicas.

KrakenD performance degrades over time - what's causing the memory leak?

Usually goroutine leaks from abandoned requests or connection pooling issues.

Long-running KrakenD instances accumulate connections and goroutines over time.Memory leak debugging:bash# Check goroutine count over timekubectl exec -it krakend-pod -- curl localhost:8080/__stats# Monitor connection poolskubectl exec -it krakend-pod -- netstat -an | grep ESTABLISHED | wc -lCommon causes:

Backend services not properly closing connections
Infinite timeout configurations allowing requests to hang forever
Missing context cancellation in custom pluginsMitigation strategies:
Set reasonable timeouts at all levels
Monitor goroutine count and restart pods when it grows too large
Use connection pooling limits in your HTTP client configuration

Quick Navigation

Memory Leaks and OOM Kills

Configuration Validation Hell

Backend Service Discovery Failures

JWT Validation Breaking Authentication

Rate Limiting Gone Wrong

Why is KrakenD returning 502 errors for backends that are clearly running?

KrakenD keeps crashing with exit code 137 - what's killing it?

Config changes aren't taking effect - what am I doing wrong?

JWT validation is randomly failing - tokens work sometimes but not others

Rate limiting is blocking legitimate traffic - how do I tune it?

KrakenD won't start and logs show "bind: address already in use"

Backend services are slow and KrakenD is timing out - how do I fix timeouts?

Why can't I see detailed error messages from my backends?

Production Monitoring That Actually Helps

KrakenD is consuming all available memory during traffic spikes - how do I fix it?

Configuration changes cause intermittent 404s - what's the deployment issue?

Circuit breakers are opening unnecessarily during normal load - how do I tune them?

JWT tokens are being rejected with "signature verification failed" errors

Backend services are healthy but KrakenD shows connection failures

Rate limiting isn't working as expected - legitimate traffic gets blocked

KrakenD performance degrades over time - what's causing the memory leak?

Related Tools & Recommendations

KrakenD API Gateway: Fast, Open Source API Management Overview

Bolt.new Production Deployment Troubleshooting Guide

AWS Lambda Overview: Run Code Without Servers - Pros & Cons

Fix gRPC Production Errors - The 3AM Debugging Guide

TaxBit Enterprise Production Troubleshooting: Debug & Fix Issues

Fix Pulumi Deployment Failures - Complete Troubleshooting Guide

ArgoCD Production Troubleshooting: Debugging & Fixing Deployments

Kong Gateway: Cloud-Native API Gateway Overview & Features

Fix TaxAct Errors: Login, WebView2, E-file & State Rejection Guide

AWS API Gateway Security Hardening: Protect Your APIs in Production

Trivy & Docker Security Scanner Failures: Debugging CI/CD Integration Issues

Solana Web3.js Production Debugging Guide: Fix Common Errors

npm Enterprise Troubleshooting: Fix Corporate IT & Dev Problems

Get Alpaca Market Data Without the Connection Constantly Dying on You

ib_insync is Dead, Here's How to Migrate Without Breaking Everything

Python - The Language Everyone Uses (Despite Its Flaws)

Arbitrum Production Debugging: Fix Gas & WASM Errors in Live Dapps

Fix MongoDB "Topology Was Destroyed" Connection Pool Errors

AWS API Gateway: The API Service That Actually Works

React Production Debugging: Fix App Crashes & White Screens