Currently viewing the AI version
Switch to human version

KrakenD Production Troubleshooting Guide

Critical Failure Modes and Solutions

Memory Leaks and OOM Kills

Symptoms: Random container restarts, exit code 137, Kubernetes pod kills
Root Cause: Unlimited goroutine spawning during traffic spikes without proper resource limits
Severity: CRITICAL - Can cause complete service outage
Frequency: Common during traffic spikes

Production-Ready Fix:

resources:
  limits:
    memory: "1Gi"
    cpu: "1000m"
  requests:
    memory: "512Mi"
    cpu: "100m"

Critical Configuration:

  • Set explicit memory limits in Kubernetes deployment
  • Configure concurrent request limits per endpoint
  • Monitor goroutine count - restart pods when count grows excessively

Operational Intelligence: Teams lose entire weekends without proper resource limits. Memory issues escalate in 60 seconds during peak traffic.

Configuration Validation Failures

Symptoms: Endpoints return 404s, config changes don't take effect, mysterious routing issues
Root Cause: Subtle JSON syntax errors that don't break startup but break routing
Severity: HIGH - Causes service availability issues
Time Investment: Can waste hours debugging 30-second fixes

Validation Strategy:

krakend check --config krakend.json

Common Gotchas:

  • Missing http:// in backend URLs
  • Trailing slashes in endpoint paths (/api/users/ vs /api/users)
  • Wrong url_pattern vs endpoint matching

Operational Intelligence: KrakenD error messages for config issues are inadequate. Always validate before deployment.

Backend Service Discovery Failures

Symptoms: 502 errors, "connection refused", connectivity failures despite running services
Root Cause: DNS resolution issues in Kubernetes environments
Severity: HIGH - Breaks service communication

Debugging Commands:

kubectl exec -it krakend-pod -- nslookup backend-service
kubectl exec -it krakend-pod -- curl backend-service:8080/health

Kubernetes-Specific Solutions:

  • Use full service names: backend-service.namespace.svc.cluster.local:8080
  • Verify service selector labels match backend pods
  • Check port number conflicts (service port vs container port)

JWT Validation Failures

Symptoms: 401 errors for valid tokens, sporadic authentication failures
Root Cause: Clock drift, JWK endpoint unreachability, algorithm mismatches
Severity: HIGH - Breaks user authentication

Temporary Fix (for debugging only):

{
  "auth/validator": {
    "alg": "RS256",
    "jwk_url": "https://your-auth.com/.well-known/jwks.json",
    "cache_ttl": "15m",
    "disable_jwk_security": true
  }
}

Warning: disable_jwk_security is for debugging only - never leave in production

Rate Limiting Issues

Symptoms: Legitimate traffic getting 429 errors during normal load
Root Cause: Per-instance rate limiting with aggressive defaults
Operational Intelligence: Rate limits are per-instance, not global - 3 replicas = 3x configured limit

Working Configuration:

{
  "qos/ratelimit/token-bucket": {
    "max_rate": 1000,
    "capacity": 1000,
    "every": "1s"
  }
}

Production Monitoring Requirements

Essential Metrics (Life-Saving)

  • Response time percentiles per endpoint: P95/P99 latencies by endpoint
  • Circuit breaker state changes: 60-second window before incident escalation
  • Backend connection failures: Separate connection refused, timeout, DNS errors
  • Request queue depth: Growing queues indicate impending problems
  • Memory usage with goroutine count: Detect memory leaks early

Alert Thresholds (Tested in Production)

  • Error rate > 1% for 2+ minutes (not just spikes)
  • P95 response time > 2x baseline for 5 minutes
  • Circuit breaker open: Immediate alert (any duration)
  • Memory usage > 80% for 10 minutes
  • Backend connection failures > 10/minute: Per backend

Logging Configuration

{
  "extra_config": {
    "telemetry/logging": {
      "level": "INFO",
      "prefix": "[KRAKEND]",
      "syslog": false,
      "stdout": true,
      "format": "json"
    }
  }
}

Critical Log Patterns:

  • connection refused: Backend connectivity
  • context deadline exceeded: Timeout problems
  • jwt validation failed: Authentication issues
  • circuit breaker is open: Backend failures

Performance Troubleshooting Workflow

Debugging Sequence (80% Success Rate)

  1. Check backend health first: 80% of KrakenD issues are backend problems
  2. Review traffic patterns: Correlate performance degradation with traffic spikes
  3. Examine resource utilization: Scale horizontally before config tuning
  4. Validate recent config changes: Most production issues from recent deployments
  5. Analyze request flows: Use distributed tracing for expensive operations

Capacity Planning Reality

Memory Scaling Formula: 100MB base + (concurrent_requests × 1MB) per endpoint
Example: 10 endpoints × 100 concurrent requests = 1GB+ memory usage under load

CPU vs Network: CPU efficient until network I/O limits. Need more replicas for bandwidth before CPU cores.

Emergency Response (Quick Fixes)

Exit Code 137 (OOM Kill)

Immediate Action: Set memory limits in Kubernetes deployment
Root Cause: Unlimited concurrent requests exhausting memory

502 Errors with Running Backends

Check List:

  1. Backend URL spelling (missing http:// prefix most common)
  2. Network connectivity from KrakenD pods
  3. Port conflicts (service vs container ports)

Config Changes Not Taking Effect

Debug Steps:

  1. Verify ConfigMap mounting: kubectl exec -it krakend-pod -- cat /etc/krakend/krakend.json
  2. Validate JSON syntax: krakend check --config krakend.json
  3. Confirm pod restart after config changes

JWT Random Failures

Immediate Fixes:

  • Check NTP synchronization across nodes
  • Reduce JWK cache TTL to 5 minutes
  • Verify auth service JWK endpoint accessibility

Rate Limiting Blocking Legitimate Traffic

Solution: Start with generous limits, tune down based on metrics
Remember: Limits are per-instance - calculate total capacity across replicas

Advanced Production Problems

Memory Consumption During Traffic Spikes

Cause: Concurrent request limits + slow backends = request pileup
Solution: Set explicit concurrent call limits per endpoint

{
  "endpoints": [{
    "concurrent_calls": 10,
    "extra_config": {
      "qos/ratelimit/token-bucket": {
        "max_rate": 100,
        "capacity": 200
      }
    }
  }]
}

Zero-Downtime Config Updates

Strategy:

  1. Use flexible configuration with environment variables
  2. Deploy config changes separately from image updates
  3. Force pod restart on config changes with annotations

Circuit Breaker Tuning

Production Settings:

{
  "qos/circuit-breaker": {
    "interval": 60,
    "max_errors": 20,
    "timeout": 10
  }
}

Guidelines: Start with 20-30 errors per interval, not default 5

Network Connectivity Debug

kubectl exec -it krakend-pod -- nslookup backend-service
kubectl exec -it krakend-pod -- curl -v backend-service:8080/health
kubectl exec -it krakend-pod -- netstat -rn

Resource Requirements

Time Investments

  • Basic setup: 2-4 hours for production-ready configuration
  • Monitoring setup: 4-8 hours for comprehensive observability
  • Troubleshooting major issues: 2-6 hours without proper monitoring, 30 minutes with good observability

Expertise Requirements

  • Kubernetes networking knowledge: Essential for service discovery issues
  • JWT/OAuth understanding: Required for authentication troubleshooting
  • Observability tools: Prometheus, Grafana, distributed tracing experience

Breaking Points

  • 1000+ concurrent requests per endpoint: UI becomes unusable for debugging
  • Traffic spikes without resource limits: Guaranteed OOM kills
  • Default circuit breaker settings: Too aggressive for real-world traffic variance
  • Missing health checks: Kubernetes restarts healthy pods

Community Resources

Response Time Expectations

  • GitHub Issues: 1-3 days for community response
  • Slack Community: Real-time during business hours
  • Stack Overflow: 6-24 hours for common problems

Quality Indicators

  • Documentation: Good for basics, limited for production edge cases
  • Community Support: Active, experienced users share real-world solutions
  • Enterprise Support: Available for mission-critical deployments

Critical Warnings

What Documentation Doesn't Tell You

  • Default settings will fail in production traffic
  • Error messages are often misleading or generic
  • Resource requirements scale non-linearly with traffic
  • Clock synchronization is critical for JWT validation
  • Per-instance rate limiting catches everyone off guard

Configuration Gotchas

  • Missing http:// prefix breaks backend connectivity
  • Trailing slashes create different endpoints
  • Circuit breaker defaults are too aggressive
  • JWT validation requires perfect timing synchronization
  • Health checks only test process status, not functionality

Useful Links for Further Investigation

Link Group

LinkDescription
KrakenD Check CommandValidate your configuration before deploying. Should be part of your CI/CD pipeline to catch config errors before they hit production.
KrakenD Debug EndpointBuilt-in endpoint that shows internal request/response flow. Essential for debugging routing issues and backend connectivity problems.
Grafana Dashboard for KrakenDPre-configured dashboard with all the metrics you need to debug performance issues. Import this before you have problems, not during them.
KrakenD Health Check ConfigurationConfigure proper health checks for Kubernetes liveness and readiness probes. Basic health checks miss most production issues.
OpenTelemetry IntegrationComplete observability setup with traces, metrics, and logs. Required for debugging complex request flows through multiple services.
Prometheus Metrics ExportExpose KrakenD metrics for Prometheus scraping. Essential for alerting and monitoring production deployments.
Circuit Breaker ConfigurationProtect your backends from cascading failures. Proper circuit breaker configuration prevents minor issues from becoming major outages.
Distributed Tracing with JaegerTrack request flows across services to identify bottlenecks and failures. Critical for debugging performance issues in microservices environments.
Kubernetes Deployment GuideProduction-ready Kubernetes configurations with proper resource limits, health checks, and scaling parameters.
Flexible Configuration SystemTemplate-based configuration management for multiple environments. Reduces configuration errors and simplifies deployments.
Docker Best PracticesContainer optimization and security hardening for production KrakenD deployments. Includes resource optimization and security configurations.
Configuration Audit ToolAutomated security and performance analysis of your KrakenD configuration. Run this regularly to catch misconfigurations before they cause issues.
Load Balancing StrategiesConfigure backend load balancing for optimal performance and failover. Essential for high-availability production deployments.
Connection Pooling and TimeoutsHTTP client optimization for backend connections. Poor connection pooling configuration causes most performance issues.
Rate Limiting Best PracticesProduction-tested rate limiting configurations that protect your APIs without blocking legitimate traffic.
KrakenD GitHub IssuesSearch existing issues before creating new ones. Most production problems have been encountered and solved by others.
KrakenD Community ForumActive community support for troubleshooting and configuration questions. Response times are usually better than most paid support.
KrakenD Slack CommunityReal-time chat support with KrakenD users and developers. Fastest way to get help during production incidents.
Stack Overflow KrakenD TagSearchable Q&A for common KrakenD problems. Good for finding solutions to specific error messages and configuration issues.

Related Tools & Recommendations

tool
Recommended

Python 3.13 Production Deployment - What Actually Breaks

Python 3.13 will probably break something in your production environment. Here's how to minimize the damage.

Python 3.13
/tool/python-3.13/production-deployment
100%
howto
Recommended

Python 3.13 Finally Lets You Ditch the GIL - Here's How to Install It

Fair Warning: This is Experimental as Hell and Your Favorite Packages Probably Don't Work Yet

Python 3.13
/howto/setup-python-free-threaded-mode/setup-guide
100%
troubleshoot
Recommended

Python Performance Disasters - What Actually Works When Everything's On Fire

Your Code is Slow, Users Are Pissed, and You're Getting Paged at 3AM

Python
/troubleshoot/python-performance-optimization/performance-bottlenecks-diagnosis
100%
news
Recommended

OpenAI Gets Sued After GPT-5 Convinced Kid to Kill Himself

Parents want $50M because ChatGPT spent hours coaching their son through suicide methods

Technology News Aggregation
/news/2025-08-26/openai-gpt5-safety-lawsuit
73%
tool
Recommended

AWS RDS - Amazon's Managed Database Service

built on Amazon RDS

Amazon RDS
/tool/aws-rds/overview
73%
tool
Recommended

AWS Organizations - Stop Losing Your Mind Managing Dozens of AWS Accounts

When you've got 50+ AWS accounts scattered across teams and your monthly bill looks like someone's phone number, Organizations turns that chaos into something y

AWS Organizations
/tool/aws-organizations/overview
73%
compare
Recommended

Coinbase vs Poloniex: The Brutal Truth About Trading Crypto

One bleeds your wallet dry, the other might just disappear

coinbase
/compare/coinbase/poloniex/reality-check-coinbase-vs-poloniex
70%
tool
Recommended

Coinbase Developer Platform - Build Crypto Apps Without the Headaches

The same APIs that power Coinbase.com, available to developers who want to build crypto apps fast

Coinbase
/tool/coinbase/overview
70%
compare
Recommended

MetaMask vs Coinbase Wallet vs Trust Wallet vs Ledger Live - Which Won't Screw You Over?

I've Lost Money With 3 of These 4 Wallets - Here's What I Learned

MetaMask
/compare/metamask/coinbase-wallet/trust-wallet/ledger-live/security-architecture-comparison
70%
integration
Recommended

PyTorch ↔ TensorFlow Model Conversion: The Real Story

How to actually move models between frameworks without losing your sanity

PyTorch
/integration/pytorch-tensorflow/model-interoperability-guide
67%
tool
Recommended

Binance Chain JavaScript SDK - Legacy Tool for Legacy Chain

This SDK is basically dead. BNB Beacon Chain is being sunset and this thing hasn't been updated in 2 years. Use it for legacy apps, avoid it for new projects

Binance Chain JavaScript SDK
/tool/binance-smart-chain-sdk/performance-optimization
66%
tool
Recommended

Binance API - Build Trading Bots That Actually Work

The crypto exchange API with decent speed, horrific documentation, and rate limits that'll make you question your career choices

Binance API
/tool/binance-api/overview
66%
tool
Recommended

Binance Pro Mode - The Trading Interface That Unlocks Everything Binance Hides From Beginners

Stop getting treated like a child - Pro Mode is where Binance actually shows you all their features, including the leverage that can make you rich or bankrupt y

Binance Pro
/tool/binance-pro/overview
66%
integration
Recommended

Stripe WooCommerce Integration - Doesn't Completely Suck (Unlike PayPal)

Connect Stripe to WooCommerce without losing your sanity or your customers' money

Stripe
/integration/stripe-woocommerce-wordpress/overview
66%
tool
Recommended

WordPress - Runs 43% of the Web Because It Just Works

Free, flexible, and frustrating in equal measure - but it gets the job done

WordPress
/tool/wordpress/overview
66%
tool
Recommended

PHP Performance Optimization - Stop Blaming the Language

compatible with PHP: Hypertext Preprocessor

PHP: Hypertext Preprocessor
/tool/php/performance-optimization
66%
tool
Recommended

phpMyAdmin - The MySQL Tool That Won't Die

Every hosting provider throws this at you whether you want it or not

phpMyAdmin
/tool/phpmyadmin/overview
66%
tool
Recommended

PHP - The Language That Actually Runs the Internet

compatible with PHP: Hypertext Preprocessor

PHP: Hypertext Preprocessor
/tool/php/overview
66%
review
Recommended

Which JavaScript Runtime Won't Make You Hate Your Life

Two years of runtime fuckery later, here's the truth nobody tells you

Bun
/review/bun-nodejs-deno-comparison/production-readiness-assessment
66%
integration
Recommended

Build Trading Bots That Actually Work - IB API Integration That Won't Ruin Your Weekend

TWS Socket API vs REST API - Which One Won't Break at 3AM

Interactive Brokers API
/integration/interactive-brokers-nodejs/overview
66%

Recommendations combine user behavior, content similarity, research intelligence, and SEO optimization