KrakenD Production Troubleshooting Guide
Critical Failure Modes and Solutions
Memory Leaks and OOM Kills
Symptoms: Random container restarts, exit code 137, Kubernetes pod kills
Root Cause: Unlimited goroutine spawning during traffic spikes without proper resource limits
Severity: CRITICAL - Can cause complete service outage
Frequency: Common during traffic spikes
Production-Ready Fix:
resources:
limits:
memory: "1Gi"
cpu: "1000m"
requests:
memory: "512Mi"
cpu: "100m"
Critical Configuration:
- Set explicit memory limits in Kubernetes deployment
- Configure concurrent request limits per endpoint
- Monitor goroutine count - restart pods when count grows excessively
Operational Intelligence: Teams lose entire weekends without proper resource limits. Memory issues escalate in 60 seconds during peak traffic.
Configuration Validation Failures
Symptoms: Endpoints return 404s, config changes don't take effect, mysterious routing issues
Root Cause: Subtle JSON syntax errors that don't break startup but break routing
Severity: HIGH - Causes service availability issues
Time Investment: Can waste hours debugging 30-second fixes
Validation Strategy:
krakend check --config krakend.json
Common Gotchas:
- Missing
http://
in backend URLs - Trailing slashes in endpoint paths (
/api/users/
vs/api/users
) - Wrong
url_pattern
vsendpoint
matching
Operational Intelligence: KrakenD error messages for config issues are inadequate. Always validate before deployment.
Backend Service Discovery Failures
Symptoms: 502 errors, "connection refused", connectivity failures despite running services
Root Cause: DNS resolution issues in Kubernetes environments
Severity: HIGH - Breaks service communication
Debugging Commands:
kubectl exec -it krakend-pod -- nslookup backend-service
kubectl exec -it krakend-pod -- curl backend-service:8080/health
Kubernetes-Specific Solutions:
- Use full service names:
backend-service.namespace.svc.cluster.local:8080
- Verify service selector labels match backend pods
- Check port number conflicts (service port vs container port)
JWT Validation Failures
Symptoms: 401 errors for valid tokens, sporadic authentication failures
Root Cause: Clock drift, JWK endpoint unreachability, algorithm mismatches
Severity: HIGH - Breaks user authentication
Temporary Fix (for debugging only):
{
"auth/validator": {
"alg": "RS256",
"jwk_url": "https://your-auth.com/.well-known/jwks.json",
"cache_ttl": "15m",
"disable_jwk_security": true
}
}
Warning: disable_jwk_security
is for debugging only - never leave in production
Rate Limiting Issues
Symptoms: Legitimate traffic getting 429 errors during normal load
Root Cause: Per-instance rate limiting with aggressive defaults
Operational Intelligence: Rate limits are per-instance, not global - 3 replicas = 3x configured limit
Working Configuration:
{
"qos/ratelimit/token-bucket": {
"max_rate": 1000,
"capacity": 1000,
"every": "1s"
}
}
Production Monitoring Requirements
Essential Metrics (Life-Saving)
- Response time percentiles per endpoint: P95/P99 latencies by endpoint
- Circuit breaker state changes: 60-second window before incident escalation
- Backend connection failures: Separate connection refused, timeout, DNS errors
- Request queue depth: Growing queues indicate impending problems
- Memory usage with goroutine count: Detect memory leaks early
Alert Thresholds (Tested in Production)
- Error rate > 1% for 2+ minutes (not just spikes)
- P95 response time > 2x baseline for 5 minutes
- Circuit breaker open: Immediate alert (any duration)
- Memory usage > 80% for 10 minutes
- Backend connection failures > 10/minute: Per backend
Logging Configuration
{
"extra_config": {
"telemetry/logging": {
"level": "INFO",
"prefix": "[KRAKEND]",
"syslog": false,
"stdout": true,
"format": "json"
}
}
}
Critical Log Patterns:
connection refused
: Backend connectivitycontext deadline exceeded
: Timeout problemsjwt validation failed
: Authentication issuescircuit breaker is open
: Backend failures
Performance Troubleshooting Workflow
Debugging Sequence (80% Success Rate)
- Check backend health first: 80% of KrakenD issues are backend problems
- Review traffic patterns: Correlate performance degradation with traffic spikes
- Examine resource utilization: Scale horizontally before config tuning
- Validate recent config changes: Most production issues from recent deployments
- Analyze request flows: Use distributed tracing for expensive operations
Capacity Planning Reality
Memory Scaling Formula: 100MB base + (concurrent_requests × 1MB) per endpoint
Example: 10 endpoints × 100 concurrent requests = 1GB+ memory usage under load
CPU vs Network: CPU efficient until network I/O limits. Need more replicas for bandwidth before CPU cores.
Emergency Response (Quick Fixes)
Exit Code 137 (OOM Kill)
Immediate Action: Set memory limits in Kubernetes deployment
Root Cause: Unlimited concurrent requests exhausting memory
502 Errors with Running Backends
Check List:
- Backend URL spelling (missing
http://
prefix most common) - Network connectivity from KrakenD pods
- Port conflicts (service vs container ports)
Config Changes Not Taking Effect
Debug Steps:
- Verify ConfigMap mounting:
kubectl exec -it krakend-pod -- cat /etc/krakend/krakend.json
- Validate JSON syntax:
krakend check --config krakend.json
- Confirm pod restart after config changes
JWT Random Failures
Immediate Fixes:
- Check NTP synchronization across nodes
- Reduce JWK cache TTL to 5 minutes
- Verify auth service JWK endpoint accessibility
Rate Limiting Blocking Legitimate Traffic
Solution: Start with generous limits, tune down based on metrics
Remember: Limits are per-instance - calculate total capacity across replicas
Advanced Production Problems
Memory Consumption During Traffic Spikes
Cause: Concurrent request limits + slow backends = request pileup
Solution: Set explicit concurrent call limits per endpoint
{
"endpoints": [{
"concurrent_calls": 10,
"extra_config": {
"qos/ratelimit/token-bucket": {
"max_rate": 100,
"capacity": 200
}
}
}]
}
Zero-Downtime Config Updates
Strategy:
- Use flexible configuration with environment variables
- Deploy config changes separately from image updates
- Force pod restart on config changes with annotations
Circuit Breaker Tuning
Production Settings:
{
"qos/circuit-breaker": {
"interval": 60,
"max_errors": 20,
"timeout": 10
}
}
Guidelines: Start with 20-30 errors per interval, not default 5
Network Connectivity Debug
kubectl exec -it krakend-pod -- nslookup backend-service
kubectl exec -it krakend-pod -- curl -v backend-service:8080/health
kubectl exec -it krakend-pod -- netstat -rn
Resource Requirements
Time Investments
- Basic setup: 2-4 hours for production-ready configuration
- Monitoring setup: 4-8 hours for comprehensive observability
- Troubleshooting major issues: 2-6 hours without proper monitoring, 30 minutes with good observability
Expertise Requirements
- Kubernetes networking knowledge: Essential for service discovery issues
- JWT/OAuth understanding: Required for authentication troubleshooting
- Observability tools: Prometheus, Grafana, distributed tracing experience
Breaking Points
- 1000+ concurrent requests per endpoint: UI becomes unusable for debugging
- Traffic spikes without resource limits: Guaranteed OOM kills
- Default circuit breaker settings: Too aggressive for real-world traffic variance
- Missing health checks: Kubernetes restarts healthy pods
Community Resources
Response Time Expectations
- GitHub Issues: 1-3 days for community response
- Slack Community: Real-time during business hours
- Stack Overflow: 6-24 hours for common problems
Quality Indicators
- Documentation: Good for basics, limited for production edge cases
- Community Support: Active, experienced users share real-world solutions
- Enterprise Support: Available for mission-critical deployments
Critical Warnings
What Documentation Doesn't Tell You
- Default settings will fail in production traffic
- Error messages are often misleading or generic
- Resource requirements scale non-linearly with traffic
- Clock synchronization is critical for JWT validation
- Per-instance rate limiting catches everyone off guard
Configuration Gotchas
- Missing
http://
prefix breaks backend connectivity - Trailing slashes create different endpoints
- Circuit breaker defaults are too aggressive
- JWT validation requires perfect timing synchronization
- Health checks only test process status, not functionality
Useful Links for Further Investigation
Link Group
Link | Description |
---|---|
KrakenD Check Command | Validate your configuration before deploying. Should be part of your CI/CD pipeline to catch config errors before they hit production. |
KrakenD Debug Endpoint | Built-in endpoint that shows internal request/response flow. Essential for debugging routing issues and backend connectivity problems. |
Grafana Dashboard for KrakenD | Pre-configured dashboard with all the metrics you need to debug performance issues. Import this before you have problems, not during them. |
KrakenD Health Check Configuration | Configure proper health checks for Kubernetes liveness and readiness probes. Basic health checks miss most production issues. |
OpenTelemetry Integration | Complete observability setup with traces, metrics, and logs. Required for debugging complex request flows through multiple services. |
Prometheus Metrics Export | Expose KrakenD metrics for Prometheus scraping. Essential for alerting and monitoring production deployments. |
Circuit Breaker Configuration | Protect your backends from cascading failures. Proper circuit breaker configuration prevents minor issues from becoming major outages. |
Distributed Tracing with Jaeger | Track request flows across services to identify bottlenecks and failures. Critical for debugging performance issues in microservices environments. |
Kubernetes Deployment Guide | Production-ready Kubernetes configurations with proper resource limits, health checks, and scaling parameters. |
Flexible Configuration System | Template-based configuration management for multiple environments. Reduces configuration errors and simplifies deployments. |
Docker Best Practices | Container optimization and security hardening for production KrakenD deployments. Includes resource optimization and security configurations. |
Configuration Audit Tool | Automated security and performance analysis of your KrakenD configuration. Run this regularly to catch misconfigurations before they cause issues. |
Load Balancing Strategies | Configure backend load balancing for optimal performance and failover. Essential for high-availability production deployments. |
Connection Pooling and Timeouts | HTTP client optimization for backend connections. Poor connection pooling configuration causes most performance issues. |
Rate Limiting Best Practices | Production-tested rate limiting configurations that protect your APIs without blocking legitimate traffic. |
KrakenD GitHub Issues | Search existing issues before creating new ones. Most production problems have been encountered and solved by others. |
KrakenD Community Forum | Active community support for troubleshooting and configuration questions. Response times are usually better than most paid support. |
KrakenD Slack Community | Real-time chat support with KrakenD users and developers. Fastest way to get help during production incidents. |
Stack Overflow KrakenD Tag | Searchable Q&A for common KrakenD problems. Good for finding solutions to specific error messages and configuration issues. |
Related Tools & Recommendations
Python 3.13 Production Deployment - What Actually Breaks
Python 3.13 will probably break something in your production environment. Here's how to minimize the damage.
Python 3.13 Finally Lets You Ditch the GIL - Here's How to Install It
Fair Warning: This is Experimental as Hell and Your Favorite Packages Probably Don't Work Yet
Python Performance Disasters - What Actually Works When Everything's On Fire
Your Code is Slow, Users Are Pissed, and You're Getting Paged at 3AM
OpenAI Gets Sued After GPT-5 Convinced Kid to Kill Himself
Parents want $50M because ChatGPT spent hours coaching their son through suicide methods
AWS RDS - Amazon's Managed Database Service
built on Amazon RDS
AWS Organizations - Stop Losing Your Mind Managing Dozens of AWS Accounts
When you've got 50+ AWS accounts scattered across teams and your monthly bill looks like someone's phone number, Organizations turns that chaos into something y
Coinbase vs Poloniex: The Brutal Truth About Trading Crypto
One bleeds your wallet dry, the other might just disappear
Coinbase Developer Platform - Build Crypto Apps Without the Headaches
The same APIs that power Coinbase.com, available to developers who want to build crypto apps fast
MetaMask vs Coinbase Wallet vs Trust Wallet vs Ledger Live - Which Won't Screw You Over?
I've Lost Money With 3 of These 4 Wallets - Here's What I Learned
PyTorch ↔ TensorFlow Model Conversion: The Real Story
How to actually move models between frameworks without losing your sanity
Binance Chain JavaScript SDK - Legacy Tool for Legacy Chain
This SDK is basically dead. BNB Beacon Chain is being sunset and this thing hasn't been updated in 2 years. Use it for legacy apps, avoid it for new projects
Binance API - Build Trading Bots That Actually Work
The crypto exchange API with decent speed, horrific documentation, and rate limits that'll make you question your career choices
Binance Pro Mode - The Trading Interface That Unlocks Everything Binance Hides From Beginners
Stop getting treated like a child - Pro Mode is where Binance actually shows you all their features, including the leverage that can make you rich or bankrupt y
Stripe WooCommerce Integration - Doesn't Completely Suck (Unlike PayPal)
Connect Stripe to WooCommerce without losing your sanity or your customers' money
WordPress - Runs 43% of the Web Because It Just Works
Free, flexible, and frustrating in equal measure - but it gets the job done
PHP Performance Optimization - Stop Blaming the Language
compatible with PHP: Hypertext Preprocessor
phpMyAdmin - The MySQL Tool That Won't Die
Every hosting provider throws this at you whether you want it or not
PHP - The Language That Actually Runs the Internet
compatible with PHP: Hypertext Preprocessor
Which JavaScript Runtime Won't Make You Hate Your Life
Two years of runtime fuckery later, here's the truth nobody tells you
Build Trading Bots That Actually Work - IB API Integration That Won't Ruin Your Weekend
TWS Socket API vs REST API - Which One Won't Break at 3AM
Recommendations combine user behavior, content similarity, research intelligence, and SEO optimization