Kubernetes Service Accessibility Troubleshooting Guide
Critical Service Failure Modes
1. Selector Mismatch (90% of Service Failures)
Symptoms: Service exists, shows healthy in kubectl get service
, but returns 503 errors
Root Cause: Service selector doesn't match pod labels
Critical Impact: Complete service unavailability while appearing healthy in monitoring
Detection: kubectl get endpoints SERVICE-NAME
shows <none>
Real-World Consequences:
- Production incident: Payment API 47-minute outage during Black Friday
- Financial impact: $180K lost transactions, 2.3 hours downtime
- Cause: Label change from
app: payment-api-v1
toapp: payment-api-v2
without updating service selector
Emergency Fix:
kubectl patch service my-service -p '{"spec":{"selector":{"app":"correct-label"}}}'
2. Port Configuration Hell
Symptoms: Endpoints exist but connections refused/timeout
Root Cause: Misalignment between containerPort, service port, and targetPort
Critical Impact: Service layer completely broken despite healthy pods
Configuration Error Pattern:
- Application listens on port 8080
- Service targetPort configured as 3000
- Port-forward works (bypasses service layer), masking the issue
Production Failure Example:
- React frontend on AKS: 2.5 hours checkout failures
- Damage: $240K abandoned carts
- Root cause: Next.js listening on 8080, service targeting 3000
Emergency Fix:
kubectl patch service my-service -p '{"spec":{"ports":[{"port":80,"targetPort":8080}]}}'
3. Network Policy Lockdown
Symptoms: Services work initially, then fail after security policies applied
Root Cause: Network policies are additive - any policy creates default-deny behavior
Critical Impact: Complete platform shutdown within minutes
Catastrophic Incident:
- Production EKS cluster: 6.5 hours to restore all services
- Financial damage: $2.8M lost revenue, 847 abandoned carts
- Cause: Default-deny network policy applied without allow rules
Policy Pattern That Breaks Everything:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-all
spec:
podSelector: {} # Affects ALL pods
policyTypes:
- Ingress
- Egress
# No allow rules = everything blocked
4. DNS Resolution Chaos
Symptoms: Intermittent service failures, connection works sometimes
Root Cause: CoreDNS pod failures, DNS query limits, cache corruption
Critical Impact: Unpredictable service availability
Failure Thresholds:
- CoreDNS crashes at 5000+ QPS without horizontal autoscaling
- Node-local DNS cache corruption in Kubernetes 1.33+
- DNS query timeout spikes during high pod churn
5. Readiness Probe Deception
Symptoms: Pods show "Running" but aren't receiving traffic
Root Cause: Readiness probes fail, pods removed from service endpoints
Critical Impact: Healthy pods sit idle while users get 503 errors
Production Example:
- PostgreSQL migration with 30-second readiness probe timeout
- Health check queries hang on table locks during migration
- Result: Pods marked unready, removed from load balancer rotation
Service Debugging Time Requirements
Cloud Provider Load Balancer Provisioning (2025 Standards)
- AWS ALB: 3-5 minutes
- GCP GLB: 2-4 minutes
- Azure ALB: 5-8 minutes
Kubernetes Component Response Times
- Network Policy Changes: Immediate effect, CNI propagation 5-15 seconds
- DNS Propagation: 30-60 seconds for CoreDNS updates
- Ingress Controller Updates: NGINX (30-60s), Traefik (10-30s), Gateway API (60-120s)
- Pod Startup: 2-5 minutes for application initialization (longer for JVM apps)
When NOT to Wait (Immediate Investigation Required)
- Connection refused errors
- Service selector mismatches
- Missing endpoints
- HTTP 5xx errors from ingress
- Pod CrashLoopBackOff status
Systematic Debugging Workflow
Phase 1: Quick Triage (5 Minutes Maximum)
# 1. Verify service exists and has endpoints
kubectl get service my-service -o wide
kubectl get endpointslices -l kubernetes.io/service-name=my-service -o wide
# 2. Check pod readiness (not just "Running")
kubectl get pods -l app=my-app -o wide
kubectl describe pods -l app=my-app | grep -A 10 -B 2 "Conditions:"
# 3. Test internal connectivity
kubectl run debug-pod --image=nicolaka/netshoot --rm -it --restart=Never -- bash
# Inside debug pod:
nslookup my-service.my-namespace.svc.cluster.local
curl -v my-service:80 --connect-timeout 5
# 4. Check for blocking network policies
kubectl get networkpolicy --all-namespaces -o wide
Phase 2: Systematic Investigation
Service Configuration Validation:
# Verify selector matches pod labels
kubectl get service my-service -o yaml | grep -A 5 selector
kubectl get pods --show-labels | grep my-app
# Check port alignment
kubectl get service my-service -o jsonpath='{.spec.ports[*]}'
kubectl exec -it my-pod -- netstat -tlnp
EndpointSlice Analysis (Kubernetes 1.21+ Required):
# Modern endpoint debugging
kubectl get endpointslices -l kubernetes.io/service-name=my-service -o yaml
kubectl get endpointslices -l kubernetes.io/service-name=my-service -o jsonpath='{range .items[*].endpoints[*]}{.addresses[*]}{" - Ready: "}{.conditions.ready}{"\n"}{end}'
Direct Pod Testing:
# Test bypassing service layer
kubectl get pods -l app=my-app -o wide
kubectl run debug-pod --image=nicolaka/netshoot --rm -it -- curl POD-IP:8080/health
Critical Error Patterns
Connection Refused vs Connection Timeout
- Connection Refused: Port closed, wrong port config, app binding to localhost
- Connection Timeout: Network policies, firewall rules, CNI issues
Intermittent Failures
Debugging Pattern:
# Test connectivity over time
for i in {1..20}; do
kubectl run connectivity-test-$i --image=nicolaka/netshoot --rm --restart=Never \
-- timeout 10 curl -s -w "Response: %{http_code}, Time: %{time_total}s\n" \
my-service.my-namespace.svc.cluster.local:80 || echo "Attempt $i: Connection failed"
sleep 2
done
Port-Forward Works But Ingress Fails
Root Cause: Port-forward bypasses ingress controller, load balancer, and TLS termination
Debug Sequence:
# Test service layer directly
kubectl run debug-pod --image=nicolaka/netshoot --rm -it -- curl my-service.my-namespace.svc.cluster.local:80
# Test ingress controller directly
kubectl port-forward -n ingress-nginx service/ingress-nginx-controller 8080:80
curl -H "Host: my-domain.com" localhost:8080/my-path
# Check ingress logs
kubectl logs -n ingress-nginx -l app.kubernetes.io/name=ingress-nginx --tail=100 | grep "my-domain.com"
Emergency Fix Commands
Selector Mismatch
kubectl patch service my-service -p '{"spec":{"selector":{"app":"correct-label"}}}'
Port Configuration
kubectl patch service my-service -p '{"spec":{"ports":[{"port":80,"targetPort":8080}]}}'
Disable Readiness Probe (Temporary)
kubectl patch deployment my-app -p '{"spec":{"template":{"spec":{"containers":[{"name":"my-container","readinessProbe":null}]}}}}'
Network Policy Allow Rule
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-my-service
spec:
podSelector:
matchLabels:
app: my-app
policyTypes:
- Ingress
ingress:
- from:
- podSelector:
matchLabels:
app: client-app
ports:
- protocol: TCP
port: 8080
Resource Requirements and Expertise Levels
Time Investment for Resolution
- Selector Mismatch: 5-15 minutes (junior engineer)
- Port Configuration: 10-30 minutes (requires container knowledge)
- Network Policy Issues: 30-120 minutes (requires security expertise)
- DNS Problems: 15-60 minutes (requires networking knowledge)
- Ingress Issues: 20-90 minutes (requires load balancer expertise)
Required Expertise Levels
- Basic Service Issues: Junior engineer with kubectl knowledge
- Network Policy Debugging: Senior engineer with security background
- DNS Troubleshooting: Platform engineer with networking expertise
- Multi-Component Failures: Senior SRE with production incident experience
Hidden Costs
- Learning Network Policies: 2-4 weeks for production proficiency
- Cloud Provider Specifics: 1-2 weeks per provider (AWS/GCP/Azure)
- Service Mesh Integration: 4-8 weeks for Istio/Linkerd proficiency
- Production Debugging Skills: 6-12 months of incident response experience
Production-Tested Tool Requirements
Essential Debug Tools
- netshoot container:
nicolaka/netshoot
(comprehensive networking tools) - kubectl debug: Kubernetes 1.25+ enhanced debugging
- Cloud provider CLI: AWS CLI, gcloud, az CLI for load balancer debugging
Monitoring Requirements
- Prometheus: Service discovery and endpoint monitoring
- Grafana: Service health dashboards
- Jaeger: Distributed tracing for complex service interactions
Development Environment Differences
- Network Policies: Dev clusters permissive, prod restrictive
- Resource Limits: Prod stricter CPU/memory affecting readiness probes
- Scale Issues: Problems only appear under load
- Security Contexts: Prod runs non-root, dev runs root
Failure Cost Analysis
Service Outage Financial Impact (Real Examples)
- Selector Mismatch: $180K in 47 minutes (payment API)
- Port Configuration: $240K in 2.5 hours (checkout failures)
- Network Policy Error: $2.8M in 6.5 hours (complete platform down)
Time to Recovery Patterns
- Single Component Issues: 15-45 minutes with systematic approach
- Multi-Component Failures: 2-8 hours requiring multiple teams
- Network Policy Disasters: 4-12 hours recreating all connectivity rules
Prevention vs Recovery Costs
- Proper Testing: 2-4 hours per release cycle
- Production Debugging: 20-80 hours per major incident
- Team Training: 40-80 hours initial investment, 95% incident reduction
Decision Framework
When to Use Each Debugging Approach
Problem Type | First Step | Time Investment | Success Rate |
---|---|---|---|
No endpoints | Check selectors/labels | 5-15 minutes | 95% |
Connection refused | Test direct pod connectivity | 10-30 minutes | 90% |
Intermittent failures | Monitor EndpointSlice stability | 30-60 minutes | 80% |
DNS issues | Test from debug pod | 15-45 minutes | 85% |
Network policy blocks | Check policies and test connectivity | 60-180 minutes | 70% |
Escalation Criteria
- 15 minutes: No obvious configuration issues found
- 30 minutes: Multiple debugging approaches attempted
- 45 minutes: Impact exceeds single service
- 60 minutes: Root cause unclear, need additional expertise
Useful Links for Further Investigation
Essential Kubernetes Service Troubleshooting Resources
Link | Description |
---|---|
Debug Services - Official Guide | The canonical guide to debugging service issues. Covers the systematic approach to service troubleshooting with step-by-step commands. |
Troubleshooting Applications | Comprehensive application-level debugging guide that covers pod, service, and ingress troubleshooting scenarios. |
Cluster Networking Concepts | Deep dive into Kubernetes networking fundamentals. Essential reading for understanding how service networking actually works. |
Troubleshooting Clusters | Cluster-level troubleshooting guide. Use when service issues might be related to cluster-wide problems. |
Network Policies | Official documentation on network policies. Critical for understanding and debugging network policy-related service accessibility issues. |
kubectl Reference Documentation | Complete kubectl command reference. Bookmark the troubleshooting sections for quick access during outages. |
Netshoot Container | The essential debugging container with all network troubleshooting tools pre-installed. Use with `kubectl debug` for comprehensive network diagnostics. |
kubectl-debug Plugin | Enhanced debugging capabilities for Kubernetes. Provides additional debugging features beyond standard kubectl debug. |
Popeye - Kubernetes Cluster Sanitizer | Scans your cluster for potential issues including service misconfigurations, selector problems, and resource inconsistencies. |
k9s - Terminal UI for Kubernetes | Interactive terminal UI that makes service debugging more efficient. Excellent for navigating service, pod, and endpoint relationships. |
stern - Multi-Pod Log Tailing | Tail logs from multiple pods simultaneously. Essential for debugging service issues that span multiple pod replicas. |
kube-score | Analyzes Kubernetes object configurations and identifies potential issues including service configuration problems. |
CNCF Kubernetes Troubleshooting Guide | Step-by-step troubleshooting methodology for common Kubernetes errors including service accessibility issues. |
Platform9 Kubernetes Networking Troubleshooting | Real-world networking issues and their solutions. Covers the most common service accessibility problems encountered in production. |
Komodor Kubernetes Networking Errors Guide | Practical guide to handling and preventing Kubernetes networking errors with specific focus on service-related issues. |
CloudSigma Kubernetes Network Inspection Guide | Tools and techniques for inspecting Kubernetes networking, with practical examples for service debugging. |
Spectro Cloud Kubernetes Errors Guide | Top 10 most common Kubernetes errors including service accessibility problems, with practical solutions. |
Kubernetes DNS Troubleshooting | Official guide to debugging DNS-related service issues. Essential for resolving service name resolution problems. |
CoreDNS Troubleshooting Guide | CoreDNS-specific troubleshooting documentation. Use when DNS resolution is failing for service names. |
Groundcover DNS Troubleshooting | Comprehensive guide to Kubernetes DNS issues with practical debugging steps and solutions. |
AWS EKS Troubleshooting Guide | AWS-specific service troubleshooting including load balancer, security group, and VPC networking issues. |
Google GKE Troubleshooting | GKE-specific networking and service troubleshooting guide with Google Cloud integration details. |
Azure AKS Troubleshooting | AKS-specific troubleshooting guide covering Azure networking and load balancer integration. |
DigitalOcean Kubernetes Guide | DOKS-specific Kubernetes guide covering cluster management and basic troubleshooting. |
Prometheus Kubernetes Monitoring | Set up Prometheus monitoring for Kubernetes services. Essential for proactive service health monitoring. |
Grafana Kubernetes Dashboards | Pre-built dashboards for monitoring Kubernetes service health and networking metrics. |
Jaeger Distributed Tracing | Implement distributed tracing to debug complex service communication issues across multiple microservices. |
Istio Service Mesh Debugging | Service mesh specific troubleshooting guide. Use when debugging services in Istio service mesh environments. |
Kubernetes Slack #troubleshooting | Real-time community support for Kubernetes troubleshooting. Join the troubleshooting channel for immediate help. |
Stack Overflow Kubernetes Service Tag | Community Q&A for Kubernetes service issues. Search existing questions before asking new ones. |
GitHub Kubernetes Issues | Official Kubernetes issue tracker for bug reports and troubleshooting discussions. |
Kubernetes Community Forums | Official community forum for longer-form discussions about Kubernetes troubleshooting approaches. |
Telepresence | Debug remote Kubernetes services from your local development environment. Useful for testing service connectivity during development. |
Skaffold | Local development workflow tool that can help identify service connectivity issues early in the development cycle. |
Tilt | Development environment tool that provides real-time feedback on service health during development. |
Linkerd Documentation | Service mesh documentation with debugging guides for service-to-service communication issues. |
Kubernetes Network Policy Recipes | Collection of network policy examples and patterns. Essential for understanding how network policies affect service accessibility. |
Falco Runtime Security | Runtime security monitoring that can help identify when network policies are blocking legitimate service communication. |
Open Policy Agent (OPA) Gatekeeper | Policy engine for Kubernetes that can help enforce proper service configuration to prevent accessibility issues. |
KillerCoda Kubernetes Scenarios | Interactive scenarios for learning Kubernetes networking and service debugging hands-on. |
Minikube | Local Kubernetes environment for practicing service troubleshooting techniques safely. |
Kubernetes Learning Path | Official tutorials including networking and service troubleshooting exercises. |
Kubernetes Networking | Comprehensive book covering Kubernetes networking concepts essential for understanding service accessibility issues. |
Kubernetes Best Practices | Configuration best practices that help prevent common service configuration issues. |
Kubernetes in Action | Comprehensive book covering Kubernetes concepts including systematic troubleshooting and service debugging approaches. |
Kubernetes Incident Response Guide | Framework for responding to Kubernetes incidents including service outages. |
SRE Workbook - Kubernetes | Site Reliability Engineering practices for Kubernetes including service reliability and incident response. |
Runbook Templates for Kubernetes | Template runbooks for common Kubernetes operational procedures including service troubleshooting. |
Related Tools & Recommendations
Google Cloud Run - Throw a Container at Google, Get Back a URL
Skip the Kubernetes hell and deploy containers that actually work.
Set Up Microservices Monitoring That Actually Works
Stop flying blind - get real visibility into what's breaking your distributed services
Debug Kubernetes Issues - The 3AM Production Survival Guide
When your pods are crashing, services aren't accessible, and your pager won't stop buzzing - here's how to actually fix it
Fix Kubernetes Pod CrashLoopBackOff - Complete Troubleshooting Guide
Master Kubernetes CrashLoopBackOff. This complete guide explains what it means, diagnoses common causes, provides proven solutions, and offers advanced preventi
Fix Kubernetes ImagePullBackOff Error - The Complete Battle-Tested Guide
From "Pod stuck in ImagePullBackOff" to "Problem solved in 90 seconds"
Docker Swarm Service Discovery Broken? Here's How to Unfuck It
When your containers can't find each other and everything goes to shit
Docker Swarm Node Down? Here's How to Fix It
When your production cluster dies at 3am and management is asking questions
Docker Swarm - Container Orchestration That Actually Works
Multi-host Docker without the Kubernetes PhD requirement
HashiCorp Nomad - Kubernetes Alternative Without the YAML Hell
competes with HashiCorp Nomad
Amazon ECS - Container orchestration that actually works
alternative to Amazon ECS
GitHub Actions + Docker + ECS: Stop SSH-ing Into Servers Like It's 2015
Deploy your app without losing your mind or your weekend
Fix Helm When It Inevitably Breaks - Debug Guide
The commands, tools, and nuclear options for when your Helm deployment is fucked and you need to debug template errors at 3am.
Helm - Because Managing 47 YAML Files Will Drive You Insane
Package manager for Kubernetes that saves you from copy-pasting deployment configs like a savage. Helm charts beat maintaining separate YAML files for every dam
Making Pulumi, Kubernetes, Helm, and GitOps Actually Work Together
Stop fighting with YAML hell and infrastructure drift - here's how to manage everything through Git without losing your sanity
GitOps Integration Hell: Docker + Kubernetes + ArgoCD + Prometheus
How to Wire Together the Modern DevOps Stack Without Losing Your Sanity
Why Your Monitoring Bill Tripled (And How I Fixed Mine)
Four Tools That Actually Work + The Real Cost of Making Them Play Nice
Tired of GitHub Actions Eating Your Budget? Here's Where Teams Are Actually Going
integrates with GitHub Actions
GitHub Actions is Fucking Slow: Alternatives That Actually Work
integrates with GitHub Actions
GitHub Actions - CI/CD That Actually Lives Inside GitHub
integrates with GitHub Actions
CVE-2025-9074 Docker Desktop Emergency Patch - Critical Container Escape Fixed
Critical vulnerability allowing container breakouts patched in Docker Desktop 4.44.3
Recommendations combine user behavior, content similarity, research intelligence, and SEO optimization