Currently viewing the AI version
Switch to human version

Kubernetes Service Accessibility Troubleshooting Guide

Critical Service Failure Modes

1. Selector Mismatch (90% of Service Failures)

Symptoms: Service exists, shows healthy in kubectl get service, but returns 503 errors
Root Cause: Service selector doesn't match pod labels
Critical Impact: Complete service unavailability while appearing healthy in monitoring
Detection: kubectl get endpoints SERVICE-NAME shows <none>

Real-World Consequences:

  • Production incident: Payment API 47-minute outage during Black Friday
  • Financial impact: $180K lost transactions, 2.3 hours downtime
  • Cause: Label change from app: payment-api-v1 to app: payment-api-v2 without updating service selector

Emergency Fix:

kubectl patch service my-service -p '{"spec":{"selector":{"app":"correct-label"}}}'

2. Port Configuration Hell

Symptoms: Endpoints exist but connections refused/timeout
Root Cause: Misalignment between containerPort, service port, and targetPort
Critical Impact: Service layer completely broken despite healthy pods

Configuration Error Pattern:

  • Application listens on port 8080
  • Service targetPort configured as 3000
  • Port-forward works (bypasses service layer), masking the issue

Production Failure Example:

  • React frontend on AKS: 2.5 hours checkout failures
  • Damage: $240K abandoned carts
  • Root cause: Next.js listening on 8080, service targeting 3000

Emergency Fix:

kubectl patch service my-service -p '{"spec":{"ports":[{"port":80,"targetPort":8080}]}}'

3. Network Policy Lockdown

Symptoms: Services work initially, then fail after security policies applied
Root Cause: Network policies are additive - any policy creates default-deny behavior
Critical Impact: Complete platform shutdown within minutes

Catastrophic Incident:

  • Production EKS cluster: 6.5 hours to restore all services
  • Financial damage: $2.8M lost revenue, 847 abandoned carts
  • Cause: Default-deny network policy applied without allow rules

Policy Pattern That Breaks Everything:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-all
spec:
  podSelector: {}  # Affects ALL pods
  policyTypes:
  - Ingress
  - Egress
  # No allow rules = everything blocked

4. DNS Resolution Chaos

Symptoms: Intermittent service failures, connection works sometimes
Root Cause: CoreDNS pod failures, DNS query limits, cache corruption
Critical Impact: Unpredictable service availability

Failure Thresholds:

  • CoreDNS crashes at 5000+ QPS without horizontal autoscaling
  • Node-local DNS cache corruption in Kubernetes 1.33+
  • DNS query timeout spikes during high pod churn

5. Readiness Probe Deception

Symptoms: Pods show "Running" but aren't receiving traffic
Root Cause: Readiness probes fail, pods removed from service endpoints
Critical Impact: Healthy pods sit idle while users get 503 errors

Production Example:

  • PostgreSQL migration with 30-second readiness probe timeout
  • Health check queries hang on table locks during migration
  • Result: Pods marked unready, removed from load balancer rotation

Service Debugging Time Requirements

Cloud Provider Load Balancer Provisioning (2025 Standards)

  • AWS ALB: 3-5 minutes
  • GCP GLB: 2-4 minutes
  • Azure ALB: 5-8 minutes

Kubernetes Component Response Times

  • Network Policy Changes: Immediate effect, CNI propagation 5-15 seconds
  • DNS Propagation: 30-60 seconds for CoreDNS updates
  • Ingress Controller Updates: NGINX (30-60s), Traefik (10-30s), Gateway API (60-120s)
  • Pod Startup: 2-5 minutes for application initialization (longer for JVM apps)

When NOT to Wait (Immediate Investigation Required)

  • Connection refused errors
  • Service selector mismatches
  • Missing endpoints
  • HTTP 5xx errors from ingress
  • Pod CrashLoopBackOff status

Systematic Debugging Workflow

Phase 1: Quick Triage (5 Minutes Maximum)

# 1. Verify service exists and has endpoints
kubectl get service my-service -o wide
kubectl get endpointslices -l kubernetes.io/service-name=my-service -o wide

# 2. Check pod readiness (not just "Running")
kubectl get pods -l app=my-app -o wide
kubectl describe pods -l app=my-app | grep -A 10 -B 2 "Conditions:"

# 3. Test internal connectivity
kubectl run debug-pod --image=nicolaka/netshoot --rm -it --restart=Never -- bash
# Inside debug pod:
nslookup my-service.my-namespace.svc.cluster.local
curl -v my-service:80 --connect-timeout 5

# 4. Check for blocking network policies
kubectl get networkpolicy --all-namespaces -o wide

Phase 2: Systematic Investigation

Service Configuration Validation:

# Verify selector matches pod labels
kubectl get service my-service -o yaml | grep -A 5 selector
kubectl get pods --show-labels | grep my-app

# Check port alignment
kubectl get service my-service -o jsonpath='{.spec.ports[*]}'
kubectl exec -it my-pod -- netstat -tlnp

EndpointSlice Analysis (Kubernetes 1.21+ Required):

# Modern endpoint debugging
kubectl get endpointslices -l kubernetes.io/service-name=my-service -o yaml
kubectl get endpointslices -l kubernetes.io/service-name=my-service -o jsonpath='{range .items[*].endpoints[*]}{.addresses[*]}{" - Ready: "}{.conditions.ready}{"\n"}{end}'

Direct Pod Testing:

# Test bypassing service layer
kubectl get pods -l app=my-app -o wide
kubectl run debug-pod --image=nicolaka/netshoot --rm -it -- curl POD-IP:8080/health

Critical Error Patterns

Connection Refused vs Connection Timeout

  • Connection Refused: Port closed, wrong port config, app binding to localhost
  • Connection Timeout: Network policies, firewall rules, CNI issues

Intermittent Failures

Debugging Pattern:

# Test connectivity over time
for i in {1..20}; do
  kubectl run connectivity-test-$i --image=nicolaka/netshoot --rm --restart=Never \
    -- timeout 10 curl -s -w "Response: %{http_code}, Time: %{time_total}s\n" \
    my-service.my-namespace.svc.cluster.local:80 || echo "Attempt $i: Connection failed"
  sleep 2
done

Port-Forward Works But Ingress Fails

Root Cause: Port-forward bypasses ingress controller, load balancer, and TLS termination

Debug Sequence:

# Test service layer directly
kubectl run debug-pod --image=nicolaka/netshoot --rm -it -- curl my-service.my-namespace.svc.cluster.local:80

# Test ingress controller directly
kubectl port-forward -n ingress-nginx service/ingress-nginx-controller 8080:80
curl -H "Host: my-domain.com" localhost:8080/my-path

# Check ingress logs
kubectl logs -n ingress-nginx -l app.kubernetes.io/name=ingress-nginx --tail=100 | grep "my-domain.com"

Emergency Fix Commands

Selector Mismatch

kubectl patch service my-service -p '{"spec":{"selector":{"app":"correct-label"}}}'

Port Configuration

kubectl patch service my-service -p '{"spec":{"ports":[{"port":80,"targetPort":8080}]}}'

Disable Readiness Probe (Temporary)

kubectl patch deployment my-app -p '{"spec":{"template":{"spec":{"containers":[{"name":"my-container","readinessProbe":null}]}}}}'

Network Policy Allow Rule

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-my-service
spec:
  podSelector:
    matchLabels:
      app: my-app
  policyTypes:
  - Ingress
  ingress:
  - from:
    - podSelector:
        matchLabels:
          app: client-app
    ports:
    - protocol: TCP
      port: 8080

Resource Requirements and Expertise Levels

Time Investment for Resolution

  • Selector Mismatch: 5-15 minutes (junior engineer)
  • Port Configuration: 10-30 minutes (requires container knowledge)
  • Network Policy Issues: 30-120 minutes (requires security expertise)
  • DNS Problems: 15-60 minutes (requires networking knowledge)
  • Ingress Issues: 20-90 minutes (requires load balancer expertise)

Required Expertise Levels

  • Basic Service Issues: Junior engineer with kubectl knowledge
  • Network Policy Debugging: Senior engineer with security background
  • DNS Troubleshooting: Platform engineer with networking expertise
  • Multi-Component Failures: Senior SRE with production incident experience

Hidden Costs

  • Learning Network Policies: 2-4 weeks for production proficiency
  • Cloud Provider Specifics: 1-2 weeks per provider (AWS/GCP/Azure)
  • Service Mesh Integration: 4-8 weeks for Istio/Linkerd proficiency
  • Production Debugging Skills: 6-12 months of incident response experience

Production-Tested Tool Requirements

Essential Debug Tools

  • netshoot container: nicolaka/netshoot (comprehensive networking tools)
  • kubectl debug: Kubernetes 1.25+ enhanced debugging
  • Cloud provider CLI: AWS CLI, gcloud, az CLI for load balancer debugging

Monitoring Requirements

  • Prometheus: Service discovery and endpoint monitoring
  • Grafana: Service health dashboards
  • Jaeger: Distributed tracing for complex service interactions

Development Environment Differences

  • Network Policies: Dev clusters permissive, prod restrictive
  • Resource Limits: Prod stricter CPU/memory affecting readiness probes
  • Scale Issues: Problems only appear under load
  • Security Contexts: Prod runs non-root, dev runs root

Failure Cost Analysis

Service Outage Financial Impact (Real Examples)

  • Selector Mismatch: $180K in 47 minutes (payment API)
  • Port Configuration: $240K in 2.5 hours (checkout failures)
  • Network Policy Error: $2.8M in 6.5 hours (complete platform down)

Time to Recovery Patterns

  • Single Component Issues: 15-45 minutes with systematic approach
  • Multi-Component Failures: 2-8 hours requiring multiple teams
  • Network Policy Disasters: 4-12 hours recreating all connectivity rules

Prevention vs Recovery Costs

  • Proper Testing: 2-4 hours per release cycle
  • Production Debugging: 20-80 hours per major incident
  • Team Training: 40-80 hours initial investment, 95% incident reduction

Decision Framework

When to Use Each Debugging Approach

Problem Type First Step Time Investment Success Rate
No endpoints Check selectors/labels 5-15 minutes 95%
Connection refused Test direct pod connectivity 10-30 minutes 90%
Intermittent failures Monitor EndpointSlice stability 30-60 minutes 80%
DNS issues Test from debug pod 15-45 minutes 85%
Network policy blocks Check policies and test connectivity 60-180 minutes 70%

Escalation Criteria

  • 15 minutes: No obvious configuration issues found
  • 30 minutes: Multiple debugging approaches attempted
  • 45 minutes: Impact exceeds single service
  • 60 minutes: Root cause unclear, need additional expertise

Useful Links for Further Investigation

Essential Kubernetes Service Troubleshooting Resources

LinkDescription
Debug Services - Official GuideThe canonical guide to debugging service issues. Covers the systematic approach to service troubleshooting with step-by-step commands.
Troubleshooting ApplicationsComprehensive application-level debugging guide that covers pod, service, and ingress troubleshooting scenarios.
Cluster Networking ConceptsDeep dive into Kubernetes networking fundamentals. Essential reading for understanding how service networking actually works.
Troubleshooting ClustersCluster-level troubleshooting guide. Use when service issues might be related to cluster-wide problems.
Network PoliciesOfficial documentation on network policies. Critical for understanding and debugging network policy-related service accessibility issues.
kubectl Reference DocumentationComplete kubectl command reference. Bookmark the troubleshooting sections for quick access during outages.
Netshoot ContainerThe essential debugging container with all network troubleshooting tools pre-installed. Use with `kubectl debug` for comprehensive network diagnostics.
kubectl-debug PluginEnhanced debugging capabilities for Kubernetes. Provides additional debugging features beyond standard kubectl debug.
Popeye - Kubernetes Cluster SanitizerScans your cluster for potential issues including service misconfigurations, selector problems, and resource inconsistencies.
k9s - Terminal UI for KubernetesInteractive terminal UI that makes service debugging more efficient. Excellent for navigating service, pod, and endpoint relationships.
stern - Multi-Pod Log TailingTail logs from multiple pods simultaneously. Essential for debugging service issues that span multiple pod replicas.
kube-scoreAnalyzes Kubernetes object configurations and identifies potential issues including service configuration problems.
CNCF Kubernetes Troubleshooting GuideStep-by-step troubleshooting methodology for common Kubernetes errors including service accessibility issues.
Platform9 Kubernetes Networking TroubleshootingReal-world networking issues and their solutions. Covers the most common service accessibility problems encountered in production.
Komodor Kubernetes Networking Errors GuidePractical guide to handling and preventing Kubernetes networking errors with specific focus on service-related issues.
CloudSigma Kubernetes Network Inspection GuideTools and techniques for inspecting Kubernetes networking, with practical examples for service debugging.
Spectro Cloud Kubernetes Errors GuideTop 10 most common Kubernetes errors including service accessibility problems, with practical solutions.
Kubernetes DNS TroubleshootingOfficial guide to debugging DNS-related service issues. Essential for resolving service name resolution problems.
CoreDNS Troubleshooting GuideCoreDNS-specific troubleshooting documentation. Use when DNS resolution is failing for service names.
Groundcover DNS TroubleshootingComprehensive guide to Kubernetes DNS issues with practical debugging steps and solutions.
AWS EKS Troubleshooting GuideAWS-specific service troubleshooting including load balancer, security group, and VPC networking issues.
Google GKE TroubleshootingGKE-specific networking and service troubleshooting guide with Google Cloud integration details.
Azure AKS TroubleshootingAKS-specific troubleshooting guide covering Azure networking and load balancer integration.
DigitalOcean Kubernetes GuideDOKS-specific Kubernetes guide covering cluster management and basic troubleshooting.
Prometheus Kubernetes MonitoringSet up Prometheus monitoring for Kubernetes services. Essential for proactive service health monitoring.
Grafana Kubernetes DashboardsPre-built dashboards for monitoring Kubernetes service health and networking metrics.
Jaeger Distributed TracingImplement distributed tracing to debug complex service communication issues across multiple microservices.
Istio Service Mesh DebuggingService mesh specific troubleshooting guide. Use when debugging services in Istio service mesh environments.
Kubernetes Slack #troubleshootingReal-time community support for Kubernetes troubleshooting. Join the troubleshooting channel for immediate help.
Stack Overflow Kubernetes Service TagCommunity Q&A for Kubernetes service issues. Search existing questions before asking new ones.
GitHub Kubernetes IssuesOfficial Kubernetes issue tracker for bug reports and troubleshooting discussions.
Kubernetes Community ForumsOfficial community forum for longer-form discussions about Kubernetes troubleshooting approaches.
TelepresenceDebug remote Kubernetes services from your local development environment. Useful for testing service connectivity during development.
SkaffoldLocal development workflow tool that can help identify service connectivity issues early in the development cycle.
TiltDevelopment environment tool that provides real-time feedback on service health during development.
Linkerd DocumentationService mesh documentation with debugging guides for service-to-service communication issues.
Kubernetes Network Policy RecipesCollection of network policy examples and patterns. Essential for understanding how network policies affect service accessibility.
Falco Runtime SecurityRuntime security monitoring that can help identify when network policies are blocking legitimate service communication.
Open Policy Agent (OPA) GatekeeperPolicy engine for Kubernetes that can help enforce proper service configuration to prevent accessibility issues.
KillerCoda Kubernetes ScenariosInteractive scenarios for learning Kubernetes networking and service debugging hands-on.
MinikubeLocal Kubernetes environment for practicing service troubleshooting techniques safely.
Kubernetes Learning PathOfficial tutorials including networking and service troubleshooting exercises.
Kubernetes NetworkingComprehensive book covering Kubernetes networking concepts essential for understanding service accessibility issues.
Kubernetes Best PracticesConfiguration best practices that help prevent common service configuration issues.
Kubernetes in ActionComprehensive book covering Kubernetes concepts including systematic troubleshooting and service debugging approaches.
Kubernetes Incident Response GuideFramework for responding to Kubernetes incidents including service outages.
SRE Workbook - KubernetesSite Reliability Engineering practices for Kubernetes including service reliability and incident response.
Runbook Templates for KubernetesTemplate runbooks for common Kubernetes operational procedures including service troubleshooting.

Related Tools & Recommendations

tool
Similar content

Google Cloud Run - Throw a Container at Google, Get Back a URL

Skip the Kubernetes hell and deploy containers that actually work.

Google Cloud Run
/tool/google-cloud-run/overview
100%
howto
Recommended

Set Up Microservices Monitoring That Actually Works

Stop flying blind - get real visibility into what's breaking your distributed services

Prometheus
/howto/setup-microservices-observability-prometheus-jaeger-grafana/complete-observability-setup
82%
tool
Similar content

Debug Kubernetes Issues - The 3AM Production Survival Guide

When your pods are crashing, services aren't accessible, and your pager won't stop buzzing - here's how to actually fix it

Kubernetes
/tool/kubernetes/debugging-kubernetes-issues
71%
troubleshoot
Similar content

Fix Kubernetes Pod CrashLoopBackOff - Complete Troubleshooting Guide

Master Kubernetes CrashLoopBackOff. This complete guide explains what it means, diagnoses common causes, provides proven solutions, and offers advanced preventi

Kubernetes
/troubleshoot/kubernetes-pod-crashloopbackoff/crashloop-diagnosis-solutions
54%
troubleshoot
Similar content

Fix Kubernetes ImagePullBackOff Error - The Complete Battle-Tested Guide

From "Pod stuck in ImagePullBackOff" to "Problem solved in 90 seconds"

Kubernetes
/troubleshoot/kubernetes-imagepullbackoff/comprehensive-troubleshooting-guide
52%
troubleshoot
Recommended

Docker Swarm Service Discovery Broken? Here's How to Unfuck It

When your containers can't find each other and everything goes to shit

Docker Swarm
/troubleshoot/docker-swarm-production-failures/service-discovery-routing-mesh-failures
51%
troubleshoot
Recommended

Docker Swarm Node Down? Here's How to Fix It

When your production cluster dies at 3am and management is asking questions

Docker Swarm
/troubleshoot/docker-swarm-node-down/node-down-recovery
51%
tool
Recommended

Docker Swarm - Container Orchestration That Actually Works

Multi-host Docker without the Kubernetes PhD requirement

Docker Swarm
/tool/docker-swarm/overview
51%
tool
Recommended

HashiCorp Nomad - Kubernetes Alternative Without the YAML Hell

competes with HashiCorp Nomad

HashiCorp Nomad
/tool/hashicorp-nomad/overview
49%
tool
Recommended

Amazon ECS - Container orchestration that actually works

alternative to Amazon ECS

Amazon ECS
/tool/aws-ecs/overview
49%
integration
Recommended

GitHub Actions + Docker + ECS: Stop SSH-ing Into Servers Like It's 2015

Deploy your app without losing your mind or your weekend

GitHub Actions
/integration/github-actions-docker-aws-ecs/ci-cd-pipeline-automation
49%
tool
Recommended

Fix Helm When It Inevitably Breaks - Debug Guide

The commands, tools, and nuclear options for when your Helm deployment is fucked and you need to debug template errors at 3am.

Helm
/tool/helm/troubleshooting-guide
49%
tool
Recommended

Helm - Because Managing 47 YAML Files Will Drive You Insane

Package manager for Kubernetes that saves you from copy-pasting deployment configs like a savage. Helm charts beat maintaining separate YAML files for every dam

Helm
/tool/helm/overview
49%
integration
Recommended

Making Pulumi, Kubernetes, Helm, and GitOps Actually Work Together

Stop fighting with YAML hell and infrastructure drift - here's how to manage everything through Git without losing your sanity

Pulumi
/integration/pulumi-kubernetes-helm-gitops/complete-workflow-integration
49%
integration
Recommended

GitOps Integration Hell: Docker + Kubernetes + ArgoCD + Prometheus

How to Wire Together the Modern DevOps Stack Without Losing Your Sanity

prometheus
/integration/docker-kubernetes-argocd-prometheus/gitops-workflow-integration
49%
integration
Recommended

Why Your Monitoring Bill Tripled (And How I Fixed Mine)

Four Tools That Actually Work + The Real Cost of Making Them Play Nice

Sentry
/integration/sentry-datadog-newrelic-prometheus/unified-observability-architecture
49%
alternatives
Recommended

Tired of GitHub Actions Eating Your Budget? Here's Where Teams Are Actually Going

integrates with GitHub Actions

GitHub Actions
/alternatives/github-actions/migration-ready-alternatives
44%
alternatives
Recommended

GitHub Actions is Fucking Slow: Alternatives That Actually Work

integrates with GitHub Actions

GitHub Actions
/alternatives/github-actions/performance-optimized-alternatives
44%
tool
Recommended

GitHub Actions - CI/CD That Actually Lives Inside GitHub

integrates with GitHub Actions

GitHub Actions
/tool/github-actions/overview
44%
troubleshoot
Recommended

CVE-2025-9074 Docker Desktop Emergency Patch - Critical Container Escape Fixed

Critical vulnerability allowing container breakouts patched in Docker Desktop 4.44.3

Docker Desktop
/troubleshoot/docker-cve-2025-9074/emergency-response-patching
44%

Recommendations combine user behavior, content similarity, research intelligence, and SEO optimization