Why Your Kubernetes Services Are Unreachable (It's Usually One of These Five Things)

Kubernetes Service Networking Architecture

Kubernetes service failures are pure psychological torture because everything looks healthy. Your pods are running, your deployments show green checkmarks everywhere, but somehow nobody can connect to your fucking application. The problem? Kubernetes networking has more abstraction layers than a enterprise software architecture diagram, and each one can silently destroy your weekend.

After debugging hundreds of service accessibility disasters (usually at 3 AM while executives send passive-aggressive Slack messages), we've found that 90% fall into five predictable failure modes. Here's what actually breaks when your services vanish into the networking black hole.

The Service Accessibility Hierarchy (Where Things Go Wrong)

Kubernetes Service Architecture

Kubernetes networking operates through multiple abstraction layers, each with its own delightful ways to fail. After debugging service outages in production clusters running everything from Kubernetes 1.28 to the latest 1.33 releases, the failure patterns are depressingly consistent. Understanding this hierarchy means you can debug systematically instead of frantically trying random kubectl commands while your CEO asks when the site will be back online.

Layer 1: Pod-Level Issues

  • Pods aren't actually ready despite showing "Running" status
  • Application binds to localhost instead of 0.0.0.0
  • Wrong port numbers between container and service configuration
  • Health checks failing due to misconfigured probes

Layer 2: Service Configuration Problems

  • Service selectors don't match pod labels (most common cause)
  • Port mapping errors between service and target ports
  • Service exists but has no endpoints
  • Wrong service type for your use case (ClusterIP vs LoadBalancer)

Layer 3: Network Policy Restrictions

  • Default-deny policies blocking legitimate traffic
  • Incorrect policy selectors preventing pod communication
  • Missing ingress/egress rules for required connections
  • Namespace isolation preventing cross-namespace communication

Layer 4: DNS Resolution Failures

  • CoreDNS pods not running or misconfigured (check for OOMKilled in kube-system namespace)
  • Service name resolution failing within cluster (DNS query timeout spikes during high pod churn)
  • External DNS not propagating for ingress resources (Route53 vs CloudDNS propagation delays)
  • DNS caching issues causing stale entries (node-local DNS cache corruption in Kubernetes 1.33+)
  • DNS query limits hit during traffic bursts (5000+ QPS crashes CoreDNS without horizontal pod autoscaling)

Layer 5: Ingress and Load Balancer Issues

  • Ingress controller not receiving traffic (NGINX controller restart loop after config parse errors)
  • Backend service health checks failing (ALB health checks timeout after 30 seconds by default)
  • SSL/TLS certificate problems (cert-manager renewal failures during high load)
  • Cloud provider load balancer misconfigurations (EKS ALB annotation hell in Kubernetes 1.33)
  • Gateway API conflicts with legacy Ingress resources (chaos when both are active simultaneously)

The Five Most Common Service Accessibility Failures

1. Selector Mismatch - The Silent Killer

Kubernetes Service and Endpoints

What happens: Your service exists, looks healthy, but `kubectl get endpoints` shows no endpoints available.

Why it happens: The service selector doesn't match any pod labels. This is surprisingly common because:

  • Copy-paste errors when creating services from YAML templates
  • Pod labels changed during deployment updates
  • Typos in label keys or values (app: web vs app: webapp)
  • Case sensitivity issues (App: Web vs app: web)

Real-world disaster: During a "quick" Node.js 18.19.0 upgrade in our payment API (running on GKE 1.31.2), we changed deployment labels from app: payment-api-v1 to app: payment-api-v2 but forgot to update the service selector. Datadog showed green pods, kubectl get pods looked perfect, but every payment request returned HTTP 503 with the dreaded "no healthy upstream" error.

The real kick in the teeth? Our health check endpoint at /health was responding perfectly when we curled the pod IPs directly. But the LoadBalancer service was frantically searching for pods labeled payment-api-v1 while our shiny new v2 pods sat there being completely ignored. Took us 47 minutes of increasingly panicked debugging (while Black Friday traffic was failing) to run kubectl get endpoints payment-service and see the brutal truth: <none>.

Total damage: 2.3 hours of payment downtime, $180K in lost transactions, and one very pissed-off CTO. All because of eight characters in a YAML selector. Selector mismatches are silent killers that make everything look healthy while being completely broken.

How to identify using kubectl troubleshooting commands:

## Check service selector
kubectl describe service my-service | grep Selector

## Check pod labels using [label queries](https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/#label-selectors)
kubectl get pods --show-labels | grep my-app

## Compare - if they don't match, you found the problem

2. Port Configuration Hell

What happens: Service endpoints exist but connections are refused or time out.

Why it's confusing: Kubernetes has three different port concepts that must align:

  • `containerPort`: The port your application listens on inside the container
  • port: The port the service exposes to other pods
  • `targetPort`: The port the service forwards traffic to (should match containerPort)

The mistake pattern:

## Your application listens on port 8080
containers:
- name: web-app
  ports:
  - containerPort: 8080

## But your service configuration is wrong
spec:
  ports:
  - port: 80          # Service port (correct)
    targetPort: 3000   # Wrong! Should be 8080

Production nightmare: Our React frontend (Next.js 14.0.4 on AKS 1.31.3) started throwing 502 "Bad Gateway" errors randomly during peak Black Friday traffic. AWS Application Load Balancer showed healthy targets, Prometheus metrics looked perfect, pods were reporting as ready, but users couldn't complete checkout flows.

The smoking gun came when we finally tested direct pod connectivity: kubectl exec -it debug-pod -- curl payment-frontend-pod-ip:8080 worked perfectly, but curl payment-frontend-service:80 returned connection refused. Our service spec had targetPort: 3000 while the containerized Next.js app was actually listening on port 8080 (configured via PORT=8080 in the container env).

Why didn't we catch this obvious fuckup during testing? Because developers exclusively used kubectl port-forward service/payment-frontend 3000:80 for local testing, which completely bypasses the service port mapping layer. The port-forward worked fine, masking the service configuration error until production load exposed it.

Final damage: 2.5 hours of checkout failures, $240K in abandoned carts, and a very memorable post-mortem about why service port mapping matters. All because someone copy-pasted a port number from an old deployment and never actually tested the service layer.

3. Network Policy Lockdown

Kubernetes Network Policy Diagram

What happens: Connections work initially, then stop working after security policies are applied.

Why security breaks everything: Network policies in Kubernetes are additive, meaning once you create any NetworkPolicy, traffic is denied by default unless explicitly allowed. This default-deny behavior is a major source of service connectivity failures.

The "secure by default" trap:

## Someone creates this "secure" policy
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: deny-all
spec:
  podSelector: {}  # Applies to ALL pods
  policyTypes:
  - Ingress
  # No ingress rules = deny everything

What this breaks: Everything. All pod-to-pod communication stops working until you create specific allow rules for every required connection.

The great network policy disaster of December 2024: Security team deployed Calico network policies with "secure by default" configuration on our production EKS 1.31 cluster at 2:17 PM EST on a Tuesday. No staging test. No rollback plan. No communication to the platform team.

By 2:19 PM, our entire e-commerce platform was completely dead. HTTP 503 errors everywhere. Payment processing API: down. User authentication service: down. Product catalog: down. Even our internal monitoring (Prometheus scraping) was returning timeouts. The CEO's first Slack message at 2:23 PM was simple: "Site is down. ETA?"

The network policy they deployed was beautifully secure and completely catastrophic:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-all
spec:
  podSelector: {}  # Affects ALL pods in namespace
  policyTypes:
  - Ingress
  - Egress
  # No allow rules = everything blocked

Six increasingly panicked engineers (including me) spent the next 6.5 hours manually creating allow rules for every single microservice interaction. Database connections: blocked. Service mesh communication: blocked. Even DNS queries to CoreDNS: fucking blocked.

Final tally: $2.8M in lost revenue, 847 abandoned shopping carts, and one very uncomfortable all-hands meeting about change management. All because someone applied a "secure by default" policy to production without realizing it would immediately break everything that makes the platform function.

Network policies are incredibly powerful security tools. They're also the fastest way to transform a functioning production platform into very expensive digital paperweights.

4. DNS Resolution Chaos

Kubernetes DNS Architecture

What happens: Service names can't be resolved, or resolution is intermittent.

The DNS dependency chain: Every service lookup depends on CoreDNS pods running correctly. The Kubernetes DNS specification defines how service discovery works:

  • CoreDNS pods must be healthy and running
  • DNS service must have valid endpoints
  • DNS configuration must propagate to all nodes
  • Application must use correct service name format

Common DNS failures:

  • CoreDNS pods crash during high load
  • DNS service endpoints become stale after node failures
  • Wrong service name format (my-service vs my-service.namespace.svc.cluster.local)
  • DNS query timeouts during traffic spikes

The debugging nightmare: DNS issues are particularly maddening because they're intermittent as hell. Sometimes connections work perfectly (cached DNS entries), sometimes they fail mysteriously (cache expired), and debugging requires understanding both Kubernetes networking intricacies and DNS resolution timing. It's like playing Russian roulette with your service calls.

5. Readiness Probe Deception

What happens: Pods show "Running" status but aren't actually ready to serve traffic.

The readiness vs liveness confusion:

  • Liveness probe: Determines if container should be restarted
  • Readiness probe: Determines if container should receive traffic

The deceptive scenario:

kubectl get pods
NAME                     READY   STATUS    RESTARTS   AGE
web-app-7c8b9d5f-abc123  1/1     Running   0          5m

## But actually...
kubectl describe pod web-app-7c8b9d5f-abc123
## Shows: Warning  Unhealthy  readiness probe failed

Why this breaks services: Kubernetes includes pods in service endpoints based on readiness probe success. If readiness probes fail, pods won't receive traffic even though they appear healthy in kubectl get pods.

Real incident: Our PostgreSQL 15.3 database migration (running on RDS with connection pooling via PgBouncer) typically took 8-12 minutes during off-peak windows. But our Spring Boot 3.2.1 application had readiness probes configured with a 30-second timeout hitting /actuator/health:

readinessProbe:
  httpGet:
    path: /actuator/health
    port: 8080
  timeoutSeconds: 30  # The problem
  periodSeconds: 10

During migrations, the health check would query SELECT 1 FROM schema_migrations which would hang waiting for table locks. After 30 seconds, the probe failed, Kubernetes marked pods as unready, and they got removed from service endpoints. Result: 503 errors for users while perfectly healthy pods sat idle.

Took us three failed Saturday morning deployments (and three very angry product manager messages) to realize the readiness probe was timing out during normal database operations. The fix was bumping timeout to 60 seconds and creating a dedicated /ready endpoint that checked application state without hitting locked database tables.

The real lesson: readiness probes should check if your app can serve traffic, not if your database migration is complete.

The Interconnected Nature of Service Failures

Service accessibility problems are rarely isolated issues. A DNS failure might mask a selector mismatch, or network policies might hide port configuration errors. Understanding these interconnections is crucial for systematic debugging:

Cascading failures

One misconfiguration often triggers others. For example:

  1. Wrong service selector creates empty endpoints
  2. Load balancer health checks fail due to no backends
  3. Ingress controller marks service as unhealthy
  4. DNS entries become stale
  5. Application-level retries overwhelm remaining services

Time-based issues

Some problems only appear under specific conditions:

  • High traffic reveals DNS resolution limits
  • Pod restarts expose readiness probe misconfiguration
  • Node failures trigger network policy edge cases
  • Certificate rotations break ingress TLS configuration

Environment-specific behaviors

What works in development often fails in production:

  • Development clusters have permissive network policies
  • Staging environments don't match production node configurations
  • Load balancer behavior differs between cloud providers
  • DNS resolution works differently in single-node vs multi-node clusters

Understanding these root causes provides the foundation for systematic debugging. For more detailed information, consult the official Kubernetes troubleshooting guide and service debugging documentation. The CNCF Kubernetes troubleshooting patterns also provide excellent real-world examples. Additional resources include the Kubernetes networking troubleshooting guide and production debugging best practices.

Understanding these five failure modes is essential, but theory won't save you when production is melting down at 2 AM and executives are asking for ETAs you can't provide. What you need are the exact commands to run, in the correct sequence, to identify and fix the problem before it becomes a résumé-generating event.

The systematic debugging methodology that follows transforms this theoretical knowledge into actionable command sequences that actually work when you're under extreme pressure. Every command has been tested in real production outages where minutes of downtime translate to thousands of dollars in lost revenue and damaged credibility.

This isn't another generic "run kubectl describe" tutorial. These are the specific debugging workflows, refined through hundreds of 3 AM service failures, that methodically eliminate possibilities until you find the root cause. The approach works whether you're debugging a startup's single-node cluster or a Fortune 500's multi-region Kubernetes deployment running thousands of services.

Most importantly, this systematic approach prevents the panic-driven random troubleshooting that destroys careers during outages. Instead of desperately trying every kubectl command you can remember while management demands updates every five minutes, you'll have a proven methodology that consistently identifies and resolves service accessibility issues before they escalate into company-wide incidents.

The 5-Minute Service Recovery Playbook (Copy-Paste Commands That Actually Work)

Kubernetes Service Debugging

Your service is down. Users are losing their minds. Management wants answers in the next 5 minutes. You need working commands, not a fucking lecture on Kubernetes architecture.

This is the exact debugging sequence that actually identifies and fixes service accessibility problems when everything is on fire. No fluff, no theoretical bullshit about how things "should" work - just the copy-paste commands that get your services responding again before you get fired.

The 5-Minute Service Accessibility Triage

Kubernetes Service Types

Before diving into complex debugging, run this quick triage sequence to identify the most likely problem area:

Quick Triage Commands (Test These First - 5 Minutes Max)

## 1. Verify service exists and has endpoints (Kubernetes 1.21+ uses EndpointSlices by default)
kubectl get service my-service -o wide
kubectl get endpoints my-service    # Legacy - still works but deprecated
kubectl get endpointslices -l kubernetes.io/service-name=my-service -o wide  # Current method

## 2. Check if pods are actually ready (not just \"Running\")
kubectl get pods -l app=my-app -o wide
kubectl describe pods -l app=my-app | grep -A 10 -B 2 \"Conditions:\"
kubectl get pods -l app=my-app --field-selector=status.phase=Running --no-headers | wc -l  # Count ready pods

## 3. Test internal service connectivity using debug containers (K8s 1.25+ has enhanced kubectl debug)
kubectl run debug-pod --image=nicolaka/netshoot --rm -it --restart=Never -- bash
## Inside debug pod (netshoot has all tools pre-installed):
nslookup my-service.my-namespace.svc.cluster.local
curl -v my-service:80 --connect-timeout 5
telnet my-service 80

## 4. Check for network policies that block traffic (common in production clusters)
kubectl get networkpolicy --all-namespaces -o wide
kubectl describe networkpolicy -n my-namespace | grep -A 20 \"Spec:\"

Triage interpretation:

Systematic Service Debugging Workflow

Phase 1: Service and Endpoint Validation

Service Endpoint Configuration

Step 1: Verify Service Configuration

## Get detailed service information
kubectl describe service my-service

## Check service selector matches pod labels
kubectl get service my-service -o yaml | grep -A 5 selector
kubectl get pods --show-labels | grep my-app

## Verify port configuration
kubectl get service my-service -o jsonpath='{.spec.ports[*]}'

What to look for:

Step 2: Check Endpoint Creation (Critical for Kubernetes 1.21+)

## Check legacy endpoints (removed in K8s 1.35+, only works in clusters 1.34 and below)
kubectl get endpoints my-service -o yaml 2>/dev/null || echo \"Endpoints API removed - use EndpointSlices\"

## EndpointSlices (GA since 1.21, required knowledge for modern clusters)
kubectl get endpointslices -l kubernetes.io/service-name=my-service -o yaml
kubectl get endpointslices -l kubernetes.io/service-name=my-service -o jsonpath='{.items[*].endpoints[*]}'

## Kubernetes 1.33+ enhanced EndpointSlice debugging
kubectl get endpointslices -l kubernetes.io/service-name=my-service -o jsonpath='{range .items[*]}{.metadata.name}{\" \"}{.endpoints[*].addresses[*]}{\" \"}{.endpoints[*].conditions}{\"
\"}{end}'

## Modern approach: Check EndpointSlice readiness conditions
kubectl get endpointslices -l kubernetes.io/service-name=my-service -o jsonpath='{range .items[*].endpoints[*]}{.addresses[*]}{\" - Ready: \"}{.conditions.ready}{\"
\"}{end}'

## If no endpoints, verify pod readiness and labels simultaneously
kubectl get pods -l app=my-app -o wide --show-labels
kubectl describe pods -l app=my-app | grep -E \"(Ready|Conditions)\" -A 3 -B 1

Common endpoint problems that will ruin your day:

  • Empty endpoints list = selector mismatch or no ready pods (check your labels!)
  • Endpoints exist but point to wrong IPs = pod networking is fucked
  • Endpoints flapping = pod readiness probes failing intermittently (usually a timeout issue)

Step 3: Direct Pod Connectivity Testing

## Get pod IP addresses
kubectl get pods -l app=my-app -o wide

## Test direct pod connectivity (bypassing service) using [netshoot container](https://github.com/nicolaka/netshoot)
kubectl run debug-pod --image=nicolaka/netshoot --rm -it -- bash
## Inside debug pod:
curl POD-IP:8080/health  # Replace POD-IP with actual pod IP
telnet POD-IP 8080

Results interpretation (the moment of truth):

  • Direct pod connection works: Service configuration is broken, not the app
  • Direct pod connection fails: Your application or container is the problem
  • Connection refused: Wrong port or app is binding to localhost instead of 0.0.0.0

Phase 2: Network Layer Debugging

Network Policy Configuration

Step 4: DNS Resolution Testing

## Test DNS from within cluster
kubectl run dns-test --image=busybox --rm -it -- sh
## Inside test pod:
nslookup my-service
nslookup my-service.my-namespace
nslookup my-service.my-namespace.svc.cluster.local

## Check CoreDNS health
kubectl get pods -n kube-system -l k8s-app=kube-dns
kubectl logs -n kube-system -l k8s-app=kube-dns --tail=50

DNS troubleshooting (DNS debugging guide):

Step 5: Network Policy Investigation

## Check for [network policies](https://kubernetes.io/docs/concepts/services-networking/network-policies/) affecting your pods
kubectl get networkpolicy --all-namespaces
kubectl describe networkpolicy -n my-namespace

## Test network connectivity with policies using [nc command](https://kubernetes.io/docs/tasks/debug/debug-application/debug-service/#debugging-services)
kubectl exec debug-pod -- nc -zv my-service 80
kubectl exec debug-pod -- telnet my-service.my-namespace 80

Network policy debugging (troubleshooting guide):

  • Connection works without policies, fails with policies: Policy blocking traffic
  • Need to create explicit allow rules for pod-to-pod communication
  • Check both ingress and egress rules

Phase 3: Ingress and External Access Debugging

Ingress Controller Architecture

Step 6: Ingress Controller Health

## Check [ingress resource](https://kubernetes.io/docs/concepts/services-networking/ingress/) configuration
kubectl get ingress my-ingress -o yaml
kubectl describe ingress my-ingress

## Check [ingress controller](https://kubernetes.io/docs/concepts/services-networking/ingress-controllers/) pods
kubectl get pods -n ingress-nginx -l app.kubernetes.io/name=ingress-nginx
kubectl logs -n ingress-nginx -l app.kubernetes.io/name=ingress-nginx --tail=50

## Test ingress controller directly using [port forwarding](https://kubernetes.io/docs/tasks/access-application-cluster/port-forward-access-application-cluster/)
kubectl port-forward -n ingress-nginx service/ingress-nginx-controller 8080:80
curl -H \"Host: my-app.example.com\" localhost:8080  # Test via port-forward

Ingress troubleshooting steps:

  • Ingress has no address: Controller not creating load balancer
  • 404 errors: Host/path rules not matching requests
  • 502/503 errors: Backend service issues (return to service debugging)
  • SSL/TLS issues: Certificate or TLS configuration problems

Step 7: Load Balancer and External IP Issues (Cloud Provider Specific)

## Check external IP assignment (can take 2-15 minutes depending on provider)
kubectl get service my-service -o wide
kubectl describe service my-service | grep -A 10 Events
kubectl get service my-service -o jsonpath='{.status.loadBalancer.ingress[*]}'

## AWS EKS (Application Load Balancer Controller v2.4+ with newer annotations)
aws elbv2 describe-load-balancers --query 'LoadBalancers[?contains(LoadBalancerName, `k8s-default-myservice`)]'
kubectl get service my-service -o jsonpath='{.metadata.annotations}' | jq '."service.beta.kubernetes.io/aws-load-balancer-type"'

## Google GKE (using Google Cloud Load Balancer)
gcloud compute forwarding-rules list --filter=\"name~k8s.*my-service\" --format=\"table(name,IPAddress,target)\"
kubectl get service my-service -o jsonpath='{.metadata.annotations}' | jq '."cloud.google.com/load-balancer-type"'

## Azure AKS (with Azure Load Balancer integration)
az network lb list --query '[?contains(name, `kubernetes`)].{name:name,ip:frontendIpConfigurations[0].privateIpAddress}'

## Test load balancer health checks and node readiness
kubectl get nodes -o wide --show-labels | grep node-role
kubectl describe nodes | grep -E \"(Conditions|Taints)\" -A 5

Advanced Debugging Techniques

Container Port and Application Binding Issues

Check what ports your application is actually listening on (debugging guide):

kubectl exec -it my-pod -- netstat -tlnp
kubectl exec -it my-pod -- ss -tlnp

## Check if application binds to localhost vs 0.0.0.0
kubectl exec -it my-pod -- lsof -i :8080

Common binding problems:

  • Application binds to 127.0.0.1:8080 instead of 0.0.0.0:8080
  • Container exposes port 8080 but application listens on 3000
  • Multiple processes trying to bind to same port

Service Mesh Debugging (Istio/Linkerd)

If using a service mesh, additional debugging steps:

## Check sidecar proxy status ([Istio example](https://istio.io/latest/docs/ops/diagnostic-tools/proxy-cmd/))
kubectl get pods -l app=my-app -o jsonpath='{.items[*].status.containerStatuses[*].name}'

## Check proxy configuration using [istioctl](https://istio.io/latest/docs/ops/diagnostic-tools/istioctl/)
istioctl proxy-config cluster my-pod
istioctl proxy-config listener my-pod

## Bypass service mesh for testing
kubectl annotate pod my-pod sidecar.istio.io/inject-

When services are slow or timing out (performance troubleshooting):

## Check resource usage using [kubectl top](https://kubernetes.io/docs/reference/kubectl/generated/kubectl_top/)
kubectl top pods -l app=my-app
kubectl describe pods -l app=my-app | grep -A 5 Requests
kubectl describe pods -l app=my-app | grep -A 5 Limits

## Check for [CPU throttling](https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#how-pods-with-resource-limits-are-run)
kubectl exec -it my-pod -- cat /sys/fs/cgroup/cpu/cpu.stat | grep throttled

## Monitor connection counts
kubectl exec -it my-pod -- ss -s

Debugging Command Reference

Essential kubectl Commands for Service Issues

Reference the complete kubectl cheat sheet for additional commands.

## Service inspection using [service debugging](https://kubernetes.io/docs/tasks/debug/debug-application/debug-service/)
kubectl get svc -o wide
kubectl describe svc SERVICE-NAME
kubectl get endpoints SERVICE-NAME

## Pod and label verification using [label selectors](https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/)
kubectl get pods --show-labels
kubectl get pods -l LABEL-SELECTOR -o wide

## Network testing from inside cluster
kubectl run debug --image=nicolaka/netshoot --rm -it -- bash
kubectl run test --image=busybox --rm -it -- sh

## DNS testing using [cluster DNS](https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/)
kubectl exec POD-NAME -- nslookup SERVICE-NAME
kubectl exec POD-NAME -- dig SERVICE-NAME.NAMESPACE.svc.cluster.local

## Log analysis for [troubleshooting](https://kubernetes.io/docs/tasks/debug/debug-application/debug-running-pod/)
kubectl logs POD-NAME --previous
kubectl logs -l LABEL-SELECTOR --tail=100 -f

## Network policy inspection
kubectl get networkpolicy -o yaml
kubectl describe networkpolicy POLICY-NAME

Cloud-Specific Debugging Commands

AWS EKS:

## Check VPC and security group configuration
aws ec2 describe-security-groups --group-ids sg-xxx
aws eks describe-cluster --name CLUSTER-NAME

## Check load balancer health
aws elbv2 describe-target-groups --target-group-arns arn:aws:...
aws elbv2 describe-target-health --target-group-arn arn:aws:...

Google GKE:

## Check firewall rules
gcloud compute firewall-rules list --filter=\"name~my-service\"

## Check load balancer configuration
gcloud compute backend-services list
gcloud compute forwarding-rules list

Quick Fix Solutions for Common Problems

Fix 1: Selector Mismatch

## Update service selector to match pod labels
kubectl patch service my-service -p '{\"spec\":{\"selector\":{\"app\":\"correct-label\"}}}'

## Or update pod labels to match service
kubectl label pods -l app=old-label app=new-label

Fix 2: Port Configuration Error

## Update service target port
kubectl patch service my-service -p '{\"spec\":{\"ports\":[{\"port\":80,\"targetPort\":8080}]}}'

Fix 3: Network Policy Allow Rule

## Create allow rule for service communication
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-my-service
spec:
  podSelector:
    matchLabels:
      app: my-app
  policyTypes:
  - Ingress
  ingress:
  - from:
    - podSelector:
        matchLabels:
          app: client-app
    ports:
    - protocol: TCP
      port: 8080

Fix 4: Readiness Probe Adjustment

## Temporarily disable readiness probe for testing
kubectl patch deployment my-app -p '{\"spec\":{\"template\":{\"spec\":{\"containers\":[{\"name\":\"my-container\",\"readinessProbe\":null}]}}}}'

## Update readiness probe timeout
kubectl patch deployment my-app -p '{\"spec\":{\"template\":{\"spec\":{\"containers\":[{\"name\":\"my-container\",\"readinessProbe\":{\"timeoutSeconds\":10}}]}}}}'

This systematic approach provides the foundation for resolving service accessibility issues. For additional debugging techniques, consult the kubectl debug documentation, networking troubleshooting guide, and cluster debugging resources. Advanced users should also review the service mesh debugging guides and CNI plugin troubleshooting documentation. The Kubernetes debugging flowchart provides a visual reference for systematic troubleshooting.

These systematic debugging approaches handle 85% of service accessibility crises efficiently. But production systems are creative assholes that love combining multiple failures in ways that make senior engineers cry into their energy drinks at 3 AM. The methodology above handles the straightforward cases, but what about when nothing makes sense?

What happens when the obvious fixes don't work? When multiple things break simultaneously? When you're 45 minutes into an outage and still don't know what's wrong while the CEO is asking increasingly pointed questions about system reliability?

The FAQ section addresses the panic-inducing questions you'll inevitably Google during high-stress debugging sessions. These are the "oh shit" scenarios where systematic approaches meet chaotic production realities, and you need rapid answers to keep systems breathing while executives hover menacingly over your shoulder.

Service Debugging FAQ (The Panic-Googling Questions)

Q

Why does `kubectl get service` show my service but I can't connect to it?

A

Your service exists but likely has no endpoints. Check with kubectl get endpoints my-service - if it shows <none>, your service selector doesn't match any pod labels.

Quick fix sequence:

  1. kubectl describe service my-service | grep Selector - get service selector
  2. kubectl get pods --show-labels - check pod labels
  3. If they don't match: kubectl patch service my-service -p '{"spec":{"selector":{"app":"correct-label"}}}'

Most common cause: Copy-paste errors when creating services from templates, or pod labels changed during deployment updates.

Q

My pods show "Running" but the service still returns 503 errors - what's wrong?

A

Running doesn't mean ready. Check if your pods are actually ready to serve traffic:

kubectl get pods -o wide
## Look at the READY column - should show 1/1, not 0/1

kubectl describe pod my-pod | grep -A 5 Conditions
## Look for "Ready: True" condition

If pods aren't ready: Your readiness probe is failing. Common causes:

  • Probe timeout too short (app takes time to start)
  • Wrong probe endpoint or port
  • Database connectivity required for health check but DB is slow

Quick fix: Temporarily disable readiness probe: kubectl patch deployment my-app -p '{"spec":{"template":{"spec":{"containers":[{"name":"my-container","readinessProbe":null}]}}}}'

Q

Why can't I reach my service from outside the cluster?

A

ClusterIP services (the default) are only accessible from within the cluster. For external access, you need:

  • LoadBalancer: kubectl patch service my-service -p '{"spec":{"type":"LoadBalancer"}}'
  • NodePort: kubectl patch service my-service -p '{"spec":{"type":"NodePort"}}'
  • Ingress: Create an ingress resource with proper hostname/path rules

Check your current service type: kubectl get service my-service -o jsonpath='{.spec.type}'

If you have a LoadBalancer service but no external IP, check cloud provider quotas and permissions.

Q

My service worked yesterday but stopped working today - what changed?

A

Most likely culprits:

  1. Network policies were added: kubectl get networkpolicy --all-namespaces
  2. Pods were updated with different labels: kubectl get pods --show-labels | grep my-app
  3. DNS issues: kubectl get pods -n kube-system -l k8s-app=kube-dns
  4. Node problems: kubectl get nodes (look for NotReady status)

Quick diagnosis: Compare current configuration with working state:

kubectl get service my-service -o yaml > current-service.yaml
kubectl get pods -l app=my-app --show-labels > current-pods.txt
## Compare with your last working configuration
Q

Why does my service only work sometimes (intermittent failures)?

A

Classic signs of intermittent issues:

  1. Some pods are healthy, others aren't: kubectl get pods -l app=my-app - look for mixed Ready states
  2. DNS caching: Old DNS entries point to dead pods - kubectl delete pods -n kube-system -l k8s-app=kube-dns (restarts CoreDNS)
  3. Load balancer health checks failing: Some backend pods fail health checks, get removed from rotation
  4. Network policy edge cases: Policies work for some connections but not others

Debug intermittent issues (production-tested approach):

## Test connectivity pattern over time (modern approach with better networking tools)
for i in {1..20}; do
  kubectl run connectivity-test-$i --image=nicolaka/netshoot --rm --restart=Never \
    -- timeout 10 curl -s -w "Response: %{http_code}, Time: %{time_total}s
" \
    my-service.my-namespace.svc.cluster.local:80 || echo "Attempt $i: Connection failed"
  sleep 2
done

## Advanced: Check EndpointSlice stability during intermittent failures
kubectl get endpointslices -l kubernetes.io/service-name=my-service -w &
## Let it run while you test connections, then kill with Ctrl+C
Q

How do I debug "connection refused" vs "connection timeout" errors?

A

Connection Refused = port is closed or service not listening

  • Wrong port configuration between service and container
  • Application not binding to 0.0.0.0 (binding to localhost instead)
  • Process not running or crashed

Connection Timeout = network routing problem

  • Network policies blocking traffic
  • Firewall rules (cloud provider security groups)
  • Pod-to-pod networking issues (CNI plugin problems)
  • DNS resolution extremely slow

Debug approach:

## Test direct pod connectivity
kubectl exec -it debug-pod -- telnet POD-IP 8080
## Refused = app problem, Timeout = network problem

## Check what's listening in the container
kubectl exec -it my-pod -- netstat -tlnp
kubectl exec -it my-pod -- ss -tlnp
Q

Why doesn't my ingress work even though the service is accessible internally?

A

Ingress debugging hierarchy:

  1. Check ingress controller: kubectl get pods -n ingress-nginx
  2. Verify ingress resource: kubectl describe ingress my-ingress
  3. Test host header: curl -H "Host: my-app.example.com" http://EXTERNAL-IP/
  4. Check backend service: kubectl get service my-service (should match ingress backend)

Common ingress problems:

  • No external IP: Load balancer not created or cloud provider quotas exceeded
  • 404 errors: Host/path rules don't match your requests
  • 502/503 errors: Backend service issues (go back to service debugging)
  • SSL/TLS errors: Certificate not properly configured
Q

What does "no endpoints available for service" mean and how do I fix it?

A

This error means your service has no healthy pods to route traffic to.

Debugging steps:

## Check if you have any pods at all
kubectl get pods -l app=my-app

## Check if pods are ready
kubectl get pods -l app=my-app -o wide

## Check service selector
kubectl describe service my-service | grep Selector

## Compare service selector with pod labels
kubectl get pods --show-labels | grep my-app

Most common fixes:

  • Update service selector to match pod labels
  • Fix readiness probes so pods become ready
  • Scale deployment to create pods if none exist
Q

How do I test service connectivity without affecting production traffic?

A

Safe testing approaches:

## Method 1: Temporary debug pod
kubectl run debug-pod --image=nicolaka/netshoot --rm -it -- bash

## Method 2: Debug container (K8s 1.25+)
kubectl debug my-pod -it --image=nicolaka/netshoot --target=my-container

## Method 3: Port forward for external testing
kubectl port-forward service/my-service 8080:80
curl localhost:8080  # Test via port-forward (localhost)

## Method 4: Service mesh traffic splitting (if using Istio)
kubectl apply -f virtual-service-canary.yaml

Testing from inside containers:

  • curl for HTTP services
  • telnet for TCP connectivity testing
  • nslookup/dig for DNS resolution testing
  • nc -zv for port connectivity testing
Q

Why do my services work in development but fail in production?

A

Environment differences that break services:

  1. Network policies: Dev clusters often have permissive policies, prod has restrictive ones
  2. Resource limits: Prod has stricter CPU/memory limits affecting readiness probes
  3. Node configuration: Different CNI plugins, firewall rules, or cloud provider settings
  4. Scale differences: Issues only appear under load (connection pool exhaustion, DNS limits)
  5. Security contexts: Prod runs as non-root user, dev runs as root

Compare environments:

## Check network policies
kubectl get networkpolicy --all-namespaces

## Check resource limits
kubectl describe pods -l app=my-app | grep -A 5 Limits

## Check security context
kubectl get pods my-pod -o jsonpath='{.spec.securityContext}'

## Check node configuration
kubectl describe nodes | grep -A 10 System
Q

What are the most important kubectl commands for service debugging?

A

The essential debugging sequence (updated for Kubernetes 1.29+):

## 1. Basic service health (use modern EndpointSlices)
kubectl get service my-service -o wide
kubectl get endpointslices -l kubernetes.io/service-name=my-service  # Modern method
kubectl describe service my-service | grep -A 10 -E "(Selector|Endpoints|Events)"

## 2. Pod health and labels (verify readiness conditions)
kubectl get pods -l app=my-app -o wide --show-labels
kubectl get pods -l app=my-app -o jsonpath='{range .items[*]}{.metadata.name}{": Ready="}{.status.conditions[?(@.type=="Ready")].status}{"
"}{end}'

## 3. Network testing (use netshoot for comprehensive tools)
kubectl run debug-$(date +%s) --image=nicolaka/netshoot --rm -it --restart=Never -- bash
## Inside netshoot container:
## nslookup my-service.my-namespace.svc.cluster.local
## curl -v --connect-timeout 10 my-service:80

## 4. Detailed investigation (combine logs and events)
kubectl describe pods -l app=my-app | grep -E "(Events|Conditions)" -A 10
kubectl logs -l app=my-app --tail=100 --timestamps=true
kubectl get events --field-selector reason!=Pulled --sort-by='.lastTimestamp' | tail -10

## 5. Network policies and advanced networking
kubectl get networkpolicy --all-namespaces -o wide
kubectl describe networkpolicy -n my-namespace | grep -A 15 "Spec:"

Pro tip: Save these commands in a shell script for quick debugging during outages.

Q

How long should I wait before declaring a service accessibility issue?

A

Timing for different scenarios (updated for Kubernetes 1.33+ and 2025 cloud provider speeds):

  • Pod startup issues: Wait 2-5 minutes for application initialization (longer for JVM applications)
  • DNS propagation: Wait 30-60 seconds for CoreDNS updates (node-local DNS cache adds 10-30s delay)
  • Load balancer provisioning: AWS ALB (3-5 min), GCP GLB (2-4 min), Azure ALB (5-8 min) in 2025
  • Network policy changes: Effect should be immediate, but CNI plugins may take 5-15 seconds
  • Ingress controller updates: NGINX (30-60s), Traefik (10-30s), Gateway API controllers (60-120s)

When NOT to wait (immediate investigation required):

  • Connection refused errors (immediate network/port issue)
  • Service selector mismatches (immediate configuration issue)
  • Missing endpoints (immediate pod readiness issue)
  • HTTP 5xx errors from ingress (backend is fundamentally broken)
  • Pod CrashLoopBackOff status (application startup is failing)

Set realistic timeouts for modern Kubernetes clusters:

  • Application health checks: 30-60 seconds (increase to 120s for resource-constrained clusters)
  • Service discovery: 2-5 minutes for new services (EndpointSlice controllers faster than legacy Endpoints)
  • External load balancer: 3-8 minutes for cloud provider provisioning (varies by region and provider SLA)
  • Certificate provisioning: 2-5 minutes for cert-manager with HTTP01, 30-60 seconds for DNS01
Q

Why does my service work with port-forward but not through ingress?

A

This classic problem reveals a fundamental misunderstanding of Kubernetes networking layers:

Port-forward bypasses everything: When you run kubectl port-forward service/my-service 8080:80, traffic flows directly from your machine to the service, bypassing:

  • Ingress controllers completely
  • Load balancer health checks
  • TLS termination
  • Host header routing rules
  • Path-based routing rules

Debug sequence for ingress-only failures:

## 1. Test the service layer directly (should work if port-forward works)
kubectl run debug-pod --image=nicolaka/netshoot --rm -it -- curl my-service.my-namespace.svc.cluster.local:80

## 2. Test ingress controller directly (bypass external load balancer)
kubectl port-forward -n ingress-nginx service/ingress-nginx-controller 8080:80
curl -H "Host: my-domain.com" localhost:8080/my-path

## 3. Check ingress resource configuration
kubectl describe ingress my-ingress | grep -A 10 -B 5 "Rules\|Backend"

## 4. Check ingress controller logs for routing errors
kubectl logs -n ingress-nginx -l app.kubernetes.io/name=ingress-nginx --tail=100 | grep "my-domain.com"

Most common ingress-specific failures:

  • Host header doesn't match ingress rules (curl example.com vs ingress expecting app.example.com)
  • Path routing fails (/api/v1/users doesn't match /api/ prefix rule)
  • Backend service name/port mismatch in ingress spec
  • TLS issues (certificate not matching hostname, missing TLS configuration)
  • Ingress controller itself is unhealthy (pods restarting due to config parse errors)

These FAQs handle the immediate "oh fuck" moments when everything is broken and you need answers fast. But once you've got services breathing again, the real engineering skill is choosing the right debugging strategy from the start - before you're 30 minutes into an outage trying random approaches.

The strategic decision framework that follows helps you cut through the chaos and pick the debugging method most likely to succeed for your specific situation. Instead of frantically trying every kubectl command you can remember, you'll know which approach gives you the best chance of rapid resolution based on your problem type, infrastructure, and time constraints.

This systematic approach to debugging strategy selection is what separates junior engineers frantically Googling solutions from senior engineers who methodically diagnose and resolve issues. It's the difference between emergency firefighting and professional incident response.

Kubernetes Service Debugging Methods Comparison

Problem Type

kubectl describe

Direct Pod Testing

Network Tools

DNS Testing

Best First Step

✅ Excellent

  • shows selector/labels

❌ No pods to test

❌ Nothing to connect to

❌ Service exists but empty

kubectl describe service

Selector Mismatch

✅ Perfect

  • shows selector vs pod labels

❌ Irrelevant for config issues

❌ Won't show config problems

❌ DNS works, routing doesn't

kubectl get pods --show-labels

Port Configuration Error

⚠️ Shows ports but not application binding

✅ Direct pod test reveals real ports

✅ netstat/ss shows listening ports

❌ DNS fine, connection fails

kubectl exec pod -- netstat -tlnp

Readiness Probe Failures

✅ Excellent

  • shows probe status

⚠️ Direct test bypasses readiness

❌ Network tools won't show health

❌ DNS works but no traffic

kubectl describe pod

Network Policy Blocking

❌ Doesn't show network restrictions

✅ Perfect for testing connectivity

✅ telnet/nc show connection blocks

❌ DNS fine, connection blocked

kubectl exec pod -- nc -zv target

DNS Resolution Issues

⚠️ May show some DNS errors

❌ IP-based tests bypass DNS

❌ Network tools use IPs

✅ Perfect for DNS problems

nslookup from debug pod

Ingress Controller Issues

⚠️ Shows ingress config, not behavior

✅ Tests backend service directly

⚠️ Limited for L7 load balancing

⚠️ DNS may work, routing fails

kubectl port-forward ingress

Load Balancer Problems

⚠️ Shows LB config, not health

✅ Tests backend bypassing LB

⚠️ Can test LB endpoints

❌ Internal DNS fine, external fails

Cloud provider CLI tools

Essential Kubernetes Service Troubleshooting Resources

Related Tools & Recommendations

tool
Similar content

Helm: Simplify Kubernetes Deployments & Avoid YAML Chaos

Package manager for Kubernetes that saves you from copy-pasting deployment configs like a savage. Helm charts beat maintaining separate YAML files for every dam

Helm
/tool/helm/overview
100%
tool
Similar content

Helm Troubleshooting Guide: Fix Deployments & Debug Errors

The commands, tools, and nuclear options for when your Helm deployment is fucked and you need to debug template errors at 3am.

Helm
/tool/helm/troubleshooting-guide
94%
tool
Similar content

containerd - The Container Runtime That Actually Just Works

The boring container runtime that Kubernetes uses instead of Docker (and you probably don't need to care about it)

containerd
/tool/containerd/overview
92%
tool
Similar content

etcd Overview: The Core Database Powering Kubernetes Clusters

etcd stores all the important cluster state. When it breaks, your weekend is fucked.

etcd
/tool/etcd/overview
85%
troubleshoot
Similar content

Kubernetes Crisis Management: Fix Your Down Cluster Fast

How to fix Kubernetes disasters when everything's on fire and your phone won't stop ringing.

Kubernetes
/troubleshoot/kubernetes-production-crisis-management/production-crisis-management
78%
integration
Recommended

Setting Up Prometheus Monitoring That Won't Make You Hate Your Job

How to Connect Prometheus, Grafana, and Alertmanager Without Losing Your Sanity

Prometheus
/integration/prometheus-grafana-alertmanager/complete-monitoring-integration
74%
troubleshoot
Similar content

Fix Snyk Authentication Registry Errors: Deployment Nightmares Solved

When Snyk can't connect to your registry and everything goes to hell

Snyk
/troubleshoot/snyk-container-scan-errors/authentication-registry-errors
66%
troubleshoot
Similar content

Fix Kubernetes CrashLoopBackOff Exit Code 1 Application Errors

Troubleshoot and fix Kubernetes CrashLoopBackOff with Exit Code 1 errors. Learn why your app works locally but fails in Kubernetes and discover effective debugg

Kubernetes
/troubleshoot/kubernetes-crashloopbackoff-exit-code-1/exit-code-1-application-errors
65%
troubleshoot
Similar content

Fix Kubernetes Pod CrashLoopBackOff - Complete Troubleshooting Guide

Master Kubernetes CrashLoopBackOff. This complete guide explains what it means, diagnoses common causes, provides proven solutions, and offers advanced preventi

Kubernetes
/troubleshoot/kubernetes-pod-crashloopbackoff/crashloop-diagnosis-solutions
65%
troubleshoot
Similar content

Kubernetes CrashLoopBackOff: Debug & Fix Pod Restart Issues

Your pod is fucked and everyone knows it - time to fix this shit

Kubernetes
/troubleshoot/kubernetes-pod-crashloopbackoff/crashloopbackoff-debugging
65%
troubleshoot
Similar content

Fix Kubernetes ImagePullBackOff Error: Complete Troubleshooting Guide

From "Pod stuck in ImagePullBackOff" to "Problem solved in 90 seconds"

Kubernetes
/troubleshoot/kubernetes-imagepullbackoff/comprehensive-troubleshooting-guide
65%
troubleshoot
Similar content

Debug Kubernetes AI GPU Failures: Pods Stuck Pending & OOM

Debugging workflows for when Kubernetes decides your AI workload doesn't deserve those GPUs. Based on 3am production incidents where everything was on fire.

Kubernetes
/troubleshoot/kubernetes-ai-workload-deployment-issues/ai-workload-gpu-resource-failures
63%
tool
Similar content

Django Production Deployment Guide: Docker, Security, Monitoring

From development server to bulletproof production: Docker, Kubernetes, security hardening, and monitoring that doesn't suck

Django
/tool/django/production-deployment-guide
55%
troubleshoot
Similar content

Fix Docker Networking Issues: Troubleshooting Guide & Solutions

When containers can't reach shit and the error messages tell you nothing useful

Docker Engine
/troubleshoot/docker-cve-2024-critical-fixes/network-connectivity-troubleshooting
55%
howto
Similar content

FastAPI Kubernetes Deployment: Production Reality Check

What happens when your single Docker container can't handle real traffic and you need actual uptime

FastAPI
/howto/fastapi-kubernetes-deployment/production-kubernetes-deployment
53%
tool
Similar content

Fix gRPC Production Errors - The 3AM Debugging Guide

Fix critical gRPC production errors: 'connection refused', 'DEADLINE_EXCEEDED', and slow calls. This guide provides debugging strategies and monitoring solution

gRPC
/tool/grpc/production-troubleshooting
51%
integration
Similar content

Jenkins Docker Kubernetes CI/CD: Deploy Without Breaking Production

The Real Guide to CI/CD That Actually Works

Jenkins
/integration/jenkins-docker-kubernetes/enterprise-ci-cd-pipeline
49%
tool
Similar content

Flux GitOps: Secure Kubernetes Deployments with CI/CD

GitOps controller that pulls from Git instead of having your build pipeline push to Kubernetes

FluxCD (Flux v2)
/tool/flux/overview
47%
pricing
Similar content

Docker, Podman & Kubernetes Enterprise Pricing Comparison

Real costs, hidden fees, and why your CFO will hate you - Docker Business vs Red Hat Enterprise Linux vs managed Kubernetes services

Docker
/pricing/docker-podman-kubernetes-enterprise/enterprise-pricing-comparison
47%
tool
Similar content

ArgoCD Production Troubleshooting: Debugging & Fixing Deployments

The real-world guide to debugging ArgoCD when your deployments are on fire and your pager won't stop buzzing

Argo CD
/tool/argocd/production-troubleshooting
45%

Recommendations combine user behavior, content similarity, research intelligence, and SEO optimization