Why can't my pods talk to each other even though they're in the same namespace?

99% of the time it's network policies. Check this first: ```bash kubectl get networkpolicies -n your-namespace ``` If you see anything, that's probably your problem. Usually some asshole installed a Helm chart that included default network policies without telling anyone. ```bash # Test direct pod connectivity first kubectl exec pod-1 -- ping pod-2-ip ``` If ping works but your app doesn't, it's not networking - it's your app being broken.

My service returns "connection refused" but the pods are running and healthy - what's wrong?

Service selector is fucked. Check this: ```bash kubectl get endpoints your-service ``` If it shows no endpoints, your service isn't selecting your pods. Usually because someone changed the labels and forgot to update the service selector. Quick fix: Copy the pod labels and paste them into your service selector. Should take 30 seconds, will take 2 hours when you discover the port numbers are also wrong.

My ingress shows "Address: " and never gets an external IP - how do I fix this?

Your cloud provider can't create a load balancer. Check the ingress controller logs: ```bash kubectl logs -n ingress-nginx deployment/ingress-nginx-controller --tail=20 ``` Usually it's either IAM permissions (AWS), quotas (GCP), or someone forgot to enable the load balancer service (everywhere). On AWS, check your node groups have the ELB permissions. On GCP, make sure you didn't hit the forwarding rule limits. This will take 20 minutes to fix, 4 hours to figure out what's wrong.

Why do my external API calls timeout but internal service calls work fine?

This points to egress traffic being blocked by network policies or firewall rules: **Network policies blocking external traffic**: ```bash kubectl get networkpolicies -n your-namespace kubectl describe networkpolicy your-policy | grep -A 10 "egress" # Test external connectivity kubectl exec your-pod -- curl -v https://httpbin.org/ip kubectl exec your-pod -- dig google.com ``` **Cloud provider security groups blocking egress**: ```bash # AWS: Check security group egress rules aws ec2 describe-security-groups --group-ids your-sg-id --query 'SecurityGroups[*].IpPermissionsEgress' # GCP: Check VPC firewall rules gcloud compute firewall-rules list --filter="direction=EGRESS" ``` **Corporate proxy/firewall requirements**: ```bash kubectl exec your-pod -- env | grep -i proxy kubectl exec your-pod -- curl -v --proxy proxy-server:8080 https://www.google.com ```

My CNI plugin is crashing and nodes are going NotReady - how do I recover?

CNI crashes usually indicate configuration problems or resource exhaustion: **Check CNI pod status and logs**: ```bash kubectl get pods -n kube-system | grep -E "flannel|calico|cilium|weave" kubectl logs -n kube-system ds/kube-flannel-ds --tail=100 ``` **Verify CNI configuration matches cluster setup**: ```bash # Check pod CIDR configuration kubectl get nodes -o jsonpath='{.items[*].spec.podCIDR}' kubectl cluster-info dump | grep -A 1 -B 1 cluster-cidr # For Flannel, check subnet configuration kubectl get configmap kube-flannel-cfg -n kube-flannel -o yaml ``` **IP address exhaustion**: Cluster ran out of pod IPs ```bash # Check IP pool utilization (Calico example) calicoctl get ippool -o wide kubectl get nodes -o custom-columns="NAME:.metadata.name,PODCIDR:.spec.podCIDR,ALLOCATED:.status.allocatable.pods" ``` **Emergency recovery**: Restart CNI pods to recover temporarily ```bash kubectl delete pods -n kube-system -l k8s-app=flannel # Or for other CNIs: kubectl rollout restart ds/calico-node -n calico-system ```

How do I debug network policies without breaking everything?

Network policy debugging requires careful testing to avoid blocking legitimate traffic: **Start with logging/monitoring mode** (if your CNI supports it): ```bash # Calico example: log but don't enforce kubectl annotate networkpolicy your-policy projectcalico.org/metadata='{"defaultEndpointToHostAction":"ALLOW"}' ``` **Create a test pod in the same namespace**: ```bash kubectl run policy-test --image=busybox --rm -it -- sh # Inside pod, test connections to various services ``` **Use debugging tools to visualize policies**: ```bash # Check existing policies and their selectors kubectl get networkpolicies -A -o wide kubectl describe networkpolicy your-policy | grep -A 10 -B 10 selector ``` **Test policy impact with temporary exemptions**: ```bash # Add temporary label to exempt pods from policies kubectl label pod your-pod policy-exempt=true # Create bypass policy for testing kubectl apply -f - <<EOF apiVersion: networking.k8s.io/v1 kind: NetworkPolicy metadata: name: debug-bypass spec: podSelector: matchLabels: policy-exempt: "true" policyTypes: [] # Allow all traffic EOF ```

Why is my service mesh (Istio/Linkerd) breaking basic networking?

Service meshes add complexity that can interfere with basic Kubernetes networking: **Sidecar injection problems**: Pods don't have proxy sidecars ```bash kubectl get pods your-pod -o yaml | grep -A 5 -B 5 "istio-proxy\|linkerd-proxy" kubectl describe namespace your-namespace | grep -i "injection\|mesh" ``` **mTLS configuration conflicts**: Automatic mTLS is breaking non-mesh services ```bash # Istio: Check mTLS policy istioctl authn tls-check your-service.your-namespace.svc.cluster.local kubectl get peerauthentication -A # Linkerd: Check mTLS status linkerd edges deployments linkerd stat deploy/your-deployment ``` **Traffic routing conflicts**: Service mesh policies override Kubernetes services ```bash # Istio: Check virtual services and destination rules kubectl get virtualservice,destinationrule -A istioctl analyze -n your-namespace # Linkerd: Check service profiles kubectl get serviceprofile -A ``` **Resource exhaustion from sidecars**: Proxy containers using too much memory/CPU ```bash kubectl top pods | grep -E "istio-proxy|linkerd-proxy" kubectl describe pod your-pod | grep -A 10 -B 10 "istio-proxy\|linkerd-proxy" ```

How do I know if the problem is CNI, kube-proxy, or something else?

Network problems can originate from multiple components. Here's a systematic approach to isolate the issue: **Test Layer 3 connectivity** (bypasses kube-proxy and services): ```bash # Get pod IPs and test direct connectivity kubectl get pods -o wide kubectl exec pod-1 -- ping pod-2-ip ``` **Test DNS resolution** (isolates CoreDNS issues): ```bash kubectl exec pod-1 -- nslookup kubernetes.default kubectl exec pod-1 -- nslookup your-service.your-namespace ``` **Test service connectivity** (isolates kube-proxy issues): ```bash kubectl exec pod-1 -- curl your-service:8080 kubectl get endpoints your-service # Should show backend pod IPs ``` **Check component health**: ```bash # CNI health (varies by plugin) kubectl get pods -n kube-system | grep -E "flannel|calico|cilium" # kube-proxy health kubectl get pods -n kube-system -l k8s-app=kube-proxy kubectl logs -n kube-system -l k8s-app=kube-proxy --tail=50 # CoreDNS health kubectl get pods -n kube-system -l k8s-app=kube-dns kubectl logs -n kube-system -l k8s-app=kube-dns --tail=50 ``` **Layer-by-layer diagnosis**: - Layer 3 fails → CNI problem - Layer 3 works, DNS fails → CoreDNS problem - DNS works, service access fails → kube-proxy problem - Everything works internally, external access fails → Ingress/LoadBalancer problem

My pods keep getting new IP addresses and losing connections - how do I fix this?

Frequent pod IP changes indicate instability in the networking layer or pod scheduling: **Pod restart loops**: Pods are crashing and getting new IPs on restart ```bash kubectl describe pod your-pod | grep -A 10 "Events\|Restart" kubectl logs your-pod --previous # Logs from crashed container ``` **Node network instability**: CNI is reassigning IPs due to node issues ```bash kubectl get nodes -o wide # Check node status kubectl describe node your-node | grep -A 10 "Conditions\|Events" ``` **IP address pool exhaustion**: CNI is reusing IPs due to shortage ```bash # Check available IP ranges kubectl get nodes -o jsonpath='{.items[*].spec.podCIDR}' calicoctl get ippool -o wide # For Calico kubectl describe node your-node | grep "PodCIDR\|Allocatable" ``` **Service mesh connection draining**: Proxy isn't handling connection migration ```bash # Check proxy logs for connection errors kubectl logs your-pod -c istio-proxy | grep -i "connection\|reset" kubectl exec your-pod -c linkerd-proxy -- curl localhost:4191/metrics | grep connection ``` The fix usually involves identifying why pods are restarting (resource limits, liveness probe failures) or expanding IP address pools to reduce IP reuse frequency.

Currently viewing the AI version

Switch to human version

Kubernetes Network Troubleshooting: AI-Optimized Knowledge Base

Critical Failure Scenarios and Consequences

CNI Plugin Failures - Cluster-Wide Impact

Symptoms:

Nodes stuck in NotReady with "CNI plugin not initialized"
Pods stuck in Pending forever
Random connectivity drops causing service degradation
failed to create pod sandbox errors in kubelet logs

Critical Consequence: Entire cluster becomes unusable, all new pod deployments fail

Root Causes with Business Impact:

CIDR Conflicts: Pod network overlaps with node/service networks → Complete cluster failure during weekend deployments
Version Mismatches: CNI plugin incompatible with Kubernetes version → Silent failures that manifest under load
IP Exhaustion: Insufficient CIDR allocation → Service unavailability during traffic spikes

Real-World Failure Example: Black Friday 2021 - AWS us-east-1 outage led to failover attempt blocked by /24 pod CIDR supporting only 254 IPs when 500 pods needed. Checkout down 3 hours.

DNS Resolution Failures - Application-Level Breakdown

Symptoms:

Service resolution works 70% of the time (intermittent failures)
nslookup kubernetes.default returns SERVFAIL
Apps can't find services that exist
DNS works from some pods but not others

Critical Consequence: Microservices architecture becomes unreliable, cascading failures across service dependencies

Resource Starvation Impact: Default CoreDNS limits (100m CPU) cause DNS throttling under any real load → Application timeouts and user-facing errors

Configuration That Actually Works in Production

CoreDNS Resource Requirements

Default Settings That Fail:

CPU: 100m (insufficient for production)
Memory: 170Mi (causes OOM under load)

Production-Tested Configuration:

CPU: 500m minimum (handles real traffic)
Memory: 512Mi minimum (prevents OOM kills)

Implementation:

kubectl patch deployment coredns -n kube-system -p '{"spec":{"template":{"spec":{"containers":[{"name":"coredns","resources":{"limits":{"cpu":"500m","memory":"512Mi"}}}]}}}}'

Network Policy Configuration That Doesn't Break Everything

Critical Understanding: Adding ANY network policy to a namespace changes default from "allow all" to "deny all" for selected pods.

Essential DNS Policy (Must-Have):

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-dns
  namespace: your-namespace
spec:
  podSelector: {}
  policyTypes:
  - Egress
  egress:
  - to:
    - namespaceSelector:
        matchLabels:
          name: kube-system
    - podSelector:
        matchLabels:
          k8s-app: kube-dns
    ports:
    - protocol: UDP
      port: 53
    - protocol: TCP
      port: 53

Decision-Support Information

CNI Plugin Comparison with Operational Reality

CNI Plugin	Complexity	Failure Rate	Debug Difficulty	Production Readiness
Flannel	Low	Medium	Easy	Good for simple setups
Calico	Medium	Low	Medium	Best for scale/policies
Cilium	High	Medium	Very Hard	Powerful but complex

Calico Trade-offs:

Worth it despite: BGP complexity and debugging requirements
Hidden cost: Requires network engineering expertise
Breaking point: BGP session failures cause cross-node communication loss

Cilium Trade-offs:

Worth it despite: eBPF debugging complexity requiring kernel knowledge
Performance benefit: Only solution handling 50k+ pods without performance degradation
Hidden cost: Requires deep Linux networking expertise

Service Mesh Decision Matrix

Istio Implementation Reality:

Time investment: 3-6 months to operational maturity
Expertise required: Deep Envoy and mTLS knowledge
Common failure: Sidecar injection breaks 20% of deployments initially
Performance impact: 10-15% latency increase, 200MB memory overhead per pod

Linkerd Comparison:

Easier than: Istio configuration and debugging
Harder than: Basic Kubernetes networking
Sweet spot: Teams wanting service mesh benefits without Istio complexity

Critical Warnings and Operational Intelligence

What Official Documentation Doesn't Tell You

Kubernetes 1.25 Changes:

Default CNI timeout increased from 10s to 30s
Hidden impact: Masks real connection issues by making them appear successful
Debugging implication: Timeouts that would fail fast now hang for 30s

Flannel Version-Specific Issues:

Flannel 0.15.1: Corrupts routing tables on node restart (avoid completely)
GKE Default CIDR: 10.0.0.0/14 conflicts with most corporate VPNs
Production impact: Engineering team VPN access blocked after cluster upgrades

Network Policy Production Failures

Three Documented Production Outages:

August 2023 - "Defense in depth" deployment
- Cause: Single ingress policy deployed Friday 4:30 PM
- Impact: Frontend couldn't reach database by Monday
- Root cause: Policy enabled deny-all mode for database connections
- Resolution time: 4 hours (assumed DNS, debugged wrong layer)
Black Friday scaling failure
- Cause: /24 CIDR allocated for 500-pod requirement
- Impact: Checkout system down 3 hours during peak traffic
- Prevention: CIDR planning based on worst-case scaling scenarios
Network policy label mismatch
- Cause: Policy matched pod count instead of service identity
- Impact: Night batch processing blocked when pods scaled 10→200
- Hidden cost: 3 weeks debugging, management wouldn't approve downtime

Common Misconceptions That Cause Failures

"Zero Network Policies = Secure Default"

Reality: Zero policies = everything allowed
Trap: Adding one policy = deny-all for non-matching traffic
Fix timing: Plan comprehensive policies, deploy atomically

"Default Resource Limits Are Production-Ready"

Reality: CoreDNS 100m CPU fails under any real load
Impact: DNS throttling appears as application bugs
Fix: 5x default limits minimum for production

"CNI Plugins Are Interchangeable"

Reality: Each has specific failure modes and debugging requirements
Migration cost: Complete cluster rebuild often required
Expertise transfer: Team knowledge doesn't transfer between CNIs

Diagnostic Procedures with Time Investment

Systematic Network Debugging (15-30 minutes)

Layer-by-layer diagnosis approach:

CNI Health Check (2 minutes)

kubectl get nodes -o wide
kubectl describe nodes | grep Ready

Basic Connectivity Test (3 minutes)

kubectl run test --image=busybox --rm -it -- ping 8.8.8.8

DNS Verification (2 minutes)

kubectl exec test -- nslookup kubernetes.default

Service Routing Test (3 minutes)

kubectl exec test -- curl service-name:8080

External Access Verification (5 minutes)

kubectl exec test -- curl your-domain.com

Time-saving rule: 90% of issues found in steps 1-3, don't skip to complex debugging

CNI-Specific Debugging Time Investment

Calico Issues (30-60 minutes):

BGP Status Check: 5 minutes with calicoctl
IP Allocation Debug: 10 minutes understanding IPAM
Policy Troubleshooting: 45 minutes for complex scenarios

Cilium Issues (2-4 hours):

eBPF Program Analysis: Requires kernel debugging skills
Policy Tracing: Complex evaluation logic
Performance Impact: Often requires cluster-level changes

Resource Requirements and Expertise Costs

Human Time Investment by Problem Type

Problem Category	Initial Diagnosis	Full Resolution	Expertise Required
DNS Throttling	5 minutes	15 minutes	Basic kubectl
Network Policy	10 minutes	2 hours	Label selector understanding
CNI Failures	30 minutes	4 hours	Network engineering
Service Mesh	1 hour	8 hours	Deep proxy knowledge

Skill Prerequisites Not in Documentation

Network Policy Debugging:

Required: Deep understanding of label selectors and namespace behavior
Time to competency: 2-3 production incidents
Common gap: Developers don't understand Kubernetes networking defaults

CNI Troubleshooting:

Required: Linux networking, routing tables, iptables
Time to competency: 6 months production experience
Common gap: Cloud engineers lack on-premises networking knowledge

Service Mesh Operations:

Required: TLS, proxy configuration, observability tools
Time to competency: 3-6 months dedicated focus
Common gap: Application developers lack infrastructure knowledge

Breaking Points and Failure Modes

Scale-Related Network Failures

CNI Performance Limits:

Flannel: 100-200 nodes before BGP instability
Calico: 1000+ nodes with proper BGP configuration
Cilium: 5000+ nodes but requires eBPF expertise

DNS Performance Breakdown:

CoreDNS: Becomes bottleneck at 500+ QPS with default limits
Symptom: Intermittent resolution failures under load
Fix cost: Resource tuning (easy) vs DNS caching architecture (complex)

Network Policy Complexity Limits

Management Overhead:

10-20 policies: Manageable with documentation
50+ policies: Requires automation and testing
100+ policies: Policy conflicts become undebuggable

Real-world breaking point: Teams abandon network policies after 3rd production outage caused by policy interactions

Tools Effectiveness Matrix

Debugging Tool Selection by Problem Type

Tool	Basic Connectivity	DNS Issues	Policy Debug	CNI Problems	Time to Result
kubectl logs	Limited	Good	Poor	Poor	30 seconds
kubectl describe	Good	Limited	Good	Good	1 minute
netshoot pod	Excellent	Excellent	Good	Limited	2 minutes
calicoctl	Poor	Poor	Excellent	Excellent	5 minutes
tcpdump	Excellent	Good	Poor	Excellent	10 minutes

Cost-Benefit Analysis of Debugging Approaches

Quick Wins (5-15 minutes):

kubectl logs and describe commands
Basic connectivity tests with busybox
Resource limit verification

Medium Investment (30-60 minutes):

Network policy analysis
CNI-specific tooling
Service mesh configuration review

Deep Debugging (2+ hours):

Packet capture analysis
eBPF program inspection
Multi-cluster networking issues

ROI Guidance: Start with quick wins, escalate only when basic approaches fail

Migration and Change Management

Version Upgrade Risks

Kubernetes Version Changes:

1.24→1.25: CNI timeout behavior change masks issues
1.25→1.26: Network policy evaluation order changes
Impact: Silent failures appearing weeks after upgrade

CNI Plugin Migrations:

Flannel→Calico: Requires complete cluster rebuild
Calico→Cilium: IP pool migration complexity
Time investment: 2-4 weeks planning, 1 week execution

Operational Maturity Stages

Stage 1 - Basic Operations (0-6 months):

Can debug DNS and basic connectivity
Understands service networking
Avoids network policies

Stage 2 - Intermediate (6-18 months):

Deploys simple network policies safely
Debugs CNI-specific issues
Handles routine networking problems

Stage 3 - Advanced (18+ months):

Designs complex network architectures
Debugs service mesh issues
Handles multi-cluster networking

Acceleration factors: Production incidents provide 10x learning rate compared to lab environments

Emergency Response Procedures

Network Policy Emergency Recovery

Immediate Action (1 minute):

kubectl delete networkpolicies --all -n affected-namespace

Verification (2 minutes):

kubectl run test --image=busybox --rm -it -- ping service-ip

Root Cause Analysis (15 minutes):

Review policy selectors and namespace labels
Test policy application with temporary pods
Document policy interactions for future prevention

CNI Failure Recovery

Emergency Pod Restart (2 minutes):

kubectl delete pods -n kube-system -l k8s-app=flannel
# Or for other CNIs:
kubectl rollout restart ds/calico-node -n calico-system

Node-Level Recovery (5 minutes):

# Check and restart kubelet if needed
systemctl status kubelet
systemctl restart kubelet

Full CNI Reinstall (30 minutes):

Deploy CNI manifests
Verify node network configuration
Test pod-to-pod connectivity

Quality and Support Indicators

Community and Vendor Support Quality

Calico/Tigera:

Documentation quality: Excellent technical depth
Community response: 24-48 hours for complex issues
Enterprise support: Available with SLA guarantees

Cilium/Isovalent:

Documentation quality: Good but assumes advanced knowledge
Community response: Variable, depends on complexity
Enterprise support: Required for production deployments

Flannel:

Documentation quality: Basic, often outdated
Community response: Slow, limited maintainer availability
Enterprise support: None available

Tool Reliability Assessment

Production-Ready Tools:

netshoot: Consistently reliable across environments
calicoctl: Stable API, good backward compatibility
kubectl: Core functionality stable, extensions variable

Experimental/Risky Tools:

cilium CLI: Rapid development, breaking changes
Custom network tools: Environment-specific reliability
Alpha networking features: Not production suitable

This knowledge base provides the operational intelligence needed for AI systems to make informed decisions about Kubernetes networking troubleshooting, including understanding failure modes, resource requirements, and the real-world costs of different approaches.

Useful Links for Further Investigation

Resources That Don't Suck

Link	Description
netshoot	The only debugging container worth a damn. Has everything: tcpdump, curl, dig, netstat. I've used this on every cluster since 2019.
k9s	Better than kubectl for debugging. Real-time updates, easy navigation. Makes finding broken pods less painful.
stern	Tail logs from multiple pods at once. Essential when you're trying to figure out which pod is actually broken.
Kubernetes Service Debug Guide	Actually useful step-by-step troubleshooting. Skip the first three results on Google, use this instead.
Network Policies	The official docs are dry but accurate. Better than blog posts that are wrong half the time.
Calico Troubleshooting	Comprehensive but the search function sucks. `calicoctl` commands actually work.
Flannel GitHub Issues	More useful than their docs. Real people solving real problems.
Kubernetes Community Discuss	Less corporate than official docs. People actually share war stories and working solutions.
Stack Overflow	Hit or miss, but sometimes you find the exact error message you're seeing. Avoid answers from 2018, they're all wrong now.

21%

Recommendations combine user behavior, content similarity, research intelligence, and SEO optimization