Kubernetes Network Policies: AI-Optimized Troubleshooting Guide
Critical Behavior Switch
Primary Failure Mode: Applying ANY network policy to a pod switches from "allow everything" to "deny everything" mode. This is the #1 cause of production outages.
Impact Severity: Complete application stack failure - frontend loses API access, databases become unreachable, monitoring stops working.
Time to Detection: Immediate (within seconds of policy application)
Recovery Time: Hours if root cause unknown, minutes if understood
CNI Plugin Compatibility Matrix
Actually Enforce Policies
- Calico: ✓ Works but cryptic debugging
- Cilium: ✓ Best debugging tools when functional
- AWS VPC CNI: ✓ Requires aws-network-policy-agent addon (v1.14.0+)
- Azure CNI: ✓ Needs Network Policy Manager addon
Silently Ignore Policies
- Flannel: ✗ No support whatsoever
- Basic Docker networking: ✗ No enforcement
- Default cloud CNIs: ✗ Usually no support without addons
Verification Test: Apply deny-all policy; if test pod can still reach external sites, CNI ignores policies.
Root Cause Analysis Priority
1. Label Mismatches (90% of Issues)
Common Failures:
- Typos:
app: frontend
vsapplication: frontend
- Case sensitivity:
App: Frontend
vsapp: frontend
- Environment drift: staging uses
env: dev
, production usesenvironment: production
- Missing namespace labels for namespace selectors
Debugging Commands:
kubectl get pods --show-labels
kubectl get namespaces --show-labels
kubectl get pods -l app=frontend # Test selector matching
2. Bidirectional Policy Requirements
Critical Understanding: Need TWO policies for every connection:
- Source pod: EGRESS permission to send
- Destination pod: INGRESS permission to receive
Failure Symptom: Connection timeouts (not refused connections)
3. DNS Policy Omission
Failure Mode: Pods can reach each other by IP but not by service name
Required Rules: Both UDP AND TCP port 53 to kube-system namespace
Why TCP: DNS switches to TCP for large responses; UDP-only policies cause intermittent failures
Essential DNS Policy:
egress:
- to:
- namespaceSelector:
matchLabels:
name: kube-system
ports:
- protocol: UDP
port: 53
- protocol: TCP
port: 53
CNI-Specific Failure Patterns
AWS VPC CNI Critical Issues
Prerequisites:
- VPC CNI version 1.14.0+ required
- aws-network-policy-agent addon must be installed
- PolicyEndpoints CRD must exist
- Specific IAM permissions required
Common Failures:
- Network policy agent container crashes silently
- Policies accepted but ignored (no error indication)
- Works in staging (Calico) but fails in production (VPC CNI)
Diagnostic Commands:
kubectl get crd policyendpoints.networking.k8s.aws
kubectl logs -n kube-system -l app=aws-node -c aws-network-policy-agent
Calico Debugging
Strengths: Actually enforces policies
Weaknesses: Cryptic error messages, complex iptables interactions
Diagnostic Commands:
kubectl exec -n kube-system <calico-pod> -- calicoctl node status
kubectl exec -n kube-system <calico-pod> -- calicoctl get networkpolicy -o wide
Cilium Advanced Debugging
Strengths: Real-time policy decision monitoring
Weaknesses: Complex eBPF dependencies, kernel version requirements
Diagnostic Commands:
kubectl exec -n kube-system <cilium-pod> -- cilium monitor --type=policy-verdict
kubectl exec -n kube-system <cilium-pod> -- cilium policy trace --src-k8s-pod=ns:pod --dst-k8s-pod=ns:pod
Connection Testing Matrix
Systematic Testing Approach
# Test direct pod-to-pod (bypasses DNS)
kubectl exec -it source-pod -- nc -zv <target-ip> <port>
# Test service connectivity (includes DNS resolution)
kubectl exec -it source-pod -- nc -zv service.namespace.svc.cluster.local <port>
# Test DNS resolution separately
kubectl exec -it source-pod -- nslookup service.namespace.svc.cluster.local
# Test external connectivity (rule out total network failure)
kubectl exec -it source-pod -- nc -zv 8.8.8.8 53
Connection Failure Interpretation
- Connection timeout: Network policy blocking (expected for security)
- Connection refused: App not listening on port (configuration issue)
- DNS resolution failure: Missing DNS egress rules
- External connectivity failure: CNI or infrastructure problem
Production-Ready Policy Templates
Standard DNS Policy (Apply to Every Namespace)
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-dns-egress
namespace: <NAMESPACE>
spec:
podSelector: {}
policyTypes:
- Egress
egress:
- to:
- namespaceSelector:
matchLabels:
name: kube-system
ports:
- protocol: UDP
port: 53
- protocol: TCP
port: 53
Bidirectional Service Communication
Frontend Egress Policy:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: frontend-to-backend
namespace: frontend
spec:
podSelector:
matchLabels:
app: web-frontend
policyTypes:
- Egress
egress:
- to:
- namespaceSelector:
matchLabels:
name: backend
- podSelector:
matchLabels:
app: api-service
ports:
- protocol: TCP
port: 8080
Backend Ingress Policy:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: backend-from-frontend
namespace: backend
spec:
podSelector:
matchLabels:
app: api-service
policyTypes:
- Ingress
ingress:
- from:
- namespaceSelector:
matchLabels:
name: frontend
- podSelector:
matchLabels:
app: web-frontend
ports:
- protocol: TCP
port: 8080
Default-Deny with Essential Services
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-with-basics
namespace: <NAMESPACE>
spec:
podSelector: {}
policyTypes:
- Ingress
- Egress
egress:
# DNS (essential)
- to:
- namespaceSelector:
matchLabels:
name: kube-system
ports:
- protocol: UDP
port: 53
- protocol: TCP
port: 53
# Kubernetes API (health checks, service discovery)
- to: []
ports:
- protocol: TCP
port: 443
# Common health check ports
- to: []
ports:
- protocol: TCP
port: 8080
- protocol: TCP
port: 9090
Performance Considerations
Policy Scaling Limits
Performance Degradation: 100+ individual pod policies cause significant packet processing delays
Optimization Strategy: Use namespace selectors instead of individual pod policies
Resource Impact: Each policy generates iptables rules or eBPF programs
Resource Monitoring
CNI Component CPU Usage:
- Calico Felix high CPU indicates rule processing overhead
- Cilium agent memory usage scales with policy complexity
- AWS VPC CNI network-policy-agent frequent restarts indicate resource constraints
Emergency Recovery Procedures
Policy Rollback Strategy
# Emergency policy removal (nuclear option)
kubectl delete networkpolicy --all -n <namespace>
# Targeted policy removal (safer)
kubectl delete networkpolicy <policy-name> -n <namespace>
Break-Glass Access Policy
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: emergency-allow-all
namespace: <NAMESPACE>
spec:
podSelector: {}
policyTypes:
- Ingress
- Egress
ingress:
- {}
egress:
- {}
Testing and Validation
Policy Effectiveness Test
#!/bin/bash
# Verify policies actually enforce restrictions
kubectl run test-pod --image=busybox --command -- sleep 3600
kubectl exec test-pod -- nc -zv <protected-service> <port>
# Should fail if policies working correctly
Automated Policy Testing
test_connection() {
local source_pod=$1
local target_host=$2
local target_port=$3
local expected_result=$4
if kubectl exec $source_pod -- nc -zv $target_host $target_port 2>/dev/null; then
actual="WORKS"
else
actual="BLOCKED"
fi
if [ "$actual" = "$expected_result" ]; then
echo "✓ PASS: $actual (expected $expected_result)"
return 0
else
echo "✗ FAIL: $actual (expected $expected_result)"
return 1
fi
}
Common Migration Pitfalls
Environment Consistency Issues
Risk: Staging uses different CNI than production
Impact: Policies work in staging, fail silently in production
Mitigation: Verify CNI plugin consistency across environments
Label Standardization Drift
Risk: Labels change over time without policy updates
Impact: Policies gradually select fewer resources, reducing security
Mitigation: Implement label validation webhooks and policy testing
NodeLocal DNSCache Considerations
Additional DNS Rules Required:
egress:
- to: [] # Allow access to any node IP
ports:
- protocol: UDP
port: 53
- protocol: TCP
port: 53
Detection: DNS works sometimes but fails randomly
Root Cause: NodeLocal DNSCache uses node IPs, not kube-system pods
Monitoring and Alerting
Essential Metrics
- Policy count per namespace (performance indicator)
- Policy effectiveness rate (security indicator)
- DNS resolution success rate (functionality indicator)
- Cross-namespace connection success rate (application health)
Critical Alerts
- Network policy creation/deletion (change tracking)
- CNI component restarts (stability indicator)
- DNS resolution failures (immediate impact)
- Unexpected connection timeouts (policy misconfiguration)
Resource Requirements
Time Investment
- Initial policy setup: 2-4 days for complex microservices
- Debugging production issues: 1-6 hours per incident
- Migration between CNIs: 1-2 weeks including testing
Expertise Requirements
- Deep Kubernetes networking knowledge
- CNI-specific debugging skills
- iptables/eBPF understanding for advanced troubleshooting
- Label management and selector logic
Infrastructure Dependencies
- CNI plugin with network policy support
- Adequate cluster resources for policy processing
- Monitoring infrastructure for policy compliance
- Testing framework for policy validation
Useful Links for Further Investigation
Tools That Actually Help (When They're Not Broken)
Link | Description |
---|---|
Kubernetes Network Policies Documentation | The official docs that every tutorial references but nobody reads completely. Buries the important behavioral changes in paragraph 47. Essential reading if you enjoy pain. |
AWS EKS Network Policy Troubleshooting Guide | AWS's attempt at documenting their network policy implementation. Scattered across 12 different pages and half the links are broken. Good luck. |
Azure AKS Network Policy Best Practices | Microsoft's guide for AKS network policies. Actually more helpful than the AWS docs, which isn't saying much. |
Calico Debugging | Calico's debugging tools are powerful if you can figure out their cryptic command syntax. `calicoctl` is like iptables - works great once you memorize 47 different flags. |
Cilium Policy Tracing | Cilium has the best debugging tools when everything is working. `cilium monitor` actually shows you policy decisions in real-time, which is fucking magical when it works. |
Cilium Monitoring That Actually Works | Real-time policy decision monitoring. When this works, it's beautiful. When it doesn't, you're debugging the debugger. |
Network Policy Editor | Web-based policy editor that's prettier than vim. Generates policies that sometimes work. Better than writing YAML by hand, which isn't a high bar. |
Goldpinger - Bloomberg's Network Tool | Actually useful for visualizing what's talking to what. Bloomberg knows their shit about networking. |
kubectl-np-viewer | Kubectl plugin for visualizing policies. Works when you can get it installed, which is 50% of the time. |
knetvis - Policy Visualization | Graph-based policy visualization. Helps you see why your policies are fucked up in pretty colors. |
Network Policy Recipes | Collection of policies that actually work. Copy these instead of writing your own - you'll save days of debugging. |
Netshoot Container | Debugging container with every network tool you need. Like a Swiss Army knife for when your cluster networking is completely fucked. |
kubectl exec for Network Testing | Basic kubectl commands for testing connectivity. If you don't know these, you're not ready for network policies. |
Falco Network Policy Monitoring | Runtime security monitoring that can detect policy violations. Generates way too many alerts until you tune it properly. |
AWS CloudWatch for VPC CNI Logs | AWS's attempt at logging policy decisions. The documentation is scattered and contradictory, but the logs are sometimes useful. |
Prometheus NetworkPolicy Metrics | Metrics for monitoring network policy configurations. Useful for alerting when someone breaks everything. |
Kubernetes Slack #sig-network | Where network policy experts hang out. Ask here after you've tried everything else and read the docs. |
Stack Overflow Network Policy Tag | Search here first - someone has probably hit your exact problem before. |
CNCF Slack CNI Channels | CNI-specific help channels. The Cilium folks are particularly helpful when they're not busy. |
OPA Gatekeeper Policy Templates | Templates to prevent network policy misconfigurations. Set these up or you'll be fixing the same mistakes forever. |
Polaris Configuration Validation | Catches common network policy mistakes before they break production. Wish I'd known about this sooner. |
kubectl dry-run | Test your policies before applying them. Basic stuff, but you'd be surprised how many people skip this step. |
Related Tools & Recommendations
GitOps Integration Hell: Docker + Kubernetes + ArgoCD + Prometheus
How to Wire Together the Modern DevOps Stack Without Losing Your Sanity
Kafka + MongoDB + Kubernetes + Prometheus Integration - When Event Streams Break
When your event-driven services die and you're staring at green dashboards while everything burns, you need real observability - not the vendor promises that go
Prometheus + Grafana + Jaeger: Stop Debugging Microservices Like It's 2015
When your API shits the bed right before the big demo, this stack tells you exactly why
GitHub Actions + Docker + ECS: Stop SSH-ing Into Servers Like It's 2015
Deploy your app without losing your mind or your weekend
Fix Helm When It Inevitably Breaks - Debug Guide
The commands, tools, and nuclear options for when your Helm deployment is fucked and you need to debug template errors at 3am.
Helm - Because Managing 47 YAML Files Will Drive You Insane
Package manager for Kubernetes that saves you from copy-pasting deployment configs like a savage. Helm charts beat maintaining separate YAML files for every dam
Making Pulumi, Kubernetes, Helm, and GitOps Actually Work Together
Stop fighting with YAML hell and infrastructure drift - here's how to manage everything through Git without losing your sanity
Docker Alternatives That Won't Break Your Budget
Docker got expensive as hell. Here's how to escape without breaking everything.
I Tested 5 Container Security Scanners in CI/CD - Here's What Actually Works
Trivy, Docker Scout, Snyk Container, Grype, and Clair - which one won't make you want to quit DevOps
Grafana - The Monitoring Dashboard That Doesn't Suck
integrates with Grafana
Set Up Microservices Monitoring That Actually Works
Stop flying blind - get real visibility into what's breaking your distributed services
GitHub Actions Marketplace - Where CI/CD Actually Gets Easier
integrates with GitHub Actions Marketplace
GitHub Actions Alternatives That Don't Suck
integrates with GitHub Actions
Stop Debugging Microservices Networking at 3AM
How Docker, Kubernetes, and Istio Actually Work Together (When They Work)
Istio - Service Mesh That'll Make You Question Your Life Choices
The most complex way to connect microservices, but it actually works (eventually)
How to Deploy Istio Without Destroying Your Production Environment
A battle-tested guide from someone who's learned these lessons the hard way
RAG on Kubernetes: Why You Probably Don't Need It (But If You Do, Here's How)
Running RAG Systems on K8s Will Make You Hate Your Life, But Sometimes You Don't Have a Choice
Rancher Desktop - Docker Desktop's Free Replacement That Actually Works
extends Rancher Desktop
I Ditched Docker Desktop for Rancher Desktop - Here's What Actually Happened
3 Months Later: The Good, Bad, and Bullshit
Rancher - Manage Multiple Kubernetes Clusters Without Losing Your Sanity
One dashboard for all your clusters, whether they're on AWS, your basement server, or that sketchy cloud provider your CTO picked
Recommendations combine user behavior, content similarity, research intelligence, and SEO optimization