Why did everything stop working when I applied my first network policy?

Network policies flip a switch in how Kubernetes handles traffic. Apply ANY network policy to a pod and it goes from "allow everything" to "deny everything" mode. Your first policy doesn't just affect what you think it affects - it blocks ALL traffic unless explicitly allowed.Need policies for every connection your app makes: frontend to backend, backend to database, DNS lookups, health checks, metrics - everything. Miss one and that connection breaks.

How do I know if my CNI plugin supports network policies?

Apply a deny-all policy and see if it actually blocks connections. If your test pod can still ping external sites after applying a blocking policy, your CNI is ignoring network policies.[Flannel doesn't support network policies](https://github.com/flannel-io/flannel/issues/1228). Basic Docker networking doesn't either. Many cloud provider CNIs silently ignore them. Need [Calico](https://docs.tigera.io/calico/latest/network-policy/), [Cilium](https://docs.cilium.io/en/stable/security/policy/), or your cloud provider's network policy addon.

Why can't my pods resolve DNS names anymore?

Network policies block DNS by default. DNS queries to CoreDNS need egress rules allowing UDP AND TCP port 53 to the kube-system namespace. Need both protocols because DNS switches to TCP for large responses.Add this to every namespace where you have network policies or DNS will randomly fail:```yamlegress:- to: - namespaceSelector: matchLabels: name: kube-system ports: - protocol: UDP port: 53 - protocol: TCP port: 53```

My selectors look right but nothing works. What the hell?

Your labels don't match. I guarantee it. Run `kubectl get pods --show-labels` and `kubectl get namespaces --show-labels` and compare them to your policy selectors character by character.Common fuck-ups:- `app: frontend` vs `application: frontend`- `env: prod` vs `environment: production`- The namespace doesn't have the labels your policy expects- Someone changed the labels in the deployment but not the policy[Kubernetes label matching](https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/) is case-sensitive and exact. One wrong character and your policy selects nothing.

My cross-namespace communication is fucked. Help?

You need TWO policies: one to allow sending AND one to allow receiving. Cross-namespace communication fails because you only wrote half the required policies.Frontend in namespace A needs EGRESS permission to talk to backend in namespace B. Backend in namespace B needs INGRESS permission to receive from frontend in namespace A. Miss either one and the connection fails.Also, your namespaces need labels that your policies can select. If your policy says `namespaceSelector: matchLabels: name: backend` but your backend namespace doesn't have a `name: backend` label, the policy selects nothing.

What's the difference between "connection refused" and "connection timeout"?

**Connection refused**: The connection reached the pod but nothing is listening on that port. This is usually an app config problem, not a network policy problem. Your app is listening on port 8080 but you're trying to connect to port 3000.**Connection timeout**: Network policy is blocking the connection, or there's a network problem. The packets never reached the destination. This is what you see when network policies are working correctly to block traffic.If you get "connection refused" after fixing a "connection timeout", you successfully unblocked the network policy but now you have an app configuration issue.

How do I test if my network policies actually do anything?

Create a test pod and try to connect to something that should be blocked:```bash# Create a test podkubectl run test-pod --image=busybox --command -- sleep 3600# Try to connect to something your policy should blockkubectl exec test-pod -- nc -zv protected-service 8080# If this works when it should be blocked, your network policies are being ignored```If your test pod can connect when it shouldn't be able to, either:1. Your CNI doesn't support network policies (most common)2. Your policy selectors don't match anything3. You have an allow policy that's overriding your deny policyThe "hello world" of network policy debugging is getting basic blocking to work. If you can't block a simple connection, don't bother with complex multi-namespace policies.

Why do my policies work in staging but fail in production?

Usually it's because your staging cluster uses Calico but production uses whatever-the-hell your cloud provider installed by default. Or someone "standardized" the labels in staging but forgot to update the 47 different prod deployments. Verify that your production CNI plugin supports network policies with the same feature set as staging. Compare labels, network policy configurations, and CNI plugin versions between environments.

How do I handle network policies for monitoring and observability tools?

Monitoring tools break everything the moment you apply your first network policy. Prometheus can't scrape metrics, logging agents can't reach their endpoints, service mesh sidecars fail to inject. Create dedicated policies allowing these tools to scrape metrics, collect logs, or inject sidecars. Use namespace selectors to grant broad access to monitoring namespaces while maintaining security boundaries for application namespaces.

What's the performance impact of having many network policies?

Too many network policies will make your cluster slow as shit. Each policy creates iptables rules or eBPF programs, and too many rules slow down packet processing. I've seen clusters grind to a halt because someone created 200+ individual pod policies instead of using namespace selectors. Consolidate similar policies, use efficient selectors, and monitor CNI component resource usage to optimize performance.

How do I debug AWS VPC CNI network policy issues specifically?

AWS VPC CNI network policies are a special kind of broken. The logs are scattered across 3 different places and half the documentation is wrong. Enable [network policy flow logs](https://docs.aws.amazon.com/eks/latest/userguide/network-policies-troubleshooting.html) to see traffic acceptance and denial decisions. Check that PolicyEndpoints CRDs are created correctly and that the aws-network-policy-agent container is running in aws-node pods. Verify IAM permissions for the VPC CNI role include access to network policy resources.

Can I use IP addresses instead of selectors in network policies?

Yes, use `ipBlock` selectors with CIDR notation to allow or deny traffic based on IP ranges. This is useful for external service access or legacy systems without proper labeling. However, IP-based policies are less flexible than label-based selectors and can break when pod IPs change during rescheduling.

How do I migrate from one CNI plugin to another without breaking network policies?

CNI migrations are a nightmare. Half your policies will break in subtle ways and you won't notice until production starts failing. Plan the migration carefully by documenting all existing policies and their expected behavior. Test the new CNI plugin in a separate environment first. During migration, temporarily implement equivalent security controls at the application or infrastructure level, then redeploy network policies after confirming the new CNI plugin works correctly.

What happens when network policies conflict with service mesh policies?

Service meshes and network policies fight each other constantly. Istio has its own ideas about traffic management that can conflict with Kubernetes network policies. Generally, network policies operate at Layer 3/4 while service mesh policies work at Layer 7, but overlapping enforcement causes weird failures that are a pain to debug. Coordinate security policies between network policy and service mesh teams to avoid conflicts.

How do I implement emergency access when network policies block critical communication?

You need a "break glass" policy for when everything goes to hell at 3am. Create emergency access policies that temporarily allow broader access while maintaining audit trails. The trick is having them ready to deploy instantly when your perfectly crafted policies are blocking the exact traffic you need to fix the production outage. Use GitOps workflows to deploy emergency policies quickly while ensuring they're removed after incident resolution.

Currently viewing the AI version

Switch to human version

Kubernetes Network Policies: AI-Optimized Troubleshooting Guide

Critical Behavior Switch

Primary Failure Mode: Applying ANY network policy to a pod switches from "allow everything" to "deny everything" mode. This is the #1 cause of production outages.

Impact Severity: Complete application stack failure - frontend loses API access, databases become unreachable, monitoring stops working.

Time to Detection: Immediate (within seconds of policy application)

Recovery Time: Hours if root cause unknown, minutes if understood

CNI Plugin Compatibility Matrix

Actually Enforce Policies

Calico: ✓ Works but cryptic debugging
Cilium: ✓ Best debugging tools when functional
AWS VPC CNI: ✓ Requires aws-network-policy-agent addon (v1.14.0+)
Azure CNI: ✓ Needs Network Policy Manager addon

Silently Ignore Policies

Flannel: ✗ No support whatsoever
Basic Docker networking: ✗ No enforcement
Default cloud CNIs: ✗ Usually no support without addons

Verification Test: Apply deny-all policy; if test pod can still reach external sites, CNI ignores policies.

Root Cause Analysis Priority

1. Label Mismatches (90% of Issues)

Common Failures:

Typos: app: frontend vs application: frontend
Case sensitivity: App: Frontend vs app: frontend
Environment drift: staging uses env: dev, production uses environment: production
Missing namespace labels for namespace selectors

Debugging Commands:

kubectl get pods --show-labels
kubectl get namespaces --show-labels
kubectl get pods -l app=frontend  # Test selector matching

2. Bidirectional Policy Requirements

Critical Understanding: Need TWO policies for every connection:

Source pod: EGRESS permission to send
Destination pod: INGRESS permission to receive

Failure Symptom: Connection timeouts (not refused connections)

3. DNS Policy Omission

Failure Mode: Pods can reach each other by IP but not by service name
Required Rules: Both UDP AND TCP port 53 to kube-system namespace
Why TCP: DNS switches to TCP for large responses; UDP-only policies cause intermittent failures

Essential DNS Policy:

egress:
- to:
  - namespaceSelector:
      matchLabels:
        name: kube-system
  ports:
  - protocol: UDP
    port: 53
  - protocol: TCP
    port: 53

CNI-Specific Failure Patterns

AWS VPC CNI Critical Issues

Prerequisites:

VPC CNI version 1.14.0+ required
aws-network-policy-agent addon must be installed
PolicyEndpoints CRD must exist
Specific IAM permissions required

Common Failures:

Network policy agent container crashes silently
Policies accepted but ignored (no error indication)
Works in staging (Calico) but fails in production (VPC CNI)

Diagnostic Commands:

kubectl get crd policyendpoints.networking.k8s.aws
kubectl logs -n kube-system -l app=aws-node -c aws-network-policy-agent

Calico Debugging

Strengths: Actually enforces policies
Weaknesses: Cryptic error messages, complex iptables interactions

Diagnostic Commands:

kubectl exec -n kube-system <calico-pod> -- calicoctl node status
kubectl exec -n kube-system <calico-pod> -- calicoctl get networkpolicy -o wide

Cilium Advanced Debugging

Strengths: Real-time policy decision monitoring
Weaknesses: Complex eBPF dependencies, kernel version requirements

Diagnostic Commands:

kubectl exec -n kube-system <cilium-pod> -- cilium monitor --type=policy-verdict
kubectl exec -n kube-system <cilium-pod> -- cilium policy trace --src-k8s-pod=ns:pod --dst-k8s-pod=ns:pod

Connection Testing Matrix

Systematic Testing Approach

# Test direct pod-to-pod (bypasses DNS)
kubectl exec -it source-pod -- nc -zv <target-ip> <port>

# Test service connectivity (includes DNS resolution)  
kubectl exec -it source-pod -- nc -zv service.namespace.svc.cluster.local <port>

# Test DNS resolution separately
kubectl exec -it source-pod -- nslookup service.namespace.svc.cluster.local

# Test external connectivity (rule out total network failure)
kubectl exec -it source-pod -- nc -zv 8.8.8.8 53

Connection Failure Interpretation

Connection timeout: Network policy blocking (expected for security)
Connection refused: App not listening on port (configuration issue)
DNS resolution failure: Missing DNS egress rules
External connectivity failure: CNI or infrastructure problem

Production-Ready Policy Templates

Standard DNS Policy (Apply to Every Namespace)

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-dns-egress
  namespace: <NAMESPACE>
spec:
  podSelector: {}
  policyTypes:
  - Egress
  egress:
  - to:
    - namespaceSelector:
        matchLabels:
          name: kube-system
    ports:
    - protocol: UDP
      port: 53
    - protocol: TCP
      port: 53

Bidirectional Service Communication

Frontend Egress Policy:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: frontend-to-backend
  namespace: frontend
spec:
  podSelector:
    matchLabels:
      app: web-frontend
  policyTypes:
  - Egress
  egress:
  - to:
    - namespaceSelector:
        matchLabels:
          name: backend
    - podSelector:
        matchLabels:
          app: api-service
    ports:
    - protocol: TCP
      port: 8080

Backend Ingress Policy:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: backend-from-frontend
  namespace: backend
spec:
  podSelector:
    matchLabels:
      app: api-service
  policyTypes:
  - Ingress
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          name: frontend
    - podSelector:
        matchLabels:
          app: web-frontend
    ports:
    - protocol: TCP
      port: 8080

Default-Deny with Essential Services

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-with-basics
  namespace: <NAMESPACE>
spec:
  podSelector: {}
  policyTypes:
  - Ingress
  - Egress
  egress:
  # DNS (essential)
  - to:
    - namespaceSelector:
        matchLabels:
          name: kube-system
    ports:
    - protocol: UDP
      port: 53
    - protocol: TCP
      port: 53
  # Kubernetes API (health checks, service discovery)
  - to: []
    ports:
    - protocol: TCP
      port: 443
  # Common health check ports
  - to: []
    ports:
    - protocol: TCP
      port: 8080
    - protocol: TCP
      port: 9090

Performance Considerations

Policy Scaling Limits

Performance Degradation: 100+ individual pod policies cause significant packet processing delays
Optimization Strategy: Use namespace selectors instead of individual pod policies
Resource Impact: Each policy generates iptables rules or eBPF programs

Resource Monitoring

CNI Component CPU Usage:

Calico Felix high CPU indicates rule processing overhead
Cilium agent memory usage scales with policy complexity
AWS VPC CNI network-policy-agent frequent restarts indicate resource constraints

Emergency Recovery Procedures

Policy Rollback Strategy

# Emergency policy removal (nuclear option)
kubectl delete networkpolicy --all -n <namespace>

# Targeted policy removal (safer)
kubectl delete networkpolicy <policy-name> -n <namespace>

Break-Glass Access Policy

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: emergency-allow-all
  namespace: <NAMESPACE>
spec:
  podSelector: {}
  policyTypes:
  - Ingress
  - Egress
  ingress:
  - {}
  egress:
  - {}

Testing and Validation

Policy Effectiveness Test

#!/bin/bash
# Verify policies actually enforce restrictions
kubectl run test-pod --image=busybox --command -- sleep 3600
kubectl exec test-pod -- nc -zv <protected-service> <port>
# Should fail if policies working correctly

Automated Policy Testing

test_connection() {
  local source_pod=$1
  local target_host=$2
  local target_port=$3
  local expected_result=$4
  
  if kubectl exec $source_pod -- nc -zv $target_host $target_port 2>/dev/null; then
    actual="WORKS"
  else
    actual="BLOCKED"
  fi
  
  if [ "$actual" = "$expected_result" ]; then
    echo "✓ PASS: $actual (expected $expected_result)"
    return 0
  else
    echo "✗ FAIL: $actual (expected $expected_result)"
    return 1
  fi
}

Common Migration Pitfalls

Environment Consistency Issues

Risk: Staging uses different CNI than production
Impact: Policies work in staging, fail silently in production
Mitigation: Verify CNI plugin consistency across environments

Label Standardization Drift

Risk: Labels change over time without policy updates
Impact: Policies gradually select fewer resources, reducing security
Mitigation: Implement label validation webhooks and policy testing

NodeLocal DNSCache Considerations

Additional DNS Rules Required:

egress:
- to: []  # Allow access to any node IP
  ports:
  - protocol: UDP
    port: 53
  - protocol: TCP
    port: 53

Detection: DNS works sometimes but fails randomly
Root Cause: NodeLocal DNSCache uses node IPs, not kube-system pods

Monitoring and Alerting

Essential Metrics

Policy count per namespace (performance indicator)
Policy effectiveness rate (security indicator)
DNS resolution success rate (functionality indicator)
Cross-namespace connection success rate (application health)

Critical Alerts

Network policy creation/deletion (change tracking)
CNI component restarts (stability indicator)
DNS resolution failures (immediate impact)
Unexpected connection timeouts (policy misconfiguration)

Resource Requirements

Time Investment

Initial policy setup: 2-4 days for complex microservices
Debugging production issues: 1-6 hours per incident
Migration between CNIs: 1-2 weeks including testing

Expertise Requirements

Deep Kubernetes networking knowledge
CNI-specific debugging skills
iptables/eBPF understanding for advanced troubleshooting
Label management and selector logic

Infrastructure Dependencies

CNI plugin with network policy support
Adequate cluster resources for policy processing
Monitoring infrastructure for policy compliance
Testing framework for policy validation

Useful Links for Further Investigation

Tools That Actually Help (When They're Not Broken)

Link	Description
Kubernetes Network Policies Documentation	The official docs that every tutorial references but nobody reads completely. Buries the important behavioral changes in paragraph 47. Essential reading if you enjoy pain.
AWS EKS Network Policy Troubleshooting Guide	AWS's attempt at documenting their network policy implementation. Scattered across 12 different pages and half the links are broken. Good luck.
Azure AKS Network Policy Best Practices	Microsoft's guide for AKS network policies. Actually more helpful than the AWS docs, which isn't saying much.
Calico Debugging	Calico's debugging tools are powerful if you can figure out their cryptic command syntax. `calicoctl` is like iptables - works great once you memorize 47 different flags.
Cilium Policy Tracing	Cilium has the best debugging tools when everything is working. `cilium monitor` actually shows you policy decisions in real-time, which is fucking magical when it works.
Cilium Monitoring That Actually Works	Real-time policy decision monitoring. When this works, it's beautiful. When it doesn't, you're debugging the debugger.
Network Policy Editor	Web-based policy editor that's prettier than vim. Generates policies that sometimes work. Better than writing YAML by hand, which isn't a high bar.
Goldpinger - Bloomberg's Network Tool	Actually useful for visualizing what's talking to what. Bloomberg knows their shit about networking.
kubectl-np-viewer	Kubectl plugin for visualizing policies. Works when you can get it installed, which is 50% of the time.
knetvis - Policy Visualization	Graph-based policy visualization. Helps you see why your policies are fucked up in pretty colors.
Network Policy Recipes	Collection of policies that actually work. Copy these instead of writing your own - you'll save days of debugging.
Netshoot Container	Debugging container with every network tool you need. Like a Swiss Army knife for when your cluster networking is completely fucked.
kubectl exec for Network Testing	Basic kubectl commands for testing connectivity. If you don't know these, you're not ready for network policies.
Falco Network Policy Monitoring	Runtime security monitoring that can detect policy violations. Generates way too many alerts until you tune it properly.
AWS CloudWatch for VPC CNI Logs	AWS's attempt at logging policy decisions. The documentation is scattered and contradictory, but the logs are sometimes useful.
Prometheus NetworkPolicy Metrics	Metrics for monitoring network policy configurations. Useful for alerting when someone breaks everything.
Kubernetes Slack #sig-network	Where network policy experts hang out. Ask here after you've tried everything else and read the docs.
Stack Overflow Network Policy Tag	Search here first - someone has probably hit your exact problem before.
CNCF Slack CNI Channels	CNI-specific help channels. The Cilium folks are particularly helpful when they're not busy.
OPA Gatekeeper Policy Templates	Templates to prevent network policy misconfigurations. Set these up or you'll be fixing the same mistakes forever.
Polaris Configuration Validation	Catches common network policy mistakes before they break production. Wish I'd known about this sooner.
kubectl dry-run	Test your policies before applying them. Basic stuff, but you'd be surprised how many people skip this step.

Kubernetes Network Policies: AI-Optimized Troubleshooting Guide

Critical Behavior Switch

CNI Plugin Compatibility Matrix

Actually Enforce Policies

Silently Ignore Policies

Root Cause Analysis Priority

1. Label Mismatches (90% of Issues)

2. Bidirectional Policy Requirements

3. DNS Policy Omission

CNI-Specific Failure Patterns

AWS VPC CNI Critical Issues

Calico Debugging

Cilium Advanced Debugging

Connection Testing Matrix

Systematic Testing Approach

Connection Failure Interpretation

Production-Ready Policy Templates

Standard DNS Policy (Apply to Every Namespace)

Bidirectional Service Communication

Default-Deny with Essential Services

Performance Considerations

Policy Scaling Limits

Resource Monitoring

Emergency Recovery Procedures

Policy Rollback Strategy

Break-Glass Access Policy

Testing and Validation

Policy Effectiveness Test

Automated Policy Testing

Common Migration Pitfalls

Environment Consistency Issues

Label Standardization Drift

NodeLocal DNSCache Considerations

Monitoring and Alerting

Essential Metrics

Critical Alerts

Resource Requirements

Time Investment

Expertise Requirements

Infrastructure Dependencies

Useful Links for Further Investigation

Tools That Actually Help (When They're Not Broken)

Related Tools & Recommendations

GitOps Integration Hell: Docker + Kubernetes + ArgoCD + Prometheus

Kafka + MongoDB + Kubernetes + Prometheus Integration - When Event Streams Break

Prometheus + Grafana + Jaeger: Stop Debugging Microservices Like It's 2015

GitHub Actions + Docker + ECS: Stop SSH-ing Into Servers Like It's 2015

Fix Helm When It Inevitably Breaks - Debug Guide

Helm - Because Managing 47 YAML Files Will Drive You Insane

Making Pulumi, Kubernetes, Helm, and GitOps Actually Work Together

Docker Alternatives That Won't Break Your Budget

I Tested 5 Container Security Scanners in CI/CD - Here's What Actually Works

Grafana - The Monitoring Dashboard That Doesn't Suck

Set Up Microservices Monitoring That Actually Works

GitHub Actions Marketplace - Where CI/CD Actually Gets Easier

GitHub Actions Alternatives That Don't Suck

Stop Debugging Microservices Networking at 3AM

Istio - Service Mesh That'll Make You Question Your Life Choices

How to Deploy Istio Without Destroying Your Production Environment

RAG on Kubernetes: Why You Probably Don't Need It (But If You Do, Here's How)

Rancher Desktop - Docker Desktop's Free Replacement That Actually Works

I Ditched Docker Desktop for Rancher Desktop - Here's What Actually Happened

Rancher - Manage Multiple Kubernetes Clusters Without Losing Your Sanity