Currently viewing the AI version
Switch to human version

Kubernetes Network Policies: AI-Optimized Troubleshooting Guide

Critical Behavior Switch

Primary Failure Mode: Applying ANY network policy to a pod switches from "allow everything" to "deny everything" mode. This is the #1 cause of production outages.

Impact Severity: Complete application stack failure - frontend loses API access, databases become unreachable, monitoring stops working.

Time to Detection: Immediate (within seconds of policy application)

Recovery Time: Hours if root cause unknown, minutes if understood

CNI Plugin Compatibility Matrix

Actually Enforce Policies

  • Calico: ✓ Works but cryptic debugging
  • Cilium: ✓ Best debugging tools when functional
  • AWS VPC CNI: ✓ Requires aws-network-policy-agent addon (v1.14.0+)
  • Azure CNI: ✓ Needs Network Policy Manager addon

Silently Ignore Policies

  • Flannel: ✗ No support whatsoever
  • Basic Docker networking: ✗ No enforcement
  • Default cloud CNIs: ✗ Usually no support without addons

Verification Test: Apply deny-all policy; if test pod can still reach external sites, CNI ignores policies.

Root Cause Analysis Priority

1. Label Mismatches (90% of Issues)

Common Failures:

  • Typos: app: frontend vs application: frontend
  • Case sensitivity: App: Frontend vs app: frontend
  • Environment drift: staging uses env: dev, production uses environment: production
  • Missing namespace labels for namespace selectors

Debugging Commands:

kubectl get pods --show-labels
kubectl get namespaces --show-labels
kubectl get pods -l app=frontend  # Test selector matching

2. Bidirectional Policy Requirements

Critical Understanding: Need TWO policies for every connection:

  • Source pod: EGRESS permission to send
  • Destination pod: INGRESS permission to receive

Failure Symptom: Connection timeouts (not refused connections)

3. DNS Policy Omission

Failure Mode: Pods can reach each other by IP but not by service name
Required Rules: Both UDP AND TCP port 53 to kube-system namespace
Why TCP: DNS switches to TCP for large responses; UDP-only policies cause intermittent failures

Essential DNS Policy:

egress:
- to:
  - namespaceSelector:
      matchLabels:
        name: kube-system
  ports:
  - protocol: UDP
    port: 53
  - protocol: TCP
    port: 53

CNI-Specific Failure Patterns

AWS VPC CNI Critical Issues

Prerequisites:

  • VPC CNI version 1.14.0+ required
  • aws-network-policy-agent addon must be installed
  • PolicyEndpoints CRD must exist
  • Specific IAM permissions required

Common Failures:

  • Network policy agent container crashes silently
  • Policies accepted but ignored (no error indication)
  • Works in staging (Calico) but fails in production (VPC CNI)

Diagnostic Commands:

kubectl get crd policyendpoints.networking.k8s.aws
kubectl logs -n kube-system -l app=aws-node -c aws-network-policy-agent

Calico Debugging

Strengths: Actually enforces policies
Weaknesses: Cryptic error messages, complex iptables interactions

Diagnostic Commands:

kubectl exec -n kube-system <calico-pod> -- calicoctl node status
kubectl exec -n kube-system <calico-pod> -- calicoctl get networkpolicy -o wide

Cilium Advanced Debugging

Strengths: Real-time policy decision monitoring
Weaknesses: Complex eBPF dependencies, kernel version requirements

Diagnostic Commands:

kubectl exec -n kube-system <cilium-pod> -- cilium monitor --type=policy-verdict
kubectl exec -n kube-system <cilium-pod> -- cilium policy trace --src-k8s-pod=ns:pod --dst-k8s-pod=ns:pod

Connection Testing Matrix

Systematic Testing Approach

# Test direct pod-to-pod (bypasses DNS)
kubectl exec -it source-pod -- nc -zv <target-ip> <port>

# Test service connectivity (includes DNS resolution)  
kubectl exec -it source-pod -- nc -zv service.namespace.svc.cluster.local <port>

# Test DNS resolution separately
kubectl exec -it source-pod -- nslookup service.namespace.svc.cluster.local

# Test external connectivity (rule out total network failure)
kubectl exec -it source-pod -- nc -zv 8.8.8.8 53

Connection Failure Interpretation

  • Connection timeout: Network policy blocking (expected for security)
  • Connection refused: App not listening on port (configuration issue)
  • DNS resolution failure: Missing DNS egress rules
  • External connectivity failure: CNI or infrastructure problem

Production-Ready Policy Templates

Standard DNS Policy (Apply to Every Namespace)

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-dns-egress
  namespace: <NAMESPACE>
spec:
  podSelector: {}
  policyTypes:
  - Egress
  egress:
  - to:
    - namespaceSelector:
        matchLabels:
          name: kube-system
    ports:
    - protocol: UDP
      port: 53
    - protocol: TCP
      port: 53

Bidirectional Service Communication

Frontend Egress Policy:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: frontend-to-backend
  namespace: frontend
spec:
  podSelector:
    matchLabels:
      app: web-frontend
  policyTypes:
  - Egress
  egress:
  - to:
    - namespaceSelector:
        matchLabels:
          name: backend
    - podSelector:
        matchLabels:
          app: api-service
    ports:
    - protocol: TCP
      port: 8080

Backend Ingress Policy:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: backend-from-frontend
  namespace: backend
spec:
  podSelector:
    matchLabels:
      app: api-service
  policyTypes:
  - Ingress
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          name: frontend
    - podSelector:
        matchLabels:
          app: web-frontend
    ports:
    - protocol: TCP
      port: 8080

Default-Deny with Essential Services

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-with-basics
  namespace: <NAMESPACE>
spec:
  podSelector: {}
  policyTypes:
  - Ingress
  - Egress
  egress:
  # DNS (essential)
  - to:
    - namespaceSelector:
        matchLabels:
          name: kube-system
    ports:
    - protocol: UDP
      port: 53
    - protocol: TCP
      port: 53
  # Kubernetes API (health checks, service discovery)
  - to: []
    ports:
    - protocol: TCP
      port: 443
  # Common health check ports
  - to: []
    ports:
    - protocol: TCP
      port: 8080
    - protocol: TCP
      port: 9090

Performance Considerations

Policy Scaling Limits

Performance Degradation: 100+ individual pod policies cause significant packet processing delays
Optimization Strategy: Use namespace selectors instead of individual pod policies
Resource Impact: Each policy generates iptables rules or eBPF programs

Resource Monitoring

CNI Component CPU Usage:

  • Calico Felix high CPU indicates rule processing overhead
  • Cilium agent memory usage scales with policy complexity
  • AWS VPC CNI network-policy-agent frequent restarts indicate resource constraints

Emergency Recovery Procedures

Policy Rollback Strategy

# Emergency policy removal (nuclear option)
kubectl delete networkpolicy --all -n <namespace>

# Targeted policy removal (safer)
kubectl delete networkpolicy <policy-name> -n <namespace>

Break-Glass Access Policy

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: emergency-allow-all
  namespace: <NAMESPACE>
spec:
  podSelector: {}
  policyTypes:
  - Ingress
  - Egress
  ingress:
  - {}
  egress:
  - {}

Testing and Validation

Policy Effectiveness Test

#!/bin/bash
# Verify policies actually enforce restrictions
kubectl run test-pod --image=busybox --command -- sleep 3600
kubectl exec test-pod -- nc -zv <protected-service> <port>
# Should fail if policies working correctly

Automated Policy Testing

test_connection() {
  local source_pod=$1
  local target_host=$2
  local target_port=$3
  local expected_result=$4
  
  if kubectl exec $source_pod -- nc -zv $target_host $target_port 2>/dev/null; then
    actual="WORKS"
  else
    actual="BLOCKED"
  fi
  
  if [ "$actual" = "$expected_result" ]; then
    echo "✓ PASS: $actual (expected $expected_result)"
    return 0
  else
    echo "✗ FAIL: $actual (expected $expected_result)"
    return 1
  fi
}

Common Migration Pitfalls

Environment Consistency Issues

Risk: Staging uses different CNI than production
Impact: Policies work in staging, fail silently in production
Mitigation: Verify CNI plugin consistency across environments

Label Standardization Drift

Risk: Labels change over time without policy updates
Impact: Policies gradually select fewer resources, reducing security
Mitigation: Implement label validation webhooks and policy testing

NodeLocal DNSCache Considerations

Additional DNS Rules Required:

egress:
- to: []  # Allow access to any node IP
  ports:
  - protocol: UDP
    port: 53
  - protocol: TCP
    port: 53

Detection: DNS works sometimes but fails randomly
Root Cause: NodeLocal DNSCache uses node IPs, not kube-system pods

Monitoring and Alerting

Essential Metrics

  • Policy count per namespace (performance indicator)
  • Policy effectiveness rate (security indicator)
  • DNS resolution success rate (functionality indicator)
  • Cross-namespace connection success rate (application health)

Critical Alerts

  • Network policy creation/deletion (change tracking)
  • CNI component restarts (stability indicator)
  • DNS resolution failures (immediate impact)
  • Unexpected connection timeouts (policy misconfiguration)

Resource Requirements

Time Investment

  • Initial policy setup: 2-4 days for complex microservices
  • Debugging production issues: 1-6 hours per incident
  • Migration between CNIs: 1-2 weeks including testing

Expertise Requirements

  • Deep Kubernetes networking knowledge
  • CNI-specific debugging skills
  • iptables/eBPF understanding for advanced troubleshooting
  • Label management and selector logic

Infrastructure Dependencies

  • CNI plugin with network policy support
  • Adequate cluster resources for policy processing
  • Monitoring infrastructure for policy compliance
  • Testing framework for policy validation

Useful Links for Further Investigation

Tools That Actually Help (When They're Not Broken)

LinkDescription
Kubernetes Network Policies DocumentationThe official docs that every tutorial references but nobody reads completely. Buries the important behavioral changes in paragraph 47. Essential reading if you enjoy pain.
AWS EKS Network Policy Troubleshooting GuideAWS's attempt at documenting their network policy implementation. Scattered across 12 different pages and half the links are broken. Good luck.
Azure AKS Network Policy Best PracticesMicrosoft's guide for AKS network policies. Actually more helpful than the AWS docs, which isn't saying much.
Calico DebuggingCalico's debugging tools are powerful if you can figure out their cryptic command syntax. `calicoctl` is like iptables - works great once you memorize 47 different flags.
Cilium Policy TracingCilium has the best debugging tools when everything is working. `cilium monitor` actually shows you policy decisions in real-time, which is fucking magical when it works.
Cilium Monitoring That Actually WorksReal-time policy decision monitoring. When this works, it's beautiful. When it doesn't, you're debugging the debugger.
Network Policy EditorWeb-based policy editor that's prettier than vim. Generates policies that sometimes work. Better than writing YAML by hand, which isn't a high bar.
Goldpinger - Bloomberg's Network ToolActually useful for visualizing what's talking to what. Bloomberg knows their shit about networking.
kubectl-np-viewerKubectl plugin for visualizing policies. Works when you can get it installed, which is 50% of the time.
knetvis - Policy VisualizationGraph-based policy visualization. Helps you see why your policies are fucked up in pretty colors.
Network Policy RecipesCollection of policies that actually work. Copy these instead of writing your own - you'll save days of debugging.
Netshoot ContainerDebugging container with every network tool you need. Like a Swiss Army knife for when your cluster networking is completely fucked.
kubectl exec for Network TestingBasic kubectl commands for testing connectivity. If you don't know these, you're not ready for network policies.
Falco Network Policy MonitoringRuntime security monitoring that can detect policy violations. Generates way too many alerts until you tune it properly.
AWS CloudWatch for VPC CNI LogsAWS's attempt at logging policy decisions. The documentation is scattered and contradictory, but the logs are sometimes useful.
Prometheus NetworkPolicy MetricsMetrics for monitoring network policy configurations. Useful for alerting when someone breaks everything.
Kubernetes Slack #sig-networkWhere network policy experts hang out. Ask here after you've tried everything else and read the docs.
Stack Overflow Network Policy TagSearch here first - someone has probably hit your exact problem before.
CNCF Slack CNI ChannelsCNI-specific help channels. The Cilium folks are particularly helpful when they're not busy.
OPA Gatekeeper Policy TemplatesTemplates to prevent network policy misconfigurations. Set these up or you'll be fixing the same mistakes forever.
Polaris Configuration ValidationCatches common network policy mistakes before they break production. Wish I'd known about this sooner.
kubectl dry-runTest your policies before applying them. Basic stuff, but you'd be surprised how many people skip this step.

Related Tools & Recommendations

integration
Recommended

GitOps Integration Hell: Docker + Kubernetes + ArgoCD + Prometheus

How to Wire Together the Modern DevOps Stack Without Losing Your Sanity

prometheus
/integration/docker-kubernetes-argocd-prometheus/gitops-workflow-integration
100%
integration
Recommended

Kafka + MongoDB + Kubernetes + Prometheus Integration - When Event Streams Break

When your event-driven services die and you're staring at green dashboards while everything burns, you need real observability - not the vendor promises that go

Apache Kafka
/integration/kafka-mongodb-kubernetes-prometheus-event-driven/complete-observability-architecture
74%
integration
Recommended

Prometheus + Grafana + Jaeger: Stop Debugging Microservices Like It's 2015

When your API shits the bed right before the big demo, this stack tells you exactly why

Prometheus
/integration/prometheus-grafana-jaeger/microservices-observability-integration
57%
integration
Recommended

GitHub Actions + Docker + ECS: Stop SSH-ing Into Servers Like It's 2015

Deploy your app without losing your mind or your weekend

GitHub Actions
/integration/github-actions-docker-aws-ecs/ci-cd-pipeline-automation
43%
tool
Recommended

Fix Helm When It Inevitably Breaks - Debug Guide

The commands, tools, and nuclear options for when your Helm deployment is fucked and you need to debug template errors at 3am.

Helm
/tool/helm/troubleshooting-guide
32%
tool
Recommended

Helm - Because Managing 47 YAML Files Will Drive You Insane

Package manager for Kubernetes that saves you from copy-pasting deployment configs like a savage. Helm charts beat maintaining separate YAML files for every dam

Helm
/tool/helm/overview
32%
integration
Recommended

Making Pulumi, Kubernetes, Helm, and GitOps Actually Work Together

Stop fighting with YAML hell and infrastructure drift - here's how to manage everything through Git without losing your sanity

Pulumi
/integration/pulumi-kubernetes-helm-gitops/complete-workflow-integration
32%
alternatives
Recommended

Docker Alternatives That Won't Break Your Budget

Docker got expensive as hell. Here's how to escape without breaking everything.

Docker
/alternatives/docker/budget-friendly-alternatives
30%
compare
Recommended

I Tested 5 Container Security Scanners in CI/CD - Here's What Actually Works

Trivy, Docker Scout, Snyk Container, Grype, and Clair - which one won't make you want to quit DevOps

docker
/compare/docker-security/cicd-integration/docker-security-cicd-integration
30%
tool
Recommended

Grafana - The Monitoring Dashboard That Doesn't Suck

integrates with Grafana

Grafana
/tool/grafana/overview
30%
howto
Recommended

Set Up Microservices Monitoring That Actually Works

Stop flying blind - get real visibility into what's breaking your distributed services

Prometheus
/howto/setup-microservices-observability-prometheus-jaeger-grafana/complete-observability-setup
30%
tool
Recommended

GitHub Actions Marketplace - Where CI/CD Actually Gets Easier

integrates with GitHub Actions Marketplace

GitHub Actions Marketplace
/tool/github-actions-marketplace/overview
29%
alternatives
Recommended

GitHub Actions Alternatives That Don't Suck

integrates with GitHub Actions

GitHub Actions
/alternatives/github-actions/use-case-driven-selection
29%
integration
Recommended

Stop Debugging Microservices Networking at 3AM

How Docker, Kubernetes, and Istio Actually Work Together (When They Work)

Docker
/integration/docker-kubernetes-istio/service-mesh-architecture
29%
tool
Recommended

Istio - Service Mesh That'll Make You Question Your Life Choices

The most complex way to connect microservices, but it actually works (eventually)

Istio
/tool/istio/overview
29%
howto
Recommended

How to Deploy Istio Without Destroying Your Production Environment

A battle-tested guide from someone who's learned these lessons the hard way

Istio
/howto/setup-istio-production/production-deployment
29%
integration
Recommended

RAG on Kubernetes: Why You Probably Don't Need It (But If You Do, Here's How)

Running RAG Systems on K8s Will Make You Hate Your Life, But Sometimes You Don't Have a Choice

Vector Databases
/integration/vector-database-rag-production-deployment/kubernetes-orchestration
27%
tool
Recommended

Rancher Desktop - Docker Desktop's Free Replacement That Actually Works

extends Rancher Desktop

Rancher Desktop
/tool/rancher-desktop/overview
24%
review
Recommended

I Ditched Docker Desktop for Rancher Desktop - Here's What Actually Happened

3 Months Later: The Good, Bad, and Bullshit

Rancher Desktop
/review/rancher-desktop/overview
24%
tool
Recommended

Rancher - Manage Multiple Kubernetes Clusters Without Losing Your Sanity

One dashboard for all your clusters, whether they're on AWS, your basement server, or that sketchy cloud provider your CTO picked

Rancher
/tool/rancher/overview
24%

Recommendations combine user behavior, content similarity, research intelligence, and SEO optimization