The Nuclear Option: When All Pods Are Fucked

I've been paged at 3AM more times than I care to count because CNI plugins decided to shit the bed. Here's the playbook that's saved my ass repeatedly.

Step 1: Don't Panic (But Move Fast)

First rule of production debugging: check if you can schedule new pods. If you can't, you're dealing with cluster-wide CNI failure and you have minutes before people start screaming.

kubectl run test-pod --image=nginx --rm -it -- /bin/bash

If this fails with ContainerCreating stuck forever, your CNI is toast. If it works, the problem is localized to specific pods or nodes.

The 3-Minute Triage

Check these in order - don't waste time on logs until you know what you're dealing with:

  1. CNI plugin pods status: kubectl get pods -n kube-system | grep -E \"(cilium|calico|flannel)\"
  2. Node status: kubectl get nodes -o wide - look for NotReady nodes
  3. Recent pod failures: kubectl get events --sort-by='.lastTimestamp' | grep -i error | tail -10

The official Kubernetes troubleshooting guide covers this systematic approach, but here's the real-world version that actually works.

CNI debugging decision tree

\"failed to setup CNI\" - The Most Common Nightmare

This error means the kubelet can't reach your CNI plugin. 95% of the time it's one of these:

The Kubernetes CNI troubleshooting documentation covers the theory, but James Sturtevant's debugging guide has the practical commands that actually work.

Missing CNI Binary

## Check if CNI binary exists on the failing node
kubectl get nodes
kubectl debug node/worker-node-1 -it --image=alpine
ls -la /opt/cni/bin/

If your CNI binary is missing (common after node updates), you need to reinstall:

## For Calico
kubectl rollout restart daemonset/calico-node -n kube-system

## For Cilium
kubectl rollout restart daemonset/cilium -n kube-system

## For Flannel
kubectl rollout restart daemonset/kube-flannel-ds -n kube-flannel

Corrupted CNI Config

Your CNI config lives in /etc/cni/net.d/ and if it's fucked, everything is fucked:

kubectl debug node/worker-node-1 -it --image=alpine
ls -la /etc/cni/net.d/
cat /etc/cni/net.d/*.conf

Look for:

  • Invalid JSON (yes, this breaks everything silently)
  • Wrong file permissions (needs to be readable by kubelet)
  • Multiple configs with conflicting priorities

The fix is usually nuking the bad config and letting your CNI operator recreate it:

## Delete the broken config (do this on the node)
rm /etc/cni/net.d/10-broken.conf
## Restart CNI pods to regenerate
kubectl delete pod -n kube-system -l k8s-app=your-cni-plugin

The Dreaded \"No Route to Host\"

This is where shit gets real. Your pods can schedule but can't reach each other or external services.

Network troubleshooting workflow

Quick Network Sanity Check

Get into a failing pod and run these commands:

kubectl exec -it failing-pod -- /bin/bash
ip route show
ping 8.8.8.8
nslookup kubernetes.default.svc.cluster.local

If ip route show is empty, your CNI plugin never set up routing. This usually means:

  • CNI plugin crashed during pod creation
  • IP address pool exhaustion
  • Node network configuration is fucked

If external ping fails but internal DNS works, check your egress/masquerading rules.

If DNS fails, your CoreDNS pods probably can't reach the Kubernetes API. This is almost always a network policy issue. The Container Solutions debugging guide covers DNS troubleshooting in detail.

IP Address Exhaustion (The Silent Killer)

This one is sneaky - existing pods work fine, new ones get stuck in ContainerCreating. Check your IP allocation:

## For Calico
kubectl exec -n kube-system calico-node-xxxxx -- calicoctl ipam show

## For Cilium
kubectl exec -n kube-system cilium-xxxxx -- cilium status --verbose

## For AWS VPC CNI
kubectl describe configmap aws-node -n kube-system

If you're out of IPs, you have a few options:

  1. Expand your pod CIDR (cluster restart required - good luck)
  2. Enable IP prefix delegation (AWS only) - check the AWS EKS IP allocation guide
  3. Clean up unused pods and hope for the best

The Kubernetes troubleshooting guide has practical examples of IP exhaustion scenarios.

When Your CNI Plugin is Completely Fucked

Sometimes you just need to burn it all down and start over. Here's the nuclear option:

## Save this script - you'll need it eventually
#!/bin/bash
echo \"Nuking CNI configuration...\"

## Delete all CNI-related DaemonSets and Deployments
kubectl delete ds -n kube-system -l k8s-app=cilium
kubectl delete ds -n kube-system -l k8s-app=calico-node
kubectl delete ds -n kube-flannel -l app=flannel

## Clean up node-level networking (run this on each node)
for node in $(kubectl get nodes -o name); do
    kubectl debug $node -it --image=alpine -- sh -c \"
    rm -rf /etc/cni/net.d/*
    rm -rf /opt/cni/bin/*
    ip link delete cilium_host 2>/dev/null || true
    ip link delete cilium_net 2>/dev/null || true
    iptables -F -t nat
    iptables -F -t filter
    iptables -F -t mangle
    \"
done

## Reinstall your CNI (example for Cilium)
helm upgrade --install cilium cilium/cilium --namespace kube-system

This is the equivalent of turning it off and on again, but for networking. It'll cause downtime, but sometimes it's the only way.

Pro Tips from the Trenches

  1. Always keep a debug pod running in each namespace with network tools installed. When shit hits the fan, you don't want to wait for image pulls.

  2. Monitor CNI plugin resource usage. I've seen Cilium eat 4GB RAM and bring down nodes. Set proper limits. The container runtime documentation covers resource limits for both containerd and CRI-O.

  3. Test your CNI failure scenarios in dev. Most people only find out their monitoring is broken when production is on fire. For container runtime debugging, use the crictl debugging guide to interact directly with the container runtime.

  4. Keep the previous CNI version ready to deploy. Rollbacks are faster than debugging new bugs at 3AM.

The key to CNI debugging is systematic elimination. Start with the basics (can pods schedule?), then work your way up the network stack. Don't get caught up in fancy eBPF traces until you've confirmed the fundamentals work.

For comprehensive logging strategies, check the Kubernetes logging guide and basic troubleshooting fundamentals. The CNI Kubernetes guide also provides solid fundamentals for understanding networking failures.

Emergency CNI Debugging FAQ

Q

My pods are stuck in ContainerCreating forever. What now?

A

First, check if the CNI binary is actually there: kubectl debug node/your-node -it --image=alpine then ls /opt/cni/bin/. If it's missing, your CNI DaemonSet probably failed to deploy. I've seen this after kernel updates where the CNI pods crash on startup. Delete the failing CNI pods: kubectl delete pod -n kube-system -l k8s-app=cilium (or whatever your CNI is).

Q

I get "failed to setup CNI: cnisetup: no setup happened" - what the hell?

A

This error is deceptive as fuck. It usually means the CNI config file is corrupted or missing. Check /etc/cni/net.d/ on the node

  • if there's invalid JSON or the file permissions are wrong, kubelet can't read it. I've spent hours on this only to find someone accidentally saved a file with Windows line endings that broke the JSON parser.
Q

Pods can reach external internet but can't talk to other pods

A

Your inter-pod routing is fucked.

For Flannel, check if VXLAN is working: tcpdump -i flannel.1 -n.

For Calico, verify BGP peering: kubectl exec -n kube-system calico-node-xxx -- calicoctl node status.

For Cilium, check connectivity: kubectl exec -n kube-system cilium-xxx -- cilium status. Nine times out of ten, it's a firewall rule blocking the overlay network.

Q

"error getting ClusterInformation: connection is unauthorized"

A

Your CNI plugin can't authenticate to the Kubernetes API. Check if the service account exists and has proper RBAC permissions. This breaks after cluster upgrades when API server certificates change. Delete and recreate the CNI RBAC: kubectl delete clusterrolebinding cilium then reinstall your CNI.

Q

My AWS VPC CNI is failing with "no available IP addresses"

A

You've hit the ENI limit. Each instance type has a max number of ENIs and IPs per ENI. Enable IP prefix delegation or use a larger instance type. I've seen this kill clusters that were running fine for months until they hit the magic scaling threshold.

Q

Cilium pods are CrashLoopBackOff after kernel update

A

eBPF programs are kernel-version specific. Check dmesg for eBPF verification errors. You need a Cilium version that supports your new kernel. I keep a compatibility matrix because this happens every fucking time there's a security update. Rollback the kernel or upgrade Cilium

  • there's no middle ground.
Q

"CNI network not found" when starting containers

A

Your container runtime (Docker/containerd) can't find the CNI configuration. This happens when the CNI config file gets deleted or renamed incorrectly. The file in /etc/cni/net.d/ needs to start with a number (like 10-calico.conflist) and have the right permissions. If you have multiple files, the lowest number wins.

Q

CoreDNS pods are running but DNS resolution is broken

A

CNI network policies are probably blocking CoreDNS traffic.

Temporarily delete all NetworkPolicies: kubectl delete networkpolicies --all --all-namespaces and test. If DNS works, you've got a policy blocking port 53. This is why I always create CoreDNS allow rules first in production.

Q

How do I debug Calico when everything seems fine but networking is broken?

A

Use calicoctl to check the actual dataplane programming: kubectl exec -n kube-system calico-node-xxx -- calicoctl get nodes, then `calicoctl get ip

Pool, then calicoctl get workloadEndpoints`. If workload endpoints are missing, Calico isn't seeing your pods. If IP pools are wrong, your pod CIDR is fucked.

Q

Flannel works fine until I restart a node, then everything breaks

A

Flannel stores state in etcd/Kubernetes that gets out of sync when nodes restart ungracefully.

Delete the flannel subnet for that node: kubectl delete node old-node-name, then cordon and drain properly: kubectl cordon node && kubectl drain node --ignore-daemonsets. Don't just reboot nodes with Flannel

  • you'll create orphaned network state.
Q

I'm getting "failed to allocate for range" errors

A

Your IP address pool is exhausted or fragmented.

For Calico: kubectl exec -n kube-system calico-node-xxx -- calicoctl ipam show --show-blocks. Look for fragmentation where blocks are allocated but not fully used. Sometimes you need to expand the pod CIDR or enable IP recycling more aggressively.

Q

My network policies work sometimes but fail randomly

A

Policy programming takes time to propagate. Add a sleep 10 after applying policies before testing. Also check if you're hitting policy map limits

  • each CNI has different limits on how many policies can be active. Cilium especially gets grumpy when you have thousands of network policies.

Plugin-Specific Debugging (When Generic Fixes Don't Work)

Different CNI plugins fail in wonderfully unique ways. Here's how to debug the most common ones when the standard kubectl commands don't help.

Calico: When BGP Goes to Hell

Calico's BGP routing is powerful until it isn't. When Calico breaks, it usually takes the entire cluster networking with it.

Check BGP Peering Status

## Get into a Calico node pod
kubectl exec -n kube-system calico-node-xxxxx -it -- bash

## Check BGP neighbor status
calicoctl node status
## Should show "Established" for all peers

## If you see "Idle" or "Connect" states:
calicoctl get bgppeers
calicoctl get nodes --output=wide

Common BGP fuckups:

  • Wrong AS numbers: Each node needs unique AS numbers in full-mesh mode
  • Firewall blocking TCP 179: BGP protocol port must be open between nodes
  • IP-in-IP tunnel issues: Check if encapsulation mode matches across all nodes

The official Calico troubleshooting guide covers BGP debugging in detail, and Tigera's BGP troubleshooting article has step-by-step commands for common scenarios.

Calico IP Pool Exhaustion

This one's subtle - pods schedule but get no IP addresses:

## Check IP allocation
calicoctl ipam show --show-blocks

## Look for blocks that are "borrowed" but not "allocated"
## This means IP addresses are reserved but not actually used

## Nuclear option - reclaim unused IPs
calicoctl ipam release --from-report=ipam-report.json

The most frustrating Calico bug I've encountered: BGP routes getting into a loop where nodes can't reach themselves. Fix: restart all Calico pods simultaneously:

kubectl delete pods -n kube-system -l k8s-app=calico-node

Cilium: eBPF Mastery or Misery

Cilium is fast as hell when it works, but debugging eBPF issues requires kernel-level expertise most of us don't have.

Cilium eBPF datapath visualization

Essential Cilium Debug Commands

## Get into a Cilium pod
kubectl exec -n kube-system cilium-xxxxx -it -- bash

## Check connectivity status
cilium status --verbose

## Most important: connectivity tests
cilium connectivity test

## Check if eBPF programs loaded correctly
cilium bpf endpoint list
cilium bpf ct list global

When Cilium Status Shows Errors

"BPF filesystem not mounted": Your nodes don't have /sys/fs/bpf mounted. This breaks after certain kernel updates:

## Fix on each node
mount -t bpf bpf /sys/fs/bpf
echo "bpf /sys/fs/bpf bpf defaults 0 0" >> /etc/fstab

"Kubernetes APIs unavailable": Cilium can't reach the API server. Check if the API server endpoint is correct and if RBAC permissions are fucked.

The comprehensive Cilium troubleshooting documentation is actually useful, unlike most vendor docs. For advanced debugging, the 2025 Cilium troubleshooting guide has updated techniques.

Cilium Network Policy Debugging

Cilium policies are more powerful than standard Kubernetes NetworkPolicies, but they're also easier to fuck up:

## Check policy status
cilium endpoint list
cilium policy get

## Debug specific policy enforcement
cilium endpoint get <endpoint-id>
cilium monitor --type drop

The most common Cilium policy issue: FQDN-based policies that don't work because DNS resolution failed. Check cilium monitor --type drop for DNS-related drops.

For hands-on debugging examples, check the practical Cilium networking guide and the Cilium eBPF guide on AlmaLinux for distribution-specific issues.

Flannel: Simple Until It Isn't

Flannel is supposed to be the "simple" CNI, but it has its own special ways of breaking.

VXLAN Overlay Issues

Flannel uses VXLAN by default, and when the overlay breaks, cross-node communication dies:

## Check if VXLAN interface exists
ip link show flannel.1

## Check VXLAN neighbor entries
bridge fdb show dev flannel.1

## Test VXLAN connectivity manually
tcpdump -i flannel.1 -n icmp

Missing VXLAN neighbors: Usually means the flannel daemon can't reach the Kubernetes API to get node information. Check if the subnet.env file exists:

cat /run/flannel/subnet.env
## Should contain FLANNEL_NETWORK and FLANNEL_SUBNET

The Flannel GitHub issues are your best resource since the official docs are minimal. Look for similar VXLAN issues in the troubleshooting label.

Flannel Subnet Conflicts

I've seen this kill entire clusters - Flannel assigns overlapping subnets to different nodes:

## Check assigned subnets
kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"	"}{.spec.podCIDR}{"
"}{end}'

## If you see duplicates, you need to manually fix the node specs
kubectl patch node <node-name> -p '{"spec":{"podCIDR":"10.244.X.0/24"}}'

AWS VPC CNI: Cloud-Specific Nightmares

The AWS VPC CNI integrates pods directly with VPC networking, which is great until you hit AWS limits.

ENI and IP Limits

Each EC2 instance type has limits on network interfaces and IPs per interface:

## Check current IP usage
kubectl describe configmap aws-node -n kube-system

## Enable IP prefix delegation (if not already)
kubectl set env daemonset aws-node -n kube-system ENABLE_PREFIX_DELEGATION=true
kubectl rollout restart daemonset/aws-node -n kube-system

"No available IP addresses" error: You've hit the ENI limit. Solutions in order of preference:

  1. Enable prefix delegation
  2. Use larger instance types
  3. Implement pod density limits per node

Security Group Issues

VPC CNI respects EC2 security groups, which can block pod-to-pod communication:

## Check if pods are getting IPs from the right subnet
kubectl get pods -o wide

## Verify security group rules allow pod CIDR
aws ec2 describe-security-groups --group-ids sg-xxxxx

The most subtle AWS VPC CNI bug: pods get IPs but can't reach the cluster DNS because security groups block port 53. Always add rules for the cluster's pod CIDR range.

For AWS-specific debugging, check the NetScaler Calico configuration guide and Azorian's BGP peering guide for connecting Kubernetes to existing network infrastructure.

Multi-CNI Environments (Please Don't)

If you're running multiple CNI plugins (using Multus or similar), may God have mercy on your soul. The debugging approach is:

  1. Identify which CNI handled which interface: kubectl exec pod -- ip addr show
  2. Test each network separately: Each interface will have different routing tables
  3. Check CNI plugin logs individually: Each plugin logs to different places
  4. Pray to the networking gods: Because this shit is complicated

The golden rule of CNI debugging: start simple. Before diving into eBPF traces or BGP routing tables, make sure basic connectivity works. Can you ping the node? Can you reach the Kubernetes API? Can DNS resolve external names?

Most CNI issues are either:

  • Configuration errors (wrong CIDR ranges, missing files)
  • Resource exhaustion (out of IPs, hitting kernel limits)
  • Firewall/security group issues (ports blocked, wrong rules)

Fix these first before assuming you've found a bug in the CNI plugin itself.

For deeper networking analysis, explore the Sigrid Jin's Calico networking modes article and check out Istio CNI troubleshooting if you're dealing with service mesh networking issues.

CNI Plugin Debugging Cheat Sheet

Error Message

Most Likely CNI

First Thing to Check

Nuclear Option

failed to setup CNI

Any

CNI binary missing from /opt/cni/bin/

Restart CNI DaemonSet

no available IP addresses

AWS VPC CNI

ENI limits hit

Enable prefix delegation

BPF filesystem not mounted

Cilium

/sys/fs/bpf not mounted

Mount BPF filesystem

BGP not established

Calico

TCP 179 blocked by firewall

Delete all Calico pods

VXLAN neighbors missing

Flannel

API server unreachable

Check subnet.env file

connection is unauthorized

Any

RBAC permissions broken

Recreate service accounts

CNI Debugging Resources That Actually Help