CNI Debugging Guide: AI-Optimized Reference
Critical Triage (3-Minute Emergency Response)
Immediate Assessment Commands
# Test pod scheduling capability - CRITICAL FIRST CHECK
kubectl run test-pod --image=nginx --rm -it -- /bin/bash
# If ContainerCreating stuck forever = cluster-wide CNI failure
# If successful = localized issue only
Emergency Status Check Sequence
- CNI plugin status:
kubectl get pods -n kube-system | grep -E "(cilium|calico|flannel)"
- Node readiness:
kubectl get nodes -o wide
(look for NotReady) - Recent failures:
kubectl get events --sort-by='.lastTimestamp' | grep -i error | tail -10
Time Criticality: If new pods cannot schedule, you have minutes before production impact escalates.
Common Failure Patterns and Root Causes
"failed to setup CNI" Error (95% occurrence rate)
Root Causes (in order of frequency):
- Missing CNI binary -
/opt/cni/bin/
empty after node updates - Corrupted CNI config - Invalid JSON in
/etc/cni/net.d/
- Permission issues - Config files unreadable by kubelet
- Multiple conflicting configs - Priority conflicts between CNI files
Diagnosis Commands:
# Check CNI binary existence
kubectl debug node/worker-node-1 -it --image=alpine
ls -la /opt/cni/bin/
# Validate CNI configuration
ls -la /etc/cni/net.d/
cat /etc/cni/net.d/*.conf
Recovery Actions:
# Restart CNI DaemonSet (fixes 80% of cases)
kubectl rollout restart daemonset/calico-node -n kube-system
kubectl rollout restart daemonset/cilium -n kube-system
kubectl rollout restart daemonset/kube-flannel-ds -n kube-flannel
# Nuclear config fix
rm /etc/cni/net.d/10-broken.conf
kubectl delete pod -n kube-system -l k8s-app=your-cni-plugin
"No Route to Host" Network Failures
Diagnostic Workflow:
kubectl exec -it failing-pod -- /bin/bash
ip route show # Empty = CNI routing failure
ping 8.8.8.8 # External connectivity test
nslookup kubernetes.default.svc.cluster.local # Internal DNS test
Decision Tree:
- Empty routing table → CNI plugin crashed during setup
- External ping fails + DNS works → Egress/masquerading issue
- DNS fails → CoreDNS connectivity blocked (usually NetworkPolicy)
IP Address Exhaustion (Silent Killer Pattern)
Symptoms: Existing pods functional, new pods stuck in ContainerCreating
Detection Commands:
# Calico
kubectl exec -n kube-system calico-node-xxxxx -- calicoctl ipam show
# Cilium
kubectl exec -n kube-system cilium-xxxxx -- cilium status --verbose
# AWS VPC CNI
kubectl describe configmap aws-node -n kube-system
Resolution Options (ranked by implementation difficulty):
- Clean unused pods (immediate, temporary)
- Enable IP prefix delegation (AWS only, requires restart)
- Expand pod CIDR (requires cluster restart - high risk)
Plugin-Specific Debugging Intelligence
Calico BGP Failures
Critical Check: BGP peering status determines cluster connectivity
kubectl exec -n kube-system calico-node-xxxxx -it -- bash
calicoctl node status # Must show "Established" for all peers
Common BGP Failure Modes:
- Idle/Connect states → Firewall blocking TCP 179
- Wrong AS numbers → Full-mesh configuration error
- IP-in-IP issues → Encapsulation mode mismatch
BGP Loop Recovery (nuclear option):
kubectl delete pods -n kube-system -l k8s-app=calico-node
Cilium eBPF Diagnostics
Essential Status Checks:
kubectl exec -n kube-system cilium-xxxxx -it -- bash
cilium status --verbose
cilium connectivity test # Comprehensive connectivity validation
cilium bpf endpoint list # eBPF program verification
Critical Failure Indicators:
- "BPF filesystem not mounted" →
/sys/fs/bpf
missing (kernel update issue) - "Kubernetes APIs unavailable" → API server connectivity/RBAC failure
eBPF Recovery:
# Node-level fix for BPF filesystem
mount -t bpf bpf /sys/fs/bpf
echo "bpf /sys/fs/bpf bpf defaults 0 0" >> /etc/fstab
AWS VPC CNI Resource Limits
ENI/IP Exhaustion Detection:
kubectl describe configmap aws-node -n kube-system
Scaling Solutions:
- Enable prefix delegation:
kubectl set env daemonset aws-node -n kube-system ENABLE_PREFIX_DELEGATION=true
- Use larger instance types (immediate capacity increase)
- Implement pod density limits (preventive measure)
Hidden Gotcha: Security groups must allow pod CIDR range for cluster DNS (port 53)
Flannel VXLAN Overlay Issues
VXLAN Validation:
ip link show flannel.1 # Interface existence
bridge fdb show dev flannel.1 # Neighbor discovery
tcpdump -i flannel.1 -n icmp # Traffic analysis
Subnet Conflict Detection:
kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.podCIDR}{"\n"}{end}'
# Check for duplicate CIDR assignments
Nuclear Recovery Procedures
Complete CNI Reset Script
#!/bin/bash
# CAUSES DOWNTIME - Use only when all else fails
# Delete CNI DaemonSets
kubectl delete ds -n kube-system -l k8s-app=cilium
kubectl delete ds -n kube-system -l k8s-app=calico-node
kubectl delete ds -n kube-flannel -l app=flannel
# Per-node cleanup (run on each node)
for node in $(kubectl get nodes -o name); do
kubectl debug $node -it --image=alpine -- sh -c "
rm -rf /etc/cni/net.d/*
rm -rf /opt/cni/bin/*
ip link delete cilium_host 2>/dev/null || true
ip link delete cilium_net 2>/dev/null || true
iptables -F -t nat
iptables -F -t filter
iptables -F -t mangle"
done
# Reinstall CNI (example for Cilium)
helm upgrade --install cilium cilium/cilium --namespace kube-system
Production Operational Intelligence
Critical Monitoring Requirements
- Debug pods per namespace - Pre-deployed with network tools for emergency access
- CNI resource limits - Prevent Cilium consuming 4GB+ RAM and crashing nodes
- Previous CNI version availability - Rollback faster than debugging at 3AM
- Policy validation in development - Test failure scenarios before production
Failure Prevention Patterns
- Monitor IP pool utilization before exhaustion occurs
- Test kernel updates against CNI compatibility matrices
- Validate NetworkPolicies with temporary deletion during emergencies
- Maintain RBAC permissions for CNI service accounts after cluster upgrades
Emergency Decision Matrix
Symptom | Time to Impact | Primary Action | Fallback |
---|---|---|---|
New pods cannot schedule | Minutes | Restart CNI DaemonSet | Nuclear CNI reset |
Inter-pod communication failing | Immediate | Check BGP/VXLAN status | Policy cleanup |
DNS resolution broken | Immediate | Delete NetworkPolicies | CoreDNS restart |
IP exhaustion warnings | Hours | Enable prefix delegation | Scale instance types |
Critical Resource Thresholds
- Calico IP blocks: Monitor fragmentation above 80% utilization
- AWS ENI limits: Track per-instance-type maximums
- Cilium memory usage: Alert above 2GB per node
- BGP session count: Scale considerations for large clusters
Common Misconfigurations with High Impact
- Multiple CNI configs with conflicting priorities - Silent failures
- Windows line endings in JSON configs - Parser failures
- Wrong API server endpoints - Post-upgrade authentication failures
- Security group rules missing pod CIDR - DNS resolution blocks
This operational intelligence represents field-tested knowledge from production CNI debugging scenarios, optimized for systematic troubleshooting and rapid problem resolution.
Useful Links for Further Investigation
CNI Debugging Resources That Actually Help
Link | Description |
---|---|
Kubernetes Network Troubleshooting Approach | The most systematic approach I've found. Walks through the entire network stack from pod creation to external connectivity. |
Platform9 CNI Troubleshooting Guide | Real production scenarios with actual fixes. These guys run Kubernetes for customers, so they've seen everything. |
AWS EKS CNI Troubleshooting Guide | Essential if you're running on EKS and hit IP address limits. Covers ENI troubleshooting and prefix delegation. |
Cilium Troubleshooting Documentation | The official docs are actually useful. Covers eBPF debugging, connectivity tests, and cluster mesh issues. |
Calico Network Policy Troubleshooting | Has the calicoctl commands you need when policies aren't working. The BGP troubleshooting section saved my ass multiple times. |
Flannel Troubleshooting GitHub Issues | Since Flannel docs are minimal, the GitHub issues are your best resource for weird edge cases. |
Cilium CLI | The cilium connectivity test command is gold for testing all networking paths. Install this immediately if you use Cilium. |
Calicoctl | Required for any Calico debugging. The node status and ipam show commands are essential. |
Kubernetes Network Policy Recipes | Working examples of NetworkPolicies that actually work in production. Copy-paste and modify as needed. |
Prometheus CNI Metrics | Scrape CNI plugin metrics. Essential for knowing when IP pools are exhausted before they break. |
Grafana CNI Dashboards | Pre-built dashboards for Cilium, Calico, and generic CNI metrics. Don't build from scratch. |
Samsung Ads: Calico to Cilium Migration | How they migrated CNI plugins in production without downtime. Covers all the gotchas. |
Kubernetes Networking Deep Dive | The best explanation of how Kubernetes networking actually works. Read this first if you're new to CNI. |
CNCF Slack #sig-network | Where CNI maintainers hang out. They'll actually help if you ask specific questions with logs. |
Cilium Community | Active community for Cilium-specific issues. Join their Slack workspace and GitHub discussions for real-time help. |
Stack Overflow CNI Questions | Filter by newest and highest voted. Real problems with real solutions, not documentation examples. |
kubectl-debug | Creates debug containers on nodes without having to remember the long kubectl debug syntax. |
Popeye | Scans your cluster for common configuration issues including CNI problems. |
Network Policy Editor | Free tool from Cilium to create and visualize network policies. Essential when you have hundreds of policies and can't figure out why traffic is blocked. |
Related Tools & Recommendations
Project Calico - The CNI That Actually Works in Production
Used on 8+ million nodes worldwide because it doesn't randomly break on you. Pure L3 routing without overlay networking bullshit.
Cilium - Fix Kubernetes Networking with eBPF
Replace your slow-ass kube-proxy with kernel-level networking that doesn't suck
Container Network Interface (CNI) - How Kubernetes Does Networking
Pick the wrong CNI plugin and your pods can't talk to each other. Here's what you need to know.
When Kubernetes Network Policies Break Everything (And How to Fix It)
Your pods can't talk, logs are useless, and everything's broken
Debugging Istio Production Issues - The 3AM Survival Guide
When traffic disappears and your service mesh is the prime suspect
Kubernetes Networking Breaks. Here's How to Fix It.
When nothing can talk to anything else and you're getting paged at 2am on a Sunday because someone deployed a \
Your Kubernetes Cluster is Down at 3am: Now What?
How to fix Kubernetes disasters when everything's on fire and your phone won't stop ringing.
Fix Kubernetes ImagePullBackOff Error - The Complete Battle-Tested Guide
From "Pod stuck in ImagePullBackOff" to "Problem solved in 90 seconds"
When Your Entire Kubernetes Cluster Dies at 3AM
Learn to debug, survive, and recover from Kubernetes cluster-wide cascade failures. This guide provides essential strategies and commands for when kubectl is de
Google Kubernetes Engine (GKE) - Google's Managed Kubernetes (That Actually Works Most of the Time)
Google runs your Kubernetes clusters so you don't wake up to etcd corruption at 3am. Costs way more than DIY but beats losing your weekend to cluster disasters.
Temporal + Kubernetes + Redis: The Only Microservices Stack That Doesn't Hate You
Stop debugging distributed transactions at 3am like some kind of digital masochist
GitOps Integration Hell: Docker + Kubernetes + ArgoCD + Prometheus
How to Wire Together the Modern DevOps Stack Without Losing Your Sanity
Fix Kubernetes Pod CrashLoopBackOff - Complete Troubleshooting Guide
Master Kubernetes CrashLoopBackOff. This complete guide explains what it means, diagnoses common causes, provides proven solutions, and offers advanced preventi
Docker Networking Is Broken (And So Is Your Sanity) - Here's What Actually Works
Docker networking drives me insane. After 6 years of debugging this shit, here's what I've learned about making containers actually talk to each other.
Docker Containers Can't Connect - Fix the Networking Bullshit
Your containers worked fine locally. Now they're deployed and nothing can talk to anything else.
containerd - The Container Runtime That Actually Just Works
The boring container runtime that Kubernetes uses instead of Docker (and you probably don't need to care about it)
Google fout X et Instagram dans Discover - 18 septembre 2025
compatible with oci
Nepal Goes Nuclear on Social Media, Bans 26 Platforms Including Facebook and YouTube
Government Blocks Everything from TikTok to LinkedIn in Sweeping Censorship Crackdown
Red Hat OpenShift Container Platform - Enterprise Kubernetes That Actually Works
More expensive than vanilla K8s but way less painful to operate in production
Deploy Weaviate in Production Without Everything Catching Fire
So you've got Weaviate running in dev and now management wants it in production
Recommendations combine user behavior, content similarity, research intelligence, and SEO optimization