Currently viewing the AI version
Switch to human version

CNI Debugging Guide: AI-Optimized Reference

Critical Triage (3-Minute Emergency Response)

Immediate Assessment Commands

# Test pod scheduling capability - CRITICAL FIRST CHECK
kubectl run test-pod --image=nginx --rm -it -- /bin/bash

# If ContainerCreating stuck forever = cluster-wide CNI failure
# If successful = localized issue only

Emergency Status Check Sequence

  1. CNI plugin status: kubectl get pods -n kube-system | grep -E "(cilium|calico|flannel)"
  2. Node readiness: kubectl get nodes -o wide (look for NotReady)
  3. Recent failures: kubectl get events --sort-by='.lastTimestamp' | grep -i error | tail -10

Time Criticality: If new pods cannot schedule, you have minutes before production impact escalates.

Common Failure Patterns and Root Causes

"failed to setup CNI" Error (95% occurrence rate)

Root Causes (in order of frequency):

  1. Missing CNI binary - /opt/cni/bin/ empty after node updates
  2. Corrupted CNI config - Invalid JSON in /etc/cni/net.d/
  3. Permission issues - Config files unreadable by kubelet
  4. Multiple conflicting configs - Priority conflicts between CNI files

Diagnosis Commands:

# Check CNI binary existence
kubectl debug node/worker-node-1 -it --image=alpine
ls -la /opt/cni/bin/

# Validate CNI configuration
ls -la /etc/cni/net.d/
cat /etc/cni/net.d/*.conf

Recovery Actions:

# Restart CNI DaemonSet (fixes 80% of cases)
kubectl rollout restart daemonset/calico-node -n kube-system
kubectl rollout restart daemonset/cilium -n kube-system
kubectl rollout restart daemonset/kube-flannel-ds -n kube-flannel

# Nuclear config fix
rm /etc/cni/net.d/10-broken.conf
kubectl delete pod -n kube-system -l k8s-app=your-cni-plugin

"No Route to Host" Network Failures

Diagnostic Workflow:

kubectl exec -it failing-pod -- /bin/bash
ip route show                    # Empty = CNI routing failure
ping 8.8.8.8                    # External connectivity test
nslookup kubernetes.default.svc.cluster.local  # Internal DNS test

Decision Tree:

  • Empty routing table → CNI plugin crashed during setup
  • External ping fails + DNS works → Egress/masquerading issue
  • DNS fails → CoreDNS connectivity blocked (usually NetworkPolicy)

IP Address Exhaustion (Silent Killer Pattern)

Symptoms: Existing pods functional, new pods stuck in ContainerCreating
Detection Commands:

# Calico
kubectl exec -n kube-system calico-node-xxxxx -- calicoctl ipam show

# Cilium  
kubectl exec -n kube-system cilium-xxxxx -- cilium status --verbose

# AWS VPC CNI
kubectl describe configmap aws-node -n kube-system

Resolution Options (ranked by implementation difficulty):

  1. Clean unused pods (immediate, temporary)
  2. Enable IP prefix delegation (AWS only, requires restart)
  3. Expand pod CIDR (requires cluster restart - high risk)

Plugin-Specific Debugging Intelligence

Calico BGP Failures

Critical Check: BGP peering status determines cluster connectivity

kubectl exec -n kube-system calico-node-xxxxx -it -- bash
calicoctl node status  # Must show "Established" for all peers

Common BGP Failure Modes:

  • Idle/Connect states → Firewall blocking TCP 179
  • Wrong AS numbers → Full-mesh configuration error
  • IP-in-IP issues → Encapsulation mode mismatch

BGP Loop Recovery (nuclear option):

kubectl delete pods -n kube-system -l k8s-app=calico-node

Cilium eBPF Diagnostics

Essential Status Checks:

kubectl exec -n kube-system cilium-xxxxx -it -- bash
cilium status --verbose
cilium connectivity test     # Comprehensive connectivity validation
cilium bpf endpoint list     # eBPF program verification

Critical Failure Indicators:

  • "BPF filesystem not mounted"/sys/fs/bpf missing (kernel update issue)
  • "Kubernetes APIs unavailable" → API server connectivity/RBAC failure

eBPF Recovery:

# Node-level fix for BPF filesystem
mount -t bpf bpf /sys/fs/bpf
echo "bpf /sys/fs/bpf bpf defaults 0 0" >> /etc/fstab

AWS VPC CNI Resource Limits

ENI/IP Exhaustion Detection:

kubectl describe configmap aws-node -n kube-system

Scaling Solutions:

  1. Enable prefix delegation: kubectl set env daemonset aws-node -n kube-system ENABLE_PREFIX_DELEGATION=true
  2. Use larger instance types (immediate capacity increase)
  3. Implement pod density limits (preventive measure)

Hidden Gotcha: Security groups must allow pod CIDR range for cluster DNS (port 53)

Flannel VXLAN Overlay Issues

VXLAN Validation:

ip link show flannel.1          # Interface existence
bridge fdb show dev flannel.1   # Neighbor discovery
tcpdump -i flannel.1 -n icmp   # Traffic analysis

Subnet Conflict Detection:

kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.podCIDR}{"\n"}{end}'
# Check for duplicate CIDR assignments

Nuclear Recovery Procedures

Complete CNI Reset Script

#!/bin/bash
# CAUSES DOWNTIME - Use only when all else fails

# Delete CNI DaemonSets
kubectl delete ds -n kube-system -l k8s-app=cilium
kubectl delete ds -n kube-system -l k8s-app=calico-node
kubectl delete ds -n kube-flannel -l app=flannel

# Per-node cleanup (run on each node)
for node in $(kubectl get nodes -o name); do
    kubectl debug $node -it --image=alpine -- sh -c "
    rm -rf /etc/cni/net.d/*
    rm -rf /opt/cni/bin/*
    ip link delete cilium_host 2>/dev/null || true
    ip link delete cilium_net 2>/dev/null || true
    iptables -F -t nat
    iptables -F -t filter
    iptables -F -t mangle"
done

# Reinstall CNI (example for Cilium)
helm upgrade --install cilium cilium/cilium --namespace kube-system

Production Operational Intelligence

Critical Monitoring Requirements

  1. Debug pods per namespace - Pre-deployed with network tools for emergency access
  2. CNI resource limits - Prevent Cilium consuming 4GB+ RAM and crashing nodes
  3. Previous CNI version availability - Rollback faster than debugging at 3AM
  4. Policy validation in development - Test failure scenarios before production

Failure Prevention Patterns

  • Monitor IP pool utilization before exhaustion occurs
  • Test kernel updates against CNI compatibility matrices
  • Validate NetworkPolicies with temporary deletion during emergencies
  • Maintain RBAC permissions for CNI service accounts after cluster upgrades

Emergency Decision Matrix

Symptom Time to Impact Primary Action Fallback
New pods cannot schedule Minutes Restart CNI DaemonSet Nuclear CNI reset
Inter-pod communication failing Immediate Check BGP/VXLAN status Policy cleanup
DNS resolution broken Immediate Delete NetworkPolicies CoreDNS restart
IP exhaustion warnings Hours Enable prefix delegation Scale instance types

Critical Resource Thresholds

  • Calico IP blocks: Monitor fragmentation above 80% utilization
  • AWS ENI limits: Track per-instance-type maximums
  • Cilium memory usage: Alert above 2GB per node
  • BGP session count: Scale considerations for large clusters

Common Misconfigurations with High Impact

  1. Multiple CNI configs with conflicting priorities - Silent failures
  2. Windows line endings in JSON configs - Parser failures
  3. Wrong API server endpoints - Post-upgrade authentication failures
  4. Security group rules missing pod CIDR - DNS resolution blocks

This operational intelligence represents field-tested knowledge from production CNI debugging scenarios, optimized for systematic troubleshooting and rapid problem resolution.

Useful Links for Further Investigation

CNI Debugging Resources That Actually Help

LinkDescription
Kubernetes Network Troubleshooting ApproachThe most systematic approach I've found. Walks through the entire network stack from pod creation to external connectivity.
Platform9 CNI Troubleshooting GuideReal production scenarios with actual fixes. These guys run Kubernetes for customers, so they've seen everything.
AWS EKS CNI Troubleshooting GuideEssential if you're running on EKS and hit IP address limits. Covers ENI troubleshooting and prefix delegation.
Cilium Troubleshooting DocumentationThe official docs are actually useful. Covers eBPF debugging, connectivity tests, and cluster mesh issues.
Calico Network Policy TroubleshootingHas the calicoctl commands you need when policies aren't working. The BGP troubleshooting section saved my ass multiple times.
Flannel Troubleshooting GitHub IssuesSince Flannel docs are minimal, the GitHub issues are your best resource for weird edge cases.
Cilium CLIThe cilium connectivity test command is gold for testing all networking paths. Install this immediately if you use Cilium.
CalicoctlRequired for any Calico debugging. The node status and ipam show commands are essential.
Kubernetes Network Policy RecipesWorking examples of NetworkPolicies that actually work in production. Copy-paste and modify as needed.
Prometheus CNI MetricsScrape CNI plugin metrics. Essential for knowing when IP pools are exhausted before they break.
Grafana CNI DashboardsPre-built dashboards for Cilium, Calico, and generic CNI metrics. Don't build from scratch.
Samsung Ads: Calico to Cilium MigrationHow they migrated CNI plugins in production without downtime. Covers all the gotchas.
Kubernetes Networking Deep DiveThe best explanation of how Kubernetes networking actually works. Read this first if you're new to CNI.
CNCF Slack #sig-networkWhere CNI maintainers hang out. They'll actually help if you ask specific questions with logs.
Cilium CommunityActive community for Cilium-specific issues. Join their Slack workspace and GitHub discussions for real-time help.
Stack Overflow CNI QuestionsFilter by newest and highest voted. Real problems with real solutions, not documentation examples.
kubectl-debugCreates debug containers on nodes without having to remember the long kubectl debug syntax.
PopeyeScans your cluster for common configuration issues including CNI problems.
Network Policy EditorFree tool from Cilium to create and visualize network policies. Essential when you have hundreds of policies and can't figure out why traffic is blocked.

Related Tools & Recommendations

tool
Similar content

Project Calico - The CNI That Actually Works in Production

Used on 8+ million nodes worldwide because it doesn't randomly break on you. Pure L3 routing without overlay networking bullshit.

Project Calico
/tool/calico/overview
100%
tool
Similar content

Cilium - Fix Kubernetes Networking with eBPF

Replace your slow-ass kube-proxy with kernel-level networking that doesn't suck

Cilium
/tool/cilium/overview
94%
tool
Similar content

Container Network Interface (CNI) - How Kubernetes Does Networking

Pick the wrong CNI plugin and your pods can't talk to each other. Here's what you need to know.

Container Network Interface
/tool/cni/overview
83%
troubleshoot
Similar content

When Kubernetes Network Policies Break Everything (And How to Fix It)

Your pods can't talk, logs are useless, and everything's broken

Kubernetes
/troubleshoot/kubernetes-network-policy-ingress-egress-debugging/connectivity-troubleshooting
66%
tool
Similar content

Debugging Istio Production Issues - The 3AM Survival Guide

When traffic disappears and your service mesh is the prime suspect

Istio
/tool/istio/debugging-production-issues
62%
troubleshoot
Similar content

Kubernetes Networking Breaks. Here's How to Fix It.

When nothing can talk to anything else and you're getting paged at 2am on a Sunday because someone deployed a \

Kubernetes
/troubleshoot/kubernetes-networking/network-troubleshooting-guide
60%
troubleshoot
Similar content

Your Kubernetes Cluster is Down at 3am: Now What?

How to fix Kubernetes disasters when everything's on fire and your phone won't stop ringing.

Kubernetes
/troubleshoot/kubernetes-production-crisis-management/production-crisis-management
57%
troubleshoot
Similar content

Fix Kubernetes ImagePullBackOff Error - The Complete Battle-Tested Guide

From "Pod stuck in ImagePullBackOff" to "Problem solved in 90 seconds"

Kubernetes
/troubleshoot/kubernetes-imagepullbackoff/comprehensive-troubleshooting-guide
55%
troubleshoot
Similar content

When Your Entire Kubernetes Cluster Dies at 3AM

Learn to debug, survive, and recover from Kubernetes cluster-wide cascade failures. This guide provides essential strategies and commands for when kubectl is de

Kubernetes
/troubleshoot/kubernetes-production-outages/cluster-wide-cascade-failures
55%
tool
Recommended

Google Kubernetes Engine (GKE) - Google's Managed Kubernetes (That Actually Works Most of the Time)

Google runs your Kubernetes clusters so you don't wake up to etcd corruption at 3am. Costs way more than DIY but beats losing your weekend to cluster disasters.

Google Kubernetes Engine (GKE)
/tool/google-kubernetes-engine/overview
55%
integration
Recommended

Temporal + Kubernetes + Redis: The Only Microservices Stack That Doesn't Hate You

Stop debugging distributed transactions at 3am like some kind of digital masochist

Temporal
/integration/temporal-kubernetes-redis-microservices/microservices-communication-architecture
55%
integration
Recommended

GitOps Integration Hell: Docker + Kubernetes + ArgoCD + Prometheus

How to Wire Together the Modern DevOps Stack Without Losing Your Sanity

kubernetes
/integration/docker-kubernetes-argocd-prometheus/gitops-workflow-integration
55%
troubleshoot
Similar content

Fix Kubernetes Pod CrashLoopBackOff - Complete Troubleshooting Guide

Master Kubernetes CrashLoopBackOff. This complete guide explains what it means, diagnoses common causes, provides proven solutions, and offers advanced preventi

Kubernetes
/troubleshoot/kubernetes-pod-crashloopbackoff/crashloop-diagnosis-solutions
54%
troubleshoot
Similar content

Docker Networking Is Broken (And So Is Your Sanity) - Here's What Actually Works

Docker networking drives me insane. After 6 years of debugging this shit, here's what I've learned about making containers actually talk to each other.

Docker
/troubleshoot/docker-performance/networking-connectivity-issues
53%
troubleshoot
Similar content

Docker Containers Can't Connect - Fix the Networking Bullshit

Your containers worked fine locally. Now they're deployed and nothing can talk to anything else.

Docker Desktop
/troubleshoot/docker-cve-2025-9074-fix/fixing-network-connectivity-issues
53%
tool
Recommended

containerd - The Container Runtime That Actually Just Works

The boring container runtime that Kubernetes uses instead of Docker (and you probably don't need to care about it)

containerd
/tool/containerd/overview
50%
news
Recommended

Google fout X et Instagram dans Discover - 18 septembre 2025

compatible with oci

oci
/fr:news/2025-09-18/google-discover-integration-sociale
50%
news
Recommended

Nepal Goes Nuclear on Social Media, Bans 26 Platforms Including Facebook and YouTube

Government Blocks Everything from TikTok to LinkedIn in Sweeping Censorship Crackdown

Microsoft Copilot
/news/2025-09-07/nepal-social-media-ban
50%
tool
Recommended

Red Hat OpenShift Container Platform - Enterprise Kubernetes That Actually Works

More expensive than vanilla K8s but way less painful to operate in production

Red Hat OpenShift Container Platform
/tool/openshift/overview
50%
howto
Similar content

Deploy Weaviate in Production Without Everything Catching Fire

So you've got Weaviate running in dev and now management wants it in production

Weaviate
/howto/weaviate-production-deployment-scaling/production-deployment-scaling
50%

Recommendations combine user behavior, content similarity, research intelligence, and SEO optimization