Project Calico - The CNI That Actually Works in Production

Currently viewing the human version

Why Calico Became the Default CNI Choice

I've deployed Calico on everything from 3-node dev clusters to 500+ node production environments. Here's why it's the CNI that doesn't make you want to throw your laptop out the window.

Calico Architecture Components

The Problem with Other CNIs

Most CNI plugins are either too simple (Flannel has no network policies) or too clever for their own good (Cilium's eBPF everything approach breaks half your debugging tools). Weave died, Antrea is VMware-only, and don't get me started on kube-router.

Calico hits the sweet spot: it actually implements Kubernetes network policies, performs well, and when things break, you can actually debug them.

How Calico Actually Works

Pure IP Routing: No VXLAN tunnels, no overlay networking overhead. Your pods get real routable IP addresses. When a packet needs to go from node A to node B, it just... goes there. Revolutionary concept, apparently.

BGP Distribution: Uses BGP to tell other nodes about local routes. Yes, the same BGP that powers the internet. It works because it's been battle-tested for decades.

Multiple Dataplane Options: Start with iptables (it just works), upgrade to eBPF when you hit performance walls. Windows nodes? Supported. VPP for extreme performance? Available if you hate yourself enough to configure it.

What's New in Calico 3.30 (Released 2025)

Calico Open Source Data Planes

The biggest improvement is that they moved a bunch of previously paid features into the open source version. About damn time:

Flow logs without paying: The Goldmane API lets you actually see what's happening on your network
Web UI for debugging: Calico Whisker web console beats parsing JSON logs at 3am
Test policies before they break production: Policy preview lets you see what would be blocked before enabling enforcement
Gateway API support: If you're into that sort of thing, works with Envoy Gateway and Kubernetes Gateway API

The Components That Matter

Calico Felix Architecture

Felix: The agent that runs on every node and does the actual work. Programs iptables rules, manages network interfaces, handles route advertising. When Felix breaks, your networking dies.

BIRD: Handles BGP route distribution. Basically tells other nodes "hey, I have pods at these IPs, route traffic to me." Works in full mesh mode until your cluster gets big enough that you need route reflectors.

Typha: A caching proxy that sits between Felix and etcd. Required for clusters over ~100 nodes unless you want to watch etcd melt under load.

Why Choose Calico Over The Alternatives

It works everywhere: Unlike CNIs tied to specific cloud providers or distributions, Calico runs the same on bare metal, AWS, GCP, Azure, and your crappy on-prem cluster.

Debugging doesn't make you cry: When network policies aren't working, you can actually figure out why. `calicoctl get workloadendpoints` and `iptables -L` still work.

Performance that doesn't suck: Native IP routing means packets don't get wrapped in three layers of tunneling. Your latency stays reasonable and throughput doesn't tank.

The 8+ million nodes running Calico worldwide aren't there because of marketing - they're there because when you're woken up at 3am by networking issues, Calico is the CNI that actually helps you debug and fix problems instead of forcing you to restart everything and hope for the best. When you need Kubernetes networking that just works, Calico delivers.

Real-World CNI Comparison

Reality Check	Calico	Cilium	Flannel	Weave (Dead)
When It Breaks	`calicoctl` helps debug, iptables rules visible	Good luck debugging eBPF programs	Restart daemon, pray	💀 Project abandoned
Network Policies	Actually work	L7 policies are cool but complex	None (use external firewall)	Basic K8s only
Performance	No tunnels = fast	eBPF fast but CPU hungry	VXLAN overhead hurts	Was decent
Day 1 Setup	15 min if lucky, 2 hours if not	Plan for a full day	Works immediately	Doesn't matter
Windows Nodes	Full support (actually works)	Basically broken	Basic functionality	Nope
Debugging Tools	`tcpdump`, `wireshark` still work	Half your tools break	Standard tools work	N/A
Memory Usage	~100MB per node	200MB+ per node	~50MB per node	Was ~80MB
Learning Curve	Moderate BGP concepts help	Steep eBPF knowledge required	Minimal	Easy
When You'll Regret It	Complex BGP scenarios	When eBPF breaks your monitoring	First security audit	Immediately

What Actually Makes Calico Worth Using

Dataplane Options That Don't Suck

Calico Project Logo

eBPF Mode (When You Need Speed)

The eBPF dataplane is genuinely fast - about 25% better throughput than iptables in high-connection scenarios. But here's the catch: when networking breaks, good luck debugging it. Your standard tooling won't help, and you'll be staring at eBPF bytecode at 3am.

Use eBPF when:

You've hit performance walls with iptables
You have engineers who understand eBPF debugging
You're not running legacy monitoring that depends on iptables packet counters

iptables Mode (The Reliable Choice)

Default dataplane that just works. When pods can't reach services, you can actually debug it with tools you already know. Performance is fine for most workloads - if you're hitting limits with iptables, you probably have bigger problems.

Windows Support (Actually Works)

Unlike every other CNI, Calico's Windows support isn't an afterthought. Network policies work, service discovery works, and it doesn't randomly break when Windows updates.

VPP Mode (For Masochists)

Vector Packet Processing delivers insane performance if you can figure out how to configure it. Documentation is sparse, debugging is hell, and you'll spend more time fighting VPP than benefiting from the performance.

BGP Routing (The Good and Bad)

Calico BGP Network Architecture

The Good: No VXLAN tunnels means no encapsulation overhead. Packets flow at line rate because they're just regular IP packets. BGP has been running the internet for decades, so it's battle-tested.

The Bad: Your network team needs to understand what's happening. Some cloud providers (looking at you, GCP) make BGP more complex than it needs to be. Route reflectors become necessary around 100+ nodes.

The Reality: BGP mode gives you the best performance but requires network knowledge. IP-in-IP mode works everywhere but adds ~10% overhead. VXLAN mode exists for environments where even IP-in-IP doesn't work (rare).

Network Policies That Actually Work

The Reality of Network Policies

Calico Network Policy Flow

Most CNIs either don't support network policies (Flannel) or implement them poorly. Calico's policies actually do what they say they'll do.

## This will block everything except what you explicitly allow
apiVersion: projectcalico.org/v3
kind: NetworkPolicy
metadata:
  name: deny-all-default
spec:
  selector: all()
  types:
  - Ingress
  - Egress

Pro tip: Start with deny-all policies and add allow rules. The opposite approach (allow-all then deny) is security theater.

Policy Debugging Hell (And How to Escape)

When network policies don't work as expected:

`calicoctl get workloadendpoints` - see what Calico thinks your pods are
`calicoctl get networkpolicies` - verify policies are actually loaded
`iptables -L | grep cali` - see the actual iptables rules

The policy preview feature is a godsend - you can test what would be blocked before enabling enforcement.

WireGuard Encryption

Works out of the box with minimal performance impact (~5-10%). Way easier than setting up service mesh encryption, and it encrypts node-to-node traffic that service mesh misses.

Enable it once your cluster is stable - don't add encryption complexity to initial deployments.

Observability That Doesn't Suck

Calico Ecosystem Map

Flow Logs (Finally Free)

Calico 3.30 moved flow logs from paid to free. About time. The Goldmane API gives you JSON logs of what's actually happening on your network, which beats guessing when debugging connectivity issues.

The Whisker web console is decent for filtering through logs - better than parsing gigabytes of JSON by hand.

Real Debugging Workflow

When networking is broken (and it will be):

Check if pods are up: kubectl get pods
Check if Calico sees them: calicoctl get workloadendpoints
Check policy evaluation: calicoctl get networkpolicies --show-labels
Look at flow logs for denied connections
Check iptables rules: iptables -L | grep cali

The flow logs finally let you see why connections are failing instead of just that they are.

The Production Reality

Calico Big Cat Community Logo

When Calico Scales (And When It Doesn't)

Calico works well up to 1000+ nodes. The largest deployments run millions of nodes across multiple clusters. But:

Route reflectors become mandatory around 100+ nodes
etcd performance becomes the bottleneck before Calico does
Felix memory usage grows with the number of endpoints
Policy evaluation time increases with policy complexity

Cloud Provider Gotchas

AWS: VPC CNI exists, but Calico works fine and gives you network policies
GCP: Alias IP mode can be tricky, but worth it for performance
Azure: Works out of the box, no special considerations
Bare metal: BGP peering with your network gear actually works

Operational Headaches

Upgrades: Usually smooth, but test policy evaluation after minor version bumps
Monitoring: Felix restarts are normal, but frequent restarts indicate configuration issues
Resource limits: Don't be cheap with Felix memory limits - OOMKilled Felix breaks networking
IPv6: Supported but doubles your debugging complexity

The reality is that once Calico is running correctly, it tends to stay running. The hard part is getting the initial configuration right, but that's where this guide comes in - we've walked through the production gotchas so you don't have to learn them at 3am when your cluster is down.

This is why Calico dominates the CNI landscape. It's not perfect, but it's predictable. When things break, you can actually fix them. When you need to scale, it scales. When you need security policies, they work. And when you're evaluating CNI options for your next cluster, remember: the best CNI is the one that doesn't wake you up at night.

The Questions You Actually Ask at 3AM

Why is my pod not reaching other pods?

First, calm down and check the basics:bash# Is the pod actually running?kubectl get pods -o wide# Does Calico see it?calicoctl get workloadendpoints# What does the routing table look like?ip routeIf routes are missing, Felix probably crashed. Check kubectl logs -n kube-system -l k8s-app=calico-node.

Network policies aren't working and I'm losing my mind

You're not alone. Network policies are confusing. Here's the debugging order:

Verify the policy exists: calicoctl get networkpolicies
Check if your pod labels match the selector: kubectl get pods --show-labels
Look at the actual iptables rules: iptables -L | grep cali
Use staged policies to see what would be blocked

Does Calico actually work on Windows?

Yes, and it's the only CNI where Windows isn't an afterthought. Network policies work, service discovery works, and it doesn't randomly break when Windows updates. Still Windows though, so expect the usual Windows nonsense.

Should I pay for Calico Cloud/Enterprise?

Open source version covers 90% of use cases. Pay for the commercial version if you need:

Multi-cluster visibility across dozens of clusters
Compliance reporting that makes auditors happy
Someone to blame when things break (support)

The 3.30 release moved flow logs to open source, which was the main reason to upgrade.

eBPF vs iptables - which one won't ruin my weekend?

Start with iptables. It's slower but debuggable with tools you know. Move to eBPF later if you hit performance walls.

eBPF is 20-25% faster but debugging requires learning new tools. If your monitoring depends on iptables packet counters, eBPF will break it.

How big can my cluster get before Calico melts?

Calico scales to 1000+ nodes easily. The pain points:

Around 100 nodes: Set up route reflectors or BGP mesh gets unwieldy
Around 500 nodes: Watch etcd performance and Felix memory usage
1000+ nodes: You probably have bigger problems than CNI choice

How do I migrate from Flannel/other CNI?

You don't migrate - you rebuild. CNI plugins are too deeply integrated to switch live. Options:

Rebuild the entire cluster (easiest)
Rolling node replacement (if you have the time)
Blue-green cluster deployment (if you're fancy)

Plan for downtime. Anyone claiming zero-downtime CNI migration is selling something.

Calico broke and my cluster is down - what now?

Don't panic. Common fixes:bash# Restart Calico podskubectl delete pods -n kube-system -l k8s-app=calico-node# Check if Felix is OOMKilledkubectl describe pods -n kube-system -l k8s-app=calico-node# Nuclear option: remove and reinstallkubectl delete -f calico.yaml && kubectl apply -f calico.yaml

Does service mesh + Calico make sense?

Yes, they solve different problems:

Calico: L3/L4 network security and connectivity
Service mesh: L7 features, observability, traffic management

Use both for defense in depth. Don't use service mesh as a replacement for network policies.

Air-gapped environments - does it work?

Yes, but you'll need to mirror all the container images to your internal registry. No external dependencies for core functionality, but initial setup requires internet access to pull images.

Network policies are slow - what's wrong?

Policy performance degrades with complexity. Common issues:

Too many overlapping policies (Calico has to evaluate all of them)
Complex selectors that scan all pods repeatedly
Missing connection tracking (new connections re-evaluate policies)

Keep policies simple and specific.

What Kubernetes versions work?

Calico supports Kubernetes 1.27 through 1.31 as of 2025. Generally works on any CNCF-certified Kubernetes distribution. Check the compatibility matrix before upgrading

Calico sometimes breaks on beta Kubernetes releases.

I'm new to this - where do I start?

Install on a test cluster with kubectl apply -f https://raw.githubusercontent.com/projectcalico/calico/v3.30.0/manifests/calico.yaml
Deploy some test pods and verify they can communicate
Create a simple deny-all network policy and test it
Learn calicoctl commands for debugging

Don't start with production clusters.

Where do I get help when things break?

Calico Slack for real-time help from other users
GitHub issues for bugs
Stack Overflow for configuration questions
Pay Tigera if you need someone to blame

Resources That Actually Help

27%

Recommendations combine user behavior, content similarity, research intelligence, and SEO optimization

Quick Navigation

The Problem with Other CNIs

How Calico Actually Works

What's New in Calico 3.30 (Released 2025)

The Components That Matter

Why Choose Calico Over The Alternatives

Dataplane Options That Don't Suck

eBPF Mode (When You Need Speed)

iptables Mode (The Reliable Choice)

Windows Support (Actually Works)

VPP Mode (For Masochists)

BGP Routing (The Good and Bad)

Network Policies That Actually Work

The Reality of Network Policies

Policy Debugging Hell (And How to Escape)

WireGuard Encryption

Observability That Doesn't Suck

Flow Logs (Finally Free)

Real Debugging Workflow

The Production Reality

When Calico Scales (And When It Doesn't)

Cloud Provider Gotchas

Operational Headaches

Why is my pod not reaching other pods?

Network policies aren't working and I'm losing my mind

Does Calico actually work on Windows?

Should I pay for Calico Cloud/Enterprise?

eBPF vs iptables - which one won't ruin my weekend?

How big can my cluster get before Calico melts?

How do I migrate from Flannel/other CNI?

Calico broke and my cluster is down - what now?

Does service mesh + Calico make sense?

Air-gapped environments - does it work?

Network policies are slow - what's wrong?

What Kubernetes versions work?

I'm new to this - where do I start?

Where do I get help when things break?

Related Tools & Recommendations

GitOps Integration Hell: Docker + Kubernetes + ArgoCD + Prometheus

Kafka + MongoDB + Kubernetes + Prometheus Integration - When Event Streams Break

Prometheus + Grafana + Jaeger: Stop Debugging Microservices Like It's 2015

Cilium - Fix Kubernetes Networking with eBPF

RAG on Kubernetes: Why You Probably Don't Need It (But If You Do, Here's How)

Grafana - The Monitoring Dashboard That Doesn't Suck

Set Up Microservices Monitoring That Actually Works

Sift - Fraud Detection That Actually Works

jQuery - The Library That Won't Die

GPT-5 Is So Bad That Users Are Begging for the Old Version Back

Stop Debugging Microservices Networking at 3AM

Istio - Service Mesh That'll Make You Question Your Life Choices

How to Deploy Istio Without Destroying Your Production Environment

MongoDB Alternatives: Choose the Right Database for Your Specific Use Case

GitHub Codespaces Enterprise Deployment - Complete Cost & Management Guide

Install Python 3.12 on Windows 11 - Complete Setup Guide

Migrate JavaScript to TypeScript Without Losing Your Mind

DuckDB - When Pandas Dies and Spark is Overkill

SaaSReviews - Software Reviews Without the Fake Crap

Fresh - Zero JavaScript by Default Web Framework