Currently viewing the human version
Switch to AI version

How Cilium Actually Works

Cilium sticks eBPF programs at various points in the Linux kernel networking stack. Instead of your packets going through the normal Linux networking path (and hitting thousands of iptables rules), eBPF intercepts them early and makes forwarding decisions in the kernel.

eBPF: Skip the Bullshit

eBPF lets you run code in the kernel without recompiling it. Cilium uses this to do networking stuff faster than userspace solutions. The downside? Debugging is a nightmare when things break.

Traditional CNI plugins like Flannel create VXLAN tunnels and rely on iptables for everything. Once you hit about 1,000 services, those iptables rules become a performance bottleneck. Every packet has to traverse this massive rule chain to figure out where to go.

Cilium says "fuck that" and implements forwarding logic directly in eBPF. Your packets get processed by kernel code that knows exactly what to do with them, no rule traversal needed.

eBPF programs hook into different points in the Linux networking stack - from the XDP layer for early packet processing to TC hooks for policy enforcement. Check out the eBPF and XDP Reference Guide for technical details on how this works.

Identity-Based Security That Actually Works

Instead of tracking IP addresses that change every 5 minutes in Kubernetes, Cilium tracks workload identities that stay consistent. Each pod gets a security identity based on its labels, and packets carry that identity information.

This matters because traditional Kubernetes network policies break when pods get rescheduled and IPs change. Cilium's approach means your security policies keep working even when everything's moving around. Read more about identity management and security identity allocation.

The Agent Setup Reality

Every node runs a Cilium agent that compiles and loads eBPF programs. When this works, it's magical. When it doesn't, you're debugging kernel-level networking with tools that assume you have a PhD in eBPF.

The Cilium architecture consists of the agent (running on each node), the Cilium operator (cluster-wide management), the CNI plugin (pod networking), and Hubble (observability). These components work together to replace traditional iptables-based networking with eBPF programs.

The agent talks to the Kubernetes API to get policy updates, then translates them into eBPF code and loads it into the kernel. If your kernel doesn't support the eBPF features Cilium needs, you're fucked.

I learned this the hard way when trying to run Cilium on CentOS 7 with kernel 3.10. The agent just kept restarting with "invalid argument" errors. Kernel 4.9+ is the minimum but you really want 5.10+ to avoid random eBPF loading failures. Check the system requirements before installation or you'll waste hours debugging cryptic kernel errors.

kube-proxy Replacement

kube-proxy creates iptables rules for every service endpoint. With 1,000 services and 10 endpoints each, that's 10,000+ iptables rules your packets have to potentially traverse.

Cilium's eBPF load balancing uses hash tables for O(1) lookups. It can handle millions of services without the linear performance degradation you get with iptables. See the kube-proxy replacement guide and service load balancing documentation.

I've seen clusters where replacing kube-proxy with Cilium reduced average response latency by 30-40%, but I've also seen it completely break NodePort services on GKE because Google's load balancer expects kube-proxy iptables rules. Always test in a staging cluster that matches your production setup exactly.

Performance Reality Check

The official benchmarks show impressive numbers, but they're running on clean test clusters with optimal configurations. Real performance differences between eBPF and traditional iptables become dramatic as you scale - eBPF maintains consistent O(1) lookup times while iptables performance degrades linearly with the number of rules. Also check out the CNI performance benchmark blog and scalability testing.

In the real world, Cilium is definitely faster than traditional CNI plugins, but the performance gain depends on your workload. If you're doing mostly east-west traffic between microservices, the difference is significant. If you're just running a few CRUD apps, you might not notice.

The most dramatic improvement is connection setup time. eBPF load balancing eliminates the NAT operations that create bottlenecks during high connection churn. Read about connection tracking and socket acceleration.

Multi-Cluster Networking (ClusterMesh)

ClusterMesh lets pods in different clusters talk to each other like they're in the same cluster. It's actually pretty cool when it works.

ClusterMesh creates a secure mesh between clusters by running a clustermesh-apiserver in each cluster that exposes local services to remote clusters. The clusters establish mutual TLS connections and sync service discovery information across the mesh.

Each cluster runs a clustermesh-apiserver that exposes local services to remote clusters. The clusters establish secure connections and sync service discovery information. Check the multi-cluster setup guide and troubleshooting documentation.

The catch? Setting it up requires understanding Kubernetes networking, BGP routing, and how Cilium's identity system works across clusters. I've spent entire weekends debugging certificate trust issues between clusters that worked fine individually. Budget 3x longer than you think for ClusterMesh setup.

Current Status (September 2025)

Cilium has over 22k GitHub stars as of September 2025. The project graduated from CNCF in October 2023, so it won't disappear when a maintainer switches jobs.

They release pretty regularly but each version breaks something different. Version 1.15.7 had eBPF program loading issues on Ubuntu 22.04 with kernel 5.15. Version 1.16.2 fixed that but broke ClusterMesh between different minor versions. Version 1.17.x works well unless you're using AWS Load Balancer Controller v2.6+ - they conflict and pods randomly lose network connectivity.

I'm running 1.16.15 in production right now because 1.17 kept having weird identity allocation failures under heavy pod churn. Your mileage may vary but test thoroughly before upgrading.

Real Talk CNI Comparison

Reality Check

Cilium

Calico

Flannel

What it actually does

eBPF kernel networking

iptables + BGP routing

VXLAN tunnels everywhere

Setup complexity

Complex as hell

Moderate learning curve

Just fucking works

When it breaks

Good luck debugging eBPF

Check BGP and iptables

Restart the pod

Performance

Fast once working

Decent, scales well

Fine for most workloads

Documentation

Comprehensive but assumes expertise

Pretty good

Basic but sufficient

Community

Active but niche

Large enterprise user base

Simple, stable community

Enterprise support

Isovalent (now Cisco)

Tigera (solid company)

Red Hat/CoreOS

What Actually Runs on Your Nodes

Linux Kernel Networking

eBPF Programs Everywhere

Cilium installs eBPF programs at multiple points in your kernel networking stack. When it works, packets flow through these programs instead of the usual Linux networking path. When it breaks, you're debugging kernel code with tools that barely exist.

The 3AM eBPF Debugging Reality:

  • `bpftool` shows you what programs are loaded, but good luck understanding what they actually do to your packets
  • Error messages are cryptic kernel-level failures like "Invalid argument" - I've spent hours on GitHub issues trying to decode what that means
  • When eBPF programs fail to load, networking just stops. No graceful fallback, no helpful error message
  • Kernel logs give you gems like "BPF JIT compilation failed" at 3am when you need to fix production

Cilium Agent: The Thing That Breaks

The Cilium agent runs on every node and does all the heavy lifting:

What Actually Goes Wrong:

level=error msg=\"Unable to compile BPF program\" error=\"cannot load program: permission denied\"

This usually means your kernel doesn't support the eBPF features Cilium needs, or someone enabled a security policy that blocks BPF syscalls. I hit this exact error on RHEL 8.4 because they disabled unprivileged BPF by default. The "fix" was either downgrading security or running containers privileged, which defeats half the point of using Kubernetes.

Layer 7 Policies: The HTTP Example That Works Sometimes

The textbook HTTP policy example looks clean:

apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
spec:
  endpointSelector:
    matchLabels:
      app: backend
  ingress:
  - fromEndpoints:
    - matchLabels:
        app: frontend
    toPorts:
    - ports:
      - port: \"8080\"
        protocol: TCP
      rules:
        http:
        - method: \"GET\"
          path: \"/api/v1/.*\"

The Reality:

  • Works great if your HTTP requests match exactly what you expect
  • Breaks when your frontend sends POST requests you forgot about
  • The regex \"/api/v1/.*\" might not match /api/v1/users?limit=10 depending on how your app constructs URLs
  • Debugging requires looking at Hubble flows to see which requests are getting denied

Performance: It's Complicated

Cilium is definitely faster than iptables-based solutions, but the performance gain depends on your specific workload and configuration.

Hash Table Lookups vs iptables Chains:
iptables rules form a linked list that packets traverse sequentially. With 10,000 service endpoints, that's potentially 10,000 rule evaluations per packet. Cilium uses kernel hash tables for O(1) lookups.

Socket-Level Load Balancing:
For connections between pods (east-west traffic), Cilium intercepts socket operations and picks endpoints before the connection is established. This eliminates the NAT translation overhead you get with kube-proxy.

XDP Support (If Your NICs Support It):
XDP processes packets before they hit the normal kernel networking stack. Most cloud provider NICs don't support all XDP features, so you might not get the full benefit.

Hubble: Actually Useful Observability

Network Observability

Hubble is one of Cilium's genuinely useful features. It captures network flow data using the same eBPF programs that handle networking, so you get comprehensive visibility without additional performance overhead.

The Hubble UI provides an interactive service map that visualizes network flows in real-time, showing which services are communicating, HTTP status codes, latency metrics, and policy decisions. It's like having a detailed network diagram that updates itself as your applications run.

What You Actually See:

  • Every network connection with source/destination pod identities
  • HTTP request methods, response codes, and latency
  • DNS queries and responses (useful for debugging service discovery issues)
  • Network policy decisions (which connections were allowed/denied)

Debugging Example:
Your frontend can't reach your backend. Hubble shows:

frontend-pod -> backend-service DENIED (Policy denied)

You can see exactly which policy rule blocked the connection and fix it. Without Hubble, you'd be guessing why connections are failing.

ClusterMesh: Cool But Complex

ClusterMesh lets services in different clusters discover and connect to each other. The architecture is actually elegant:

  • Each cluster runs a clustermesh-apiserver that exposes local services
  • Clusters connect using mutual TLS and sync service information
  • Pods can connect to service.namespace.svc.cluster2.local across clusters

What Will Go Wrong:

  • Network connectivity between clusters - your firewall team will block random ports
  • Certificate trust issues that make no sense - clusters that worked yesterday suddenly can't talk
  • IP address conflicts when someone forgot to check CIDR ranges before creating clusters
  • Identity conflicts causing random 403s when the same service exists in multiple clusters

I spent a full day debugging a ClusterMesh setup where everything looked correct but services couldn't connect. Turned out the mutual TLS certificates had different CA chains and were silently failing validation. Budget 3x longer than the docs suggest.

BGP Mode: For On-Premises Reality

If you're running on bare metal or want to avoid overlay networks, Cilium supports BGP routing. Your pods get routable IP addresses, and Cilium announces routes to your network infrastructure.

In BGP mode, Cilium eliminates VXLAN tunneling overhead by having each node announce pod routes directly to your network switches. This gives you native performance and makes traffic visible to existing network monitoring tools.

Requirements:

  • Your network team needs to understand BGP
  • Your switches/routers need to be configured correctly
  • IP address planning becomes critical (no more overlapping pod CIDRs)

The Benefit:
No VXLAN tunnels means better performance and easier troubleshooting. Network packets look normal to your existing monitoring tools.

When Things Break: The Real Architecture

Cilium Agent Down:

  • New pods can't get IP addresses
  • Existing network policies stop being enforced
  • eBPF programs stay loaded but don't get updates

eBPF Programs Fail to Load:

  • Networking stops working entirely on that node
  • Kubernetes events usually show unhelpful error messages
  • Check kernel logs and dmesg for actual eBPF errors

Memory Pressure:

  • eBPF maps have size limits
  • High connection churn can exhaust map entries
  • Symptoms: new connections fail while existing ones work

The architecture is elegant when everything works correctly. When it doesn't, you need to understand Linux networking, eBPF, and Kubernetes all at once.

Questions People Actually Ask

Q

Why does my Cilium agent keep crashing with "permission denied"?

A

Your kernel doesn't support the eBPF features Cilium needs, or BPF syscalls are restricted. Check:bash# Verify eBPF supportzgrep CONFIG_BPF /proc/config.gz# Should show CONFIG_BPF=y# Check kernel versionuname -r# Need 4.9+, preferably 5.10+Fix it by upgrading your kernel or adjusting security policies that restrict BPF operations. Some security-hardened distributions disable BPF by default.

Q

How do I debug "connection refused" errors with Cilium network policies?

A

Use Hubble to see what's actually happening:bashhubble observe --follow --pod frontend-podYou'll see output like:frontend-pod:54321 -> backend-pod:8080 DENIED (Policy denied)The policy denied the connection. Check your CiliumNetworkPolicy resources and make sure the selectors match your pod labels. Layer 7 policies are especially picky about HTTP method and path matching.

Q

Can I just replace kube-proxy with Cilium without breaking everything?

A

Sometimes, but expect shit to break in weird ways.

Enable kube-proxy replacement like this:bashcilium install --set kubeProxyReplacement=strictWhat will definitely break:

  • Node

Port services on GKE because Google's load balancer expects kube-proxy iptables rules

  • Any monitoring that scrapes kube-proxy metrics
  • those endpoints disappear
  • Legacy apps that make assumptions about iptables NAT behavior
  • Anything using hostNetwork pods with services
  • weird edge case but it happensTest in staging first. When it works, it's noticeably faster. When it doesn't, you'll spend hours figuring out why your load balancer health checks started failing.
Q

Why is Cilium using so much memory compared to Flannel?

A

Cilium does more stuff.

It maintains:

  • eBPF maps for connection tracking
  • Identity mappings for all pods
  • Policy state for Layer 7 filtering
  • Flow data for Hubble observabilityTypical memory usage: 150-300MB per node vs 50MB for Flannel. The memory usage scales with the number of pods and network policies. If you're running a simple cluster, Flannel's lower overhead might be better.
Q

How do I know if my NIC supports XDP for better performance?

A

Check the driver:bashethtool -i eth0 | grep driverGood XDP support: mlx5_core, i40e, ixgbe, veth (containers)Limited XDP: Most cloud provider NICs, virtio driversYou can enable XDP mode in Cilium, but you might not see performance gains if your NIC driver doesn't support the required features.

Q

Why are my HTTP network policies not working?

A

Layer 7 policies are finicky.

Common issues:

  • Your pods aren't actually sending HTTP traffic (might be HTTPS)
  • The HTTP path regex doesn't match what your app actually sends
  • The policy is applied to the wrong pods (label selectors)
  • You forgot to allow the specific HTTP method (POST, PUT, etc.)Debug with:bashhubble observe --http-status --pod your-pod-name
Q

Can I migrate from Calico to Cilium without downtime?

A

No.

Don't believe anyone who tells you otherwise. Both CNIs install kernel modules and eBPF programs that conflict. You need to:

  1. Backup your network policies (and rewrite them because the syntax is different)2. Drain nodes completely and reinstall with Cilium
  2. Or build a new cluster and migrate workloads (safer but more work)I tried the "gradual migration" approach once. Spent two days with half my nodes unable to route to the other half. Just plan for downtime and do it cleanly.
Q

What happens when the Cilium agent dies?

A

Existing connections: Keep working (eBPF programs stay loaded)New connections: Fail after existing conntrack entries expireNew pods: Can't get IP addresses or network policiesPolicy updates: Stop being enforcedThe agent restart usually takes 30-60 seconds. If it keeps crashing, check the logs:bashkubectl logs -n kube-system ds/cilium

Q

How do I troubleshoot ClusterMesh connectivity issues?

A

Start with the basics:bash# Check clustermesh statuscilium clustermesh status# Verify certificateskubectl get secret -n kube-system clustermesh-apiserver-remote-certCommon failures:

  • Network connectivity between cluster API servers
  • Certificate trust issues
  • IP address space conflicts between clusters
  • Firewall rules blocking clustermesh traffic (port 2379)
Q

Why is my Cilium installation slower than kube-proxy?

A

You probably have a configuration issue:

  • Running in overlay mode when you could use native routing
  • XDP mode enabled on NICs that don't support it properly
  • Unnecessary Layer 7 policies adding overhead
  • Hubble collecting too much flow dataCheck your Cilium configuration:bashcilium config view | grep -E "(routing-mode|kube-proxy-replacement)"
Q

Does Cilium work with my service mesh?

A

It can work alongside Istio/Linkerd, but you're duplicating functionality. Cilium provides its own service mesh features through eBPF.With traditional service mesh: You get double encryption, double observability overheadCilium-only: Lighter weight but fewer advanced traffic management featuresChoose one approach. Running both is usually overkill unless you have specific requirements that only one can meet.

Q

How do I debug eBPF programs when things go wrong?

A

You mostly can't, and that's the real problem with eBPF. The debugging tools are hot garbage:bash# See what programs are loaded (but not what they do)bpftool prog list# Check eBPF maps (cryptic output)bpftool map list# Kernel logs for eBPF errors (good luck)dmesg | grep -i bpfWhen eBPF programs fail to load, you get error messages like "Invalid argument" with zero context. I've had production outages where the only fix was rebooting nodes because some eBPF map got corrupted.Your best bet is checking GitHub issues for your exact kernel version + Cilium version combo. Someone else probably hit the same cryptic error.

Q

What's new in Cilium for 2025?

A

The upcoming 1.19 release focuses on improved multi-cluster support and better cloud integration.

Key improvements include:

  • Enhanced XDP support for AWS Nitro and Google GKE Autopilot
  • Better integration with Kubernetes Gateway API v1.0
  • Improved eBPF program compilation for ARM64 architectures
  • Enhanced Tetragon integration for runtime securityThe Cisco acquisition in 2024 brought enterprise focus without changing the open-source model. Expect more enterprise features in the commercial Isovalent distribution while keeping core functionality free.

Resources That Actually Help

Related Tools & Recommendations

integration
Recommended

GitOps Integration Hell: Docker + Kubernetes + ArgoCD + Prometheus

How to Wire Together the Modern DevOps Stack Without Losing Your Sanity

prometheus
/integration/docker-kubernetes-argocd-prometheus/gitops-workflow-integration
100%
integration
Recommended

Kafka + MongoDB + Kubernetes + Prometheus Integration - When Event Streams Break

When your event-driven services die and you're staring at green dashboards while everything burns, you need real observability - not the vendor promises that go

Apache Kafka
/integration/kafka-mongodb-kubernetes-prometheus-event-driven/complete-observability-architecture
100%
tool
Recommended

Project Calico - The CNI That Actually Works in Production

Used on 8+ million nodes worldwide because it doesn't randomly break on you. Pure L3 routing without overlay networking bullshit.

Project Calico
/tool/calico/overview
82%
integration
Recommended

Prometheus + Grafana + Jaeger: Stop Debugging Microservices Like It's 2015

When your API shits the bed right before the big demo, this stack tells you exactly why

Prometheus
/integration/prometheus-grafana-jaeger/microservices-observability-integration
74%
integration
Recommended

RAG on Kubernetes: Why You Probably Don't Need It (But If You Do, Here's How)

Running RAG Systems on K8s Will Make You Hate Your Life, But Sometimes You Don't Have a Choice

Vector Databases
/integration/vector-database-rag-production-deployment/kubernetes-orchestration
45%
integration
Recommended

Stop Debugging Microservices Networking at 3AM

How Docker, Kubernetes, and Istio Actually Work Together (When They Work)

Docker
/integration/docker-kubernetes-istio/service-mesh-architecture
43%
tool
Recommended

Istio - Service Mesh That'll Make You Question Your Life Choices

The most complex way to connect microservices, but it actually works (eventually)

Istio
/tool/istio/overview
43%
howto
Recommended

How to Deploy Istio Without Destroying Your Production Environment

A battle-tested guide from someone who's learned these lessons the hard way

Istio
/howto/setup-istio-production/production-deployment
43%
tool
Recommended

Envoy Proxy - The Network Proxy That Actually Works

Lyft built this because microservices networking was a clusterfuck, now it's everywhere

Envoy Proxy
/tool/envoy-proxy/overview
41%
tool
Recommended

Grafana - The Monitoring Dashboard That Doesn't Suck

integrates with Grafana

Grafana
/tool/grafana/overview
41%
howto
Recommended

Set Up Microservices Monitoring That Actually Works

Stop flying blind - get real visibility into what's breaking your distributed services

Prometheus
/howto/setup-microservices-observability-prometheus-jaeger-grafana/complete-observability-setup
41%
news
Popular choice

Apple Admits Defeat, Begs Google to Fix Siri's AI Disaster

After years of promising AI breakthroughs, Apple quietly asks Google to replace Siri's brain with Gemini

Technology News Aggregation
/news/2025-08-25/apple-google-siri-gemini
41%
news
Popular choice

JetBrains AI Credits: From Unlimited to Pay-Per-Thought Bullshit

Developer favorite JetBrains just fucked over millions of coders with new AI pricing that'll drain your wallet faster than npm install

Technology News Aggregation
/news/2025-08-26/jetbrains-ai-credit-pricing-disaster
38%
news
Popular choice

Meta Slashes Android Build Times by 3x With Kotlin Buck2 Breakthrough

Facebook's engineers just cracked the holy grail of mobile development: making Kotlin builds actually fast for massive codebases

Technology News Aggregation
/news/2025-08-26/meta-kotlin-buck2-incremental-compilation
36%
news
Popular choice

Apple Accidentally Leaked iPhone 17 Launch Date (Again)

September 9, 2025 - Because Apple Can't Keep Their Own Secrets

General Technology News
/news/2025-08-24/iphone-17-launch-leak
34%
news
Popular choice

Docker Desktop Hit by Critical Container Escape Vulnerability

CVE-2025-9074 exposes host systems to complete compromise through API misconfiguration

Technology News Aggregation
/news/2025-08-25/docker-cve-2025-9074
33%
troubleshoot
Recommended

Docker Daemon Won't Start on Linux - Fix This Shit Now

Your containers are useless without a running daemon. Here's how to fix the most common startup failures.

Docker Engine
/troubleshoot/docker-daemon-not-running-linux/daemon-startup-failures
31%
news
Recommended

Linux Foundation Takes Control of Solo.io's AI Agent Gateway - August 25, 2025

Open source governance shift aims to prevent vendor lock-in as AI agent infrastructure becomes critical to enterprise deployments

Technology News Aggregation
/news/2025-08-25/linux-foundation-agentgateway
31%
news
Recommended

Your Network Infrastructure Is Compromised - September 11, 2025

Cisco IOS XR Vulns Let Attackers Own Your Core Routers, Sleep Well Tonight

Redis
/news/2025-09-11/cisco-ios-xr-vulnerabilities
31%
news
Popular choice

639 API Vulnerabilities Hit AI-Powered Systems in Q2 2025 - Wallarm Report

Security firm reveals 34 AI-specific API flaws as attackers target machine learning models and agent frameworks with logic-layer exploits

Technology News Aggregation
/news/2025-08-25/wallarm-api-vulnerabilities
31%

Recommendations combine user behavior, content similarity, research intelligence, and SEO optimization