Why does my Cilium agent keep crashing with "permission denied"?

Your kernel doesn't support the eBPF features Cilium needs, or BPF syscalls are restricted. Check:```bash# Verify eBPF supportzgrep CONFIG_BPF /proc/config.gz# Should show CONFIG_BPF=y# Check kernel versionuname -r# Need 4.9+, preferably 5.10+```Fix it by upgrading your kernel or adjusting security policies that restrict BPF operations. Some security-hardened distributions disable BPF by default.

How do I debug "connection refused" errors with Cilium network policies?

Use Hubble to see what's actually happening:```bashhubble observe --follow --pod frontend-pod```You'll see output like:```frontend-pod:54321 -> backend-pod:8080 DENIED (Policy denied)```The policy denied the connection. Check your CiliumNetworkPolicy resources and make sure the selectors match your pod labels. Layer 7 policies are especially picky about HTTP method and path matching.

Can I just replace kube-proxy with Cilium without breaking everything?

Sometimes, but expect shit to break in weird ways. Enable kube-proxy replacement like this:```bashcilium install --set kubeProxyReplacement=strict```**What will definitely break:**- NodePort services on GKE because Google's load balancer expects kube-proxy iptables rules- Any monitoring that scrapes kube-proxy metrics - those endpoints disappear- Legacy apps that make assumptions about iptables NAT behavior- Anything using hostNetwork pods with services - weird edge case but it happensTest in staging first. When it works, it's noticeably faster. When it doesn't, you'll spend hours figuring out why your load balancer health checks started failing.

Why is Cilium using so much memory compared to Flannel?

Cilium does more stuff. It maintains:- eBPF maps for connection tracking- Identity mappings for all pods- Policy state for Layer 7 filtering- Flow data for Hubble observabilityTypical memory usage: 150-300MB per node vs 50MB for Flannel. The memory usage scales with the number of pods and network policies. If you're running a simple cluster, Flannel's lower overhead might be better.

How do I know if my NIC supports XDP for better performance?

Check the driver:```bashethtool -i eth0 | grep driver```Good XDP support: `mlx5_core`, `i40e`, `ixgbe`, `veth` (containers)Limited XDP: Most cloud provider NICs, virtio driversYou can enable XDP mode in Cilium, but you might not see performance gains if your NIC driver doesn't support the required features.

Why are my HTTP network policies not working?

Layer 7 policies are finicky. Common issues:- Your pods aren't actually sending HTTP traffic (might be HTTPS)- The HTTP path regex doesn't match what your app actually sends- The policy is applied to the wrong pods (label selectors)- You forgot to allow the specific HTTP method (POST, PUT, etc.)Debug with:```bashhubble observe --http-status --pod your-pod-name```

Can I migrate from Calico to Cilium without downtime?

No. Don't believe anyone who tells you otherwise. Both CNIs install kernel modules and eBPF programs that conflict. You need to:1. Backup your network policies (and rewrite them because the syntax is different)2. Drain nodes completely and reinstall with Cilium3. Or build a new cluster and migrate workloads (safer but more work)I tried the "gradual migration" approach once. Spent two days with half my nodes unable to route to the other half. Just plan for downtime and do it cleanly.

What happens when the Cilium agent dies?

**Existing connections:** Keep working (eBPF programs stay loaded)**New connections:** Fail after existing conntrack entries expire**New pods:** Can't get IP addresses or network policies**Policy updates:** Stop being enforcedThe agent restart usually takes 30-60 seconds. If it keeps crashing, check the logs:```bashkubectl logs -n kube-system ds/cilium```

How do I troubleshoot ClusterMesh connectivity issues?

Start with the basics:```bash# Check clustermesh statuscilium clustermesh status# Verify certificateskubectl get secret -n kube-system clustermesh-apiserver-remote-cert```**Common failures:**- Network connectivity between cluster API servers- Certificate trust issues- IP address space conflicts between clusters- Firewall rules blocking clustermesh traffic (port 2379)

Why is my Cilium installation slower than kube-proxy?

You probably have a configuration issue:- Running in overlay mode when you could use native routing- XDP mode enabled on NICs that don't support it properly- Unnecessary Layer 7 policies adding overhead- Hubble collecting too much flow dataCheck your Cilium configuration:```bashcilium config view | grep -E "(routing-mode|kube-proxy-replacement)"```

Does Cilium work with my service mesh?

It can work alongside Istio/Linkerd, but you're duplicating functionality. Cilium provides its own service mesh features through eBPF.**With traditional service mesh:** You get double encryption, double observability overhead**Cilium-only:** Lighter weight but fewer advanced traffic management featuresChoose one approach. Running both is usually overkill unless you have specific requirements that only one can meet.

How do I debug eBPF programs when things go wrong?

You mostly can't, and that's the real problem with eBPF. The debugging tools are hot garbage:```bash# See what programs are loaded (but not what they do)bpftool prog list# Check eBPF maps (cryptic output)bpftool map list# Kernel logs for eBPF errors (good luck)dmesg | grep -i bpf```When eBPF programs fail to load, you get error messages like "Invalid argument" with zero context. I've had production outages where the only fix was rebooting nodes because some eBPF map got corrupted.Your best bet is checking GitHub issues for your exact kernel version + Cilium version combo. Someone else probably hit the same cryptic error.

What's new in Cilium for 2025?

The upcoming 1.19 release focuses on improved multi-cluster support and better cloud integration. Key improvements include:- Enhanced XDP support for AWS Nitro and Google GKE Autopilot- Better integration with Kubernetes Gateway API v1.0- Improved eBPF program compilation for ARM64 architectures- Enhanced Tetragon integration for runtime securityThe Cisco acquisition in 2024 brought enterprise focus without changing the open-source model. Expect more enterprise features in the commercial Isovalent distribution while keeping core functionality free.

Currently viewing the AI version

Switch to human version

Cilium: eBPF-Based Kubernetes CNI - AI-Optimized Technical Reference

Overview

Core Function: Replace traditional iptables/kube-proxy networking with eBPF kernel-level packet processing
CNCF Status: Graduated project (October 2023) - enterprise production ready
Primary Use Case: Scale Kubernetes networking beyond where traditional CNI plugins fail (1000+ services)

Technical Architecture

eBPF Implementation

Packet Processing: eBPF programs intercept packets in kernel space, eliminating userspace traversal
Performance Model: O(1) hash table lookups vs O(n) iptables rule traversal
Hook Points: XDP layer (early packet processing), TC hooks (policy enforcement), socket operations
Critical Dependency: Kernel 4.9+ minimum, 5.10+ recommended to avoid eBPF loading failures

Core Components

Component	Function	Failure Impact
Cilium Agent	Compiles/loads eBPF programs per node	Complete networking failure on affected node
Cilium Operator	Cluster-wide management	Policy updates stop, new deployments fail
CNI Plugin	Pod networking setup	New pods can't get IP addresses
Hubble	Observability via eBPF flow capture	Loss of network visibility only

Performance Characteristics

Scaling Limits

Traditional CNI Breaking Point: ~1000 services (iptables performance degradation)
Cilium Capacity: Millions of services with consistent O(1) lookup performance
Memory Usage: 150-300MB per node vs 50MB for Flannel
Connection Setup: 30-40% latency reduction for east-west traffic

Hardware Requirements

XDP Support: MLX5_core, i40e, ixgbe drivers for optimal performance
Cloud Provider Limitations: Most cloud NICs have limited XDP feature support
Kernel Features: CONFIG_BPF=y required, unprivileged BPF may be disabled by default

Critical Failure Modes

eBPF Program Loading Failures

Symptoms:

level=error msg="Unable to compile BPF program" error="cannot load program: permission denied"

Root Causes:

Kernel lacks required eBPF features
Security policies block BPF syscalls (common on RHEL 8.4+)
Memory pressure exhausting eBPF map entries

Resolution: Upgrade kernel or adjust security policies, potentially requiring privileged containers

Agent Restart Scenarios

During Failure:

Existing connections: Continue working (eBPF programs persist)
New connections: Fail after conntrack expiry
New pods: Cannot obtain IP addresses
Policy updates: Stop processing

Recovery Time: 30-60 seconds for agent restart

Version Compatibility Issues

1.15.7: eBPF loading issues on Ubuntu 22.04 with kernel 5.15
1.16.2: ClusterMesh breaks between different minor versions
1.17.x: Conflicts with AWS Load Balancer Controller v2.6+

Configuration Requirements

kube-proxy Replacement

Enable: cilium install --set kubeProxyReplacement=strict
Breaking Changes:

NodePort services fail on GKE (Google LB expects kube-proxy iptables)
kube-proxy metrics endpoints disappear
Legacy apps assuming iptables NAT behavior break

Identity-Based Security

Mechanism: Workload identities based on pod labels, not changing IP addresses
Advantage: Policies survive pod rescheduling and IP changes
Implementation: Security identities carried in packet metadata

Multi-Cluster Networking (ClusterMesh)

Architecture

clustermesh-apiserver per cluster exposing local services
Mutual TLS connections between clusters
Service discovery sync across mesh

Common Failures

Network Connectivity: Firewall blocks required ports
Certificate Trust: CA chain mismatches causing silent failures
IP Conflicts: Overlapping CIDR ranges between clusters
Identity Conflicts: Same service names causing random 403s

Setup Time: Budget 3x longer than documentation suggests

Debugging and Troubleshooting

Limited eBPF Debugging Tools

Available Tools:

bpftool prog list - shows loaded programs (not functionality)
bpftool map list - displays eBPF maps (cryptic output)
dmesg | grep -i bpf - kernel error messages

Reality: Error messages like "Invalid argument" with zero context
Resolution Strategy: Search GitHub issues for exact kernel + Cilium version combinations

Hubble Observability (Actually Useful)

Capabilities:

Real-time network flow visualization
HTTP status codes and latency metrics
Network policy decision logging
Service communication mapping

Debug Example:

hubble observe --follow --pod frontend-pod
# Shows: frontend-pod:54321 -> backend-pod:8080 DENIED (Policy denied)

Layer 7 Policy Debugging

Common Issues:

HTTP path regex doesn't match actual URL patterns
Missing HTTP methods in policy (POST, PUT forgotten)
HTTPS traffic when policy expects HTTP
Wrong endpoint selectors

BGP Mode for On-Premises

Benefits

Eliminates VXLAN tunneling overhead
Native performance with existing network tools
Direct pod IP routing via BGP announcements

Requirements

Network team BGP expertise mandatory
Switch/router configuration coordination
Careful IP address space planning (no overlapping CIDRs)

Resource Planning

Time Investments

Basic Setup: 2-4 hours if following docs exactly
ClusterMesh: 3x documentation estimates (certificate issues)
Production Migration: Plan for downtime, no zero-downtime option from other CNIs
Debugging Skills: eBPF expertise required for deep troubleshooting

Expertise Requirements

Minimum: Kubernetes networking fundamentals
Operational: Linux kernel networking, eBPF concepts
Advanced: Kernel debugging, BGP routing (for on-premises)

Decision Criteria

Choose Cilium When:

Scaling beyond 1000 services
Need advanced Layer 7 policies
Want comprehensive network observability
Running multi-cluster architectures
Performance is critical for east-west traffic

Choose Alternatives When:

Simple cluster requirements (Flannel simpler)
Limited eBPF debugging expertise
Kernel version constraints (older distros)
Need guaranteed zero-downtime operations

Migration Reality

From Other CNIs

Process: Complete cluster rebuild or controlled downtime

Backup and rewrite network policies (different syntax)
Drain nodes completely before Cilium installation
No gradual migration possible (kernel module conflicts)

Testing Requirements

Staging cluster matching exact production environment
Load testing with realistic traffic patterns
Verify cloud provider compatibility (especially NodePort services)
Test all network policies and ingress controllers

Enterprise Considerations

Commercial Support

Isovalent (Cisco): Professional support, SLA coverage, enterprise features
Community Support: GitHub issues, Slack (invite often broken)
Training: Linux Foundation courses available (~$300)

Compliance and Security

Identity-based security model more resilient than IP-based
Layer 7 policies for application-level access control
Runtime security integration via Tetragon project
Comprehensive audit trails through Hubble

Current Status and Roadmap

Version Recommendations

Production: 1.16.15 (stable, tested)
Avoid: 1.17.x series (identity allocation issues under load)

2025 Developments

Enhanced cloud provider integration (AWS Nitro, GKE Autopilot)
Kubernetes Gateway API v1.0 support
Improved ARM64 eBPF compilation
Better Tetragon integration for runtime security

Critical Success Factors

Kernel Compatibility: Verify eBPF feature support before deployment
Testing Strategy: Comprehensive staging environment matching production exactly
Expertise Planning: Ensure team has eBPF troubleshooting capabilities
Migration Strategy: Plan for downtime, don't attempt zero-downtime migration
Version Strategy: Stay on proven stable versions, test upgrades thoroughly

Useful Links for Further Investigation

Resources That Actually Help

Link	Description
Cilium Community	The main community hub. Slack invite is broken half the time - use GitHub issues instead. The maintainers actually respond there.
Cilium GitHub Issues	Search here first before posting questions. Someone has probably hit your exact eBPF loading error or ClusterMesh connectivity issue. The maintainers are responsive but expect you to have done basic troubleshooting.
Official Documentation	Actually decent for once. The installation guides work (if you follow them exactly) and the troubleshooting section has solutions to real problems instead of generic \"check your configuration\" bullshit.
Hubble GitHub	Source code and examples for the observability component. Useful if you want to understand how the flow monitoring actually works or need to customize the setup.
Isovalent Labs	Hands-on tutorials that mostly work. The \"Getting Started\" lab is actually better than the official quickstart because it tells you what to do when shit inevitably breaks. Free and runs in your browser without installing anything locally.
eBPF.io	Essential reading if you want to understand why Cilium is fast and why it's a pain in the ass to debug. Good primer on eBPF concepts without getting too deep into kernel internals.
Cilium Users List	Companies running Cilium in production. Useful for convincing your boss that it's not just experimental software. Google, Adobe, and other companies with real traffic use it.
CNCF Project Page	Graduated project status means it passed CNCF's due diligence. Not just marketing - they actually audit code quality, security, and governance.
Linux Foundation Course (LFS146)	Structured learning if you prefer courses over trial-and-error. Decent introduction but assumes you understand Kubernetes networking basics. About $300.
Cilium Certified Associate	Official cert program. Useful if your company pays for training budgets or you need paper credentials. The exam covers real operational scenarios.
Performance Benchmarks	Actual benchmark data with methodology. Better than random blog posts claiming \"Cilium is 10x faster.\" Shows realistic performance comparisons with test configurations.
eBPF Architecture Guide	Technical details about how eBPF programs are structured and loaded. Useful when you need to debug why eBPF programs fail to compile on your specific kernel version.
Scalability Documentation	Real scaling limits and tuning parameters. Covers what breaks when you hit 1000+ nodes and how to fix it.
Cilium Official Website	Pretty but useless. Generic marketing bullshit about \"cloud-native transformation.\" Skip this and go straight to the docs or GitHub issues.
Isovalent Enterprise	Now owned by Cisco (acquired in 2024). Commercial support with SLAs, enterprise features, and professional services. Expensive as hell but worth it if you're running Cilium at scale and need someone to blame when networking shits the bed at 3am. Cisco hasn't ruined the open-source project yet.
Tetragon	Runtime security using eBPF. Complements Cilium by monitoring process execution and file access. Separate project but from the same team.
Gateway API	New Kubernetes ingress standard that Cilium implements. More flexible than Ingress resources but adds complexity.

26%

Recommendations combine user behavior, content similarity, research intelligence, and SEO optimization