Netshoot - The Container You Run When Docker Networking Shits the Bed

Currently viewing the human version

Why Netshoot Exists

Got woken up last weekend because containers couldn't talk to each other. API was throwing connection refused errors, users were pissed, and we were hemorrhaging money. I SSH'd into the host to debug and realized - of course - no tcpdump installed. So there I was, deciding between spending 20 minutes installing debugging tools while everything burned, or just running netshoot.

Nicola Kabar built this because he got tired of the same bullshit I was dealing with - debugging containers without proper tools. It's a 200MB container packed with actual networking utilities, so you can attach to any broken container's network namespace and figure out what's wrong.

Real Production Problems

Last month our API started throwing connection errors to PostgreSQL. Took like 2 hours to figure out it was ECONNREFUSED 127.0.0.1:5432 - the logs were garbage. Database looked fine - at least kubectl get pods said it was Running, which means absolutely nothing. Spent way too long chasing bullshit - first I thought it was the connection pool, then I blamed AWS load balancer, then I restarted half our infrastructure.

Without netshoot, you're fucked - kubectl exec into containers that have zero useful tools. Our containers are stripped down to nothing because security.

With netshoot I could finally debug:

kubectl debug api-pod-xyz -it --image=nicolaka/netshoot

Turns out DNS was fucked somehow. Took me 2 hours and 47 minutes to figure out because DNS was the last thing I checked, like an idiot. It's always DNS but you never check DNS first.

What's Actually In This Thing

Packet Analysis: tcpdump for seeing what's hitting the wire, termshark for a terminal-based Wireshark, and tshark for command-line packet inspection. I've caught MTU issues breaking jumbo frames and load balancers silently dropping connections with these tools.

Connectivity Testing: curl, telnet, nc (netcat), and nmap for testing if services are reachable. ping and traceroute for layer 3 stuff. These saved me when Istio was randomly black-holing 5% of requests - turns out the service mesh config was screwed.

DNS Debugging: dig, nslookup, host, and drill for when DNS breaks. Which is constantly. Kubernetes DNS fails intermittently in ways that make you question your life choices.

Performance Testing: iperf3 for bandwidth, fortio for HTTP load testing. Used these to prove our "network issues" were actually the application being slow, not the network.

Linux Observability Tools
Brendan Gregg's tool diagram - if you've seen this before, you know why netshoot exists

Container Network Namespace
Container network isolation - each container gets its own network namespace

Here's why netshoot actually works: it shares the network namespace with whatever container you're debugging. If your app container can't reach the database, netshoot attached to that same container will have identical connectivity problems. No "works on my machine" bullshit - you're debugging the exact same network stack that's failing.

Container Networking
Network namespace isolation - containers see their own isolated network stack

Netshoot runs on AMD64 and ARM64, works with ephemeral containers in Kubernetes 1.25+, and has 9,800+ GitHub stars from engineers who've been there during weekend outages trying to debug why containers can't talk to each other. You'll find it mentioned constantly in Stack Overflow threads and GitHub issues whenever container networking breaks. It's become the standard debugging tool referenced in Kubernetes docs and pretty much every cloud provider's troubleshooting guides.

How to Actually Use This Thing

Ways to Debug (Pick Your Poison)

Attach to a Broken Container: When your app container is failing to connect to something, attach netshoot to its network namespace. This is the money shot - you get the exact same network view as your broken app.

## Replace "that-broken-thing" with your actual container name
docker run -it --net container:that-broken-thing nicolaka/netshoot

## First thing I always do - check if basic connectivity works
curl -v https://httpbin.org/get

Debug the Host When Docker Itself is Broken: Sometimes Docker's networking is the problem. Use host network mode to debug the actual host networking.

## Get the host's network stack
docker run -it --net host nicolaka/netshoot

## Check if the Docker daemon's bridge is working
ip addr show docker0

Kubernetes Debugging: For Kubernetes, use ephemeral containers if you're on 1.25+. Otherwise you're stuck with workarounds and they all suck. Debugging pre-1.25 Kubernetes networking is pure hell.

## The new way (if you're lucky enough to have k8s 1.25+)
kubectl debug broken-pod -it --image=nicolaka/netshoot

## The old way when everything else fails - prepare for pain
kubectl run netshoot --rm -i --tty --image nicolaka/netshoot

Here's What'll Break:

Packet Capture Fails Silently: You need --cap-add=NET_ADMIN --cap-add=NET_RAW for packet capture or tcpdump will fail silently. This will fail and you'll waste exactly 17 minutes figuring out why.

## This will fail and you'll waste 20 minutes figuring out why
docker run -it --net container:app nicolaka/netshoot tcpdump -i eth0

## This actually works
docker run -it --cap-add=NET_ADMIN --cap-add=NET_RAW --net container:app nicolaka/netshoot tcpdump -i eth0

DNS Resolution is Inconsistent: Container DNS resolution can be different from host DNS. Always test DNS from inside the netshoot container attached to your broken app.

## Check what DNS servers the container is actually using
cat /etc/resolv.conf

## Test DNS resolution the same way your app does
dig @8.8.8.8 your-service.default.svc.cluster.local

The Image is Big But Worth It: 200MB because it has everything you need. No more "let me install tcpdump" while production burns. Your security team will have a meltdown about 200MB. Tell them a 2-hour outage costs $50k and downloading this image once costs $0.03 in bandwidth.

Docker bridge networking - how containers communicate via veth pairs and docker0 bridge

Commands That Actually Work in Production

Quick Connection Test: First thing to run when containers can't talk to each other.

## Test if the service is even listening
telnet database-service 5432

## Check if DNS is resolving correctly
nslookup database-service.default.svc.cluster.local

Packet Capture for the Desperate: When you need to see what's actually hitting the wire.

Network troubleshooting tools - tcpdump and other utilities for packet analysis

## Capture all traffic on eth0 and save it
tcpdump -i eth0 -w /tmp/capture.pcap

## Watch HTTP traffic in real-time
tcpdump -i eth0 -A -s 0 'tcp port 80'

Network Performance Testing: For when the network team claims it's not their fault.

## Test bandwidth between containers
iperf3 -s    # On one container
iperf3 -c container-ip    # On another

DNS Deep Dive: Because DNS is always the problem, even when it's not.

## Check all the DNS shit at once
dig @8.8.8.8 example.com
dig @1.1.1.1 example.com
nslookup example.com
host example.com

Things That Break in Specific Versions

Kubernetes 1.24 and Earlier: No ephemeral containers, you're stuck with deploying sidecar containers or other bullshit that takes 10 minutes to set up while production burns involving kubectl exec or sidecar containers.

Kubernetes 1.23 Bug: Has this weird DNS resolution bug where queries randomly timeout for 30 seconds. Ask me how I know - spent 3 hours debugging an API that was working fine except for random 30-second hangs.

Alpine Linux Issues: Some tools behave differently on Alpine vs standard Linux. netstat output format can vary, and some eBPF tools won't work on certain kernel versions.

Container Runtime Differences: Works fine with Docker and containerd, but some tools might behave weird with other runtimes like CRI-O or gVisor. This issue tracks some of the gotchas.

Docker 20.10.8 Gotcha: Breaks volume mounts on SELinux systems. Learned this during a 3am incident - use 20.10.7 or 20.10.9 instead.

Look, netshoot isn't perfect, but it gets you debugging tools in 30 seconds instead of 30 minutes. When production is down and your CEO is breathing down your neck, that's the difference between keeping your job and updating your LinkedIn.

Additional Resources for Complex Scenarios

For advanced scenarios, check the Wireshark docs for packet analysis and tcpdump man pages for capture filters. The Kubernetes debugging guide covers more complex scenarios. Also useful: Docker networking docs and container security best practices for when your security team freaks out about debugging containers in production.

Netshoot vs The Alternatives (And Why Most Suck)

Feature	Netshoot	BusyBox	Alpine	Ubuntu Debug	Rolling Your Own
Time to Debug	30 seconds	20 minutes installing tools	10 minutes installing tools	15 minutes installing tools	2 hours if you're lucky
Tools Included	Everything you need	Basically nothing	`apk add` everything	`apt install` everything	Whatever you remember
Image Size	200MB (worth it)	5MB (useless)	15MB (still need to install)	200MB (bloated)	Who knows
tcpdump Ready	✅ Yes	❌ No	❌ No	❌ No	❌ Probably not
DNS Debugging	dig, drill, nslookup, host	nslookup (maybe)	Need to install	Need to install	If you remember
Does It Actually Work When Everything's Burning	Just works	Painful	Less painful	Slow but works	Good luck
Production Outage Ready	✅ Yes	❌ Hell no	❌ Maybe	❌ Too slow	❌ Are you kidding

Common Problems and Real Solutions

How do I actually run this thing?

Just run the damn thing: docker run -it --rm nicolaka/netshoot.

If this fails, Docker itself is fucked and you have bigger problems. For Kubernetes: kubectl debug your-broken-pod -it --image=nicolaka/netshoot.

If you're still on Kubernetes < 1.25, I'm sorry for your loss

you'll need workarounds.

Why can't I capture packets with tcpdump?

Docker security defaults.

You need --cap-add=NET_ADMIN --cap-add=NET_RAW or tcpdump just sits there returning nothing. Wasted 2 hours on this once thinking the network was fine. Tried everything

different interfaces, different filters. Finally found some random Stack Overflow post mentioning capabilities. Classic Friday debugging session.Pro tip: Always check if your container is actually listening on 0.0.0.0, not 127.0.0.1.

Ask me how I know

spent 4 hours debugging "network issues" when the app was only binding to localhost.bash# This fails silently (you'll waste time)docker run -it --net container:broken-app nicolaka/netshoot tcpdump -i eth0# This actually worksdocker run -it --cap-add=NET_ADMIN --cap-add=NET_RAW --net container:broken-app nicolaka/netshoot tcpdump -i eth0

DNS works from host but fails in netshoot. What fresh hell is this?

Welcome to the wonderful world of container networking where nothing makes sense. Container DNS != host DNS. Your container might be using some janky internal DNS server, or Kubernetes DNS is being its usual shitty self. First thing: check what DNS servers you're actually hitting.bash# See what DNS servers you're actually usingcat /etc/resolv.conf# Test with different DNS serversdig @8.8.8.8 your-service.default.svc.cluster.localdig @1.1.1.1 your-service.default.svc.cluster.local

Can I use this in production without getting fired?

That's literally why this thing exists. Netshoot doesn't touch your containers

it just borrows their network namespace for debugging. Your security team will have a meltdown about 200MB and demand you use a 5MB Alpine container that has zero useful tools, but you can tell them to calculate the revenue loss from a 2-hour outage vs downloading 150MB once. Math usually wins.

Why does my container say "network namespace not found"?

Your target container probably died. Check if it's still running with docker ps or kubectl get pods. If it's in CrashLoopBackOff, you can't attach to a dead container's network namespace. Fix the crashing first, then debug.

How do I capture traffic and actually save it somewhere useful?

Mount a volume to persist the capture files, otherwise they disappear when the container exits:bash# Save captures to host filesystemdocker run -it --cap-add=NET_ADMIN --cap-add=NET_RAW --net container:app -v /tmp:/tmp nicolaka/netshoot# Then inside netshoottcpdump -i eth0 -w /tmp/capture.pcap

Does this work with Istio/service mesh bullshit?

Yes, but service mesh makes everything more complicated. You can debug both the app container and the sidecar proxy. For Istio, you'll be debugging Envoy proxy behavior, which is its own special kind of hell. Good luck with that.

My tcpdump shows traffic but my app still can't connect. Now what?

Packets reaching the interface doesn't mean they're getting to your application. Check if your app is actually listening on the right port:bash# See what's actually listeningss -tulpn | grep :5432# Test if your app is respondingtelnet localhost 5432

Can I add my own tools to netshoot?

Fork the repo and build your own. But honestly, it already has most things you need.

Why is netshoot better than just installing tools when I need them?

Because installing tcpdump during a production outage is how you get fired. Security best practices say keep your production containers minimal, and netshoot lets you debug without modifying anything permanently.

My team lead says BusyBox is "good enough" for debugging.

Your team lead said this while sipping coffee at 2pm. Ask them again during the next outage when production is down and customers are screaming. BusyBox is fine for toy examples, but try running tcpdump on it during an actual incident. I'll wait. Show them this: netshoot gets you debugging in 30 seconds, BusyBox gets you fired after you spend 27 minutes figuring out how to install tcpdump while revenue bleeds at $50k/minute.

Each container gets its own network namespace - completely isolated from other containers until you explicitly connect them

How do I debug Kubernetes networking when everything is "Running" but broken?

This is the classic Kubernetes networking problem where everything looks fine but nothing works. Start with:bash# Attach netshoot to the broken podkubectl debug broken-pod -it --image=nicolaka/netshoot# Test basic connectivityping google.comcurl -v https://httpbin.org/status/200# Check DNSnslookup other-service.same-namespace.svc.cluster.localMost of the time it's DNS being weird, service discovery failing, or network policies blocking traffic.

Kubernetes Architecture
Kubernetes cluster architecture - lots of places where networking can break