What Actually Breaks When Docker Swarm Networking Dies

I've spent way too many 3am debugging sessions figuring out why Docker Swarm's networking shits the bed. Here's what actually happens when everything goes wrong.

Docker's Service Discovery Will Fuck You Over

Docker Swarm Architecture

Docker Swarm has like 5 different networking layers that all have to work perfectly or nothing works at all. When one breaks, good luck figuring out which one. The official docs sure as hell won't prepare you for debugging this mess.

DNS is the First Thing to Break

Docker Swarm uses built-in DNS that works until it doesn't. Your containers hit up Docker's embedded DNS server at 127.0.0.11 to find other services.

Here's how it's supposed to work:

  • Service web gets a Virtual IP that magically load balances
  • tasks.web returns actual container IPs (when it feels like it)
  • Every container trusts Docker's DNS server completely

The thing that'll bite you: DNS happens inside the Docker daemon, not in /etc/hosts. When DNS breaks, it's usually because nodes can't talk to each other, or Docker's DNS server has gone completely insane with stale data.

Routing Mesh: The Magic That's Not Actually Magic

The routing mesh is supposed to let you hit any node and magically reach your service. Published ports like --publish 8080:80 create this ingress network that should route traffic correctly.

What's actually happening under the hood:

  • IPVS load balancing: Linux kernel doing the heavy lifting
  • VXLAN tunneling: UDP port 4789 carries your traffic between nodes
  • iptables rules: A massive pile of firewall rules that nobody understands

The Ways This Shit Will Break in Production

MTU: The Problem That Ruins Weekends

Here's the thing that took me 6 hours to figure out: MTU mismatches are the devil. VXLAN adds 50 bytes of overhead to every packet, so if your network MTU is 1500, you're fucked. There's an RFC that explains why this overhead exists, but that doesn't help when your production is down.

VXLAN Network Diagram

You'll see this bullshit:

  • Ping works fine (small packets)
  • HTTP requests just timeout (larger packets)
  • File uploads randomly fail
  • Database queries work until they don't

I learned this the hard way during Black Friday 2022. Site went down intermittently - small API calls worked fine, larger ones just died with "connection reset by peer". Spent 3 hours checking logs, database, load balancer. Basically everything except the obvious thing. Turned out network team had changed switch MTU from 9000 to 1500 the week before without updating Docker configs. Who does that right before Black Friday? Our revenue was dropping $2000/minute while I traced packets. Anyway, that's when I learned to check MTU first - could've saved 3 hours and my sanity.

DNS Goes Insane Under Load

Docker's DNS server loses its mind when shit gets busy. I've seen tasks.<service> return complete garbage:

Docker Service Discovery

  • Empty results when containers are definitely running
  • Only half your replicas show up in DNS
  • IPs of containers that died last Tuesday

The symptoms are fun:

  • Random "service not found" errors that make no sense
  • Load balancer happily sending traffic to dead containers
  • Half your app works, half doesn't, and you can't figure out why

Real example: We had a PostgreSQL connection pool in our Docker 20.10.8 setup using tasks.database to find all replicas. Under load, DNS would return stale IPs like 10.0.1.43 from containers that had been killed and rescheduled 6 hours earlier. 30% of connections failed with "ECONNREFUSED 10.0.1.43:5432" because they were trying to connect to ghost containers. Took me 2 fucking days to realize Docker's DNS was just lying to us. Turns out this is a known issue since Docker 19.03 - mentioned in one GitHub issue from 2020 that has 47 thumbs up and zero official response.

VXLAN Tunnels: When Nodes Can't Talk

VXLAN tunnel failures are my favorite because the symptoms are completely counterintuitive:

  • Services work fine on the same node
  • Cross-node communication is fucked
  • docker service ps shows everything is healthy
  • You can ping between nodes just fine

Common causes that'll ruin your day:

  • Firewall blocking UDP port 4789 (of course)
  • Network gear that doesn't understand VXLAN properly (especially older switches)
  • VMware NSX using the same damn port as Docker
  • Cloud security groups configured by someone who doesn't understand networking
  • MTU mismatches between host and overlay networks causing packet fragmentation

Certificates: The Silent Killer

Docker Swarm uses mutual TLS for everything. When certificates expire or get corrupted, service discovery just dies. Certificate management becomes critical for large-scale deployments:

  • Services can't resolve across nodes (but local stuff works fine)
  • Manager nodes randomly show "Unavailable"
  • New services won't deploy and you get useless error messages
  • DNS queries timeout and you blame the network

Load Balancer Goes Crazy

Docker's IPVS load balancer is great until it isn't. It'll happily keep routing traffic to containers that don't exist anymore:

  • Some requests work, others randomly timeout
  • Scaling events make everything slower
  • Services look healthy but error rates spike
  • IPVS backend state corruption causes persistent routing failures

When Everything Goes Wrong at Once

The real fun starts when multiple things break simultaneously. I've seen these combinations destroy entire weekends:

  1. MTU + DNS: MTU issues drop large DNS responses, so you only get partial service discovery
  2. Certificates + VXLAN: Expired certs prevent tunnels from re-establishing after network blips
  3. DNS + Load balancer: Stale DNS feeds bad IPs to the load balancer
  4. High availability + Scale: HA setups amplify networking failures
  5. Security + Performance: Security practices often conflict with network debugging needs

Resource Exhaustion: The Hidden Killer

Docker daemon resource usage will fuck you sideways:

  • Memory pressure: Docker daemon eating all RAM makes DNS slow as molasses
  • File descriptor limits: Hit the limit and new connections just die
  • CPU throttling: High load means DNS timeouts and health checks fail

Here's what nobody tells you: monitor dockerd itself, not just your pretty containers. I've seen a daemon eating 15GB out of 16GB RAM cause intermittent service discovery failures that looked like application bugs. The error logs? "context deadline exceeded" - thanks for nothing. Spent 3 days debugging overlay networks and DNS configs until I noticed dockerd was swapping like a Windows 95 machine. DNS timeout issues are probably resource starvation more often than network fuckery, but Docker's error messages sure don't help you figure that out.

Why Dev Environment Tests Are Useless

Your local Docker setup won't prepare you for production reality:

  • Geographic distribution: Cross-region latency makes VXLAN tunnels unstable
  • Enterprise networking: Corporate firewalls and SDN that hate VXLAN
  • Scale: DNS performance turns to shit with hundreds of services
  • Load patterns: Burst traffic that destroys connection pools
  • VIP network complexity: Service discovery behaves differently under load
  • DNS resolver conflicts: Embedded DNS server issues with hundreds of errors per second in production

Docker's error messages are worse than useless: "Connection refused" could be MTU, DNS staleness, or certificate bullshit. "Service not found" might be DNS cache poisoning, and "Request timeout" could be anything from VXLAN tunnel failure to dockerd memory pressure.

Bottom line: Docker Swarm networking problems masquerade as application bugs when they're actually infrastructure failures. You can't debug this mess without understanding the entire fucking stack - from VXLAN tunnels to embedded DNS to IPVS load balancing to certificate rotation.

How to Actually Debug This Mess

When Docker Swarm networking is fucked, skip the fancy diagnostics and start with what actually works. Here's my 3am debugging playbook.

Before you ask - no, restarting Docker doesn't magically fix everything. I know half the Stack Overflow answers suggest systemctl restart docker as step 1, but those people have never debugged a 20-node production cluster.

First: Figure Out What's Actually Broken

Check if Your Services Are Even Running

Before you waste time on network traces, make sure your shit is actually deployed:

## Check service status and placement
docker service ls
docker service ps <service-name> --no-trunc

## Verify service configuration (this will be painful to read)
docker service inspect <service-name> --pretty

Red flags that mean you're fucked:

  • Fewer replicas running than you asked for
  • Tasks stuck in "Pending" or "Preparing" (forever)
  • Recent restarts that don't make sense
  • Placement constraints pointing to dead nodes

(Side note: placement constraints are the devil and whoever thought they were a good idea clearly never worked weekends.)

Quick Test: Is This Service-Specific or Everything?

Deploy a simple test service to see if the problem is everywhere:

## Deploy a simple test service across nodes
docker service create --name connectivity-test \
  --mode global \
  --publish 8999:80 \
  nginx:alpine

## Test from each node - if this fails, your whole cluster is fucked
curl http://<node1-ip>:8999
curl http://<node2-ip>:8999
curl http://<node3-ip>:8999

If this test service works but your app doesn't, it's probably your application's fault. If the test fails too, your cluster networking is broken.

Actually, scratch that - it could be any number of things. Docker Swarm is like that.

DNS Debugging (Because That's Usually What's Broken)

Docker DNS Architecture

Test DNS Patterns (Some Work, Some Don't)

Docker has different DNS patterns that break independently. Test all of them:

## From inside a container, test all DNS patterns
docker exec -it <container-id> sh

## Test service name resolution (VIP)
nslookup web
ping web

## Test tasks resolution (all individual IPs)  
nslookup tasks.web
dig tasks.web

## Test external DNS (verifies DNS server connectivity)
nslookup google.com

What this tells you:

  • Service name works, tasks. doesn't = VIP is fine, task discovery is fucked
  • Neither works = DNS is completely broken
  • External DNS fails = Container's DNS config is garbage
When You Need to Go Deeper

If basic DNS tests don't tell you enough:

## Check DNS server responses directly
docker exec <container> dig @127.0.0.11 web
docker exec <container> dig @127.0.0.11 tasks.web

## Verify DNS server is responsive
docker exec <container> telnet 127.0.0.11 53

## Compare DNS responses across nodes
docker exec <container-on-node1> dig tasks.web
docker exec <container-on-node2> dig tasks.web

Red flags that mean you're fucked:

  • DNS queries taking forever (> 5 seconds) - should be under 100ms unless dockerd is dying
  • Different answers from different nodes (node1 says "web" is 10.0.1.5, node2 says 10.0.1.8)
  • Missing records for containers that are definitely running - docker ps shows them, DNS returns NXDOMAIN
  • Ghost IPs from dead containers (10.0.1.43 in DNS but docker exec into that IP fails with "No such container")
  • "server can't find tasks.web: NXDOMAIN" when you have 50+ replicas (Docker 20.x DNS scales like garbage)

Network Layer Debugging (The Fun Part)

VXLAN Testing (Because UDP is Evil)

This is where most cross-node failures happen:

Docker VXLAN Overlay Network

## List all networks and their details
docker network ls
docker network inspect ingress
docker network inspect <custom-overlay-network>

## Check VXLAN tunnel status - netstat is garbage, use ss
sudo ss -u sport = :4789

## Test UDP connectivity for VXLAN (this usually fails)
## On node1:
sudo tcpdump -i any port 4789 -v
## On node2:
nc -u <node1-ip> 4789
MTU Testing (The Thing That'll Ruin Your Weekend)

MTU problems are sneaky as hell. Here's how to catch them:

## Check MTU on all interfaces (look for inconsistencies)
ip addr show | grep mtu

## Test with different packet sizes - this is the key test
## This should work:
ping -c 3 -s 1400 <remote-node-ip>
## This will fail if MTU is fucked:
ping -c 3 -s 1472 <remote-node-ip>

## Test from container to container across nodes
docker exec <container1> ping -s 1400 <container2-ip>
docker exec <container1> ping -s 1472 <container2-ip>

If small packets work but large ones fail, MTU is 100% your problem. VXLAN adds 50 bytes overhead, so stick with 1450 MTU and save yourself a weekend of debugging. Learned this during a Black Friday deployment - nothing like customer complaints to teach you about packet fragmentation.

Port and Firewall Verification

Oh, and another thing that'll bite you - firewall rules. Here's what ports need to actually work:

## Test required Docker Swarm ports
## From each node to every other node:

## Management communication (managers only)
telnet <manager-ip> 2377

## Node communication (all nodes)
telnet <node-ip> 7946
nc -u <node-ip> 7946

## Overlay network data (all nodes)  
nc -u <node-ip> 4789

## Application ports (published services)
telnet <node-ip> <published-port>

Load Balancer and Routing Mesh Diagnostics

IPVS State Inspection

Docker uses IPVS for internal load balancing. Corrupted IPVS state causes mysterious routing failures:

## Check IPVS configuration (requires root)
sudo ipvsadm -L -n

## Look for stale backend entries
sudo ipvsadm -L -n --stats

## Check for connection tracking issues
sudo conntrack -L | grep <service-port>

Warning signs:

  • Backend entries pointing to non-existent containers
  • Uneven connection distribution (some backends with 0 connections)
  • Connection tracking showing failed connection attempts
Ingress Network State Verification

The ingress network handles published port routing:

## Inspect ingress network thoroughly
docker network inspect ingress --format '{{json .IPAM}}'
docker network inspect ingress --format '{{json .Containers}}'

## Check for ingress network issues
docker service ls --filter "mode=ingress"

## Verify published port configuration
docker service inspect <service> --format '{{json .Endpoint.Ports}}'

Container-Level Network Diagnostics

From Inside Failing Containers

Sometimes the issue is visible only from the container's perspective:

## Check container network configuration
docker exec <container> ip addr show
docker exec <container> ip route show
docker exec <container> cat /etc/resolv.conf

## Test connectivity to specific services
docker exec <container> curl -v http://<service-name>:port
docker exec <container> telnet <service-name> <port>

## Check if container can reach its own service VIP
docker exec <container> curl http://<service-name>:port/health
Network Interface Analysis

Network interface problems cause subtle failures:

## Check all network interfaces in container
docker exec <container> netstat -i
docker exec <container> ethtool eth0

## Look for packet drops or errors
docker exec <container> cat /proc/net/dev

## Test interface-specific connectivity
docker exec <container> ping -I eth0 <target-ip>

Log-Based Troubleshooting

Docker Daemon Logs Analysis

The Docker daemon logs contain critical networking error details. Recent troubleshooting discussions emphasize log analysis as the first debugging step:

## Check daemon logs for networking errors
journalctl -u docker --since "1 hour ago" | grep -i "network\|overlay\|vxlan\|dns"

## Look for specific error patterns
journalctl -u docker | grep -E "(context deadline exceeded|connection refused|no route to host)"

## Check for certificate issues
journalctl -u docker | grep -i "tls\|certificate\|x509"
Service and Container Logs

Application logs often reveal networking symptoms. Certificate troubleshooting discussions show how TLS issues manifest in application logs:

## Service logs for connection errors
docker service logs <service-name> --tail 100 | grep -i "connect\|timeout\|refused"

## Container logs for DNS issues
docker logs <container-id> 2>&1 | grep -i "dns\|resolve\|lookup"

Performance Impact Assessment

Connection Timing Analysis

Measure the performance impact of networking issues. Service discovery timing problems are often the first sign of underlying network layer issues:

## Time DNS resolution
time docker exec <container> nslookup <service-name>

## Measure connection establishment time
docker exec <container> curl -w "%{time_connect}
" -o /dev/null -s http://<service>:port/

## Test concurrent connection capacity
docker exec <container> ab -n 100 -c 10 http://<service>:port/
Resource Utilization During Failures

Network failures often correlate with resource exhaustion. Docker Swarm mesh networking discussions show how resource limits impact network stability:

## Check Docker daemon resource usage
ps aux | grep dockerd
top -p $(pidof dockerd)

## Monitor network traffic during failures
iftop -i docker_gwbridge
netstat -s | grep -E "(dropped|error|timeout)"

Diagnostic Decision Tree

Docker Overlay Network

If DNS resolution fails:

  1. Check embedded DNS server (127.0.0.11:53)
  2. Verify overlay network connectivity
  3. Test certificate validity
  4. Check for resource exhaustion

If DNS works but connections fail:

  1. Test MTU with large packets
  2. Check firewall rules and port access
  3. Verify IPVS/load balancer state
  4. Test application-level connectivity

If some connections work, others don't:

  1. Check for partial DNS responses
  2. Test load balancer backend health
  3. Verify container placement and constraints
  4. Look for intermittent network partitions

If external access fails but internal works:

  1. Check ingress network configuration
  2. Verify published port routing
  3. Test routing mesh functionality
  4. Check external load balancer configuration

Start at the application layer and work your way down through DNS, overlay networks, VXLAN tunnels, and physical network connectivity. Each layer fails differently - knowing the patterns saves you time. Host network issues in swarm show why you need to check each layer instead of guessing.

Fixing Docker Swarm Service Discovery and Routing Mesh Issues

Once you've diagnosed the root cause, here's how to fix the most common production failures.

Well, "proven" is a strong word. More like "worked for me once and might work for you. No guarantees."

MTU and VXLAN Configuration Fixes

Docker Swarm Services

MTU Optimization Strategy

MTU mismatches are the leading cause of "works sometimes" networking issues. VXLAN overhead requires 50 bytes, making 1450 the safe maximum for most networks.

Look, I know 1450 seems random, but trust me on this - I've debugged MTU issues for 4 years across AWS, Azure, and shitty on-premise networks where some genius set the switch MTU to 1500. 1450 just fucking works. Always.

Here's the command that'll save your ass:

## Set MTU on all Docker interfaces - this is hacky but works
sudo ip link set dev docker_gwbridge mtu 1450
sudo ip link set dev docker0 mtu 1450

## For overlay networks, restart Docker with custom MTU
sudo systemctl stop docker
echo '{"mtu": 1450}' | sudo tee /etc/docker/daemon.json
sudo systemctl start docker

## Check if it actually worked
docker run --rm alpine ip link show eth0

Permanent MTU configuration:
Create /etc/docker/daemon.json with networking optimizations:

{
  "mtu": 1450,
  "live-restore": true,
  "log-driver": "json-file",
  "log-opts": {
    "max-size": "10m",
    "max-file": "3"
  }
}

VXLAN Port Conflicts Resolution

When VXLAN port 4789 conflicts with other systems (especially VMware NSX):

## Recreate swarm with custom data path port
docker swarm leave --force
docker swarm init --data-path-port=7789 --advertise-addr <node-ip>

## For existing swarms, this requires complete rebuild (painful but necessary)
## Export service configurations first - you'll thank me later:
docker service inspect <service> > service-backup.json

VMware NSX specific fix:

## Disable VMware NSX on VXLAN port (don't ask me why this works, it just does)
sudo iptables -I OUTPUT -p udp --dport 4789 -j DROP
sudo iptables -I INPUT -p udp --sport 4789 -j DROP

## Use alternative data path port
docker swarm init --data-path-port=7789

DNS Resolution Repair Strategies

Embedded DNS Server Reset

When DNS becomes inconsistent, resetting the embedded DNS server often resolves stale entries:

## Restart Docker daemon on affected nodes
sudo systemctl restart docker

## For more aggressive DNS reset, recreate the default bridge
sudo docker network rm bridge
sudo systemctl restart docker

## Verify DNS server responsiveness
docker run --rm alpine nslookup tasks.<service-name>

DNS Cache Invalidation

Force DNS cache refresh across the cluster:

## Trigger DNS propagation by updating service labels
docker service update --label-add "dns-refresh=$(date +%s)" <service-name>

## Force service rescheduling (50/50 chance this works, but worth trying)
docker service update --force <service-name>

## For custom networks, recreate them
docker network rm <custom-overlay-network>
docker network create -d overlay <custom-overlay-network>

Advanced DNS Configuration

For persistent DNS issues, implement custom DNS configuration:

## Create service with custom DNS settings
docker service create \
  --name web \
  --dns 8.8.8.8 \
  --dns-search yourdomain.com \
  --replicas 3 \
  nginx:alpine

## For existing services, update DNS configuration
docker service update --dns-add 8.8.8.8 <service-name>

Overlay Network Reconstruction

Complete Overlay Network Reset

When overlay networks become corrupted, full reconstruction is often necessary:

## Step 1: Document current configuration
docker network ls --filter driver=overlay
docker network inspect ingress > ingress-backup.json

## Step 2: Remove all services from custom networks
docker service ls --format "table {{.Name}}	{{.Ports}}"
docker service rm <service1> <service2> # or scale to 0

## Step 3: Remove and recreate overlay networks
docker network rm <custom-overlay-network>
docker network create -d overlay --attachable <custom-overlay-network>

## Step 4: Recreate ingress network (if necessary)
docker network rm ingress
docker network create \
  --driver overlay \
  --ingress \
  --subnet=10.255.0.0/16 \
  --gateway=10.255.0.1 \
  ingress

Ingress Network Troubleshooting

Ingress network problems affect published port access:

## Check if ingress network exists and is healthy
docker network inspect ingress --format '{{json .IPAM.Config}}'

## Verify no services are attached to ingress except routing mesh
docker network inspect ingress --format '{{json .Containers}}'

## Force ingress network recreation
docker service rm <all-services-with-published-ports>
docker network rm ingress
## Docker automatically recreates ingress on next service with published ports
docker service create --name test --publish 8080:80 nginx:alpine

Load Balancer and IPVS State Repair

Docker Load Balancing

IPVS State Cleanup

Corrupted IPVS state causes traffic to route to dead backends:

## Clear all IPVS rules (requires service restart)
sudo ipvsadm -C

## Restart Docker to recreate IPVS rules
sudo systemctl restart docker

## Force service updates to regenerate load balancer configuration
docker service update --force <service-name>

Connection Tracking Reset

For connection tracking issues that cause persistent failures:

## Clear connection tracking table
sudo conntrack -F

## Check for connection tracking limits
cat /proc/sys/net/netfilter/nf_conntrack_max
echo 65536 | sudo tee /proc/sys/net/netfilter/nf_conntrack_max

## Make persistent
echo "net.netfilter.nf_conntrack_max = 65536" | sudo tee -a /etc/sysctl.conf

Certificate and Authentication Fixes

Certificate Rotation and Renewal

Certificate issues prevent cross-node communication:

## Check certificate expiration
docker system info | grep -A 10 "Swarm:"

## Force certificate rotation (managers only)
docker swarm ca --rotate

## For expired certificates, rejoin nodes
docker node demote <node-id>
docker node rm <node-id>
## On affected node:
docker swarm leave --force
docker swarm join --token <worker-token> <manager-ip>:2377

TLS Configuration Reset

For TLS handshake failures:

## Reset swarm CA configuration
docker swarm update --cert-expiry 2160h0m0s  # 90 days

## For complete certificate reset (destructive)
docker swarm leave --force
sudo rm -rf /var/lib/docker/swarm
docker swarm init --advertise-addr <node-ip>

Firewall and Security Group Configuration

Docker Swarm Port Configuration

Ensure all required ports are open between nodes:

Docker Swarm Architecture

## Open required ports (adjust for your firewall)
## Management port (managers only)
sudo ufw allow from <manager-subnet> to any port 2377

## Node communication (all nodes)
sudo ufw allow from <cluster-subnet> to any port 7946

## Overlay network data (all nodes)
sudo ufw allow from <cluster-subnet> to any port 4789/udp

## Published service ports
sudo ufw allow <published-port>

Cloud Provider Security Group Rules

For AWS, Azure, or GCP deployments:

AWS Security Group Rules:

## Manager nodes
aws ec2 authorize-security-group-ingress \
  --group-id sg-manager \
  --protocol tcp \
  --port 2377 \
  --source-group sg-cluster

## All nodes
aws ec2 authorize-security-group-ingress \
  --group-id sg-cluster \
  --protocol tcp \
  --port 7946 \
  --source-group sg-cluster

aws ec2 authorize-security-group-ingress \
  --group-id sg-cluster \
  --protocol udp \
  --port 4789 \
  --source-group sg-cluster

Performance Optimization and Resource Fixes

Docker Daemon Resource Optimization

Resource constraints cause DNS timeouts and connection failures:

## Increase Docker daemon limits
echo 'DOCKER_OPTS="--max-concurrent-downloads=3 --max-concurrent-uploads=3"' | sudo tee -a /etc/default/docker

## Configure systemd resource limits
sudo mkdir -p /etc/systemd/system/docker.service.d
cat << EOF | sudo tee /etc/systemd/system/docker.service.d/override.conf
[Service]
LimitNOFILE=1048576
LimitNPROC=1048576
LimitCORE=infinity
TasksMax=infinity
EOF

sudo systemctl daemon-reload
sudo systemctl restart docker

Memory and Connection Tuning

Optimize for high-connection workloads:

## Increase network buffer sizes
echo 'net.core.rmem_max = 16777216' | sudo tee -a /etc/sysctl.conf
echo 'net.core.wmem_max = 16777216' | sudo tee -a /etc/sysctl.conf
echo 'net.ipv4.tcp_rmem = 4096 65536 16777216' | sudo tee -a /etc/sysctl.conf
echo 'net.ipv4.tcp_wmem = 4096 65536 16777216' | sudo tee -a /etc/sysctl.conf

## Apply immediately
sudo sysctl -p

Service Recovery and Validation

Service Health Validation

After applying fixes, validate service communication:

## Test cross-node DNS resolution
docker service create --name dns-test --mode global alpine sleep 3600
docker exec $(docker ps -q -f "name=dns-test") nslookup tasks.dns-test

## Test load balancing distribution
for i in {1..10}; do
  curl -s http://<service-endpoint> | grep hostname
done

## Performance test with concurrent connections
docker run --rm --network <overlay-network> \
  alpine/curl -s -w "%{time_total}
" \
  -o /dev/null \
  http://<service-name>:port/

Monitoring Setup for Early Detection

Implement monitoring to catch future issues:

## Monitor DNS resolution times
docker service create \
  --name dns-monitor \
  --mode global \
  --mount type=bind,source=/var/run/docker.sock,target=/var/run/docker.sock \
  --restart-condition on-failure \
  alpine/curl sh -c 'while true; do time nslookup tasks.web; sleep 60; done'

## Monitor service connectivity
docker service create \
  --name connectivity-monitor \
  --mode global \
  alpine sh -c 'while true; do curl -f "${SERVICE_NAME}:${PORT}/health" || echo "FAIL"; sleep 30; done'

Emergency Recovery Procedures

Cluster-Wide Service Discovery Reset

When everything's fucked, start over:

## 1. Export all service configurations
docker service ls --format "{{.Name}}" | xargs -I {} docker service inspect {} > services-backup.json

## 2. Scale all services to 0 (don't remove)
docker service ls --format "{{.Name}}" | xargs -I {} docker service scale {}=0

## 3. Remove all custom overlay networks
docker network ls --filter driver=overlay --format "{{.Name}}" | grep -v ingress | xargs docker network rm

## 4. Restart Docker on all nodes (coordinate timing)
sudo systemctl restart docker

## 5. Recreate networks and scale services back up
## (Parse services-backup.json to restore configuration)

Split-Brain Recovery

For network partitions that cause split-brain scenarios (fun times):

## 1. Identify which partition has quorum
docker node ls  # Run on each partition

## 2. Force nodes to rejoin main cluster
## On isolated nodes:
docker swarm leave --force
docker swarm join --token <worker-token> <main-manager-ip>:2377

## 3. Verify cluster integrity
docker node ls
docker service ls

The key principle: always start with the least disruptive fix and escalate to more aggressive solutions. DNS cache refresh often resolves issues without service interruption, while complete network reconstruction requires planned downtime but guarantees clean state.

Track which solutions work for your specific environment—networking configurations vary widely, and some fixes that work in AWS might not apply to on-premises deployments with different network stacks.

Document this shit in your runbooks. You'll thank yourself when you get paged at 2am for the same DNS issue again. And you will get paged again. I've been woken up for the exact same "tasks.database returns empty results" issue 3 times in 6 months because we didn't document the fix properly. Learn from my sleep deprivation.

Frequently Asked Questions

Q

Why can my containers ping each other but HTTP requests fail?

A

Classic MTU fragmentation bullshit.

ICMP ping packets are tiny (64 bytes) but HTTP requests with headers often exceed 1450 bytes. VXLAN adds 50 bytes overhead, so if your network MTU is 1500, large packets get fragmented and dropped into the void. Fix: echo '{"mtu": 1450}' | sudo tee /etc/docker/daemon.json and restart Docker. Test with ping -s 1472 <target>

  • if this fails but smaller packets work, congratulations, you've confirmed MTU fuckery.
Q

How do I fix "service not found" errors when I can see the service is running?

A

Docker's embedded DNS server is shitting the bed again. First, test DNS directly: docker exec <container> nslookup <service-name>. If this pukes with "server can't find web: NXDOMAIN", the embedded DNS at 127.0.0.11 has gone AWOL. Try docker service update --force <service-name> to kick DNS in the teeth. Still broken? Nuclear option: sudo systemctl restart docker. I've seen the embedded DNS server lose its mind during high load (200+ concurrent requests) but Docker Engineering hasn't figured out why this happens in fucking 2024.

Q

Why does my service only work when accessed from the same node it's running on?

A

You have routing mesh failure, usually caused by ingress network problems or firewall blocking. Check if VXLAN port 4789/UDP is blocked: telnet <other-node-ip> 4789. Test published port routing: curl http://<any-node-ip>:<published-port> should work from any node. If routing mesh is completely broken, recreate the ingress network: docker network rm ingress then deploy a service with published ports to auto-recreate it.

Q

What causes intermittent "connection refused" errors in Docker Swarm?

A

Usually it's stale load balancer state

  • Docker's IPVS load balancer is happily sending traffic to ghost containers.

Check with sudo ipvsadm -L -n for entries pointing to containers that died 3 days ago. Clear this bullshit with sudo ipvsadm -C and restart Docker. Also check if DNS is lying: dig tasks.<service-name> should only return IPs of running containers. If you see IPs of containers that died last week, force DNS refresh with docker service update --force <service-name>. This happens embarrassingly often for a "production-ready" platform.

Q

How do I troubleshoot "context deadline exceeded" errors?

A

Timeout errors that mean your nodes can't talk to each other. Check the usual suspects: 2377/TCP (managers), 7946/TCP+UDP (all nodes), 4789/UDP (overlay). Test each: telnet <node-ip> 2377 and nc -u <node-ip> 4789. If those work, check certificates: docker system info | grep -A 10 Swarm to see if certs expired. Expired certs = nodes can't authenticate = everything times out. Fix with docker swarm ca --rotate but prepare for some downtime.

Q

Why do DNS queries return empty results for "tasks.service-name"?

A

The tasks. prefix returns all individual container IPs, and empty results mean Docker can't find running containers for that service.

This happens when: 1) Service has no running replicas

  • check docker service ps <service>, 2) Overlay network is partitioned
  • test cross-node container communication, 3) DNS server has stale data
  • force refresh with docker service update --force <service>. The regular service name (without tasks.) uses VIP which is more resilient than individual task IPs.
Q

How do I fix Docker Swarm when VXLAN port 4789 conflicts with other systems?

A

VMware NSX uses the same port 4789 for VXLAN, causing conflicts.

Recreate the swarm with custom data path port: docker swarm leave --force then docker swarm init --data-path-port=7789 --advertise-addr <ip>. For existing clusters, this requires complete rebuild

  • export service configs first. Alternatively, if you can't change Docker's port, configure VMware NSX to use a different VXLAN port in your virtualization layer.
Q

What should I do when overlay networks show containers but they can't communicate?

A

This indicates VXLAN tunnel failure between nodes.

First verify UDP port 4789 is open and test: nc -u <node-ip> 4789.

Check MTU consistency across all nodes: ip addr show | grep mtu

  • all should match.

Test packet size limits: ping -s 1400 <node-ip> (should work) vs ping -s 1472 <node-ip> (might fail). If tunnels are failing, restart Docker on affected nodes or recreate overlay networks entirely.

Q

How do I diagnose load balancing not distributing traffic evenly?

A

Uneven load balancing usually indicates some backends are unhealthy or unreachable. Check load balancer state: sudo ipvsadm -L -n --stats to see connection counts per backend. Test each backend directly: docker exec <container> curl http://tasks.<service>:port to see all backend IPs, then test each IP individually. Remove unhealthy backends by scaling service down and up, or force refresh with docker service update --force <service>.

Q

Why do some Docker Swarm nodes show "Unknown" status intermittently?

A

"Unknown" status indicates heartbeat timeouts between nodes, usually from network latency or packet loss.

Check network stability: ping -i 0.1 -c 100 <node-ip> to test for packet loss.

Verify system clocks are synchronized: timedatectl status

  • clock skew causes heartbeat issues.

Also check for resource exhaustion on Docker daemon: ps aux | grep dockerd

  • high CPU/memory usage causes delayed heartbeats. Consider increasing heartbeat timeouts if network has high latency.
Q

How do I recover when multiple nodes show "Down" but are actually running?

A

This is usually certificate expiration or cluster split-brain. Check certificate status: docker system info | grep -A 10 Swarm. If certificates expired, you need to rejoin nodes: docker swarm leave --force on workers, then docker swarm join --token <token> <manager-ip>:2377. For managers, demote first: docker node demote <node-id>, then rejoin as worker and promote back if needed. Save service configurations before attempting recovery.

Q

What causes Docker services to fail with "no suitable node" errors?

A

Placement constraints are preventing scheduling. Check service constraints: docker service inspect <service> --format '{{json .Spec.TaskTemplate.Placement}}'. Common issues: constraints pointing to dead nodes, resource requirements exceeding available capacity, or labels missing from nodes. Remove bad constraints: docker service update --constraint-rm 'node.hostname==dead-node' <service> or add required labels: docker node update --label-add key=value <node-id>.

Q

How do I fix "could not find an available, non-overlapping IPv4 address pool" errors?

A

Docker ran out of subnet space for overlay networks.

Check existing networks: docker network ls and their subnets: `docker network inspect --format '{{json .

IPAM}}'. Remove unused networks: docker network rm `.

For persistent issues, specify custom subnets: docker network create -d overlay --subnet=172.20.0.0/16 mynetwork. Default Docker subnets can conflict with corporate networks

  • plan your IP space carefully.
Q

Why do container health checks pass but service discovery fails?

A

Health checks test the container's application, but service discovery depends on Docker's networking layer. A container can be healthy but unreachable due to overlay network issues, DNS problems, or load balancer failures. Test service discovery separately: docker exec <another-container> curl http://<service-name>:port/health. If this fails while direct container health checks pass, you have networking issues, not application problems.

Q

How do I troubleshoot when external load balancers can't reach Docker Swarm services?

A

External load balancers need specific configuration for Docker Swarm.

If using published ports, make sure the load balancer targets all manager nodes on the published port

  • Docker's routing mesh will handle internal routing. For DNSRR services, configure the load balancer to discover individual container IPs using tasks.<service-name> DNS queries. Check that external load balancer can reach Docker nodes: telnet <docker-node> <published-port> from load balancer host.
Q

What should I do when Docker Swarm clustering works but my application-specific service discovery fails?

A

Your application might be using its own service discovery mechanism that conflicts with Docker's. Many applications (like Consul, etcd, or Kafka) have built-in clustering that doesn't understand Docker networking. Configure your application to use Docker service names instead of IP addresses, or set up proper DNS resolution for your application's discovery protocol. Check your application logs for connection errors to specific IPs rather than service names.

Resources That Actually Help