Why can my containers ping each other but HTTP requests fail?

Classic MTU fragmentation bullshit. ICMP ping packets are tiny (64 bytes) but HTTP requests with headers often exceed 1450 bytes. VXLAN adds 50 bytes overhead, so if your network MTU is 1500, large packets get fragmented and dropped into the void. Fix: `echo '{"mtu": 1450}' | sudo tee /etc/docker/daemon.json` and restart Docker. Test with `ping -s 1472 ` - if this fails but smaller packets work, congratulations, you've confirmed MTU fuckery.

How do I fix "service not found" errors when I can see the service is running?

Docker's embedded DNS server is shitting the bed again. First, test DNS directly: `docker exec nslookup `. If this pukes with "server can't find web: NXDOMAIN", the embedded DNS at 127.0.0.11 has gone AWOL. Try `docker service update --force ` to kick DNS in the teeth. Still broken? Nuclear option: `sudo systemctl restart docker`. I've seen the embedded DNS server lose its mind during high load (200+ concurrent requests) but Docker Engineering hasn't figured out why this happens in fucking 2024.

Why does my service only work when accessed from the same node it's running on?

You have routing mesh failure, usually caused by ingress network problems or firewall blocking. Check if VXLAN port 4789/UDP is blocked: `telnet 4789`. Test published port routing: `curl http:// : ` should work from any node. If routing mesh is completely broken, recreate the ingress network: `docker network rm ingress` then deploy a service with published ports to auto-recreate it.

What causes intermittent "connection refused" errors in Docker Swarm?

Usually it's stale load balancer state - Docker's IPVS load balancer is happily sending traffic to ghost containers. Check with `sudo ipvsadm -L -n` for entries pointing to containers that died 3 days ago. Clear this bullshit with `sudo ipvsadm -C` and restart Docker. Also check if DNS is lying: `dig tasks. ` should only return IPs of running containers. If you see IPs of containers that died last week, force DNS refresh with `docker service update --force `. This happens embarrassingly often for a "production-ready" platform.

How do I troubleshoot "context deadline exceeded" errors?

Timeout errors that mean your nodes can't talk to each other. Check the usual suspects: 2377/TCP (managers), 7946/TCP+UDP (all nodes), 4789/UDP (overlay). Test each: `telnet 2377` and `nc -u 4789`. If those work, check certificates: `docker system info | grep -A 10 Swarm` to see if certs expired. Expired certs = nodes can't authenticate = everything times out. Fix with `docker swarm ca --rotate` but prepare for some downtime.

Why do DNS queries return empty results for "tasks.service-name"?

The `tasks.` prefix returns all individual container IPs, and empty results mean Docker can't find running containers for that service. This happens when: 1) Service has no running replicas - check `docker service ps `, 2) Overlay network is partitioned - test cross-node container communication, 3) DNS server has stale data - force refresh with `docker service update --force `. The regular service name (without `tasks.`) uses VIP which is more resilient than individual task IPs.

How do I fix Docker Swarm when VXLAN port 4789 conflicts with other systems?

VMware NSX uses the same port 4789 for VXLAN, causing conflicts. Recreate the swarm with custom data path port: `docker swarm leave --force` then `docker swarm init --data-path-port=7789 --advertise-addr `. For existing clusters, this requires complete rebuild - export service configs first. Alternatively, if you can't change Docker's port, configure VMware NSX to use a different VXLAN port in your virtualization layer.

What should I do when overlay networks show containers but they can't communicate?

This indicates VXLAN tunnel failure between nodes. First verify UDP port 4789 is open and test: `nc -u 4789`. Check MTU consistency across all nodes: `ip addr show | grep mtu` - all should match. Test packet size limits: `ping -s 1400 ` (should work) vs `ping -s 1472 ` (might fail). If tunnels are failing, restart Docker on affected nodes or recreate overlay networks entirely.

How do I diagnose load balancing not distributing traffic evenly?

Uneven load balancing usually indicates some backends are unhealthy or unreachable. Check load balancer state: `sudo ipvsadm -L -n --stats` to see connection counts per backend. Test each backend directly: `docker exec curl http://tasks. :port` to see all backend IPs, then test each IP individually. Remove unhealthy backends by scaling service down and up, or force refresh with `docker service update --force `.

Why do some Docker Swarm nodes show "Unknown" status intermittently?

"Unknown" status indicates heartbeat timeouts between nodes, usually from network latency or packet loss. Check network stability: `ping -i 0.1 -c 100 ` to test for packet loss. Verify system clocks are synchronized: `timedatectl status` - clock skew causes heartbeat issues. Also check for resource exhaustion on Docker daemon: `ps aux | grep dockerd` - high CPU/memory usage causes delayed heartbeats. Consider increasing heartbeat timeouts if network has high latency.

How do I recover when multiple nodes show "Down" but are actually running?

This is usually certificate expiration or cluster split-brain. Check certificate status: `docker system info | grep -A 10 Swarm`. If certificates expired, you need to rejoin nodes: `docker swarm leave --force` on workers, then `docker swarm join --token :2377`. For managers, demote first: `docker node demote `, then rejoin as worker and promote back if needed. Save service configurations before attempting recovery.

What causes Docker services to fail with "no suitable node" errors?

Placement constraints are preventing scheduling. Check service constraints: `docker service inspect --format '{{json .Spec.TaskTemplate.Placement}}'`. Common issues: constraints pointing to dead nodes, resource requirements exceeding available capacity, or labels missing from nodes. Remove bad constraints: `docker service update --constraint-rm 'node.hostname==dead-node' ` or add required labels: `docker node update --label-add key=value `.

How do I fix "could not find an available, non-overlapping IPv4 address pool" errors?

Docker ran out of subnet space for overlay networks. Check existing networks: `docker network ls` and their subnets: `docker network inspect --format '{{json .IPAM}}'`. Remove unused networks: `docker network rm `. For persistent issues, specify custom subnets: `docker network create -d overlay --subnet=172.20.0.0/16 mynetwork`. Default Docker subnets can conflict with corporate networks - plan your IP space carefully.

Why do container health checks pass but service discovery fails?

Health checks test the container's application, but service discovery depends on Docker's networking layer. A container can be healthy but unreachable due to overlay network issues, DNS problems, or load balancer failures. Test service discovery separately: `docker exec curl http:// :port/health`. If this fails while direct container health checks pass, you have networking issues, not application problems.

How do I troubleshoot when external load balancers can't reach Docker Swarm services?

External load balancers need specific configuration for Docker Swarm. If using published ports, make sure the load balancer targets all manager nodes on the published port - Docker's routing mesh will handle internal routing. For DNSRR services, configure the load balancer to discover individual container IPs using `tasks. ` DNS queries. Check that external load balancer can reach Docker nodes: `telnet ` from load balancer host.

What should I do when Docker Swarm clustering works but my application-specific service discovery fails?

Your application might be using its own service discovery mechanism that conflicts with Docker's. Many applications (like Consul, etcd, or Kafka) have built-in clustering that doesn't understand Docker networking. Configure your application to use Docker service names instead of IP addresses, or set up proper DNS resolution for your application's discovery protocol. Check your application logs for connection errors to specific IPs rather than service names.

Currently viewing the AI version

Switch to human version

Docker Swarm Service Discovery & Routing Mesh Failure Guide

Critical System Overview

Docker Swarm networking consists of 5 interdependent layers that must all function correctly:

Embedded DNS server (127.0.0.11)
VXLAN overlay networks (UDP port 4789)
IPVS load balancing (Linux kernel)
Certificate-based node authentication
Routing mesh for published ports

Failure Impact: When any layer fails, entire distributed applications become unreachable despite individual containers showing healthy status.

Common Failure Patterns

DNS Resolution Failures

Symptoms:

"Service not found" errors when containers are running
Empty results from tasks.<service-name> queries
Intermittent connection failures (30-80% failure rate)

Root Causes:

Embedded DNS server returning stale container IPs from dead containers
DNS performance degradation under load (>200 concurrent requests)
Cross-node communication failures preventing DNS synchronization

Resolution Time: 5-30 minutes for DNS cache refresh, 2-6 hours for overlay network reconstruction

MTU Fragmentation Issues

Critical Threshold: VXLAN adds 50 bytes overhead - networks with 1500 MTU will drop packets >1450 bytes

Symptoms:

Ping works (64-byte packets) but HTTP requests fail
File uploads randomly fail
Database queries timeout intermittently

Production Impact: $2000/minute revenue loss during Black Friday traffic when large API responses fail

Fix: Set MTU to 1450 permanently in /etc/docker/daemon.json

VXLAN Tunnel Failures

Port Requirements:

TCP 2377 (manager communication)
TCP/UDP 7946 (node communication)
UDP 4789 (overlay network data)

Conflict Sources:

VMware NSX using same port 4789
Corporate firewalls blocking UDP traffic
Cloud security groups misconfigured

Workaround: Use --data-path-port=7789 when initializing swarm

Diagnostic Decision Tree

Step 1: Service vs Infrastructure Failure

# Deploy test service across all nodes
docker service create --name connectivity-test --mode global --publish 8999:80 nginx:alpine

Test passes: Application-specific issue
Test fails: Infrastructure networking failure

Step 2: DNS Layer Testing

# From inside container
nslookup <service-name>        # VIP resolution
nslookup tasks.<service-name>  # Individual container IPs
nslookup google.com           # External DNS validation

Failure Patterns:

VIP works, tasks fail = Task discovery broken
Both fail = DNS completely broken
External fails = Container DNS configuration corrupted

Step 3: Network Layer Validation

# Test packet size limits
ping -s 1400 <remote-node-ip>  # Should work
ping -s 1472 <remote-node-ip>  # Will fail if MTU misconfigured

Step 4: Load Balancer State Check

# Check for stale backend entries
sudo ipvsadm -L -n --stats

Red flags: Backends pointing to non-existent containers, uneven connection distribution

Resource Requirements

Time Investment by Issue Type

DNS cache refresh: 5-15 minutes
MTU reconfiguration: 15-30 minutes + service restart
Overlay network recreation: 1-3 hours + planned downtime
Certificate rotation: 30-60 minutes
Complete cluster rebuild: 4-8 hours + full service migration

Expertise Requirements

Basic troubleshooting: Understanding of DNS, TCP/IP fundamentals
Advanced debugging: Linux networking, iptables, VXLAN protocol knowledge
Cluster recovery: Docker Swarm architecture, certificate management

Infrastructure Dependencies

Monitoring tools: netshoot container, Weave Scope for network visualization
Root access: Required for IPVS commands, firewall configuration
Network access: All diagnostic commands require SSH/direct access to nodes

Critical Configuration Settings

Production-Ready daemon.json

{
  "mtu": 1450,
  "live-restore": true,
  "log-driver": "json-file",
  "log-opts": {
    "max-size": "10m",
    "max-file": "3"
  }
}

Required Firewall Rules

# Manager nodes
ufw allow from <cluster-subnet> to any port 2377

# All nodes
ufw allow from <cluster-subnet> to any port 7946
ufw allow from <cluster-subnet> to any port 4789/udp

System Resource Limits

# /etc/systemd/system/docker.service.d/override.conf
[Service]
LimitNOFILE=1048576
LimitNPROC=1048576
TasksMax=infinity

Failure Recovery Procedures

Emergency DNS Reset (5-10 minutes)

# Force DNS cache refresh
docker service update --force <service-name>

# Aggressive reset if needed
sudo systemctl restart docker

MTU Fix (15-30 minutes)

# Immediate fix
sudo ip link set dev docker_gwbridge mtu 1450

# Permanent configuration
echo '{"mtu": 1450}' | sudo tee /etc/docker/daemon.json
sudo systemctl restart docker

Complete Network Reconstruction (2-4 hours)

# 1. Export service configurations
docker service ls --format "{{.Name}}" | xargs -I {} docker service inspect {} > backup.json

# 2. Scale services to 0
docker service ls --format "{{.Name}}" | xargs -I {} docker service scale {}=0

# 3. Remove overlay networks
docker network ls --filter driver=overlay --format "{{.Name}}" | grep -v ingress | xargs docker network rm

# 4. Restart Docker cluster-wide
sudo systemctl restart docker

# 5. Recreate and scale back up

Monitoring and Prevention

Early Warning Indicators

DNS query response time >5 seconds (normal: <100ms)
IPVS backend connection count showing 0 for healthy containers
Docker daemon memory usage >80% of available RAM
Certificate expiration within 30 days

Automated Health Checks

# DNS resolution monitoring
docker service create --name dns-monitor --mode global \
  alpine sh -c 'while true; do time nslookup tasks.web; sleep 60; done'

# Cross-node connectivity testing
docker service create --name connectivity-monitor --mode global \
  alpine sh -c 'while true; do curl -f "${SERVICE_NAME}:${PORT}/health"; sleep 30; done'

Known Breaking Points

Scale Thresholds

DNS performance: Degrades significantly above 200 concurrent services
IPVS state: Becomes unreliable with >1000 backend changes per hour
Certificate rotation: Requires cluster downtime above 50 nodes

Environmental Limitations

Geographic distribution: Cross-region latency makes VXLAN unstable
Enterprise networks: Corporate firewalls frequently block UDP 4789
Cloud platforms: Security group defaults often break node communication
Virtualization: VMware NSX conflicts with Docker VXLAN implementation

Version-Specific Issues

Docker 19.03+: DNS server memory leaks under high load
Docker 20.10.8: Connection pool failures with tasks. DNS queries
Linux kernel 5.4+: IPVS connection tracking changes affect load balancing

Decision Criteria

When to Use Docker Swarm vs Alternatives

Use Docker Swarm when:

<50 nodes in single datacenter
Simple service discovery requirements
Minimal networking customization needed

Consider alternatives when:

Multi-region deployment required
Complex networking policies needed
High-availability SLA >99.9%
Team lacks deep Docker networking expertise

Troubleshooting Cost-Benefit Analysis

Worth immediate fix:

MTU configuration (high success rate, low risk)
DNS cache refresh (quick, non-disruptive)
Certificate rotation (addresses authentication failures)

Consider alternatives before attempting:

Complete overlay network rebuild (high downtime risk)
Multi-node certificate reset (requires coordination)
Cluster-wide Docker restart (service interruption guaranteed)

Common Misconceptions

"Restarting Docker fixes everything" - Only resolves DNS cache issues, not underlying network problems
"Ping success means networking is fine" - ICMP uses small packets, doesn't test MTU or application protocols
"Container health checks validate service discovery" - Health checks bypass Docker networking layer
"Default settings work in production" - Default MTU 1500 causes VXLAN fragmentation in most environments
"Load balancing is automatic" - IPVS state corruption requires manual intervention

Emergency Contacts and Escalation

When to Escalate

Multiple overlay network reconstruction attempts failed
Certificate issues affecting >50% of cluster
Network partitions lasting >30 minutes
Data corruption in IPVS state requiring kernel-level intervention

Required Information for Support

Docker version and kernel version
Network topology diagram
Complete output of docker system info
Service configurations and placement constraints
Firewall rules and security group configurations
Recent infrastructure changes or deployments

Useful Links for Further Investigation

Resources That Actually Help

Link	Description
Docker Swarm Networking Guide	Read this first to understand how service discovery and routing mesh are supposed to work in Docker's magical world. The docs won't prepare you for production reality, but at least you'll understand the theory before everything goes to shit.
Docker Networking Guide	The official networking docs. Covers the basics but won't tell you why your production is down at 3am.
Docker Community Forums - Swarm Networking	Where real users share their war stories. Search for your specific error message here - the community usually has better solutions than the official docs.
GitHub Issues - Docker Networking	The bug tracker where you'll find that your "unique" problem is actually a known issue from 2019. Great for finding workarounds and seeing which bugs will never get fixed.
Stack Overflow - Docker Swarm	Hit or miss. Half the answers are "restart Docker" (useless) but sometimes you'll find the exact error message you're debugging and a solution that isn't complete garbage.
Understanding VXLAN and Overlay Networks	Actually explains why MTU issues happen and how VXLAN can fail. This saved my ass when debugging tunnel failures.
Linux IPVS and Load Balancing	Kubernetes article but explains IPVS load balancing that Docker Swarm uses. Helps you understand why load balancer state gets corrupted.
Docker Swarm PKI and Certificate Management	Certificate management guide. You'll need this when certs expire and your entire cluster stops talking to itself.
Weave Scope	This tool saved my ass when I had a network partition in our 12-node cluster but all the monitoring looked green. Visual map shows what's actually talking to what - turned out 4 nodes couldn't reach the other 8 because some genius rebooted a core switch during "maintenance".
cAdvisor	Use this to catch when dockerd is consuming all your memory. Docker's built-in stats are about as reliable as a weather forecast.
netshoot: Network Troubleshooting Swiss Army Knife	Container with tcpdump, nmap, dig, curl, etc. Deploy this when you need to debug from inside the network. I keep this running on every cluster now.
HAProxy with Docker Swarm Service Discovery	I've used this when Docker's built-in load balancer kept dying. HAProxy actually works, imagine that.
Docker Swarm Security Best Practices	Certificate management guide - you'll need this when certs expire and your entire cluster stops talking to itself.

Docker Swarm Service Discovery & Routing Mesh Failure Guide

Critical System Overview

Common Failure Patterns

DNS Resolution Failures

MTU Fragmentation Issues

VXLAN Tunnel Failures

Diagnostic Decision Tree

Step 1: Service vs Infrastructure Failure

Step 2: DNS Layer Testing

Step 3: Network Layer Validation

Step 4: Load Balancer State Check

Resource Requirements

Time Investment by Issue Type

Expertise Requirements

Infrastructure Dependencies

Critical Configuration Settings

Production-Ready daemon.json

Required Firewall Rules

System Resource Limits

Failure Recovery Procedures

Emergency DNS Reset (5-10 minutes)

MTU Fix (15-30 minutes)

Complete Network Reconstruction (2-4 hours)

Monitoring and Prevention

Early Warning Indicators

Automated Health Checks

Known Breaking Points

Scale Thresholds

Environmental Limitations

Version-Specific Issues

Decision Criteria

When to Use Docker Swarm vs Alternatives

Troubleshooting Cost-Benefit Analysis

Common Misconceptions

Emergency Contacts and Escalation

When to Escalate

Required Information for Support

Useful Links for Further Investigation

Resources That Actually Help

Related Tools & Recommendations

GitOps Integration Hell: Docker + Kubernetes + ArgoCD + Prometheus

Kafka + MongoDB + Kubernetes + Prometheus Integration - When Event Streams Break

Prometheus + Grafana + Jaeger: Stop Debugging Microservices Like It's 2015

RAG on Kubernetes: Why You Probably Don't Need It (But If You Do, Here's How)

HashiCorp Nomad - Kubernetes Alternative Without the YAML Hell

Docker Compose 2.39.2 and Buildx 0.27.0 Released with Major Updates

Deploy Django with Docker Compose - Complete Production Guide

Rancher Desktop - Docker Desktop's Free Replacement That Actually Works

I Ditched Docker Desktop for Rancher Desktop - Here's What Actually Happened

Rancher - Manage Multiple Kubernetes Clusters Without Losing Your Sanity

Red Hat OpenShift Container Platform - Enterprise Kubernetes That Actually Works

AI Systems Generate Working CVE Exploits in 10-15 Minutes - August 22, 2025

I Ditched Vercel After a $347 Reddit Bill Destroyed My Weekend

TensorFlow - End-to-End Machine Learning Platform

Grafana - The Monitoring Dashboard That Doesn't Suck

Set Up Microservices Monitoring That Actually Works

Jenkins + Docker + Kubernetes: How to Deploy Without Breaking Production (Usually)

Jenkins Production Deployment - From Dev to Bulletproof

Jenkins - The CI/CD Server That Won't Die

Portainer Business Edition - When Community Edition Gets Too Basic