When requests start failing in production, you need a systematic approach to identify whether the problem is in your application, the service mesh configuration, or the underlying infrastructure. Here's the debugging workflow that actually works when you're getting paged.
First Response: The 60-Second Health Check
Start with these commands before diving into complex debugging. Most service mesh issues fall into one of three categories: control plane failure, sidecar injection problems, or TLS configuration errors.
Istio Health Check:
kubectl get pods -n istio-system
istioctl proxy-status
istioctl version
Linkerd Health Check:
linkerd check
linkerd viz top deploy --namespace production
linkerd edges deployment --namespace production
If any of these commands show errors, you've found your starting point. Don't waste time on application-level debugging until the mesh itself is healthy.
Debugging the Data Plane: When Sidecars Attack
The most common production issue is sidecar proxy problems. These proxies intercept every network request, so when they malfunction, everything breaks in confusing ways.
Check Sidecar Injection Status:
kubectl describe pod <failing-pod> | grep -i \"istio-proxy\|linkerd-proxy\"
kubectl get pods -o jsonpath='{.items[*].spec.containers[*].name}' | grep proxy
If the sidecar isn't present, check your namespace labels and webhook configuration. If it's there but consuming excessive resources, you're dealing with a proxy resource leak that requires immediate investigation.
Analyze Proxy Configuration:
## For Istio
istioctl proxy-config cluster <pod-name> -n <namespace>
istioctl proxy-config listeners <pod-name> -n <namespace>
## For Linkerd
linkerd viz stat deploy --namespace <namespace>
linkerd viz tap deploy/<deployment> --namespace <namespace>
The Envoy admin interface on port 15000 provides deep debugging capabilities. Check /config_dump
for the full configuration and /stats
for performance metrics.
Certificate Hell: mTLS Debugging That Works
Certificate rotation failures cause the most dramatic production outages. Services that worked fine suddenly can't communicate, and error messages are cryptic. Here's how to diagnose TLS issues systematically.
Certificate Status Check:
## Istio certificate inspection
istioctl proxy-config secret <pod-name> -n <namespace>
openssl x509 -in /var/run/secrets/istio/root-cert.pem -text -noout
## Linkerd certificate validation
linkerd check --proxy
linkerd viz edges deployment --namespace <namespace>
Common certificate problems include expired root certificates, clock skew between nodes, and SPIFFE certificate validation failures. The most brutal debugging scenario is when certificates expire during a weekend deploy and half your services can't authenticate.
mTLS Troubleshooting Steps:
- Verify certificate chain validity with
openssl verify
- Check system time synchronization across all nodes
- Inspect certificate Subject Alternative Names (SANs)
- Validate certificate rotation settings in your mesh configuration
Network Policy Conflicts: The Silent Killer
Service mesh policies interact with Kubernetes NetworkPolicies in ways that create impossible-to-debug connectivity issues. A service might work fine in testing but fail intermittently in production due to policy conflicts.
Policy Debugging Commands:
## Check effective policies
kubectl get networkpolicies --all-namespaces
kubectl describe networkpolicy <policy-name> -n <namespace>
## Istio-specific policy checking
istioctl analyze --all-namespaces
istioctl config-validation
Network policy debugging requires understanding both Kubernetes-level and service mesh-level policy enforcement. When both layers are active, connectivity can fail in non-obvious ways.
The most frustrating production issue is when a deployment works in staging but fails in production due to different network policies. Always check policy differences between environments before escalating to application teams.