When you have 50+ microservices talking to each other, shit gets complicated fast. Istio tries to solve this by adding more complexity on top, which somehow actually works. I've deployed this nightmare in production three times now, and each time I question my sanity.
How This Thing Actually Works
Istio injects Envoy proxy sidecars next to every pod in your cluster. These proxies intercept ALL network traffic - every HTTP request, every gRPC call, everything. The sidecars then phone home to istiod (the control plane) to get routing rules, security policies, and telemetry configuration.
The data plane is where your traffic actually flows. Each Envoy sidecar is basically a programmable load balancer that sits between your app and the network. It handles retries, circuit breaking, load balancing, and collects metrics on every single request. The overhead is real - expect 50-200MB RAM per sidecar and 1-5ms latency per hop.
The control plane (istiod) is the brain that tells all the sidecars what to do. It reads your Kubernetes services, your Istio configuration CRDs, and pushes updates to thousands of Envoy proxies. When istiod goes down, your traffic keeps flowing with the last known config, but you can't make changes until it comes back.
What You Actually Get
Traffic Management lets you do canary deployments without changing your application code. Want to send 10% of traffic to your new version? Write a VirtualService YAML and pray you didn't typo anything. Circuit breakers work great when you configure them right, which takes about 50 attempts.
I learned this the hard way: enabling fault injection in production will make you very unpopular with your users. Test that shit in staging first.
Security is where Istio shines. Automatic mTLS between services works out of the box and actually makes your cluster more secure by default. Authorization policies let you control which services can talk to each other at the HTTP method level.
Pro tip: Don't enable strict mTLS on day one unless you enjoy fixing broken legacy services at 3am.
Observability is the real reason most people adopt Istio. Suddenly every service in your cluster has request metrics, distributed tracing, and access logs without touching application code. The Kiali dashboard looks pretty but becomes useless with more than 20 services - use Grafana instead.
The Gotchas Nobody Tells You
Istio 1.20+ fixed most of the memory leaks that plagued earlier versions, but sidecars still randomly crash sometimes. Your application keeps working, but you lose metrics until the sidecar restarts.
Resource requirements are no joke. Small clusters need 2-4GB just for the control plane. Each sidecar uses 50-100MB minimum, but I've seen them hit 300MB+ with distributed tracing enabled. Do the math - 100 services × 150MB = 15GB just for proxies.
The configuration complexity is real. You'll spend weeks learning the difference between VirtualServices, DestinationRules, and ServiceEntries. One misplaced hyphen in your YAML and traffic dies. Always test configs in staging first.