Service mesh is either the solution to all your microservices problems or unnecessary complexity you don’t need. Here’s how to tell which.
What a Service Mesh Does
A service mesh handles cross-cutting concerns for service-to-service communication:
- Traffic management — Load balancing, routing, retries
- Security — mTLS, authorization policies
- Observability — Metrics, tracing, logging
- Resilience — Circuit breakers, timeouts, fault injection
Instead of implementing these in every service, the mesh handles them at the infrastructure layer.
How It Works
Every service gets a sidecar proxy (usually Envoy). The proxy intercepts all traffic and applies policies. A control plane configures all the proxies.
Istio Quick Start
Install
| |
Deploy an App
| |
Traffic Management
Virtual Service (Routing Rules)
| |
User “jason” sees v2, everyone else sees v1.
Destination Rule (Load Balancing)
| |
Canary Deployments
| |
10% of traffic goes to v2.
Retries and Timeouts
| |
Security
mTLS (Automatic Encryption)
| |
All service-to-service traffic is now encrypted and authenticated.
Authorization Policies
| |
Only the web service account can call the API.
Observability
Metrics
Istio automatically collects:
- Request count, duration, size
- Response codes
- Connection metrics
Access via Prometheus:
| |
Distributed Tracing
| |
Traces show the full path of requests across services.
Kiali Dashboard
| |
Visual service graph, traffic flow, health status.
Circuit Breakers
| |
Ejects unhealthy instances after 5 consecutive 5xx errors.
When You Don’t Need a Service Mesh
Skip the mesh if:
- You have < 10 services
- Traffic patterns are simple
- You can handle retries/timeouts in code
- You don’t need mTLS between services
- Observability tools already work
The overhead:
- Increased latency (small, but exists)
- More resource usage (sidecars need CPU/memory)
- Operational complexity
- Learning curve
Alternatives
Linkerd (Simpler)
| |
Linkerd is lighter weight, easier to operate, fewer features.
No Mesh (Libraries)
Handle concerns in application code:
- Retries: tenacity (Python), resilience4j (Java)
- mTLS: Application-level certificates
- Tracing: OpenTelemetry SDK
- Metrics: Prometheus client libraries
More work per service, but no infrastructure overhead.
The Service Mesh Decision
| Need | Without Mesh | With Mesh |
|---|---|---|
| Retries | Code in each service | Config once |
| mTLS | Manual cert management | Automatic |
| Traffic splitting | Complex routing | Simple YAML |
| Observability | Instrument each service | Automatic |
| Authorization | Each service checks | Centralized policies |
Start without a mesh. Add one when you genuinely need:
- Zero-trust security (mTLS everywhere)
- Fine-grained traffic control
- Consistent observability across many services
- Policy enforcement at the infrastructure layer
The Service Mesh Checklist
Before adopting:
- Have 10+ services that communicate
- Need mTLS between all services
- Want canary/blue-green without code changes
- Need consistent retry/timeout policies
- Team has capacity to learn and operate
After adopting:
- Sidecar injection enabled
- mTLS mode configured
- Basic routing rules in place
- Observability dashboards accessible
- Team trained on troubleshooting
A service mesh is powerful, but power has a price. Make sure you’re buying something you’ll actually use.