When one service in your distributed system starts failing, what happens to everything else? Without proper safeguards, a single sick service can bring down your entire platform. Circuit breakers are the solution.
The Cascading Failure Problem
Picture this: Your payment service starts timing out because a third-party API is slow. Every request to your checkout service now waits 30 seconds for payments to respond. Your checkout service’s thread pool fills up. Users can’t complete purchases, so they refresh repeatedly. Your load balancer marks checkout as unhealthy. Traffic shifts to fewer instances. Those instances overload. Now your entire e-commerce platform is down — because of one slow API.
This is a cascading failure, and it’s devastatingly common in microservice architectures.
How Circuit Breakers Work
The circuit breaker pattern, popularized by Michael Nygard in “Release It!”, mimics electrical circuit breakers. When failures reach a threshold, the circuit “opens” and stops making requests to the failing service.
Three states define the circuit:
Closed (Normal) — Requests flow through normally. The breaker monitors for failures.
Open (Failing) — After enough failures, the breaker trips. All requests fail immediately without attempting the call. This gives the downstream service time to recover.
Half-Open (Testing) — After a timeout period, the breaker allows a few test requests through. If they succeed, it closes. If they fail, it opens again.
| |
Real-World Implementation with resilience4j
In production, use a battle-tested library. Here’s resilience4j in a Spring Boot application:
| |
Configuration in application.yml:
| |
Tuning Your Thresholds
Getting thresholds wrong is worse than having no circuit breaker at all. Too sensitive and you’ll trip on normal variance. Too lenient and you won’t protect against real failures.
Start with these guidelines:
- Failure threshold: 50% over the last 10 requests is a reasonable default
- Recovery timeout: Long enough for the downstream service to recover (30-60 seconds)
- Slow call threshold: Based on your SLA requirements (p99 latency is a good starting point)
Monitor and adjust. Every system is different.
Combining with Other Patterns
Circuit breakers work best alongside:
Timeouts — Always set timeouts. A circuit breaker can’t help if requests hang forever.
| |
Retries with backoff — Retry transient failures, but with exponential backoff:
| |
Bulkheads — Isolate failures to specific service pools:
| |
Observability Is Non-Negotiable
A circuit breaker you can’t monitor is a liability. Expose metrics:
| |
Alert on:
- Circuit state changes (INFO level)
- Circuits staying open for extended periods (WARNING)
- Frequent state oscillation (ERROR — indicates threshold tuning needed)
The Fallback Strategy Matters
When the circuit opens, what happens? Your fallback strategy determines user experience:
Return cached data — Serve stale but valid responses
Degrade gracefully — Show limited functionality (“Recommendations unavailable”)
Queue for later — Accept the request and process asynchronously
Fail fast with clarity — Return a helpful error immediately
The worst option: hanging until timeout. That’s what we’re trying to prevent.
Testing Circuit Breakers
You can’t wait for production failures to validate your circuit breakers. Test them:
| |
Use chaos engineering tools like Chaos Monkey or Gremlin to validate behavior under real failure conditions.
Key Takeaways
Circuit breakers are essential infrastructure in distributed systems. They:
- Fail fast — Stop wasting resources on doomed requests
- Give services room to recover — Reduce load during outages
- Improve user experience — Fast failures beat slow timeouts
- Prevent cascading failures — Contain blast radius
Start with a library implementation, monitor religiously, and tune based on your system’s actual behavior. Your 3 AM self will thank you.
Circuit breakers won’t prevent failures — nothing will. But they’ll keep one failure from becoming a platform-wide outage. That’s resilience.