Microservices

Circuit Breakers: Fail Fast, Recover Gracefully

When a downstream service is failing, continuing to call it makes everything worse. Circuit breakers stop the cascade. The Pattern Three states: Closed: Normal operation, requests pass through Open: Service is failing, requests fail immediately Half-Open: Testing if service recovered [ C L ┌ │ ▼ O ▲ │ └ ─ S ─ ─ E ─ ─ D ─ ─ ] ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ f ─ ─ a ─ ─ i ─ ─ l s ─ u u ─ r c ─ e c ─ e ─ t s ─ h s ─ r ─ ─ e ─ ─ s ─ ─ h ─ ─ o ─ ─ l ─ ─ d ─ ─ ─ ─ ─ ─ ─ ─ ▶ ─ ─ ─ ─ [ ─ ─ O ─ ─ P │ │ ┴ ─ E ─ ─ N ─ ─ ] f ─ a ─ ─ i ─ ─ l ─ t u ─ i r ─ m e ─ e ─ ┐ │ │ o ─ u ─ t ─ ─ ─ ─ │ │ ┘ ▶ [ H A L F - O P E N ] Basic Implementation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 import time from enum import Enum from threading import Lock class State(Enum): CLOSED = "closed" OPEN = "open" HALF_OPEN = "half_open" class CircuitBreaker: def __init__( self, failure_threshold: int = 5, recovery_timeout: int = 30, half_open_max_calls: int = 3 ): self.failure_threshold = failure_threshold self.recovery_timeout = recovery_timeout self.half_open_max_calls = half_open_max_calls self.state = State.CLOSED self.failure_count = 0 self.success_count = 0 self.last_failure_time = None self.lock = Lock() def can_execute(self) -> bool: with self.lock: if self.state == State.CLOSED: return True if self.state == State.OPEN: if time.time() - self.last_failure_time > self.recovery_timeout: self.state = State.HALF_OPEN self.success_count = 0 return True return False if self.state == State.HALF_OPEN: return self.success_count < self.half_open_max_calls return False def record_success(self): with self.lock: if self.state == State.HALF_OPEN: self.success_count += 1 if self.success_count >= self.half_open_max_calls: self.state = State.CLOSED self.failure_count = 0 else: self.failure_count = 0 def record_failure(self): with self.lock: self.failure_count += 1 self.last_failure_time = time.time() if self.state == State.HALF_OPEN: self.state = State.OPEN elif self.failure_count >= self.failure_threshold: self.state = State.OPEN Using the Circuit Breaker 1 2 3 4 5 6 7 8 9 10 11 12 13 payment_breaker = CircuitBreaker(failure_threshold=3, recovery_timeout=60) def process_payment(order): if not payment_breaker.can_execute(): raise ServiceUnavailable("Payment service circuit open") try: result = payment_service.charge(order) payment_breaker.record_success() return result except Exception as e: payment_breaker.record_failure() raise Decorator Pattern 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 from functools import wraps def circuit_breaker(breaker: CircuitBreaker): def decorator(func): @wraps(func) def wrapper(*args, **kwargs): if not breaker.can_execute(): raise CircuitOpenError(f"Circuit breaker open for {func.__name__}") try: result = func(*args, **kwargs) breaker.record_success() return result except Exception as e: breaker.record_failure() raise return wrapper return decorator # Usage payment_cb = CircuitBreaker() @circuit_breaker(payment_cb) def charge_customer(customer_id, amount): return payment_api.charge(customer_id, amount) With Fallback 1 2 3 4 5 6 7 8 9 10 11 12 def get_user_recommendations(user_id): if not recommendations_breaker.can_execute(): # Fallback to cached or default recommendations return get_cached_recommendations(user_id) or DEFAULT_RECOMMENDATIONS try: result = recommendations_service.get(user_id) recommendations_breaker.record_success() return result except Exception: recommendations_breaker.record_failure() return get_cached_recommendations(user_id) or DEFAULT_RECOMMENDATIONS Library: pybreaker 1 2 3 4 5 6 7 8 9 10 11 12 13 import pybreaker db_breaker = pybreaker.CircuitBreaker( fail_max=5, reset_timeout=30 ) @db_breaker def query_database(sql): return db.execute(sql) # Check state print(db_breaker.current_state) # 'closed', 'open', or 'half-open' Library: tenacity (with circuit breaker) 1 2 3 4 5 6 7 8 from tenacity import retry, stop_after_attempt, CircuitBreaker cb = CircuitBreaker(failure_threshold=3, recovery_time=60) @retry(stop=stop_after_attempt(3)) @cb def call_external_api(): return requests.get("https://api.example.com/data") Per-Service Breakers 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 class ServiceRegistry: def __init__(self): self.breakers = {} def get_breaker(self, service_name: str) -> CircuitBreaker: if service_name not in self.breakers: self.breakers[service_name] = CircuitBreaker() return self.breakers[service_name] registry = ServiceRegistry() def call_service(service_name: str, endpoint: str): breaker = registry.get_breaker(service_name) if not breaker.can_execute(): raise ServiceUnavailable(f"{service_name} circuit is open") try: result = http_client.get(f"http://{service_name}/{endpoint}") breaker.record_success() return result except Exception: breaker.record_failure() raise Monitoring 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 from prometheus_client import Counter, Gauge circuit_state = Gauge( 'circuit_breaker_state', 'Circuit breaker state (0=closed, 1=open, 2=half-open)', ['service'] ) circuit_failures = Counter( 'circuit_breaker_failures_total', 'Circuit breaker failure count', ['service'] ) circuit_rejections = Counter( 'circuit_breaker_rejections_total', 'Requests rejected by open circuit', ['service'] ) # Update metrics in circuit breaker def record_failure(self, service_name): circuit_failures.labels(service=service_name).inc() # ... rest of failure logic circuit_state.labels(service=service_name).set(self.state.value) Configuration Guidelines Scenario Threshold Timeout Critical service, fast recovery 3-5 failures 15-30s Non-critical, can wait 5-10 failures 60-120s Flaky external API 3 failures 30-60s Database 5 failures 30s Anti-Patterns 1. Single global breaker ...

Service Discovery: Finding Services in a Dynamic World

In static infrastructure, services live at known addresses. Database at 10.0.1.5, cache at 10.0.1.6. Simple, predictable, fragile. In dynamic infrastructure — containers, auto-scaling, cloud — services appear and disappear constantly. IP addresses change. Instances multiply and vanish. Hardcoded addresses become a liability. Service discovery solves this: how do services find each other when everything is moving? The Problem 1 2 3 4 5 6 7 # Hardcoded - works until it doesn't DATABASE_URL = "postgres://10.0.1.5:5432/mydb" # What happens when: # - Database moves to a new server? # - You add read replicas? # - The IP changes after maintenance? DNS-Based Discovery The simplest approach: use DNS names instead of IPs. ...

Circuit Breaker Patterns: Failing Fast Without Failing Hard

Your payment service is down. Every request to it times out after 30 seconds. You have 100 requests per second hitting that endpoint. Do the math: within a minute, you’ve got 6,000 threads waiting on a dead service, and your entire application is choking. This is where circuit breakers earn their keep. The Problem: Cascading Failures In distributed systems, a single failing dependency can take down everything. Without protection, your system will: ...

Distributed Tracing Essentials: Following Requests Across Services

A request hits your API gateway, bounces through five microservices, touches two databases, and returns an error. Logs say “something failed.” Which service? Which call? What was the payload? Distributed tracing answers these questions by connecting the dots across service boundaries. The Core Concepts Traces and Spans A trace represents a complete request journey. A span represents a single operation within that journey. T ├ │ │ │ │ │ │ r ─ a ─ c e S ├ ├ │ └ : p ─ ─ ─ a ─ ─ ─ a n b : S S └ S ├ └ c p p ─ p ─ ─ 1 A a a ─ a ─ ─ 2 P n n n 3 I : : S : S S p p p G A U a O a a a u s n r n n t t e : d : : e h r e w D r I P a S S a n a y e e t S v y r r a e e m ( v v b r n e p i i a v t n a c c s i o t r e e e c r e e y P n Q r t u C o ) e h c r e e y c s k s i n g Each span has: ...

Distributed Tracing: The Missing Piece of Your Observability Stack

When a request fails in a distributed system, the question isn’t if something went wrong—it’s where. Logs tell you what happened. Metrics tell you how often. But tracing tells you the story. The Problem with Logs and Metrics Alone You’ve got 15 microservices. A user reports slow checkout. You check the logs—thousands of entries. You check the metrics—latency is up, but which service? You’re playing detective without a map. This is where distributed tracing shines. It connects the dots across service boundaries, showing you the exact path a request takes and where time is spent. ...

Circuit Breaker Pattern: Failing Fast to Stay Resilient

Learn how circuit breakers prevent cascade failures in distributed systems by detecting failures early and failing fast instead of waiting for timeouts.

API Gateway Patterns: The Front Door to Your Microservices

Every request to your microservices should pass through a single front door. That door is your API gateway—and getting it right determines whether your architecture scales gracefully or collapses under complexity. Why API Gateways? Without a gateway, clients must: Know the location of every service Handle authentication with each service Implement retry logic, timeouts, and circuit breaking Deal with different protocols and response formats An API gateway centralizes these concerns: ...

Service Mesh: Traffic Management, Security, and Observability with Istio

When you have dozens of microservices talking to each other, managing traffic, security, and observability becomes complex. A service mesh handles this at the infrastructure layer, so your applications don’t have to. What Problems Does a Service Mesh Solve? Without a mesh, every service needs to implement: Retries and timeouts Circuit breakers Load balancing TLS certificates Metrics and tracing Access control With a mesh, the sidecar proxy handles all of this: ...

Circuit Breakers: Building Systems That Fail Gracefully

In distributed systems, failures are inevitable. A single slow or failing service can cascade through your entire architecture, turning a minor issue into a major outage. Circuit breakers prevent this by detecting failures and stopping the cascade before it spreads. The Problem: Cascading Failures Imagine Service A calls Service B, which calls Service C. If Service C becomes slow: Requests to C start timing out Service B’s thread pool fills up waiting for C Service B becomes slow Service A’s threads fill up waiting for B Your entire system grinds to a halt One slow service just took down everything. ...

Event-Driven Architecture: Building Reactive Systems That Scale

Traditional request-response architectures work well until they don’t. When your services grow, synchronous calls create tight coupling, cascading failures, and bottlenecks. Event-driven architecture (EDA) offers an alternative: systems that react to changes rather than constantly polling for them. What Is Event-Driven Architecture? In EDA, components communicate through events — immutable records of something that happened. Instead of Service A calling Service B directly, Service A publishes an event, and any interested services subscribe to it. ...