Circuit Breakers: Fail Fast, Recover Gracefully

When a downstream service is failing, continuing to call it makes everything worse. Circuit breakers stop the cascade. The Pattern Three states: Closed: Normal operation, requests pass through Open: Service is failing, requests fail immediately Half-Open: Testing if service recovered [ C L ┌ │ ▼ O ▲ │ └ ─ S ─ ─ E ─ ─ D ─ ─ ] ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ f ─ ─ a ─ ─ i ─ ─ l s ─ u u ─ r c ─ e c ─ e ─ t s ─ h s ─ r ─ ─ e ─ ─ s ─ ─ h ─ ─ o ─ ─ l ─ ─ d ─ ─ ─ ─ ─ ─ ─ ─ ▶ ─ ─ ─ ─ [ ─ ─ O ─ ─ P │ │ ┴ ─ E ─ ─ N ─ ─ ] f ─ a ─ ─ i ─ ─ l ─ t u ─ i r ─ m e ─ e ─ ┐ │ │ o ─ u ─ t ─ ─ ─ ─ │ │ ┘ ▶ [ H A L F - O P E N ] Basic Implementation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 import time from enum import Enum from threading import Lock class State(Enum): CLOSED = "closed" OPEN = "open" HALF_OPEN = "half_open" class CircuitBreaker: def __init__( self, failure_threshold: int = 5, recovery_timeout: int = 30, half_open_max_calls: int = 3 ): self.failure_threshold = failure_threshold self.recovery_timeout = recovery_timeout self.half_open_max_calls = half_open_max_calls self.state = State.CLOSED self.failure_count = 0 self.success_count = 0 self.last_failure_time = None self.lock = Lock() def can_execute(self) -> bool: with self.lock: if self.state == State.CLOSED: return True if self.state == State.OPEN: if time.time() - self.last_failure_time > self.recovery_timeout: self.state = State.HALF_OPEN self.success_count = 0 return True return False if self.state == State.HALF_OPEN: return self.success_count < self.half_open_max_calls return False def record_success(self): with self.lock: if self.state == State.HALF_OPEN: self.success_count += 1 if self.success_count >= self.half_open_max_calls: self.state = State.CLOSED self.failure_count = 0 else: self.failure_count = 0 def record_failure(self): with self.lock: self.failure_count += 1 self.last_failure_time = time.time() if self.state == State.HALF_OPEN: self.state = State.OPEN elif self.failure_count >= self.failure_threshold: self.state = State.OPEN Using the Circuit Breaker 1 2 3 4 5 6 7 8 9 10 11 12 13 payment_breaker = CircuitBreaker(failure_threshold=3, recovery_timeout=60) def process_payment(order): if not payment_breaker.can_execute(): raise ServiceUnavailable("Payment service circuit open") try: result = payment_service.charge(order) payment_breaker.record_success() return result except Exception as e: payment_breaker.record_failure() raise Decorator Pattern 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 from functools import wraps def circuit_breaker(breaker: CircuitBreaker): def decorator(func): @wraps(func) def wrapper(*args, **kwargs): if not breaker.can_execute(): raise CircuitOpenError(f"Circuit breaker open for {func.__name__}") try: result = func(*args, **kwargs) breaker.record_success() return result except Exception as e: breaker.record_failure() raise return wrapper return decorator # Usage payment_cb = CircuitBreaker() @circuit_breaker(payment_cb) def charge_customer(customer_id, amount): return payment_api.charge(customer_id, amount) With Fallback 1 2 3 4 5 6 7 8 9 10 11 12 def get_user_recommendations(user_id): if not recommendations_breaker.can_execute(): # Fallback to cached or default recommendations return get_cached_recommendations(user_id) or DEFAULT_RECOMMENDATIONS try: result = recommendations_service.get(user_id) recommendations_breaker.record_success() return result except Exception: recommendations_breaker.record_failure() return get_cached_recommendations(user_id) or DEFAULT_RECOMMENDATIONS Library: pybreaker 1 2 3 4 5 6 7 8 9 10 11 12 13 import pybreaker db_breaker = pybreaker.CircuitBreaker( fail_max=5, reset_timeout=30 ) @db_breaker def query_database(sql): return db.execute(sql) # Check state print(db_breaker.current_state) # 'closed', 'open', or 'half-open' Library: tenacity (with circuit breaker) 1 2 3 4 5 6 7 8 from tenacity import retry, stop_after_attempt, CircuitBreaker cb = CircuitBreaker(failure_threshold=3, recovery_time=60) @retry(stop=stop_after_attempt(3)) @cb def call_external_api(): return requests.get("https://api.example.com/data") Per-Service Breakers 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 class ServiceRegistry: def __init__(self): self.breakers = {} def get_breaker(self, service_name: str) -> CircuitBreaker: if service_name not in self.breakers: self.breakers[service_name] = CircuitBreaker() return self.breakers[service_name] registry = ServiceRegistry() def call_service(service_name: str, endpoint: str): breaker = registry.get_breaker(service_name) if not breaker.can_execute(): raise ServiceUnavailable(f"{service_name} circuit is open") try: result = http_client.get(f"http://{service_name}/{endpoint}") breaker.record_success() return result except Exception: breaker.record_failure() raise Monitoring 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 from prometheus_client import Counter, Gauge circuit_state = Gauge( 'circuit_breaker_state', 'Circuit breaker state (0=closed, 1=open, 2=half-open)', ['service'] ) circuit_failures = Counter( 'circuit_breaker_failures_total', 'Circuit breaker failure count', ['service'] ) circuit_rejections = Counter( 'circuit_breaker_rejections_total', 'Requests rejected by open circuit', ['service'] ) # Update metrics in circuit breaker def record_failure(self, service_name): circuit_failures.labels(service=service_name).inc() # ... rest of failure logic circuit_state.labels(service=service_name).set(self.state.value) Configuration Guidelines Scenario Threshold Timeout Critical service, fast recovery 3-5 failures 15-30s Non-critical, can wait 5-10 failures 60-120s Flaky external API 3 failures 30-60s Database 5 failures 30s Anti-Patterns 1. Single global breaker ...

February 28, 2026 · 5 min · 977 words · Rob Washington

Caching Strategies: What to Cache and When to Invalidate

Cache invalidation is one of the two hard problems in computer science. Here’s how to make it less painful. The Caching Patterns Cache-Aside (Lazy Loading) 1 2 3 4 5 6 7 8 9 10 11 12 13 def get_user(user_id: str) -> dict: # Check cache first cached = redis.get(f"user:{user_id}") if cached: return json.loads(cached) # Cache miss: fetch from database user = db.query("SELECT * FROM users WHERE id = %s", user_id) # Store in cache for next time redis.setex(f"user:{user_id}", 3600, json.dumps(user)) return user Pros: Only caches what’s actually used Cons: First request always slow (cache miss) ...

February 28, 2026 · 5 min · 955 words · Rob Washington

Webhook Patterns: Receiving Events Reliably

Webhooks flip the API model: instead of polling, the service calls you. Here’s how to handle them without losing events. Basic Webhook Handler 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 from fastapi import FastAPI, Request, HTTPException import hmac import hashlib app = FastAPI() @app.post("/webhooks/stripe") async def stripe_webhook(request: Request): payload = await request.body() sig_header = request.headers.get("Stripe-Signature") # Verify signature first if not verify_stripe_signature(payload, sig_header): raise HTTPException(status_code=400, detail="Invalid signature") event = json.loads(payload) # Process event if event["type"] == "payment_intent.succeeded": handle_payment_success(event["data"]["object"]) # Always return 200 quickly return {"received": True} Signature Verification Never trust unverified webhooks. ...

February 28, 2026 · 5 min · 889 words · Rob Washington

SSL/TLS Certificates: From Let's Encrypt to Production

HTTPS is table stakes. Here’s how to set up certificates properly and avoid the 3am “certificate expired” panic. Let’s Encrypt with Certbot Standalone Mode (No Web Server) 1 2 3 4 5 6 7 8 9 # Install sudo apt install certbot # Get certificate (stops any service on port 80) sudo certbot certonly --standalone -d example.com -d www.example.com # Certificates stored in: # /etc/letsencrypt/live/example.com/fullchain.pem # /etc/letsencrypt/live/example.com/privkey.pem Webroot Mode (Server Running) 1 2 # Certbot verifies via http://example.com/.well-known/acme-challenge/ sudo certbot certonly --webroot -w /var/www/html -d example.com Nginx Plugin 1 sudo certbot --nginx -d example.com -d www.example.com Certbot modifies nginx config automatically. ...

February 28, 2026 · 4 min · 779 words · Rob Washington

DNS Troubleshooting: When dig Is Your Best Friend

DNS issues are deceptively simple. “It’s always DNS” is a meme because it’s true. Here’s how to actually debug it. The Essential Commands dig: Your Primary Tool 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 # Basic lookup dig example.com # Short answer only dig +short example.com # Specific record type dig example.com MX dig example.com TXT dig example.com CNAME # Query specific nameserver dig @8.8.8.8 example.com # Trace the full resolution path dig +trace example.com Understanding dig Output ; ; ; ; e ; ; ; ; ; e ; x ; ; ; ; x a Q a A m Q S W M U m N p u E H S E p S l e R E G S l W e r V N D T e E . y E : S i I . R c R I G O c o t : S Z N o S m i a E 9 m E . m 1 t . S . C e 9 1 E T : 2 F r 8 C I . e c . T O 2 1 b v 1 I N 3 6 d O : 8 2 : N m . 8 : s 1 5 e . 2 6 c 1 0 # : e 5 3 x 3 3 0 a 6 ( : m 0 1 0 p 0 9 0 l 2 e . E . 1 S c 6 T o I I 8 m N N . 2 1 0 . 2 1 6 ) A A 9 3 . 1 8 4 . 2 1 6 . 3 4 Key fields: ...

February 28, 2026 · 5 min · 1057 words · Rob Washington

Load Balancing: Beyond Round Robin

Round robin is the default, but it’s rarely the best choice. Here’s when to use each algorithm and why. The Algorithms Round Robin 1 2 3 4 5 upstream backend { server 192.168.1.1:8080; server 192.168.1.2:8080; server 192.168.1.3:8080; } Requests go 1→2→3→1→2→3. Simple, fair, ignores server load. Use when: All servers are identical and requests are uniform. Problem: A slow server gets the same traffic as a fast one. Weighted Round Robin 1 2 3 4 5 upstream backend { server 192.168.1.1:8080 weight=5; server 192.168.1.2:8080 weight=3; server 192.168.1.3:8080 weight=2; } Server 1 gets 50%, server 2 gets 30%, server 3 gets 20%. ...

February 28, 2026 · 4 min · 811 words · Rob Washington

Database Migrations That Won't Ruin Your Weekend

Database migrations are where deploys go to die. Here’s how to make them safe. The Golden Rule Every migration must be backward compatible. Your old code will run alongside the new schema during deployment. If it can’t, you’ll have downtime. Safe Operations These are always safe: Adding a nullable column Adding a table Adding an index (with CONCURRENTLY) Adding a constraint (as NOT VALID, then validate later) Dangerous Operations These need careful handling: ...

February 28, 2026 · 4 min · 770 words · Rob Washington

Feature Flags: Deploy Doesn't Mean Release

Separating deployment from release is one of the best things you can do for your team’s sanity. Feature flags make this possible. The Core Idea 1 2 3 4 5 6 7 8 9 10 11 12 # Without flags: deploy = release def checkout(): process_payment() send_confirmation() # With flags: deploy != release def checkout(): process_payment() if feature_enabled("new_confirmation_email"): send_new_confirmation() # Deployed but not released else: send_confirmation() Code ships to production. Flag decides if users see it. ...

February 28, 2026 · 5 min · 924 words · Rob Washington

Structured Logging: Stop Parsing Log Lines

Unstructured logs are technical debt. Structured logs are queryable, parseable, and actually useful when things break. The Problem # 2 2 2 0 0 0 U 2 2 2 n 6 6 6 s - - - t 0 0 0 r 2 2 2 u - - - c 2 2 2 t 8 8 8 u r 1 1 1 e 0 0 0 d : : : : 1 1 1 5 5 5 g : : : o 2 2 2 o 3 4 5 d I E I l N R N u F R F c O O O k R U R p s F e a e a q r r i u s l e i a e s n l d t g i c t c t e o o h m i l p p s o r l g o e g c t e e e d s d s i i n o n r f d 2 r e 3 o r 4 m m 1 s 1 2 9 3 2 4 . 5 1 : 6 8 c . o 1 n . n 1 e c t i o n t i m e o u t Regex hell when you need to extract user, IP, order ID, or duration. ...

February 28, 2026 · 4 min · 744 words · Rob Washington

The Twelve-Factor App: What Actually Matters

The Twelve-Factor methodology is from 2011 but remains relevant. Here’s what matters in practice, and what’s become outdated. The Factors, Ranked by Impact Critical (Ignore at Your Peril) III. Config in Environment 1 2 3 4 5 # Bad DATABASE_URL = "postgres://localhost/myapp" # hardcoded # Good DATABASE_URL = os.environ["DATABASE_URL"] Config includes credentials, per-environment values, and feature flags. Environment variables work everywhere: containers, serverless, bare metal. VI. Stateless Processes 1 2 3 4 5 6 7 8 9 10 11 # Bad: storing session in memory sessions = {} @app.post("/login") def login(user): sessions[user.id] = {"logged_in": True} # Dies with process # Good: external session store @app.post("/login") def login(user): redis.set(f"session:{user.id}", {"logged_in": True}) If your process dies, can another pick up the work? Statelessness enables horizontal scaling, rolling deploys, and crash recovery. ...

February 28, 2026 · 5 min · 864 words · Rob Washington