Reliability

Idempotency: Making APIs Safe to Retry

Networks fail. Clients timeout. Users double-click. If your API creates duplicate orders or charges cards twice when this happens, you have a problem. Idempotency is the solution—making operations safe to retry without side effects. What Is Idempotency? An operation is idempotent if performing it multiple times has the same effect as performing it once. # G P D # P P E U E O O I T T L N S S d E O T T e / / T T m u u E p s s i u o e e / d r s t r r u e d e e s s s m e r n / / e p r s t 1 1 r o s / : 2 2 s t 1 3 3 / e 2 S 1 n 3 a { 2 t / m . 3 : c e . h . D a r } i r e f g s f e u e l r t # # # e # # n e A A U t C C v l l s r h e w w e r e a r a a r e a r y y y s t g s s 1 u e e t 2 l s s i r s 3 t m e e a t e t t i e h u s s a N e r c E n u g h W c s s o a e n t o r u r e i r d s m d e 1 ( e e a r 2 a r g 3 l a 1 r e i 2 t e a n 3 o a c d h e t y a h t c i g i h s o m n e t s e i t m a = e t e s t i l l g o n e ) The Problem Client sends request → Server processes it → Response lost in transit: ...

Retry Patterns: When and How to Try Again

Not all failures are permanent. Retry patterns help distinguish transient hiccups from real problems. Exponential Backoff 1 2 3 4 5 6 7 8 9 10 11 12 13 14 import time import random def retry_with_backoff(func, max_retries=5, base_delay=1): for attempt in range(max_retries): try: return func() except Exception as e: if attempt == max_retries - 1: raise delay = base_delay * (2 ** attempt) jitter = random.uniform(0, delay * 0.1) time.sleep(delay + jitter) Each retry waits longer: 1s, 2s, 4s, 8s, 16s. Jitter prevents thundering herd. With tenacity 1 2 3 4 5 6 7 8 from tenacity import retry, stop_after_attempt, wait_exponential @retry( stop=stop_after_attempt(5), wait=wait_exponential(multiplier=1, min=1, max=60) ) def call_api(): return requests.get("https://api.example.com") Retry Only Transient Errors 1 2 3 4 5 6 7 8 from tenacity import retry, retry_if_exception_type @retry( retry=retry_if_exception_type((ConnectionError, TimeoutError)), stop=stop_after_attempt(3) ) def fetch_data(): return external_service.get() Don’t retry 400 Bad Request — that won’t fix itself. ...

Circuit Breakers: Fail Fast, Recover Gracefully

When a downstream service is failing, continuing to call it makes everything worse. Circuit breakers stop the cascade. The Pattern Three states: Closed: Normal operation, requests pass through Open: Service is failing, requests fail immediately Half-Open: Testing if service recovered [ C L ┌ │ ▼ O ▲ │ └ ─ S ─ ─ E ─ ─ D ─ ─ ] ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ f ─ ─ a ─ ─ i ─ ─ l s ─ u u ─ r c ─ e c ─ e ─ t s ─ h s ─ r ─ ─ e ─ ─ s ─ ─ h ─ ─ o ─ ─ l ─ ─ d ─ ─ ─ ─ ─ ─ ─ ─ ▶ ─ ─ ─ ─ [ ─ ─ O ─ ─ P │ │ ┴ ─ E ─ ─ N ─ ─ ] f ─ a ─ ─ i ─ ─ l ─ t u ─ i r ─ m e ─ e ─ ┐ │ │ o ─ u ─ t ─ ─ ─ ─ │ │ ┘ ▶ [ H A L F - O P E N ] Basic Implementation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 import time from enum import Enum from threading import Lock class State(Enum): CLOSED = "closed" OPEN = "open" HALF_OPEN = "half_open" class CircuitBreaker: def __init__( self, failure_threshold: int = 5, recovery_timeout: int = 30, half_open_max_calls: int = 3 ): self.failure_threshold = failure_threshold self.recovery_timeout = recovery_timeout self.half_open_max_calls = half_open_max_calls self.state = State.CLOSED self.failure_count = 0 self.success_count = 0 self.last_failure_time = None self.lock = Lock() def can_execute(self) -> bool: with self.lock: if self.state == State.CLOSED: return True if self.state == State.OPEN: if time.time() - self.last_failure_time > self.recovery_timeout: self.state = State.HALF_OPEN self.success_count = 0 return True return False if self.state == State.HALF_OPEN: return self.success_count < self.half_open_max_calls return False def record_success(self): with self.lock: if self.state == State.HALF_OPEN: self.success_count += 1 if self.success_count >= self.half_open_max_calls: self.state = State.CLOSED self.failure_count = 0 else: self.failure_count = 0 def record_failure(self): with self.lock: self.failure_count += 1 self.last_failure_time = time.time() if self.state == State.HALF_OPEN: self.state = State.OPEN elif self.failure_count >= self.failure_threshold: self.state = State.OPEN Using the Circuit Breaker 1 2 3 4 5 6 7 8 9 10 11 12 13 payment_breaker = CircuitBreaker(failure_threshold=3, recovery_timeout=60) def process_payment(order): if not payment_breaker.can_execute(): raise ServiceUnavailable("Payment service circuit open") try: result = payment_service.charge(order) payment_breaker.record_success() return result except Exception as e: payment_breaker.record_failure() raise Decorator Pattern 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 from functools import wraps def circuit_breaker(breaker: CircuitBreaker): def decorator(func): @wraps(func) def wrapper(*args, **kwargs): if not breaker.can_execute(): raise CircuitOpenError(f"Circuit breaker open for {func.__name__}") try: result = func(*args, **kwargs) breaker.record_success() return result except Exception as e: breaker.record_failure() raise return wrapper return decorator # Usage payment_cb = CircuitBreaker() @circuit_breaker(payment_cb) def charge_customer(customer_id, amount): return payment_api.charge(customer_id, amount) With Fallback 1 2 3 4 5 6 7 8 9 10 11 12 def get_user_recommendations(user_id): if not recommendations_breaker.can_execute(): # Fallback to cached or default recommendations return get_cached_recommendations(user_id) or DEFAULT_RECOMMENDATIONS try: result = recommendations_service.get(user_id) recommendations_breaker.record_success() return result except Exception: recommendations_breaker.record_failure() return get_cached_recommendations(user_id) or DEFAULT_RECOMMENDATIONS Library: pybreaker 1 2 3 4 5 6 7 8 9 10 11 12 13 import pybreaker db_breaker = pybreaker.CircuitBreaker( fail_max=5, reset_timeout=30 ) @db_breaker def query_database(sql): return db.execute(sql) # Check state print(db_breaker.current_state) # 'closed', 'open', or 'half-open' Library: tenacity (with circuit breaker) 1 2 3 4 5 6 7 8 from tenacity import retry, stop_after_attempt, CircuitBreaker cb = CircuitBreaker(failure_threshold=3, recovery_time=60) @retry(stop=stop_after_attempt(3)) @cb def call_external_api(): return requests.get("https://api.example.com/data") Per-Service Breakers 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 class ServiceRegistry: def __init__(self): self.breakers = {} def get_breaker(self, service_name: str) -> CircuitBreaker: if service_name not in self.breakers: self.breakers[service_name] = CircuitBreaker() return self.breakers[service_name] registry = ServiceRegistry() def call_service(service_name: str, endpoint: str): breaker = registry.get_breaker(service_name) if not breaker.can_execute(): raise ServiceUnavailable(f"{service_name} circuit is open") try: result = http_client.get(f"http://{service_name}/{endpoint}") breaker.record_success() return result except Exception: breaker.record_failure() raise Monitoring 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 from prometheus_client import Counter, Gauge circuit_state = Gauge( 'circuit_breaker_state', 'Circuit breaker state (0=closed, 1=open, 2=half-open)', ['service'] ) circuit_failures = Counter( 'circuit_breaker_failures_total', 'Circuit breaker failure count', ['service'] ) circuit_rejections = Counter( 'circuit_breaker_rejections_total', 'Requests rejected by open circuit', ['service'] ) # Update metrics in circuit breaker def record_failure(self, service_name): circuit_failures.labels(service=service_name).inc() # ... rest of failure logic circuit_state.labels(service=service_name).set(self.state.value) Configuration Guidelines Scenario Threshold Timeout Critical service, fast recovery 3-5 failures 15-30s Non-critical, can wait 5-10 failures 60-120s Flaky external API 3 failures 30-60s Database 5 failures 30s Anti-Patterns 1. Single global breaker ...

Webhook Patterns: Receiving Events Reliably

Webhooks flip the API model: instead of polling, the service calls you. Here’s how to handle them without losing events. Basic Webhook Handler 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 from fastapi import FastAPI, Request, HTTPException import hmac import hashlib app = FastAPI() @app.post("/webhooks/stripe") async def stripe_webhook(request: Request): payload = await request.body() sig_header = request.headers.get("Stripe-Signature") # Verify signature first if not verify_stripe_signature(payload, sig_header): raise HTTPException(status_code=400, detail="Invalid signature") event = json.loads(payload) # Process event if event["type"] == "payment_intent.succeeded": handle_payment_success(event["data"]["object"]) # Always return 200 quickly return {"received": True} Signature Verification Never trust unverified webhooks. ...

Graceful Shutdown: Stop Dropping Requests

Every deployment is a potential outage if your application doesn’t shut down gracefully. Here’s how to do it right. The Problem 1 2 3 4 5 . . . . . K P Y I U u o o n s b d u - e e r f r r i l s n s a i e p g s t r p h e e e t e s m e o x r e s v i e r e e t q r n d s u o d e r s f i s s r m t S o m s d I m e u G d g r T s i e i E e a t n R r t g M v e c i l o " c y n z e n e e r e c o n t - d i d p o o o n w i n n r t t e i s s m e e t " d e p l o y s The fix: handle SIGTERM, finish existing work, then exit. ...

Container Health Check Patterns That Actually Work

Your container says it’s healthy. Your users say the app is broken. Sound familiar? Basic health checks only tell you if a process is running. Here’s how to build checks that catch real problems. Beyond “Is It Alive?” Most health checks look like this: 1 HEALTHCHECK CMD curl -f http://localhost:8080/health || exit 1 This tells you the HTTP server responds. It doesn’t tell you: Can the app reach the database? Is the cache connected? Are critical background workers running? Is the disk filling up? Layered Health Checks Implement three levels: ...

Webhook Reliability: Building Event Delivery That Actually Works

Webhooks are the internet’s way of saying “hey, something happened.” Simple in concept, surprisingly tricky in practice. The challenge: HTTP is unreliable, servers go down, networks flake out, and your webhook payload might arrive zero times, once, or five times. Building reliable webhook infrastructure means handling all of these gracefully. The Sender Side Retry with Exponential Backoff First attempt fails? Try again. But not immediately. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 RETRY_DELAYS = [60, 300, 900, 3600, 14400, 43200] # seconds def deliver_webhook(event, attempt=0): try: response = requests.post( event.webhook_url, json=event.payload, timeout=30, headers={ "Content-Type": "application/json", "X-Webhook-ID": event.id, "X-Webhook-Timestamp": str(int(time.time())), "X-Webhook-Signature": sign_payload(event.payload) } ) if response.status_code >= 200 and response.status_code < 300: mark_delivered(event) return if response.status_code >= 500: # Server error, retry schedule_retry(event, attempt) else: # Client error (4xx), don't retry mark_failed(event, response.status_code) except requests.Timeout: schedule_retry(event, attempt) except requests.ConnectionError: schedule_retry(event, attempt) def schedule_retry(event, attempt): if attempt >= len(RETRY_DELAYS): mark_failed(event, "max_retries") return delay = RETRY_DELAYS[attempt] # Add jitter to prevent thundering herd jitter = random.uniform(0, delay * 0.1) queue_at = time.time() + delay + jitter enqueue(event, attempt + 1, queue_at) Key decisions: ...

Error Handling Philosophy: Fail Gracefully, Recover Quickly

Errors are inevitable. Networks fail. Disks fill up. Services crash. Users input garbage. The question isn’t whether your system will encounter errors — it’s how it will behave when it does. Good error handling is the difference between “the system recovered automatically” and “we lost customer data.” Fail Fast, Fail Loud Detect problems early and report them clearly: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 # ❌ Bad: Silent failure def process_order(order): try: result = payment_service.charge(order) except Exception: pass # Swallow error, pretend it worked return {"status": "ok"} # ✅ Good: Fail fast with clear error def process_order(order): if not order.items: raise ValidationError("Order must have at least one item") if order.total <= 0: raise ValidationError("Order total must be positive") result = payment_service.charge(order) # Let it fail if it fails return {"status": "ok", "charge_id": result.id} Silent failures are debugging nightmares. If something goes wrong, make it obvious. ...

Retry Patterns: Exponential Backoff and Beyond

Networks fail. Databases hiccup. External APIs return 503. In distributed systems, transient failures are not exceptional — they’re expected. The question isn’t whether to retry, but how. Bad retry logic turns a brief outage into a cascading failure. Good retry logic absorbs transient issues invisibly. The Naive Retry (Don’t Do This) 1 2 3 4 5 6 7 # Immediate retry loop - a recipe for disaster def call_api(): while True: try: return requests.get(API_URL) except RequestException: continue # Hammer the failing service This creates a thundering herd. If the service is struggling, every client immediately retrying makes it worse. You’re not recovering from failure — you’re accelerating toward total collapse. ...

Zero-Downtime Deployments: Strategies That Actually Work

“We’re deploying, please hold” is not an acceptable user experience. Whether you’re running a startup or enterprise infrastructure, users expect services to just work. Here’s how to ship code without the maintenance windows. The Goal: Invisible Deploys A zero-downtime deployment means users never notice you’re deploying. No error pages, no dropped connections, no “please refresh” messages. The old version serves traffic until the new version is proven healthy. Strategy 1: Rolling Deployments The simplest approach. Replace instances one at a time: ...