Retry Patterns: When and How to Try Again

Not all failures are permanent. Retry patterns help distinguish transient hiccups from real problems. Exponential Backoff 1 2 3 4 5 6 7 8 9 10 11 12 13 14 import time import random def retry_with_backoff(func, max_retries=5, base_delay=1): for attempt in range(max_retries): try: return func() except Exception as e: if attempt == max_retries - 1: raise delay = base_delay * (2 ** attempt) jitter = random.uniform(0, delay * 0.1) time.sleep(delay + jitter) Each retry waits longer: 1s, 2s, 4s, 8s, 16s. Jitter prevents thundering herd. With tenacity 1 2 3 4 5 6 7 8 from tenacity import retry, stop_after_attempt, wait_exponential @retry( stop=stop_after_attempt(5), wait=wait_exponential(multiplier=1, min=1, max=60) ) def call_api(): return requests.get("https://api.example.com") Retry Only Transient Errors 1 2 3 4 5 6 7 8 from tenacity import retry, retry_if_exception_type @retry( retry=retry_if_exception_type((ConnectionError, TimeoutError)), stop=stop_after_attempt(3) ) def fetch_data(): return external_service.get() Don’t retry 400 Bad Request β€” that won’t fix itself. ...

February 28, 2026 Β· 2 min Β· 242 words Β· Rob Washington

Circuit Breakers: Fail Fast, Recover Gracefully

When a downstream service is failing, continuing to call it makes everything worse. Circuit breakers stop the cascade. The Pattern Three states: Closed: Normal operation, requests pass through Open: Service is failing, requests fail immediately Half-Open: Testing if service recovered [ C L β”Œ β”‚ β–Ό O β–² β”‚ β”” ─ S ─ ─ E ─ ─ D ─ ─ ] ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ f ─ ─ a ─ ─ i ─ ─ l s ─ u u ─ r c ─ e c ─ e ─ t s ─ h s ─ r ─ ─ e ─ ─ s ─ ─ h ─ ─ o ─ ─ l ─ ─ d ─ ─ ─ ─ ─ ─ ─ ─ β–Ά ─ ─ ─ ─ [ ─ ─ O ─ ─ P β”‚ β”‚ β”΄ ─ E ─ ─ N ─ ─ ] f ─ a ─ ─ i ─ ─ l ─ t u ─ i r ─ m e ─ e ─ ┐ β”‚ β”‚ o ─ u ─ t ─ ─ ─ ─ β”‚ β”‚ β”˜ β–Ά [ H A L F - O P E N ] Basic Implementation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 import time from enum import Enum from threading import Lock class State(Enum): CLOSED = "closed" OPEN = "open" HALF_OPEN = "half_open" class CircuitBreaker: def __init__( self, failure_threshold: int = 5, recovery_timeout: int = 30, half_open_max_calls: int = 3 ): self.failure_threshold = failure_threshold self.recovery_timeout = recovery_timeout self.half_open_max_calls = half_open_max_calls self.state = State.CLOSED self.failure_count = 0 self.success_count = 0 self.last_failure_time = None self.lock = Lock() def can_execute(self) -> bool: with self.lock: if self.state == State.CLOSED: return True if self.state == State.OPEN: if time.time() - self.last_failure_time > self.recovery_timeout: self.state = State.HALF_OPEN self.success_count = 0 return True return False if self.state == State.HALF_OPEN: return self.success_count < self.half_open_max_calls return False def record_success(self): with self.lock: if self.state == State.HALF_OPEN: self.success_count += 1 if self.success_count >= self.half_open_max_calls: self.state = State.CLOSED self.failure_count = 0 else: self.failure_count = 0 def record_failure(self): with self.lock: self.failure_count += 1 self.last_failure_time = time.time() if self.state == State.HALF_OPEN: self.state = State.OPEN elif self.failure_count >= self.failure_threshold: self.state = State.OPEN Using the Circuit Breaker 1 2 3 4 5 6 7 8 9 10 11 12 13 payment_breaker = CircuitBreaker(failure_threshold=3, recovery_timeout=60) def process_payment(order): if not payment_breaker.can_execute(): raise ServiceUnavailable("Payment service circuit open") try: result = payment_service.charge(order) payment_breaker.record_success() return result except Exception as e: payment_breaker.record_failure() raise Decorator Pattern 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 from functools import wraps def circuit_breaker(breaker: CircuitBreaker): def decorator(func): @wraps(func) def wrapper(*args, **kwargs): if not breaker.can_execute(): raise CircuitOpenError(f"Circuit breaker open for {func.__name__}") try: result = func(*args, **kwargs) breaker.record_success() return result except Exception as e: breaker.record_failure() raise return wrapper return decorator # Usage payment_cb = CircuitBreaker() @circuit_breaker(payment_cb) def charge_customer(customer_id, amount): return payment_api.charge(customer_id, amount) With Fallback 1 2 3 4 5 6 7 8 9 10 11 12 def get_user_recommendations(user_id): if not recommendations_breaker.can_execute(): # Fallback to cached or default recommendations return get_cached_recommendations(user_id) or DEFAULT_RECOMMENDATIONS try: result = recommendations_service.get(user_id) recommendations_breaker.record_success() return result except Exception: recommendations_breaker.record_failure() return get_cached_recommendations(user_id) or DEFAULT_RECOMMENDATIONS Library: pybreaker 1 2 3 4 5 6 7 8 9 10 11 12 13 import pybreaker db_breaker = pybreaker.CircuitBreaker( fail_max=5, reset_timeout=30 ) @db_breaker def query_database(sql): return db.execute(sql) # Check state print(db_breaker.current_state) # 'closed', 'open', or 'half-open' Library: tenacity (with circuit breaker) 1 2 3 4 5 6 7 8 from tenacity import retry, stop_after_attempt, CircuitBreaker cb = CircuitBreaker(failure_threshold=3, recovery_time=60) @retry(stop=stop_after_attempt(3)) @cb def call_external_api(): return requests.get("https://api.example.com/data") Per-Service Breakers 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 class ServiceRegistry: def __init__(self): self.breakers = {} def get_breaker(self, service_name: str) -> CircuitBreaker: if service_name not in self.breakers: self.breakers[service_name] = CircuitBreaker() return self.breakers[service_name] registry = ServiceRegistry() def call_service(service_name: str, endpoint: str): breaker = registry.get_breaker(service_name) if not breaker.can_execute(): raise ServiceUnavailable(f"{service_name} circuit is open") try: result = http_client.get(f"http://{service_name}/{endpoint}") breaker.record_success() return result except Exception: breaker.record_failure() raise Monitoring 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 from prometheus_client import Counter, Gauge circuit_state = Gauge( 'circuit_breaker_state', 'Circuit breaker state (0=closed, 1=open, 2=half-open)', ['service'] ) circuit_failures = Counter( 'circuit_breaker_failures_total', 'Circuit breaker failure count', ['service'] ) circuit_rejections = Counter( 'circuit_breaker_rejections_total', 'Requests rejected by open circuit', ['service'] ) # Update metrics in circuit breaker def record_failure(self, service_name): circuit_failures.labels(service=service_name).inc() # ... rest of failure logic circuit_state.labels(service=service_name).set(self.state.value) Configuration Guidelines Scenario Threshold Timeout Critical service, fast recovery 3-5 failures 15-30s Non-critical, can wait 5-10 failures 60-120s Flaky external API 3 failures 30-60s Database 5 failures 30s Anti-Patterns 1. Single global breaker ...

February 28, 2026 Β· 5 min Β· 977 words Β· Rob Washington

Redis Patterns Beyond Simple Caching

Redis is often introduced as β€œa cache,” but that undersells it. Here are patterns that leverage Redis for rate limiting, sessions, queues, and real-time features. Pattern 1: Rate Limiting The sliding window approach: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 import redis import time r = redis.Redis() def is_rate_limited(user_id: str, limit: int = 100, window: int = 60) -> bool: """Allow `limit` requests per `window` seconds.""" key = f"ratelimit:{user_id}" now = time.time() pipe = r.pipeline() pipe.zremrangebyscore(key, 0, now - window) # Remove old entries pipe.zadd(key, {str(now): now}) # Add current request pipe.zcard(key) # Count requests in window pipe.expire(key, window) # Auto-cleanup results = pipe.execute() request_count = results[2] return request_count > limit Using a sorted set with timestamps gives you a true sliding window, not just fixed buckets. ...

February 28, 2026 Β· 6 min Β· 1071 words Β· Rob Washington

Circuit Breaker Patterns: Failing Fast Without Failing Hard

Your payment service is down. Every request to it times out after 30 seconds. You have 100 requests per second hitting that endpoint. Do the math: within a minute, you’ve got 6,000 threads waiting on a dead service, and your entire application is choking. This is where circuit breakers earn their keep. The Problem: Cascading Failures In distributed systems, a single failing dependency can take down everything. Without protection, your system will: ...

February 21, 2026 Β· 8 min Β· 1535 words Β· Rob Washington

LLM API Integration Patterns: Building Resilient AI Applications

Integrating Large Language Model APIs into production applications requires more than just calling an endpoint. Here are battle-tested patterns for building resilient, cost-effective LLM integrations. The Retry Cascade LLM APIs are notorious for rate limits and transient failures. A simple exponential backoff isn’t enough β€” you need a cascade strategy: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 import asyncio from dataclasses import dataclass from typing import Optional @dataclass class LLMResponse: content: str model: str tokens_used: int class LLMCascade: def __init__(self): self.providers = [ ("anthropic", "claude-sonnet-4-20250514", 3), ("openai", "gpt-4o", 2), ("anthropic", "claude-3-haiku-20240307", 5), ] async def complete(self, prompt: str) -> Optional[LLMResponse]: for provider, model, max_retries in self.providers: for attempt in range(max_retries): try: return await self._call_provider(provider, model, prompt) except RateLimitError: await asyncio.sleep(2 ** attempt) except ProviderError: break # Try next provider return None The cascade falls through primary to fallback models, attempting retries at each level before moving on. ...

February 17, 2026 Β· 5 min Β· 898 words Β· Rob Washington

Circuit Breaker Pattern: Failing Fast to Stay Resilient

Learn how circuit breakers prevent cascade failures in distributed systems by detecting failures early and failing fast instead of waiting for timeouts.

February 15, 2026 Β· 7 min Β· 1459 words Β· Rob Washington