Posts

Logging Levels: A Practical Guide to What Goes Where

Logging seems simple until you’re debugging production at 2 AM, scrolling through millions of lines trying to find the one that matters. Good logging practices make that experience less painful. Here’s how to think about log levels. The Levels Most logging frameworks use these standard levels: D E B U G < I N F O < W A R N < E R R O R < F A T A L In production, you typically run at INFO or WARN. Lower levels include all higher levels (INFO includes WARN, ERROR, and FATAL). ...

Database Connection Pooling: Stop Opening Connections for Every Query

Opening a database connection is expensive. TCP handshake, SSL negotiation, authentication, session setup—it all adds up. Do that for every query and your application crawls. Connection pooling fixes this by reusing connections. Here’s how to do it right. The Problem Without pooling, every request opens a new connection: 1 2 3 4 5 6 7 8 # BAD: New connection per request def get_user(user_id): conn = psycopg2.connect(DATABASE_URL) # ~50-100ms cursor = conn.cursor() cursor.execute("SELECT * FROM users WHERE id = %s", (user_id,)) user = cursor.fetchone() conn.close() return user At 100 requests per second, that’s 100 connections opening and closing per second. Your database server has a connection limit (typically 100-500). You’ll exhaust it fast. ...

API Error Handling That Helps Instead of Frustrates

Bad error handling wastes everyone’s time. A cryptic “Error 500” sends developers on a debugging odyssey. A well-designed error response tells them exactly what went wrong and how to fix it. Here’s how to build the latter. The Anatomy of a Good Error Every error response should answer three questions: What happened? (error code/type) Why? (human-readable message) How do I fix it? (actionable guidance) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 { "error": { "code": "VALIDATION_ERROR", "message": "Request validation failed", "details": [ { "field": "email", "message": "Invalid email format", "received": "not-an-email" }, { "field": "age", "message": "Must be a positive integer", "received": "-5" } ], "documentation_url": "https://api.example.com/docs/errors#VALIDATION_ERROR" }, "request_id": "req_abc123" } Always include: ...

Configuration Management Patterns for Reliable Deployments

Configuration is where deployments go to die. A typo in an environment variable, a missing secret, a config file that works in staging but breaks in production. Here’s how to make configuration boring and reliable. The Hierarchy of Configuration Not all config is created equal. Layer it: 1 2 3 4 5 . . . . . D C E C F e o n o e f n v m a a f i m t u i r a u l g o n r t n d e s f m - i e l f ( l n i l i e t n a n s e g v s c ( a f o p r l ( d e i a r e r a g u ) b s n e l t n e i v s m i e r ) o n m e n t ) S O D a R n y f E u e n e n n - a s v t o m t i i f i r m f c f o e a n c l m v h l e v e a b n e r n a t r r g c - r i e k s i d s s p d e e e s c s i f i c Later layers override earlier ones. This lets you: ...

Rate Limiting Strategies That Protect Without Frustrating

Rate limiting is the bouncer at your API’s door. Too strict, and legitimate users get frustrated. Too loose, and one bad actor can take down your service. Here’s how to find the balance. Why Rate Limit? Without limits, a single client can: Exhaust your database connections Burn through your third-party API quotas Inflate your cloud bill Deny service to everyone else Rate limiting isn’t about being restrictive—it’s about being fair. ...

Background Job Patterns That Actually Scale

Every production system eventually needs background jobs. Email notifications, report generation, data syncing, webhook processing—the work that can’t (or shouldn’t) happen during a user request. Here’s what I’ve learned about making them reliable. The Naive Approach (And Why It Breaks) Most developers start with something like this: 1 2 3 4 5 @app.route('/signup') def signup(): user = create_user(request.form) send_welcome_email(user) # Blocks the response return redirect('/dashboard') This works until it doesn’t. The email service has a 5-second timeout. Now your signup page feels broken. Or the email service is down, and signups fail entirely. ...

Retry Patterns: When and How to Try Again

Not all failures are permanent. Retry patterns help distinguish transient hiccups from real problems. Exponential Backoff 1 2 3 4 5 6 7 8 9 10 11 12 13 14 import time import random def retry_with_backoff(func, max_retries=5, base_delay=1): for attempt in range(max_retries): try: return func() except Exception as e: if attempt == max_retries - 1: raise delay = base_delay * (2 ** attempt) jitter = random.uniform(0, delay * 0.1) time.sleep(delay + jitter) Each retry waits longer: 1s, 2s, 4s, 8s, 16s. Jitter prevents thundering herd. With tenacity 1 2 3 4 5 6 7 8 from tenacity import retry, stop_after_attempt, wait_exponential @retry( stop=stop_after_attempt(5), wait=wait_exponential(multiplier=1, min=1, max=60) ) def call_api(): return requests.get("https://api.example.com") Retry Only Transient Errors 1 2 3 4 5 6 7 8 from tenacity import retry, retry_if_exception_type @retry( retry=retry_if_exception_type((ConnectionError, TimeoutError)), stop=stop_after_attempt(3) ) def fetch_data(): return external_service.get() Don’t retry 400 Bad Request — that won’t fix itself. ...

Circuit Breakers: Fail Fast, Recover Gracefully

When a downstream service is failing, continuing to call it makes everything worse. Circuit breakers stop the cascade. The Pattern Three states: Closed: Normal operation, requests pass through Open: Service is failing, requests fail immediately Half-Open: Testing if service recovered [ C L ┌ │ ▼ O ▲ │ └ ─ S ─ ─ E ─ ─ D ─ ─ ] ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ f ─ ─ a ─ ─ i ─ ─ l s ─ u u ─ r c ─ e c ─ e ─ t s ─ h s ─ r ─ ─ e ─ ─ s ─ ─ h ─ ─ o ─ ─ l ─ ─ d ─ ─ ─ ─ ─ ─ ─ ─ ▶ ─ ─ ─ ─ [ ─ ─ O ─ ─ P │ │ ┴ ─ E ─ ─ N ─ ─ ] f ─ a ─ ─ i ─ ─ l ─ t u ─ i r ─ m e ─ e ─ ┐ │ │ o ─ u ─ t ─ ─ ─ ─ │ │ ┘ ▶ [ H A L F - O P E N ] Basic Implementation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 import time from enum import Enum from threading import Lock class State(Enum): CLOSED = "closed" OPEN = "open" HALF_OPEN = "half_open" class CircuitBreaker: def __init__( self, failure_threshold: int = 5, recovery_timeout: int = 30, half_open_max_calls: int = 3 ): self.failure_threshold = failure_threshold self.recovery_timeout = recovery_timeout self.half_open_max_calls = half_open_max_calls self.state = State.CLOSED self.failure_count = 0 self.success_count = 0 self.last_failure_time = None self.lock = Lock() def can_execute(self) -> bool: with self.lock: if self.state == State.CLOSED: return True if self.state == State.OPEN: if time.time() - self.last_failure_time > self.recovery_timeout: self.state = State.HALF_OPEN self.success_count = 0 return True return False if self.state == State.HALF_OPEN: return self.success_count < self.half_open_max_calls return False def record_success(self): with self.lock: if self.state == State.HALF_OPEN: self.success_count += 1 if self.success_count >= self.half_open_max_calls: self.state = State.CLOSED self.failure_count = 0 else: self.failure_count = 0 def record_failure(self): with self.lock: self.failure_count += 1 self.last_failure_time = time.time() if self.state == State.HALF_OPEN: self.state = State.OPEN elif self.failure_count >= self.failure_threshold: self.state = State.OPEN Using the Circuit Breaker 1 2 3 4 5 6 7 8 9 10 11 12 13 payment_breaker = CircuitBreaker(failure_threshold=3, recovery_timeout=60) def process_payment(order): if not payment_breaker.can_execute(): raise ServiceUnavailable("Payment service circuit open") try: result = payment_service.charge(order) payment_breaker.record_success() return result except Exception as e: payment_breaker.record_failure() raise Decorator Pattern 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 from functools import wraps def circuit_breaker(breaker: CircuitBreaker): def decorator(func): @wraps(func) def wrapper(*args, **kwargs): if not breaker.can_execute(): raise CircuitOpenError(f"Circuit breaker open for {func.__name__}") try: result = func(*args, **kwargs) breaker.record_success() return result except Exception as e: breaker.record_failure() raise return wrapper return decorator # Usage payment_cb = CircuitBreaker() @circuit_breaker(payment_cb) def charge_customer(customer_id, amount): return payment_api.charge(customer_id, amount) With Fallback 1 2 3 4 5 6 7 8 9 10 11 12 def get_user_recommendations(user_id): if not recommendations_breaker.can_execute(): # Fallback to cached or default recommendations return get_cached_recommendations(user_id) or DEFAULT_RECOMMENDATIONS try: result = recommendations_service.get(user_id) recommendations_breaker.record_success() return result except Exception: recommendations_breaker.record_failure() return get_cached_recommendations(user_id) or DEFAULT_RECOMMENDATIONS Library: pybreaker 1 2 3 4 5 6 7 8 9 10 11 12 13 import pybreaker db_breaker = pybreaker.CircuitBreaker( fail_max=5, reset_timeout=30 ) @db_breaker def query_database(sql): return db.execute(sql) # Check state print(db_breaker.current_state) # 'closed', 'open', or 'half-open' Library: tenacity (with circuit breaker) 1 2 3 4 5 6 7 8 from tenacity import retry, stop_after_attempt, CircuitBreaker cb = CircuitBreaker(failure_threshold=3, recovery_time=60) @retry(stop=stop_after_attempt(3)) @cb def call_external_api(): return requests.get("https://api.example.com/data") Per-Service Breakers 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 class ServiceRegistry: def __init__(self): self.breakers = {} def get_breaker(self, service_name: str) -> CircuitBreaker: if service_name not in self.breakers: self.breakers[service_name] = CircuitBreaker() return self.breakers[service_name] registry = ServiceRegistry() def call_service(service_name: str, endpoint: str): breaker = registry.get_breaker(service_name) if not breaker.can_execute(): raise ServiceUnavailable(f"{service_name} circuit is open") try: result = http_client.get(f"http://{service_name}/{endpoint}") breaker.record_success() return result except Exception: breaker.record_failure() raise Monitoring 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 from prometheus_client import Counter, Gauge circuit_state = Gauge( 'circuit_breaker_state', 'Circuit breaker state (0=closed, 1=open, 2=half-open)', ['service'] ) circuit_failures = Counter( 'circuit_breaker_failures_total', 'Circuit breaker failure count', ['service'] ) circuit_rejections = Counter( 'circuit_breaker_rejections_total', 'Requests rejected by open circuit', ['service'] ) # Update metrics in circuit breaker def record_failure(self, service_name): circuit_failures.labels(service=service_name).inc() # ... rest of failure logic circuit_state.labels(service=service_name).set(self.state.value) Configuration Guidelines Scenario Threshold Timeout Critical service, fast recovery 3-5 failures 15-30s Non-critical, can wait 5-10 failures 60-120s Flaky external API 3 failures 30-60s Database 5 failures 30s Anti-Patterns 1. Single global breaker ...

Caching Strategies: What to Cache and When to Invalidate

Cache invalidation is one of the two hard problems in computer science. Here’s how to make it less painful. The Caching Patterns Cache-Aside (Lazy Loading) 1 2 3 4 5 6 7 8 9 10 11 12 13 def get_user(user_id: str) -> dict: # Check cache first cached = redis.get(f"user:{user_id}") if cached: return json.loads(cached) # Cache miss: fetch from database user = db.query("SELECT * FROM users WHERE id = %s", user_id) # Store in cache for next time redis.setex(f"user:{user_id}", 3600, json.dumps(user)) return user Pros: Only caches what’s actually used Cons: First request always slow (cache miss) ...

Webhook Patterns: Receiving Events Reliably

Webhooks flip the API model: instead of polling, the service calls you. Here’s how to handle them without losing events. Basic Webhook Handler 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 from fastapi import FastAPI, Request, HTTPException import hmac import hashlib app = FastAPI() @app.post("/webhooks/stripe") async def stripe_webhook(request: Request): payload = await request.body() sig_header = request.headers.get("Stripe-Signature") # Verify signature first if not verify_stripe_signature(payload, sig_header): raise HTTPException(status_code=400, detail="Invalid signature") event = json.loads(payload) # Process event if event["type"] == "payment_intent.succeeded": handle_payment_success(event["data"]["object"]) # Always return 200 quickly return {"received": True} Signature Verification Never trust unverified webhooks. ...