Backend

Message Queues: Decoupling Services Without Losing Your Mind

Synchronous request-response is simple: client asks, server answers, everyone waits. But some operations don’t fit that model. Sending emails, processing images, generating reports — these take time, and your users shouldn’t wait. Message queues decouple the “request” from the “work,” letting you respond immediately while processing happens in the background. The Basic Pattern 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 # API endpoint - returns immediately @app.post("/orders") def create_order(order_data): order = db.create_order(order_data) # Queue async work instead of doing it now queue.enqueue('process_order', order.id) return {"order_id": order.id, "status": "processing"} # Worker - processes queue in background def process_order(order_id): order = db.get_order(order_id) charge_payment(order) send_confirmation_email(order) notify_warehouse(order) update_inventory(order) The user gets a response in milliseconds. The actual work happens whenever the worker gets to it. ...

Caching Strategies: The Two Hardest Problems in Computer Science

Phil Karlton’s famous quote about hard problems in computer science exists because caching is genuinely difficult. Not the mechanics — putting data in Redis is easy. The hard part is knowing when that data is wrong. Get caching right and your application feels instant. Get it wrong and users see stale data, inconsistent state, or worse — data that was never supposed to be visible to them. The Cache-Aside Pattern (Lazy Loading) The most common pattern: check cache first, fall back to database, populate cache on miss. ...

Database Migrations Without Downtime: The Expand-Contract Pattern

Database migrations are terrifying. One wrong move and you’ve locked tables, broken queries, or corrupted data. The traditional approach — maintenance window, stop traffic, migrate, restart — works but costs you availability and customer trust. The expand-contract pattern lets you migrate schemas incrementally, with zero downtime, while your application keeps serving traffic. The Core Idea Never make breaking changes in a single step. Instead: Expand: Add the new structure alongside the old Migrate: Move data and update application code Contract: Remove the old structure Each step is independently deployable and reversible. ...

API Versioning: Breaking Changes Without Breaking Clients

Every API eventually needs to change in ways that break existing clients. Field removed, response format changed, authentication updated. The question isn’t whether you’ll have breaking changes — it’s how you’ll manage them. Good versioning gives you freedom to evolve while giving clients stability and migration paths. The Three Schools of Versioning URL Path Versioning G G E E T T / / a a p p i i / / v 1 2 / / u u s s e e r r s s / / 1 1 2 2 3 3 Pros: ...

Environment Variables Done Right: Configuration Without Chaos

The twelve-factor app methodology tells us: store config in the environment. It’s good advice. Environment variables separate config from code, making the same artifact deployable across environments. But “store config in environment variables” doesn’t mean “scatter random ENV vars everywhere and hope for the best.” Done poorly, environment configuration becomes untraceable, untestable, and unmaintainable. The Baseline Pattern Centralize environment variable access: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 import os from dataclasses import dataclass from typing import Optional @dataclass class Config: database_url: str redis_url: str api_key: str debug: bool max_workers: int log_level: str @classmethod def from_env(cls) -> 'Config': return cls( database_url=os.environ['DATABASE_URL'], redis_url=os.environ['REDIS_URL'], api_key=os.environ['API_KEY'], debug=os.environ.get('DEBUG', 'false').lower() == 'true', max_workers=int(os.environ.get('MAX_WORKERS', '4')), log_level=os.environ.get('LOG_LEVEL', 'INFO'), ) # Single source of truth config = Config.from_env() Benefits: ...

Retry Patterns: Exponential Backoff and Beyond

Networks fail. Databases hiccup. External APIs return 503. In distributed systems, transient failures are not exceptional — they’re expected. The question isn’t whether to retry, but how. Bad retry logic turns a brief outage into a cascading failure. Good retry logic absorbs transient issues invisibly. The Naive Retry (Don’t Do This) 1 2 3 4 5 6 7 # Immediate retry loop - a recipe for disaster def call_api(): while True: try: return requests.get(API_URL) except RequestException: continue # Hammer the failing service This creates a thundering herd. If the service is struggling, every client immediately retrying makes it worse. You’re not recovering from failure — you’re accelerating toward total collapse. ...

Graceful Shutdown: Finishing What You Started

When Kubernetes scales down your deployment or you push a new release, your running containers receive SIGTERM. Then, after a grace period, SIGKILL. The difference between graceful and chaotic shutdown is what happens in those seconds between the two signals. A request half-processed, a database transaction uncommitted, a file partially written — these are the artifacts of ungraceful shutdown. They create inconsistent state, failed requests, and debugging nightmares. The Signal Sequence 1 2 3 4 5 6 7 . . . . . . . S G P P P P I I r r r r r f G a o o o o T c c c c c s E e e e e e t R s s s s i M p s s s s l e l s r s s s e e i h h h x r n o o o o i u t d u u u t n l l l s n t c d d d i o o w n u s f c i g p n t i l t : r t o n o h o d p i s S c o s e c I e w a h o G s n c c d K s c i o e I b e n n L e p - n 0 L g t f e i i l c ( n n i t n s g g i o h o ( n t n m d e s e e w w r f o c c a w r l y u o k e ) l r a t k n : l y 3 0 s i n K u b e r n e t e s ) Basic Signal Handling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 import signal import sys shutdown_requested = False def handle_sigterm(signum, frame): global shutdown_requested print("SIGTERM received, initiating graceful shutdown...") shutdown_requested = True signal.signal(signal.SIGTERM, handle_sigterm) signal.signal(signal.SIGINT, handle_sigterm) # Ctrl+C # Main loop checks shutdown flag while not shutdown_requested: process_next_item() # Cleanup after loop exits cleanup() sys.exit(0) Web Servers: Stop Accepting, Finish Processing Most web frameworks have built-in graceful shutdown. The pattern: ...

Health Check Endpoints: More Than Just 200 OK

Every container orchestrator, load balancer, and monitoring system asks the same question: is this service healthy? The answer you provide determines whether traffic gets routed, containers get replaced, and alerts get fired. A health check that lies — always returning 200 even when the database is down — is worse than no health check at all. It creates false confidence while your users experience failures. The Three Types of Health Checks Liveness: “Is the process alive?” Liveness checks answer: should this container be killed and restarted? ...

Structured Logging: Stop Grepping, Start Querying

You’ve seen this log line before: 2 0 2 6 - 0 2 - 2 3 0 5 : 3 0 : 0 0 I N F O U s e r j o h n @ e x a m p l e . c o m l o g g e d i n f r o m 1 9 2 . 1 6 8 . 1 . 1 0 0 a f t e r 2 f a i l e d a t t e m p t s Human readable. Grep-able. And completely useless for answering questions like “how many users had failed login attempts yesterday?” or “what’s the P95 response time for requests from the EU region?” ...

Database Connection Pooling: The Performance Win You're Probably Missing

Database connections are expensive. Each new connection requires TCP handshake, authentication, session initialization, and memory allocation on both client and server. Do this for every web request and you’ve got a performance problem hiding in plain sight. The Problem: Connection Overhead A typical PostgreSQL connection takes 50-100ms to establish. For a web request that runs a 5ms query, you’re spending 10-20x more time on connection setup than actual work. 1 2 3 4 5 6 7 8 # The naive approach - connection per request def get_user(user_id): conn = psycopg2.connect(DATABASE_URL) # 50-100ms cursor = conn.cursor() cursor.execute("SELECT * FROM users WHERE id = %s", (user_id,)) # 5ms result = cursor.fetchone() conn.close() return result Under load, this creates connection storms. Each concurrent request opens its own connection. The database has connection limits. You hit those limits, new connections queue, latency spikes, everything falls apart. ...