Architecture

Background Job Patterns That Won't Wake You Up at 3 AM

Background jobs are the janitors of your application. They handle the work that doesn’t need to happen immediately: sending emails, processing uploads, generating reports, syncing data. When they work, nobody notices. When they fail, everyone notices—usually at 3 AM. Here’s how to build jobs that let you sleep. The Fundamentals: Idempotency First Every background job should be safe to run twice. Network hiccups, worker crashes, queue retries—your job will execute more than once eventually. ...

Graceful Degradation Patterns: When Dependencies Fail

Every production system has dependencies. APIs, databases, caches, third-party services. Each one can fail. The question isn’t if they’ll fail, but how your system behaves when they do. Graceful degradation means your system continues providing value—reduced, maybe, but value—when dependencies are unavailable. The opposite is cascade failure: one service dies, and everything dies with it. Here are the patterns that make the difference. The Hierarchy of Degradation Not all degradation is equal. Design for multiple levels: ...

LLM API Patterns for Production Systems

Building toy demos with LLM APIs is easy. Building production systems that handle real traffic, fail gracefully, and don’t bankrupt you? That’s where it gets interesting. The Reality of Production LLM Integration Most tutorials show you curl to an API and celebrate. Real systems need to handle: API rate limits and throttling Transient failures and retries Cost explosion from runaway loops Latency variance (100ms to 30s responses) Model version changes breaking prompts Token limits exceeding input size Let’s look at patterns that actually work. ...

Event-Driven Architecture: Decoupling Services the Right Way

Synchronous HTTP calls create tight coupling. Service A waits for Service B, which waits for Service C. One slow service blocks everything. One failure cascades everywhere. Event-driven architecture breaks this chain. The Core Idea Instead of direct calls, services communicate through events: T O E O I S E r r v r n h m a d e d v i a d e n e e p i i r t r n p l t - t i i S d S o n S o e r e r g e n r i r y r a v v v S v l i e i S e i c n c e r c ( e e r v e s ( v i y → ← a → i c n s c e c H w y p e h T a n u r T i c b ← ← ← o P t h l n r i s s s o → ← o s u u u u n h b b b s I o e s s s ) n u s c c c : v s r r r e ) " i i i n : O b b b t r e e e o d s s s r e y r ← ← ← C S r " " " e e O O O r a r r r v t d d d i e e e e c d r r r e " C C C r r r → ← → e e e a a a H w M t t t T a e e e e T i s d d d P t s ↓ " " " a → ← g e S h B i r p o p k i e n r g S e r v i c e The Order Service doesn’t know or care who’s listening. It just announces what happened. ...

API Versioning: Strategies That Won't Break Your Clients

You shipped v1 of your API. Clients integrated. Now you need to make breaking changes. How do you evolve without breaking everyone? API versioning is the answer—but there’s no single “right” approach. Let’s examine the tradeoffs. What Counts as a Breaking Change? Before versioning, understand what actually breaks clients: Breaking changes: Removing a field from responses Removing an endpoint Changing a field’s type ("price": "19.99" → "price": 19.99) Renaming a field Changing required request parameters Changing authentication methods Non-breaking changes: ...

Caching Strategies: When, Where, and How to Cache

Caching is one of the most powerful performance optimizations available. It’s also one of the easiest to get wrong. The classic joke—“there are only two hard things in computer science: cache invalidation and naming things”—exists for a reason. Let’s cut through the complexity. When to Cache Not everything should be cached. Before adding a cache, ask: Is this data read more than written? Caching write-heavy data creates invalidation nightmares. Is computing this expensive? Database queries, API calls, complex calculations—good candidates. Can I tolerate stale data? If not, caching gets complicated fast. Is this a hot path? Cache what’s accessed frequently, not everything. 1 2 3 4 5 6 7 8 9 # Good cache candidate: expensive query, rarely changes @cache(ttl=3600) def get_product_catalog(): return db.query("SELECT * FROM products WHERE active = true") # Bad cache candidate: changes every request @cache(ttl=60) # Don't do this def get_user_cart(user_id): return db.query("SELECT * FROM carts WHERE user_id = ?", user_id) Where to Cache Caching happens at multiple layers. Each has tradeoffs. ...

Feature Flags: Decoupling Deployment from Release

Deploying code and releasing features are not the same thing. Treating them as identical creates unnecessary risk, slows down development, and makes rollbacks painful. Feature flags fix this. The Problem with Deploy-Equals-Release Traditional deployment pipelines work like this: code merges, tests pass, artifact builds, deployment happens, users see the change. It’s linear and fragile. What happens when the feature works in staging but breaks in production? You roll back the entire deployment, potentially reverting unrelated fixes. What if you want to release to 5% of users first? You can’t — it’s all or nothing. ...

Graceful Degradation: When Things Break, Keep Working

Your dependencies will fail. Database goes down, third-party API times out, cache disappears. The question isn’t whether this happens—it’s whether your users notice. Graceful degradation keeps things working when components fail. The Philosophy Instead of: “Redis is down → Application crashes” Think: “Redis is down → Features using Redis degrade → Core features work” 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 # Brittle: Cache failure = Application failure def get_user(user_id): cached = redis.get(f"user:{user_id}") # Throws if Redis down if cached: return json.loads(cached) return db.query("SELECT * FROM users WHERE id = %s", user_id) # Resilient: Cache failure = Slower, but working def get_user(user_id): try: cached = redis.get(f"user:{user_id}") if cached: return json.loads(cached) except RedisError: logger.warning("Cache unavailable, falling back to database") return db.query("SELECT * FROM users WHERE id = %s", user_id) Timeouts: The First Defense Never wait forever: ...

Circuit Breakers: Preventing Cascading Failures in Distributed Systems

When one service in your distributed system starts failing, what happens to everything else? Without proper safeguards, a single sick service can bring down your entire platform. Circuit breakers are the solution. The Cascading Failure Problem Picture this: Your payment service starts timing out because a third-party API is slow. Every request to your checkout service now waits 30 seconds for payments to respond. Your checkout service’s thread pool fills up. Users can’t complete purchases, so they refresh repeatedly. Your load balancer marks checkout as unhealthy. Traffic shifts to fewer instances. Those instances overload. Now your entire e-commerce platform is down — because of one slow API. ...

LLM API Integration Patterns for Production Applications

Integrating LLMs into production applications is deceptively simple. Call an API, get text back. But building reliable, cost-effective systems requires more thought. Here are patterns that work at scale. The Basic Call Every LLM integration starts here: 1 2 3 4 5 6 7 8 import openai def complete(prompt: str) -> str: response = openai.chat.completions.create( model="gpt-4", messages=[{"role": "user", "content": prompt}] ) return response.choices[0].message.content This works for prototypes. Production needs more. Retry with Exponential Backoff LLM APIs have rate limits and occasional failures: ...