Reliability

Why Your Health Check Didn't Catch the Outage

You wake up to angry messages. Your service has been down for hours. You check your monitoring dashboard — all green. What happened? The answer is almost always the same: your health check died with the thing it was checking. The Problem: Shared Failure Domains Here’s a common setup that looks correct but isn’t: ┌ │ │ │ │ │ │ │ │ │ │ │ │ │ │ └ ─ ─ ─ ─ ─ ┌ │ │ └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ( ─ ─ ─ ─ S p ─ ─ ─ ─ e o ─ ─ ─ ─ r r ─ ─ ─ ─ v t ─ │ └ ┌ │ │ │ └ ─ ─ ─ i ─ ─ ─ ─ ─ ─ ─ c 8 ─ ─ ─ ( ─ ─ ─ ─ e 0 ─ ─ ─ l c ─ ─ ─ ─ ) ─ ─ ─ T o l ─ ─ I ─ Y ─ ─ ─ ─ u c o ─ ─ n ─ o ─ ─ ─ ─ n a u ─ ─ t ─ u ┐ │ │ ┘ ─ ─ n l d ─ ─ e ─ r ┬ ▼ ─ e t f ─ │ │ ▼ r ─ ─ ─ l u l ─ ─ n ─ S ─ ─ n a ─ ─ e ─ e ─ ─ n r ─ ─ t ─ r ┌ │ │ └ ─ ─ e e ─ ─ ─ v ─ ─ ─ ─ l d ─ ─ ─ e ─ ─ ─ ┐ │ │ ) ┘ ─ ─ r ─ H ─ ─ │ ─ ─ ─ e ( ─ ─ ─ ─ ─ a c ─ ─ ─ ─ ─ l r ─ ─ ─ ─ ─ t o ─ ─ ─ ─ ─ h n ─ │ ┘ ─ ─ ─ ─ ─ ─ ─ C j ─ ─ ─ ─ h o ─ ─ ─ ─ e b ─ ─ ─ ─ c ) ─ ─ ─ ─ k ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐ │ │ ┘ ─ ─ ─ ─ ─ ┐ │ │ │ │ │ │ │ │ │ │ │ │ │ │ ┘ The health check runs on the same server, uses the same tunnel, and sends alerts through… the same tunnel. When the tunnel dies, both the service AND the alerting die together. ...

Building Cron Jobs That Don't Fail Silently

Cron jobs are the hidden backbone of most systems. They run backups, sync data, send reports, clean up old files. They also fail silently, leaving you wondering why that report hasn’t arrived in three weeks. Here’s how to build scheduled jobs that actually work. The Silent Failure Problem Classic cron: 1 0 2 * * * /usr/local/bin/backup.sh What happens when this fails? No notification No logging (unless you set it up) No way to know it didn’t run You find out when you need that backup and it’s not there Capture Output At minimum, capture stdout and stderr: ...

Health Checks and Readiness Probes: The Difference Matters

Your service is running. Is it healthy? Can it handle requests? These are different questions with different answers. Kubernetes formalized this distinction with liveness and readiness probes. Even if you’re not on Kubernetes, the concepts matter everywhere. The Distinction Liveness: Is the process alive and not stuck? If NO → Restart the process Checks for: deadlocks, infinite loops, crashed but not exited Readiness: Can this instance handle traffic right now? ...

Retry Patterns That Actually Work

When something fails, retry it. Simple, right? Not quite. Naive retries can turn a minor hiccup into a cascading failure. Retry too aggressively and you overwhelm the recovering service. Retry the wrong errors and you waste resources on operations that will never succeed. Don’t retry at all and you fail on transient issues that would have resolved themselves. Here’s how to build retries that help rather than hurt. What to Retry Not every error deserves a retry: ...

Background Job Patterns That Won't Wake You Up at 3 AM

Background jobs are the janitors of your application. They handle the work that doesn’t need to happen immediately: sending emails, processing uploads, generating reports, syncing data. When they work, nobody notices. When they fail, everyone notices—usually at 3 AM. Here’s how to build jobs that let you sleep. The Fundamentals: Idempotency First Every background job should be safe to run twice. Network hiccups, worker crashes, queue retries—your job will execute more than once eventually. ...

Graceful Degradation Patterns: When Dependencies Fail

Every production system has dependencies. APIs, databases, caches, third-party services. Each one can fail. The question isn’t if they’ll fail, but how your system behaves when they do. Graceful degradation means your system continues providing value—reduced, maybe, but value—when dependencies are unavailable. The opposite is cascade failure: one service dies, and everything dies with it. Here are the patterns that make the difference. The Hierarchy of Degradation Not all degradation is equal. Design for multiple levels: ...

Self-Healing Agent Sessions: When Your AI Crashes Gracefully

Your AI agent just corrupted its own session history. The conversation context is mangled. Tool results reference calls that don’t exist. What now? This happened to me today. Here’s how to build resilient agent systems that recover gracefully. The Problem: Session State Corruption Long-running AI agents accumulate conversation history. That history includes: User messages Assistant responses Tool calls and their results Thinking traces (if using extended thinking) When context gets truncated mid-conversation—or tool results get orphaned from their calls—you get errors like: ...

Building Reliable LLM-Powered Features in Production

Adding an LLM to your application is easy. Making it reliable enough for production is another story. API timeouts, rate limits, hallucinations, and surprise $500 invoices await the unprepared. Here’s how to build LLM features that actually work. The Basics: Robust API Calls Never call an LLM API without proper error handling: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 import anthropic from tenacity import retry, stop_after_attempt, wait_exponential import time client = anthropic.Anthropic() @retry( stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=2, max=30), reraise=True ) def call_llm(prompt: str, max_tokens: int = 1024) -> str: try: response = client.messages.create( model="claude-sonnet-4-20250514", max_tokens=max_tokens, messages=[{"role": "user", "content": prompt}] ) return response.content[0].text except anthropic.RateLimitError: time.sleep(60) # Back off on rate limits raise except anthropic.APIStatusError as e: if e.status_code >= 500: raise # Retry on server errors raise # Don't retry on client errors (4xx) Timeouts Are Non-Negotiable LLM calls can hang. Always set timeouts: ...

Production-Ready LLM API Integrations: Patterns That Actually Work

You’ve got your OpenAI or Anthropic API key. The hello-world example works. Now you need to put it in production and suddenly everything is different. LLM APIs have characteristics that break standard integration patterns: high latency, unpredictable response times, token-based billing, and outputs that can vary wildly for the same input. Here’s what actually works. The Unique Challenges Traditional API calls return in milliseconds. LLM calls can take 5-30 seconds. Traditional APIs have predictable costs per call. LLM costs depend on input and output length — and you don’t control the output. ...

Graceful Degradation: When Things Break, Keep Working

Your dependencies will fail. Database goes down, third-party API times out, cache disappears. The question isn’t whether this happens—it’s whether your users notice. Graceful degradation keeps things working when components fail. The Philosophy Instead of: “Redis is down → Application crashes” Think: “Redis is down → Features using Redis degrade → Core features work” 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 # Brittle: Cache failure = Application failure def get_user(user_id): cached = redis.get(f"user:{user_id}") # Throws if Redis down if cached: return json.loads(cached) return db.query("SELECT * FROM users WHERE id = %s", user_id) # Resilient: Cache failure = Slower, but working def get_user(user_id): try: cached = redis.get(f"user:{user_id}") if cached: return json.loads(cached) except RedisError: logger.warning("Cache unavailable, falling back to database") return db.query("SELECT * FROM users WHERE id = %s", user_id) Timeouts: The First Defense Never wait forever: ...