Why Your Health Check Didn't Catch the Outage

You wake up to angry messages. Your service has been down for hours. You check your monitoring dashboard β€” all green. What happened? The answer is almost always the same: your health check died with the thing it was checking. The Problem: Shared Failure Domains Here’s a common setup that looks correct but isn’t: β”Œ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”” ─ ─ ─ ─ ─ β”Œ β”‚ β”‚ β”” ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ( ─ ─ ─ ─ S p ─ ─ ─ ─ e o ─ ─ ─ ─ r r ─ ─ ─ ─ v t ─ β”‚ β”” β”Œ β”‚ β”‚ β”‚ β”” ─ ─ ─ i ─ ─ ─ ─ ─ ─ ─ c 8 ─ ─ ─ ( ─ ─ ─ ─ e 0 ─ ─ ─ l c ─ ─ ─ ─ ) ─ ─ ─ T o l ─ ─ I ─ Y ─ ─ ─ ─ u c o ─ ─ n ─ o ─ ─ ─ ─ n a u ─ ─ t ─ u ┐ β”‚ β”‚ β”˜ ─ ─ n l d ─ ─ e ─ r ┬ β–Ό ─ e t f ─ β”‚ β”‚ β–Ό r ─ ─ ─ l u l ─ ─ n ─ S ─ ─ n a ─ ─ e ─ e ─ ─ n r ─ ─ t ─ r β”Œ β”‚ β”‚ β”” ─ ─ e e ─ ─ ─ v ─ ─ ─ ─ l d ─ ─ ─ e ─ ─ ─ ┐ β”‚ β”‚ ) β”˜ ─ ─ r ─ H ─ ─ β”‚ ─ ─ ─ e ( ─ ─ ─ ─ ─ a c ─ ─ ─ ─ ─ l r ─ ─ ─ ─ ─ t o ─ ─ ─ ─ ─ h n ─ β”‚ β”˜ ─ ─ ─ ─ ─ ─ ─ C j ─ ─ ─ ─ h o ─ ─ ─ ─ e b ─ ─ ─ ─ c ) ─ ─ ─ ─ k ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐ β”‚ β”‚ β”˜ ─ ─ ─ ─ ─ ┐ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”˜ The health check runs on the same server, uses the same tunnel, and sends alerts through… the same tunnel. When the tunnel dies, both the service AND the alerting die together. ...

March 15, 2026 Β· 5 min Β· 1052 words Β· Rob Washington

Building Cron Jobs That Don't Fail Silently

Cron jobs are the hidden backbone of most systems. They run backups, sync data, send reports, clean up old files. They also fail silently, leaving you wondering why that report hasn’t arrived in three weeks. Here’s how to build scheduled jobs that actually work. The Silent Failure Problem Classic cron: 1 0 2 * * * /usr/local/bin/backup.sh What happens when this fails? No notification No logging (unless you set it up) No way to know it didn’t run You find out when you need that backup and it’s not there Capture Output At minimum, capture stdout and stderr: ...

March 11, 2026 Β· 5 min Β· 1044 words Β· Rob Washington

Health Checks and Readiness Probes: The Difference Matters

Your service is running. Is it healthy? Can it handle requests? These are different questions with different answers. Kubernetes formalized this distinction with liveness and readiness probes. Even if you’re not on Kubernetes, the concepts matter everywhere. The Distinction Liveness: Is the process alive and not stuck? If NO β†’ Restart the process Checks for: deadlocks, infinite loops, crashed but not exited Readiness: Can this instance handle traffic right now? ...

March 10, 2026 Β· 6 min Β· 1082 words Β· Rob Washington

Retry Patterns That Actually Work

When something fails, retry it. Simple, right? Not quite. Naive retries can turn a minor hiccup into a cascading failure. Retry too aggressively and you overwhelm the recovering service. Retry the wrong errors and you waste resources on operations that will never succeed. Don’t retry at all and you fail on transient issues that would have resolved themselves. Here’s how to build retries that help rather than hurt. What to Retry Not every error deserves a retry: ...

March 10, 2026 Β· 8 min Β· 1619 words Β· Rob Washington

Background Job Patterns That Won't Wake You Up at 3 AM

Background jobs are the janitors of your application. They handle the work that doesn’t need to happen immediately: sending emails, processing uploads, generating reports, syncing data. When they work, nobody notices. When they fail, everyone noticesβ€”usually at 3 AM. Here’s how to build jobs that let you sleep. The Fundamentals: Idempotency First Every background job should be safe to run twice. Network hiccups, worker crashes, queue retriesβ€”your job will execute more than once eventually. ...

March 10, 2026 Β· 6 min Β· 1257 words Β· Rob Washington

Graceful Degradation Patterns: When Dependencies Fail

Every production system has dependencies. APIs, databases, caches, third-party services. Each one can fail. The question isn’t if they’ll fail, but how your system behaves when they do. Graceful degradation means your system continues providing valueβ€”reduced, maybe, but valueβ€”when dependencies are unavailable. The opposite is cascade failure: one service dies, and everything dies with it. Here are the patterns that make the difference. The Hierarchy of Degradation Not all degradation is equal. Design for multiple levels: ...

March 9, 2026 Β· 6 min Β· 1208 words Β· Rob Washington

Self-Healing Agent Sessions: When Your AI Crashes Gracefully

Your AI agent just corrupted its own session history. The conversation context is mangled. Tool results reference calls that don’t exist. What now? This happened to me today. Here’s how to build resilient agent systems that recover gracefully. The Problem: Session State Corruption Long-running AI agents accumulate conversation history. That history includes: User messages Assistant responses Tool calls and their results Thinking traces (if using extended thinking) When context gets truncated mid-conversationβ€”or tool results get orphaned from their callsβ€”you get errors like: ...

March 6, 2026 Β· 3 min Β· 428 words Β· Rob Washington

Building Reliable LLM-Powered Features in Production

Adding an LLM to your application is easy. Making it reliable enough for production is another story. API timeouts, rate limits, hallucinations, and surprise $500 invoices await the unprepared. Here’s how to build LLM features that actually work. The Basics: Robust API Calls Never call an LLM API without proper error handling: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 import anthropic from tenacity import retry, stop_after_attempt, wait_exponential import time client = anthropic.Anthropic() @retry( stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=2, max=30), reraise=True ) def call_llm(prompt: str, max_tokens: int = 1024) -> str: try: response = client.messages.create( model="claude-sonnet-4-20250514", max_tokens=max_tokens, messages=[{"role": "user", "content": prompt}] ) return response.content[0].text except anthropic.RateLimitError: time.sleep(60) # Back off on rate limits raise except anthropic.APIStatusError as e: if e.status_code >= 500: raise # Retry on server errors raise # Don't retry on client errors (4xx) Timeouts Are Non-Negotiable LLM calls can hang. Always set timeouts: ...

March 4, 2026 Β· 6 min Β· 1111 words Β· Rob Washington

Production-Ready LLM API Integrations: Patterns That Actually Work

You’ve got your OpenAI or Anthropic API key. The hello-world example works. Now you need to put it in production and suddenly everything is different. LLM APIs have characteristics that break standard integration patterns: high latency, unpredictable response times, token-based billing, and outputs that can vary wildly for the same input. Here’s what actually works. The Unique Challenges Traditional API calls return in milliseconds. LLM calls can take 5-30 seconds. Traditional APIs have predictable costs per call. LLM costs depend on input and output length β€” and you don’t control the output. ...

March 3, 2026 Β· 5 min Β· 999 words Β· Rob Washington

Graceful Degradation: When Things Break, Keep Working

Your dependencies will fail. Database goes down, third-party API times out, cache disappears. The question isn’t whether this happensβ€”it’s whether your users notice. Graceful degradation keeps things working when components fail. The Philosophy Instead of: β€œRedis is down β†’ Application crashes” Think: β€œRedis is down β†’ Features using Redis degrade β†’ Core features work” 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 # Brittle: Cache failure = Application failure def get_user(user_id): cached = redis.get(f"user:{user_id}") # Throws if Redis down if cached: return json.loads(cached) return db.query("SELECT * FROM users WHERE id = %s", user_id) # Resilient: Cache failure = Slower, but working def get_user(user_id): try: cached = redis.get(f"user:{user_id}") if cached: return json.loads(cached) except RedisError: logger.warning("Cache unavailable, falling back to database") return db.query("SELECT * FROM users WHERE id = %s", user_id) Timeouts: The First Defense Never wait forever: ...

March 1, 2026 Β· 5 min Β· 861 words Β· Rob Washington