Posts

Caching Patterns That Actually Work

Caching seems simple. Add Redis, cache everything, go fast. Then you get stale data, cache stampedes, and bugs that only happen in production. Here’s how to cache correctly. When to Cache Good candidates: Expensive database queries (aggregations, joins) External API responses Computed values that don’t change often Static content (templates, configs) Session data Bad candidates: Data that changes frequently User-specific data with many variations Security-sensitive data Data where staleness causes real problems Rule of thumb: Cache when read frequency » write frequency. ...

Message Queues: When and How to Use Them

Your API is slow because it’s doing too much synchronously. Here’s when to reach for a message queue, and how to implement it without overcomplicating everything. When You Need a Queue Signs you need async processing: API response time dominated by side effects (emails, webhooks, analytics) Downstream service failures cascade to user-facing errors Traffic spikes overwhelm dependent services You need to retry failed operations automatically Work needs to happen on a schedule or with delay Signs you don’t: ...

Documentation That Actually Gets Read

You wrote the docs. Nobody reads them. The same questions keep coming. Here’s how to write documentation people actually use. Why Most Docs Fail Documentation fails for predictable reasons: Wrong location — Docs in a wiki nobody checks Wrong time — Written once, never updated Wrong audience — Too technical or not technical enough Wrong format — Walls of text when a diagram would work Wrong content — Explains what, not why or how Fix these and docs become useful. ...

Infrastructure Health Checks That Actually Work

“Is everything working?” sounds like a simple question. It’s not. Here’s how to build health checks that give you real answers. What Health Checks Are For Health checks answer one question: Can this thing do its job right now? Not “is it running?” (that’s process monitoring). Not “did it work yesterday?” (that’s metrics). Not “will it work tomorrow?” (that’s capacity planning). Just: right now, can it serve traffic? The Levels of Health Level 1: Process Running Bare minimum — is the process alive? ...

On-Call That Doesn't Burn You Out

On-call is necessary. Burnout isn’t. Here’s how to build an on-call rotation that actually works for humans. The Problems With Most On-Call Alerts fire constantly, most aren’t actionable One person carries the pager for a week straight No runbooks, every incident is a mystery Escalation means “figure it out yourself” No compensation or recognition This leads to burnout, turnover, and worse incidents. Fix Your Alerts First The #1 on-call killer: alert fatigue. ...

Testing Strategies That Actually Scale

Everyone agrees testing is important. Few teams do it well at scale. Here’s what separates test suites that help from ones that slow everyone down. The Testing Pyramid (And Why It’s Still Right) E I U 2 n n E t i e t ( g f r ( e a m w t a ) i n o y n ) ( s o m e ) This isn’t new, but teams still get it backwards: ...

Practical LLM Integration Patterns

You want to add LLM capabilities to your application. Not build a chatbot — actually integrate AI into your product. Here are the patterns that work. The Naive Approach (And Why It Fails) 1 2 3 4 5 6 def process_user_input(text): response = openai.chat.completions.create( model="gpt-4", messages=[{"role": "user", "content": text}] ) return response.choices[0].message.content Problems: No error handling No rate limiting No caching No fallbacks No cost control Prompt injection vulnerable Let’s fix each one. Pattern 1: The Robust Client Wrap your LLM calls in a proper client: ...

Building Cron Jobs That Don't Fail Silently

Cron jobs are the hidden backbone of most systems. They run backups, sync data, send reports, clean up old files. They also fail silently, leaving you wondering why that report hasn’t arrived in three weeks. Here’s how to build scheduled jobs that actually work. The Silent Failure Problem Classic cron: 1 0 2 * * * /usr/local/bin/backup.sh What happens when this fails? No notification No logging (unless you set it up) No way to know it didn’t run You find out when you need that backup and it’s not there Capture Output At minimum, capture stdout and stderr: ...

Database Migrations in Production Without Downtime

Schema changes are scary. One bad migration can take down your entire application. Here’s how to evolve your database safely, even with users actively hitting it. The Core Problem You need to rename a column. Sounds simple: 1 ALTER TABLE users RENAME COLUMN name TO full_name; But your application is running. The moment this executes: Old code looking for name breaks New code looking for full_name also breaks (until deploy finishes) You have a window of guaranteed failures This is why database migrations need careful choreography. ...

API Versioning Without the Pain

You shipped v1 of your API. Users integrated it. Now you need breaking changes. How do you evolve without breaking everyone? API versioning seems simple until you actually do it. Here’s what works, what doesn’t, and how to pick the right strategy. The Core Problem APIs are contracts. When you change the response format, rename fields, or alter behavior, you break that contract. Clients built against v1 stop working when you ship v2. ...