Devops

Backup Strategies That Actually Save You

Everyone knows backups are important. Few actually test them. Here’s how to build backup systems that work when you need them. The 3-2-1 Rule The classic foundation: 3 copies of your data 2 different storage types 1 offsite copy Example implementation: P C C C r o o o i p p p m y y y a r 1 2 3 y : : : : L R C P o e l r c m o o a o u d l t d u e c s b t n r a i a e c o p p k n s l u h i p d o c a t a ( t S a ( ( 3 b s d , a a i s m f d e e f i e f s r f e e e r n r v t e e n r d t , a t r d a e i g f c i f e o e n n r t ) e e n r t ) d i s k ) What to Back Up Always Back Up Databases — This is your business Configuration — Harder to recreate than you think Secrets — Encrypted, but backed up User uploads — Can’t regenerate these Maybe Back Up Application code — If not in Git, back it up Logs — For compliance, ship to log aggregator instead Build artifacts — Rebuild from source is often better Don’t Back Up Ephemeral data — Caches, temp files, sessions Derived data — Can regenerate from source Large static assets — Use CDN/object storage with its own durability Database Backups PostgreSQL 1 2 3 4 5 6 7 8 # Logical backup (SQL dump) pg_dump -Fc mydb > backup.dump # Restore pg_restore -d mydb backup.dump # All databases pg_dumpall > all_databases.sql For larger databases, use physical backups: ...

Kubernetes Debugging: A Practical Field Guide

Your pod won’t start. The service isn’t routing. Something’s wrong but kubectl isn’t telling you what. Here’s how to actually debug Kubernetes problems. The Debugging Hierarchy Work from the outside in: Cluster level — Is the cluster healthy? Node level — Are nodes ready? Pod level — Is the pod running? Container level — Is the container healthy? Application level — Is the app working? Most problems are at levels 3-5. Start there. ...

Documentation That Actually Gets Read

You wrote the docs. Nobody reads them. The same questions keep coming. Here’s how to write documentation people actually use. Why Most Docs Fail Documentation fails for predictable reasons: Wrong location — Docs in a wiki nobody checks Wrong time — Written once, never updated Wrong audience — Too technical or not technical enough Wrong format — Walls of text when a diagram would work Wrong content — Explains what, not why or how Fix these and docs become useful. ...

Infrastructure Health Checks That Actually Work

“Is everything working?” sounds like a simple question. It’s not. Here’s how to build health checks that give you real answers. What Health Checks Are For Health checks answer one question: Can this thing do its job right now? Not “is it running?” (that’s process monitoring). Not “did it work yesterday?” (that’s metrics). Not “will it work tomorrow?” (that’s capacity planning). Just: right now, can it serve traffic? The Levels of Health Level 1: Process Running Bare minimum — is the process alive? ...

On-Call That Doesn't Burn You Out

On-call is necessary. Burnout isn’t. Here’s how to build an on-call rotation that actually works for humans. The Problems With Most On-Call Alerts fire constantly, most aren’t actionable One person carries the pager for a week straight No runbooks, every incident is a mystery Escalation means “figure it out yourself” No compensation or recognition This leads to burnout, turnover, and worse incidents. Fix Your Alerts First The #1 on-call killer: alert fatigue. ...

Testing Strategies That Actually Scale

Everyone agrees testing is important. Few teams do it well at scale. Here’s what separates test suites that help from ones that slow everyone down. The Testing Pyramid (And Why It’s Still Right) E I U 2 n n E t i e t ( g f r ( e a m w t a ) i n o y n ) ( s o m e ) This isn’t new, but teams still get it backwards: ...

Building Cron Jobs That Don't Fail Silently

Cron jobs are the hidden backbone of most systems. They run backups, sync data, send reports, clean up old files. They also fail silently, leaving you wondering why that report hasn’t arrived in three weeks. Here’s how to build scheduled jobs that actually work. The Silent Failure Problem Classic cron: 1 0 2 * * * /usr/local/bin/backup.sh What happens when this fails? No notification No logging (unless you set it up) No way to know it didn’t run You find out when you need that backup and it’s not there Capture Output At minimum, capture stdout and stderr: ...

Database Migrations in Production Without Downtime

Schema changes are scary. One bad migration can take down your entire application. Here’s how to evolve your database safely, even with users actively hitting it. The Core Problem You need to rename a column. Sounds simple: 1 ALTER TABLE users RENAME COLUMN name TO full_name; But your application is running. The moment this executes: Old code looking for name breaks New code looking for full_name also breaks (until deploy finishes) You have a window of guaranteed failures This is why database migrations need careful choreography. ...

Feature Flags for Safe Deployments

The most dangerous word in software is “deploy.” It carries the weight of “I hope nothing breaks” even when you’ve done everything right. Feature flags change that equation entirely. Deployment vs Release Most teams conflate two distinct concepts: Deployment: Code reaches production servers Release: Users see new functionality Feature flags separate these. You can deploy code on Monday and release features on Thursday. You can deploy to everyone but release to 5% of users. You can deploy globally but release only to internal testers. ...

Terraform Patterns That Scale

Terraform’s getting-started guide shows a single main.tf with everything in it. That works for demos. It doesn’t work when you have 50 resources, 5 environments, and a team making changes simultaneously. These patterns emerge from scaling Terraform across teams and environments—where state conflicts happen, where modules get copied instead of shared, and where “just run terraform apply” becomes terrifying. Project Structure The flat-file approach breaks down fast. Structure by environment and component: ...