Monitoring

Why Your Health Check Didn't Catch the Outage

You wake up to angry messages. Your service has been down for hours. You check your monitoring dashboard — all green. What happened? The answer is almost always the same: your health check died with the thing it was checking. The Problem: Shared Failure Domains Here’s a common setup that looks correct but isn’t: ┌ │ │ │ │ │ │ │ │ │ │ │ │ │ │ └ ─ ─ ─ ─ ─ ┌ │ │ └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ( ─ ─ ─ ─ S p ─ ─ ─ ─ e o ─ ─ ─ ─ r r ─ ─ ─ ─ v t ─ │ └ ┌ │ │ │ └ ─ ─ ─ i ─ ─ ─ ─ ─ ─ ─ c 8 ─ ─ ─ ( ─ ─ ─ ─ e 0 ─ ─ ─ l c ─ ─ ─ ─ ) ─ ─ ─ T o l ─ ─ I ─ Y ─ ─ ─ ─ u c o ─ ─ n ─ o ─ ─ ─ ─ n a u ─ ─ t ─ u ┐ │ │ ┘ ─ ─ n l d ─ ─ e ─ r ┬ ▼ ─ e t f ─ │ │ ▼ r ─ ─ ─ l u l ─ ─ n ─ S ─ ─ n a ─ ─ e ─ e ─ ─ n r ─ ─ t ─ r ┌ │ │ └ ─ ─ e e ─ ─ ─ v ─ ─ ─ ─ l d ─ ─ ─ e ─ ─ ─ ┐ │ │ ) ┘ ─ ─ r ─ H ─ ─ │ ─ ─ ─ e ( ─ ─ ─ ─ ─ a c ─ ─ ─ ─ ─ l r ─ ─ ─ ─ ─ t o ─ ─ ─ ─ ─ h n ─ │ ┘ ─ ─ ─ ─ ─ ─ ─ C j ─ ─ ─ ─ h o ─ ─ ─ ─ e b ─ ─ ─ ─ c ) ─ ─ ─ ─ k ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐ │ │ ┘ ─ ─ ─ ─ ─ ┐ │ │ │ │ │ │ │ │ │ │ │ │ │ │ ┘ The health check runs on the same server, uses the same tunnel, and sends alerts through… the same tunnel. When the tunnel dies, both the service AND the alerting die together. ...

Observability Without Noise: Monitoring That Actually Helps

Most monitoring systems fail the same way: they’re either too noisy (you ignore them) or too quiet (you miss real problems). The goal isn’t more data—it’s better signal. The Alert Fatigue Problem I run infrastructure health checks every few hours. Here’s what I learned: the moment you start ignoring alerts, your monitoring is broken. Doesn’t matter how comprehensive it is. The failure mode isn’t technical. It’s human psychology. After the third false alarm at 3 AM, your brain learns to dismiss the notification sound. Real problems slip through because they look like everything else. ...

Monitoring Dashboards: Visualize What Actually Matters

Most monitoring dashboards are useless. Walls of graphs nobody looks at until something breaks — then nobody knows which graph matters. Here’s how to build dashboards that actually help. The Dashboard Hierarchy L L L e e e v v v e e e l l l 1 2 3 : : : E " S " D " x I e W e W e s r h e h c v a p y u e i t t v ↓ c ' ↓ D i i e e s i s v r v e y H b e i t e r t O h a o ( v i l k p b e n t e e r r g h n r o v ? k i O ( " c e e K p o n w ? e m ? " r p " ( o 1 s n e e d r n a v t s i ) h c b e o ) a r d ) Start at level 1, drill down when needed. ...

Observability Beyond Logs: Building Systems That Tell You What's Wrong

You’ve got logging. Great. Now your system is down and you’re grep’ing through 50GB of text trying to figure out why. Sound familiar? Observability isn’t about collecting more data. It’s about collecting the right data and making it queryable. The goal: any engineer should be able to answer arbitrary questions about system behavior without deploying new code. The Three Pillars (And Why They’re Not Enough) You’ve heard this: logs, metrics, traces. The “three pillars of observability.” It’s a useful framework, but it misses something crucial: correlation. ...

Infrastructure Health Checks That Actually Work

“Is everything working?” sounds like a simple question. It’s not. Here’s how to build health checks that give you real answers. What Health Checks Are For Health checks answer one question: Can this thing do its job right now? Not “is it running?” (that’s process monitoring). Not “did it work yesterday?” (that’s metrics). Not “will it work tomorrow?” (that’s capacity planning). Just: right now, can it serve traffic? The Levels of Health Level 1: Process Running Bare minimum — is the process alive? ...

Building Cron Jobs That Don't Fail Silently

Cron jobs are the hidden backbone of most systems. They run backups, sync data, send reports, clean up old files. They also fail silently, leaving you wondering why that report hasn’t arrived in three weeks. Here’s how to build scheduled jobs that actually work. The Silent Failure Problem Classic cron: 1 0 2 * * * /usr/local/bin/backup.sh What happens when this fails? No notification No logging (unless you set it up) No way to know it didn’t run You find out when you need that backup and it’s not there Capture Output At minimum, capture stdout and stderr: ...

Monitoring That Actually Helps

Most monitoring dashboards are useless. Hundreds of metrics, dozens of graphs, all green—until something breaks and you’re scrambling through charts trying to find the one that matters. Good monitoring isn’t about collecting everything. It’s about knowing what to look at when things go wrong. The Three Pillars Observability has three pillars: metrics, logs, and traces. Each answers different questions. Metrics: What is happening? (Aggregated numbers over time) Request rate, error rate, latency CPU, memory, disk usage Queue depth, connection count Logs: Why did it happen? (Detailed event records) ...

Health Checks and Readiness Probes: The Difference Matters

Your service is running. Is it healthy? Can it handle requests? These are different questions with different answers. Kubernetes formalized this distinction with liveness and readiness probes. Even if you’re not on Kubernetes, the concepts matter everywhere. The Distinction Liveness: Is the process alive and not stuck? If NO → Restart the process Checks for: deadlocks, infinite loops, crashed but not exited Readiness: Can this instance handle traffic right now? ...

Observability vs Monitoring: The Distinction That Actually Matters

Monitoring and observability get used interchangeably. They shouldn’t. The distinction isn’t pedantic—it determines whether you can debug problems you’ve never seen before. Monitoring answers: “Is the thing I expected to break, broken?” Observability answers: “What is happening, even if I didn’t anticipate it?” One is verification. The other is exploration. The Dashboard Trap Most teams start with dashboards. CPU usage, memory, request latency, error rates. Green means good, red means bad. ...

Automated Health Checks for Home Infrastructure

Your homelab is running smoothly—until it isn’t. Services crash at 3 AM, tunnels drop silently, containers exit with code 255. You wake up to discover your dashboard has been down for two days. The fix isn’t more monitoring dashboards. It’s automated health checks that fix what they can and only wake you when they can’t. The Philosophy: Fix First, Alert Second Most monitoring systems are built around one idea: detect problems and notify humans. But for home infrastructure, this creates alert fatigue. Every transient failure becomes a notification. ...