Why Your Health Check Didn't Catch the Outage

You wake up to angry messages. Your service has been down for hours. You check your monitoring dashboard β€” all green. What happened? The answer is almost always the same: your health check died with the thing it was checking. The Problem: Shared Failure Domains Here’s a common setup that looks correct but isn’t: β”Œ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”” ─ ─ ─ ─ ─ β”Œ β”‚ β”‚ β”” ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ( ─ ─ ─ ─ S p ─ ─ ─ ─ e o ─ ─ ─ ─ r r ─ ─ ─ ─ v t ─ β”‚ β”” β”Œ β”‚ β”‚ β”‚ β”” ─ ─ ─ i ─ ─ ─ ─ ─ ─ ─ c 8 ─ ─ ─ ( ─ ─ ─ ─ e 0 ─ ─ ─ l c ─ ─ ─ ─ ) ─ ─ ─ T o l ─ ─ I ─ Y ─ ─ ─ ─ u c o ─ ─ n ─ o ─ ─ ─ ─ n a u ─ ─ t ─ u ┐ β”‚ β”‚ β”˜ ─ ─ n l d ─ ─ e ─ r ┬ β–Ό ─ e t f ─ β”‚ β”‚ β–Ό r ─ ─ ─ l u l ─ ─ n ─ S ─ ─ n a ─ ─ e ─ e ─ ─ n r ─ ─ t ─ r β”Œ β”‚ β”‚ β”” ─ ─ e e ─ ─ ─ v ─ ─ ─ ─ l d ─ ─ ─ e ─ ─ ─ ┐ β”‚ β”‚ ) β”˜ ─ ─ r ─ H ─ ─ β”‚ ─ ─ ─ e ( ─ ─ ─ ─ ─ a c ─ ─ ─ ─ ─ l r ─ ─ ─ ─ ─ t o ─ ─ ─ ─ ─ h n ─ β”‚ β”˜ ─ ─ ─ ─ ─ ─ ─ C j ─ ─ ─ ─ h o ─ ─ ─ ─ e b ─ ─ ─ ─ c ) ─ ─ ─ ─ k ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐ β”‚ β”‚ β”˜ ─ ─ ─ ─ ─ ┐ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”˜ The health check runs on the same server, uses the same tunnel, and sends alerts through… the same tunnel. When the tunnel dies, both the service AND the alerting die together. ...

March 15, 2026 Β· 5 min Β· 1052 words Β· Rob Washington

Infrastructure Health Checks That Actually Work

β€œIs everything working?” sounds like a simple question. It’s not. Here’s how to build health checks that give you real answers. What Health Checks Are For Health checks answer one question: Can this thing do its job right now? Not β€œis it running?” (that’s process monitoring). Not β€œdid it work yesterday?” (that’s metrics). Not β€œwill it work tomorrow?” (that’s capacity planning). Just: right now, can it serve traffic? The Levels of Health Level 1: Process Running Bare minimum β€” is the process alive? ...

March 11, 2026 Β· 5 min Β· 953 words Β· Rob Washington

Health Check Patterns: Liveness, Readiness, and Startup Probes

Your load balancer routes traffic to a pod that’s crashed. Or kills a pod that’s just slow. Or restarts a pod that’s still initializing. Health checks prevent these failures β€” when configured correctly. Most teams get them wrong. Here’s how to get them right. The Three Probe Types Kubernetes offers three distinct probes, each with a different purpose: Probe Question Failure Action Liveness Is the process alive? Restart container Readiness Can it handle traffic? Remove from Service Startup Has it finished starting? Delay other probes Liveness: β€œShould I restart this?” Detects when your app is stuck β€” deadlocked, infinite loop, unrecoverable state. ...

February 19, 2026 Β· 6 min Β· 1210 words Β· Rob Washington

Health Checks: Readiness, Liveness, and Startup Probes Explained

Your application says it’s running. But is it actually working? Health checks answer that question. They’re the difference between β€œprocess exists” and β€œservice is functional.” Get them wrong, and your orchestrator will either route traffic to broken instances or restart healthy ones. Three Types of Probes Liveness: β€œIs this process stuck?” Liveness probes detect deadlocks, infinite loops, and zombie processes. If liveness fails, the container gets killed and restarted. 1 2 3 4 5 6 7 livenessProbe: httpGet: path: /healthz port: 8080 initialDelaySeconds: 30 periodSeconds: 10 failureThreshold: 3 What to check: ...

February 16, 2026 Β· 6 min Β· 1131 words Β· Rob Washington

Health Checks Done Right: Liveness, Readiness, and Startup Probes

A health check that always returns 200 OK is worse than no health check at all. It gives you false confidence while your application silently fails. Let’s build health checks that actually tell you when something’s wrong. The Three Types of Probes Kubernetes defines three probe types, each serving a distinct purpose: Liveness Probe: β€œIs this process stuck?” If it fails, Kubernetes kills and restarts the container. Readiness Probe: β€œCan this instance handle traffic?” If it fails, the instance is removed from load balancing but keeps running. ...

February 11, 2026 Β· 6 min Β· 1174 words Β· Rob Washington