Why Your Health Check Didn't Catch the Outage
You wake up to angry messages. Your service has been down for hours. You check your monitoring dashboard β all green. What happened? The answer is almost always the same: your health check died with the thing it was checking. The Problem: Shared Failure Domains Hereβs a common setup that looks correct but isnβt: β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β ( β β β β S p β β β β e o β β β β r r β β β β v t β β β β β β β β β β β i β β β β β β β c 8 β β β ( β β β β e 0 β β β l c β β β β ) β β β T o l β β I β Y β β β β u c o β β n β o β β β β n a u β β t β u β β β β β β n l d β β e β r β¬ βΌ β e t f β β β βΌ r β β β l u l β β n β S β β n a β β e β e β β n r β β t β r β β β β β β e e β β β v β β β β l d β β β e β β β β β β ) β β β r β H β β β β β β e ( β β β β β a c β β β β β l r β β β β β t o β β β β β h n β β β β β β β β β β C j β β β β h o β β β β e b β β β β c ) β β β β k β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β The health check runs on the same server, uses the same tunnel, and sends alerts throughβ¦ the same tunnel. When the tunnel dies, both the service AND the alerting die together. ...