Infrastructure Health Checks That Actually Work

“Is everything working?” sounds like a simple question. It’s not. Here’s how to build health checks that give you real answers.

What Health Checks Are For

Health checks answer one question: Can this thing do its job right now?

Not “is it running?” (that’s process monitoring).
Not “did it work yesterday?” (that’s metrics).
Not “will it work tomorrow?” (that’s capacity planning).

Just: right now, can it serve traffic?

The Levels of Health

Level 1: Process Running

Bare minimum — is the process alive?

1
2
3
4
5
6
7
8
# Systemd
systemctl is-active myapp

# Docker
docker inspect --format='{{.State.Running}}' myapp

# Kubernetes
kubectl get pod myapp -o jsonpath='{.status.phase}'

This catches crashes but misses deadlocks, resource exhaustion, and dependency failures.

Level 2: Port Responding

One step better — is something listening?

1
2
3
4
5
# TCP check
nc -z localhost 8080

# HTTP check  
curl -sf http://localhost:8080/ > /dev/null

This catches binding failures but misses application-level issues.

Level 3: Application Health Endpoint

The standard approach — dedicated health endpoint:

1
2
3
@app.get("/health")
def health_check():
    return {"status": "ok"}

But this is too simple. It passes even when the database is down.

Level 4: Deep Health Check

Check actual dependencies:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
@app.get("/health")
def health_check():
    checks = {}
    
    # Database
    try:
        db.execute("SELECT 1")
        checks["database"] = "ok"
    except Exception as e:
        checks["database"] = f"error: {e}"
    
    # Cache
    try:
        redis.ping()
        checks["cache"] = "ok"
    except Exception as e:
        checks["cache"] = f"error: {e}"
    
    # External API
    try:
        resp = requests.get("https://api.stripe.com/health", timeout=5)
        checks["stripe"] = "ok" if resp.ok else f"error: {resp.status_code}"
    except Exception as e:
        checks["stripe"] = f"error: {e}"
    
    all_ok = all(v == "ok" for v in checks.values())
    status_code = 200 if all_ok else 503
    
    return JSONResponse(
        {"status": "healthy" if all_ok else "degraded", "checks": checks},
        status_code=status_code
    )

Now load balancers can route around sick instances.

Liveness vs Readiness

Kubernetes popularized this distinction:

Liveness: Should this container be restarted?
Readiness: Should this container receive traffic?

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
  initialDelaySeconds: 10
  periodSeconds: 10
  failureThreshold: 3  # Restart after 3 failures

readinessProbe:
  httpGet:
    path: /ready
    port: 8080
  periodSeconds: 5
  failureThreshold: 1  # Remove from service immediately

Liveness endpoint: Basic — “am I deadlocked?” Keep it simple and fast.

1
2
3
@app.get("/healthz")
def liveness():
    return {"status": "alive"}

Readiness endpoint: Full dependency check — “can I serve requests?”

1
2
3
4
5
6
7
@app.get("/ready")
def readiness():
    if not db.is_connected():
        raise HTTPException(503, "database unavailable")
    if cache.queue_depth > 1000:
        raise HTTPException(503, "cache backlogged")
    return {"status": "ready"}

External Health Checks

Internal checks aren’t enough. Verify from outside:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
#!/bin/bash
# health-check.sh

ENDPOINTS=(
    "https://myapp.com|200"
    "https://api.myapp.com/health|200"
    "https://cdn.myapp.com/test.txt|200"
)

FAILURES=()

for entry in "${ENDPOINTS[@]}"; do
    url="${entry%|*}"
    expected="${entry#*|}"
    
    status=$(curl -sf -o /dev/null -w "%{http_code}" "$url" --max-time 10)
    
    if [ "$status" != "$expected" ]; then
        FAILURES+=("$url returned $status (expected $expected)")
    fi
done

if [ ${#FAILURES[@]} -gt 0 ]; then
    echo "Health check failures:"
    printf '%s\n' "${FAILURES[@]}"
    exit 1
fi

echo "All endpoints healthy"
exit 0

Run this from multiple locations — your data center, cloud regions, even your phone.

Common Health Check Mistakes

1. Checking Too Much

1
2
3
4
5
# Bad: Health check does real work
@app.get("/health")
def health():
    users = db.query("SELECT COUNT(*) FROM users")  # Expensive!
    return {"users": users}

Health checks should be cheap. Don’t run expensive queries.

2. Checking Too Little

1
2
3
4
# Bad: Always returns OK
@app.get("/health")
def health():
    return {"status": "ok"}  # Even if DB is down

A health check that never fails is useless.

3. No Timeouts

1
2
3
# Bad: Hangs forever if DB is slow
def check_database():
    db.execute("SELECT 1")  # No timeout

Always set timeouts:

1
2
3
def check_database():
    with timeout(5):
        db.execute("SELECT 1")

4. Cascading Failures

If Service A’s health check calls Service B, and B is down, A reports unhealthy. Now everything looks broken.

Solutions:

Health checks should check direct dependencies only
Cache dependency status briefly
Distinguish “degraded” from “unhealthy”

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
@app.get("/health")
def health():
    critical = check_database()  # Must work
    optional = check_external_api()  # Nice to have
    
    if not critical:
        return JSONResponse({"status": "unhealthy"}, 503)
    elif not optional:
        return JSONResponse({"status": "degraded"}, 200)  # Still serve traffic
    else:
        return JSONResponse({"status": "healthy"}, 200)

Health Check Monitoring

Don’t just check — track and alert:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
# Prometheus metrics
health_check_duration = Histogram(
    'health_check_duration_seconds',
    'Time spent in health check'
)

health_check_status = Gauge(
    'health_check_status',
    'Health check result (1=healthy, 0=unhealthy)',
    ['check_name']
)

@app.get("/health")
def health():
    with health_check_duration.time():
        results = run_all_checks()
    
    for name, passed in results.items():
        health_check_status.labels(check_name=name).set(1 if passed else 0)
    
    return results

Alert when:

Health check fails N times consecutively
Health check latency exceeds threshold
Health check success rate drops below 99%

The Health Check Checklist

Every service has a /health endpoint
Health checks verify actual dependencies
Timeouts on all dependency checks
Liveness separate from readiness (if using k8s)
External monitoring from multiple locations
Metrics tracked (latency, success rate)
Alerts configured for failures
Health checks are fast (< 1 second)

A good health check is the first thing to tell you something’s wrong — and often the last thing teams invest in properly.

Health checks are your system’s vital signs. Check them often, trust their results, and fix them when they lie.

What Health Checks Are For#

The Levels of Health#

Level 1: Process Running#

Level 2: Port Responding#

Level 3: Application Health Endpoint#

Level 4: Deep Health Check#

Liveness vs Readiness#

External Health Checks#

Common Health Check Mistakes#

1. Checking Too Much#

2. Checking Too Little#

3. No Timeouts#

4. Cascading Failures#

Health Check Monitoring#

The Health Check Checklist#

📬 Get the Newsletter