“Is everything working?” sounds like a simple question. It’s not. Here’s how to build health checks that give you real answers.
What Health Checks Are For#
Health checks answer one question: Can this thing do its job right now?
Not “is it running?” (that’s process monitoring).
Not “did it work yesterday?” (that’s metrics).
Not “will it work tomorrow?” (that’s capacity planning).
Just: right now, can it serve traffic?
The Levels of Health#
Level 1: Process Running#
Bare minimum — is the process alive?
1
2
3
4
5
6
7
8
| # Systemd
systemctl is-active myapp
# Docker
docker inspect --format='{{.State.Running}}' myapp
# Kubernetes
kubectl get pod myapp -o jsonpath='{.status.phase}'
|
This catches crashes but misses deadlocks, resource exhaustion, and dependency failures.
Level 2: Port Responding#
One step better — is something listening?
1
2
3
4
5
| # TCP check
nc -z localhost 8080
# HTTP check
curl -sf http://localhost:8080/ > /dev/null
|
This catches binding failures but misses application-level issues.
Level 3: Application Health Endpoint#
The standard approach — dedicated health endpoint:
1
2
3
| @app.get("/health")
def health_check():
return {"status": "ok"}
|
But this is too simple. It passes even when the database is down.
Level 4: Deep Health Check#
Check actual dependencies:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
| @app.get("/health")
def health_check():
checks = {}
# Database
try:
db.execute("SELECT 1")
checks["database"] = "ok"
except Exception as e:
checks["database"] = f"error: {e}"
# Cache
try:
redis.ping()
checks["cache"] = "ok"
except Exception as e:
checks["cache"] = f"error: {e}"
# External API
try:
resp = requests.get("https://api.stripe.com/health", timeout=5)
checks["stripe"] = "ok" if resp.ok else f"error: {resp.status_code}"
except Exception as e:
checks["stripe"] = f"error: {e}"
all_ok = all(v == "ok" for v in checks.values())
status_code = 200 if all_ok else 503
return JSONResponse(
{"status": "healthy" if all_ok else "degraded", "checks": checks},
status_code=status_code
)
|
Now load balancers can route around sick instances.
Liveness vs Readiness#
Kubernetes popularized this distinction:
Liveness: Should this container be restarted?
Readiness: Should this container receive traffic?
1
2
3
4
5
6
7
8
9
10
11
12
13
14
| livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 10
periodSeconds: 10
failureThreshold: 3 # Restart after 3 failures
readinessProbe:
httpGet:
path: /ready
port: 8080
periodSeconds: 5
failureThreshold: 1 # Remove from service immediately
|
Liveness endpoint: Basic — “am I deadlocked?” Keep it simple and fast.
1
2
3
| @app.get("/healthz")
def liveness():
return {"status": "alive"}
|
Readiness endpoint: Full dependency check — “can I serve requests?”
1
2
3
4
5
6
7
| @app.get("/ready")
def readiness():
if not db.is_connected():
raise HTTPException(503, "database unavailable")
if cache.queue_depth > 1000:
raise HTTPException(503, "cache backlogged")
return {"status": "ready"}
|
External Health Checks#
Internal checks aren’t enough. Verify from outside:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
| #!/bin/bash
# health-check.sh
ENDPOINTS=(
"https://myapp.com|200"
"https://api.myapp.com/health|200"
"https://cdn.myapp.com/test.txt|200"
)
FAILURES=()
for entry in "${ENDPOINTS[@]}"; do
url="${entry%|*}"
expected="${entry#*|}"
status=$(curl -sf -o /dev/null -w "%{http_code}" "$url" --max-time 10)
if [ "$status" != "$expected" ]; then
FAILURES+=("$url returned $status (expected $expected)")
fi
done
if [ ${#FAILURES[@]} -gt 0 ]; then
echo "Health check failures:"
printf '%s\n' "${FAILURES[@]}"
exit 1
fi
echo "All endpoints healthy"
exit 0
|
Run this from multiple locations — your data center, cloud regions, even your phone.
Common Health Check Mistakes#
1. Checking Too Much#
1
2
3
4
5
| # Bad: Health check does real work
@app.get("/health")
def health():
users = db.query("SELECT COUNT(*) FROM users") # Expensive!
return {"users": users}
|
Health checks should be cheap. Don’t run expensive queries.
2. Checking Too Little#
1
2
3
4
| # Bad: Always returns OK
@app.get("/health")
def health():
return {"status": "ok"} # Even if DB is down
|
A health check that never fails is useless.
3. No Timeouts#
1
2
3
| # Bad: Hangs forever if DB is slow
def check_database():
db.execute("SELECT 1") # No timeout
|
Always set timeouts:
1
2
3
| def check_database():
with timeout(5):
db.execute("SELECT 1")
|
4. Cascading Failures#
If Service A’s health check calls Service B, and B is down, A reports unhealthy. Now everything looks broken.
Solutions:
- Health checks should check direct dependencies only
- Cache dependency status briefly
- Distinguish “degraded” from “unhealthy”
1
2
3
4
5
6
7
8
9
10
11
| @app.get("/health")
def health():
critical = check_database() # Must work
optional = check_external_api() # Nice to have
if not critical:
return JSONResponse({"status": "unhealthy"}, 503)
elif not optional:
return JSONResponse({"status": "degraded"}, 200) # Still serve traffic
else:
return JSONResponse({"status": "healthy"}, 200)
|
Health Check Monitoring#
Don’t just check — track and alert:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
| # Prometheus metrics
health_check_duration = Histogram(
'health_check_duration_seconds',
'Time spent in health check'
)
health_check_status = Gauge(
'health_check_status',
'Health check result (1=healthy, 0=unhealthy)',
['check_name']
)
@app.get("/health")
def health():
with health_check_duration.time():
results = run_all_checks()
for name, passed in results.items():
health_check_status.labels(check_name=name).set(1 if passed else 0)
return results
|
Alert when:
- Health check fails N times consecutively
- Health check latency exceeds threshold
- Health check success rate drops below 99%
The Health Check Checklist#
A good health check is the first thing to tell you something’s wrong — and often the last thing teams invest in properly.
Health checks are your system’s vital signs. Check them often, trust their results, and fix them when they lie.