Every modern service needs health check endpoints. Load balancers probe them. Kubernetes uses them. Monitoring systems scrape them. But a naive implementation—returning 200 OK if the process is running—tells you almost nothing useful. Here’s how to build health checks that actually help.
Two Types of Health#
Liveness: Is the process alive and not deadlocked?
Readiness: Can this instance handle requests right now?
These are different questions with different answers:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
| # Liveness: Am I alive?
@app.get("/health/live")
def liveness():
# If this returns, the process is alive
return {"status": "alive"}
# Readiness: Can I serve traffic?
@app.get("/health/ready")
def readiness():
checks = {
"database": check_database(),
"cache": check_cache(),
"disk_space": check_disk_space(),
}
all_healthy = all(c["healthy"] for c in checks.values())
return JSONResponse(
status_code=200 if all_healthy else 503,
content={"status": "ready" if all_healthy else "not_ready", "checks": checks}
)
|
Why separate them?
- Liveness failure: Restart the container (it’s stuck)
- Readiness failure: Stop sending traffic (but don’t restart)
If you restart on every readiness failure, a database blip could cause a restart cascade across your entire fleet.
Check What Matters#
Only check dependencies that would prevent the service from working:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
| def check_database():
"""Database is required—we can't serve requests without it."""
try:
db.execute("SELECT 1")
return {"healthy": True, "latency_ms": measure_latency()}
except Exception as e:
return {"healthy": False, "error": str(e)}
def check_cache():
"""Cache is optional—we can fall back to database."""
try:
cache.ping()
return {"healthy": True}
except Exception:
# Degraded but functional
return {"healthy": True, "degraded": True, "message": "Cache unavailable, using fallback"}
def check_external_api():
"""External API for non-critical feature—don't fail health on it."""
try:
external_api.ping()
return {"healthy": True}
except Exception:
# Log it, but don't affect health
return {"healthy": True, "degraded": True}
|
Rule: Only mark unhealthy for dependencies that truly prevent the service from functioning.
Include Useful Details#
A good health response helps debugging:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
| {
"status": "ready",
"version": "1.2.3",
"uptime_seconds": 3600,
"checks": {
"database": {
"healthy": true,
"latency_ms": 2.3,
"pool_size": 10,
"pool_available": 8
},
"cache": {
"healthy": true,
"latency_ms": 0.5,
"hit_rate": 0.92
},
"disk_space": {
"healthy": true,
"free_gb": 45.2,
"threshold_gb": 10
}
}
}
|
This tells you not just “healthy” but “how healthy”—useful for spotting trends before they become outages.
Kubernetes Configuration#
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
| apiVersion: v1
kind: Pod
spec:
containers:
- name: app
livenessProbe:
httpGet:
path: /health/live
port: 8080
initialDelaySeconds: 10 # Wait for startup
periodSeconds: 10 # Check every 10s
timeoutSeconds: 5 # Fail if no response in 5s
failureThreshold: 3 # Restart after 3 failures
readinessProbe:
httpGet:
path: /health/ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 2 # Remove from service after 2 failures
successThreshold: 1 # Add back after 1 success
|
Tune these values:
- initialDelaySeconds: Long enough for your app to actually start
- periodSeconds: Balance between responsiveness and overhead
- failureThreshold: Higher = more tolerant of transient issues
Startup Probes#
For slow-starting apps, use a startup probe:
1
2
3
4
5
6
7
| startupProbe:
httpGet:
path: /health/live
port: 8080
initialDelaySeconds: 0
periodSeconds: 5
failureThreshold: 30 # 30 * 5s = 150s max startup time
|
The startup probe runs first. Liveness and readiness probes only start after it succeeds. This prevents Kubernetes from killing slow-starting containers.
Don’t Check Too Much#
Health checks run frequently. Keep them fast:
1
2
3
4
5
6
7
8
9
| # BAD: Heavy query on every health check
def check_database():
result = db.execute("SELECT COUNT(*) FROM orders WHERE created_at > NOW() - INTERVAL '1 hour'")
return {"healthy": True, "recent_orders": result}
# GOOD: Minimal query
def check_database():
db.execute("SELECT 1")
return {"healthy": True}
|
If you need metrics, collect them separately on a longer interval.
Caching Health Checks#
For expensive checks, cache the result:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
| from functools import lru_cache
from time import time
class HealthCache:
def __init__(self, ttl_seconds=5):
self.ttl = ttl_seconds
self.cache = {}
def get_or_compute(self, key, compute_fn):
now = time()
if key in self.cache:
value, timestamp = self.cache[key]
if now - timestamp < self.ttl:
return value
value = compute_fn()
self.cache[key] = (value, now)
return value
health_cache = HealthCache(ttl_seconds=5)
@app.get("/health/ready")
def readiness():
db_health = health_cache.get_or_compute("database", check_database)
cache_health = health_cache.get_or_compute("cache", check_cache)
# ...
|
Deep Health Checks#
Sometimes you need a thorough check for debugging:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
| @app.get("/health/deep")
def deep_health():
"""Expensive checks—don't use for load balancer probes."""
return {
"database": {
**check_database(),
"replication_lag_ms": get_replication_lag(),
"connection_count": get_connection_count(),
},
"cache": {
**check_cache(),
"memory_usage_mb": get_cache_memory(),
"key_count": get_cache_key_count(),
},
"system": {
"cpu_percent": psutil.cpu_percent(),
"memory_percent": psutil.virtual_memory().percent,
"disk_percent": psutil.disk_usage('/').percent,
}
}
|
Protect it: This endpoint should not be exposed to the internet or used by load balancers. It’s for operators, not infrastructure.
Quick Reference#
| Endpoint | Purpose | Check Frequency | Failure Action |
|---|
/health/live | Process alive? | Every 10s | Restart |
/health/ready | Can serve traffic? | Every 5s | Remove from LB |
/health/deep | Detailed diagnostics | On demand | Alert/investigate |
The goal: Health checks should tell the truth. If a service can’t do its job, it should say so. If it can, it shouldn’t cry wolf. The infrastructure will handle the rest.