Your service is running. Is it healthy? Can it handle requests? These are different questions with different answers.
Kubernetes formalized this distinction with liveness and readiness probes. Even if you’re not on Kubernetes, the concepts matter everywhere.
The Distinction#
Liveness: Is the process alive and not stuck?
- If NO → Restart the process
- Checks for: deadlocks, infinite loops, crashed but not exited
Readiness: Can this instance handle traffic right now?
- If NO → Remove from load balancer, don’t send requests
- Checks for: database connectivity, cache warmth, dependency availability
A service can be live but not ready (starting up, lost database connection). It should stay running but shouldn’t receive traffic.
A service can be ready but about to die (memory leak approaching OOM). It’s handling requests fine… for now.
Basic Implementation#
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
| from fastapi import FastAPI, Response
import asyncio
app = FastAPI()
# State tracking
db_connected = False
cache_warmed = False
shutting_down = False
@app.get("/health/live")
async def liveness():
"""Am I alive and not stuck?"""
# If we can respond at all, we're alive
return {"status": "alive"}
@app.get("/health/ready")
async def readiness(response: Response):
"""Can I handle traffic?"""
if shutting_down:
response.status_code = 503
return {"status": "shutting_down"}
if not db_connected:
response.status_code = 503
return {"status": "database_disconnected"}
if not cache_warmed:
response.status_code = 503
return {"status": "cache_warming"}
return {"status": "ready"}
|
What Liveness Should Check#
Keep it minimal. Liveness failures trigger restarts—you don’t want false positives:
1
2
3
4
5
6
7
8
9
10
11
12
| @app.get("/health/live")
async def liveness():
# Option 1: Just respond (simplest)
return {"status": "alive"}
# Option 2: Check for stuck event loop
try:
await asyncio.wait_for(asyncio.sleep(0), timeout=1.0)
return {"status": "alive"}
except asyncio.TimeoutError:
# Event loop is stuck
return Response(status_code=503)
|
Don’t check external dependencies in liveness. If your database is down, restarting your app won’t fix it. You’ll just restart in a loop.
What Readiness Should Check#
Check everything needed to serve requests:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
| async def check_database() -> bool:
try:
await db.execute("SELECT 1")
return True
except Exception:
return False
async def check_cache() -> bool:
try:
await redis.ping()
return True
except Exception:
return False
async def check_dependencies() -> dict:
return {
"database": await check_database(),
"cache": await check_cache(),
"feature_flags": feature_flag_client.is_connected(),
}
@app.get("/health/ready")
async def readiness(response: Response):
checks = await check_dependencies()
all_healthy = all(checks.values())
if not all_healthy:
response.status_code = 503
return {
"status": "ready" if all_healthy else "not_ready",
"checks": checks
}
|
Startup Probes: The Third Kind#
Some applications take a long time to start (loading ML models, warming caches, running migrations). Without a startup probe, Kubernetes might kill them for being “not live” during initialization.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
| startup_complete = False
@app.on_event("startup")
async def startup():
global startup_complete
# Slow initialization
await load_ml_model() # 30 seconds
await warm_cache() # 20 seconds
await run_migrations() # 10 seconds
startup_complete = True
@app.get("/health/startup")
async def startup_probe():
if startup_complete:
return {"status": "started"}
return Response(status_code=503)
|
Kubernetes config:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
| startupProbe:
httpGet:
path: /health/startup
port: 8080
failureThreshold: 30
periodSeconds: 10
# Allows up to 5 minutes for startup (30 * 10s)
livenessProbe:
httpGet:
path: /health/live
port: 8080
periodSeconds: 10
failureThreshold: 3
# Only starts after startup probe succeeds
|
Graceful Shutdown#
When shutting down, stop accepting new traffic before stopping the process:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
| import signal
shutting_down = False
def handle_shutdown(signum, frame):
global shutting_down
shutting_down = True
# Readiness probe now returns 503
# Load balancer removes us from rotation
# Wait for in-flight requests to complete
asyncio.get_event_loop().call_later(10, sys.exit, 0)
signal.signal(signal.SIGTERM, handle_shutdown)
@app.get("/health/ready")
async def readiness(response: Response):
if shutting_down:
response.status_code = 503
return {"status": "shutting_down"}
# ... rest of checks
|
The sequence:
- SIGTERM received
- Readiness returns 503
- Load balancer stops sending traffic (takes a few seconds)
- In-flight requests complete
- Process exits
Deep Health Checks#
For debugging, expose more detail on a separate endpoint:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
| @app.get("/health/deep")
async def deep_health():
"""Detailed health for debugging. Don't use for probes."""
return {
"status": "healthy",
"timestamp": datetime.utcnow().isoformat(),
"version": APP_VERSION,
"uptime_seconds": time.time() - START_TIME,
"checks": {
"database": {
"connected": await check_database(),
"pool_size": db.pool.size(),
"pool_available": db.pool.freesize(),
"latency_ms": await measure_db_latency(),
},
"cache": {
"connected": await check_cache(),
"memory_used_mb": await get_redis_memory(),
"latency_ms": await measure_cache_latency(),
},
"memory": {
"rss_mb": process.memory_info().rss / 1024 / 1024,
"percent": process.memory_percent(),
},
"cpu_percent": process.cpu_percent(),
}
}
|
Don’t use this for Kubernetes probes. It’s too slow and too much information. Use it for dashboards and debugging.
Common Mistakes#
Checking dependencies in liveness#
1
2
3
4
5
| # BAD: Database down → restart loop
@app.get("/health/live")
async def liveness():
await db.execute("SELECT 1") # Don't do this
return {"status": "alive"}
|
No timeout on health checks#
1
2
3
4
5
6
7
8
9
10
11
12
13
14
| # BAD: Slow database hangs the health check
@app.get("/health/ready")
async def readiness():
await db.execute("SELECT 1") # Could take forever
return {"status": "ready"}
# GOOD: Timeout on checks
@app.get("/health/ready")
async def readiness():
try:
await asyncio.wait_for(db.execute("SELECT 1"), timeout=2.0)
return {"status": "ready"}
except asyncio.TimeoutError:
return Response(status_code=503)
|
Readiness that never recovers#
1
2
3
4
5
6
7
8
9
10
11
| # BAD: Once marked unhealthy, stays unhealthy
cache_healthy = True
@app.get("/health/ready")
async def readiness():
if not cache_healthy:
return Response(status_code=503)
return {"status": "ready"}
# Someone sets cache_healthy = False on error
# But nothing ever sets it back to True
|
Always check current state, not cached state from past failures.
The Checklist#
Before deploying:
Health checks are your service’s immune system. Get them right, and your infrastructure handles failures automatically. Get them wrong, and you’ll be debugging restart loops at 3 AM.