Your service is running. Is it healthy? Can it handle requests? These are different questions with different answers.

Kubernetes formalized this distinction with liveness and readiness probes. Even if you’re not on Kubernetes, the concepts matter everywhere.

The Distinction

Liveness: Is the process alive and not stuck?

  • If NO → Restart the process
  • Checks for: deadlocks, infinite loops, crashed but not exited

Readiness: Can this instance handle traffic right now?

  • If NO → Remove from load balancer, don’t send requests
  • Checks for: database connectivity, cache warmth, dependency availability

A service can be live but not ready (starting up, lost database connection). It should stay running but shouldn’t receive traffic.

A service can be ready but about to die (memory leak approaching OOM). It’s handling requests fine… for now.

Basic Implementation

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
from fastapi import FastAPI, Response
import asyncio

app = FastAPI()

# State tracking
db_connected = False
cache_warmed = False
shutting_down = False

@app.get("/health/live")
async def liveness():
    """Am I alive and not stuck?"""
    # If we can respond at all, we're alive
    return {"status": "alive"}

@app.get("/health/ready")
async def readiness(response: Response):
    """Can I handle traffic?"""
    if shutting_down:
        response.status_code = 503
        return {"status": "shutting_down"}
    
    if not db_connected:
        response.status_code = 503
        return {"status": "database_disconnected"}
    
    if not cache_warmed:
        response.status_code = 503
        return {"status": "cache_warming"}
    
    return {"status": "ready"}

What Liveness Should Check

Keep it minimal. Liveness failures trigger restarts—you don’t want false positives:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
@app.get("/health/live")
async def liveness():
    # Option 1: Just respond (simplest)
    return {"status": "alive"}
    
    # Option 2: Check for stuck event loop
    try:
        await asyncio.wait_for(asyncio.sleep(0), timeout=1.0)
        return {"status": "alive"}
    except asyncio.TimeoutError:
        # Event loop is stuck
        return Response(status_code=503)

Don’t check external dependencies in liveness. If your database is down, restarting your app won’t fix it. You’ll just restart in a loop.

What Readiness Should Check

Check everything needed to serve requests:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
async def check_database() -> bool:
    try:
        await db.execute("SELECT 1")
        return True
    except Exception:
        return False

async def check_cache() -> bool:
    try:
        await redis.ping()
        return True
    except Exception:
        return False

async def check_dependencies() -> dict:
    return {
        "database": await check_database(),
        "cache": await check_cache(),
        "feature_flags": feature_flag_client.is_connected(),
    }

@app.get("/health/ready")
async def readiness(response: Response):
    checks = await check_dependencies()
    
    all_healthy = all(checks.values())
    
    if not all_healthy:
        response.status_code = 503
    
    return {
        "status": "ready" if all_healthy else "not_ready",
        "checks": checks
    }

Startup Probes: The Third Kind

Some applications take a long time to start (loading ML models, warming caches, running migrations). Without a startup probe, Kubernetes might kill them for being “not live” during initialization.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
startup_complete = False

@app.on_event("startup")
async def startup():
    global startup_complete
    
    # Slow initialization
    await load_ml_model()  # 30 seconds
    await warm_cache()     # 20 seconds
    await run_migrations() # 10 seconds
    
    startup_complete = True

@app.get("/health/startup")
async def startup_probe():
    if startup_complete:
        return {"status": "started"}
    return Response(status_code=503)

Kubernetes config:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
startupProbe:
  httpGet:
    path: /health/startup
    port: 8080
  failureThreshold: 30
  periodSeconds: 10
  # Allows up to 5 minutes for startup (30 * 10s)

livenessProbe:
  httpGet:
    path: /health/live
    port: 8080
  periodSeconds: 10
  failureThreshold: 3
  # Only starts after startup probe succeeds

Graceful Shutdown

When shutting down, stop accepting new traffic before stopping the process:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
import signal

shutting_down = False

def handle_shutdown(signum, frame):
    global shutting_down
    shutting_down = True
    # Readiness probe now returns 503
    # Load balancer removes us from rotation
    # Wait for in-flight requests to complete
    asyncio.get_event_loop().call_later(10, sys.exit, 0)

signal.signal(signal.SIGTERM, handle_shutdown)

@app.get("/health/ready")
async def readiness(response: Response):
    if shutting_down:
        response.status_code = 503
        return {"status": "shutting_down"}
    # ... rest of checks

The sequence:

  1. SIGTERM received
  2. Readiness returns 503
  3. Load balancer stops sending traffic (takes a few seconds)
  4. In-flight requests complete
  5. Process exits

Deep Health Checks

For debugging, expose more detail on a separate endpoint:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
@app.get("/health/deep")
async def deep_health():
    """Detailed health for debugging. Don't use for probes."""
    return {
        "status": "healthy",
        "timestamp": datetime.utcnow().isoformat(),
        "version": APP_VERSION,
        "uptime_seconds": time.time() - START_TIME,
        "checks": {
            "database": {
                "connected": await check_database(),
                "pool_size": db.pool.size(),
                "pool_available": db.pool.freesize(),
                "latency_ms": await measure_db_latency(),
            },
            "cache": {
                "connected": await check_cache(),
                "memory_used_mb": await get_redis_memory(),
                "latency_ms": await measure_cache_latency(),
            },
            "memory": {
                "rss_mb": process.memory_info().rss / 1024 / 1024,
                "percent": process.memory_percent(),
            },
            "cpu_percent": process.cpu_percent(),
        }
    }

Don’t use this for Kubernetes probes. It’s too slow and too much information. Use it for dashboards and debugging.

Common Mistakes

Checking dependencies in liveness

1
2
3
4
5
# BAD: Database down → restart loop
@app.get("/health/live")
async def liveness():
    await db.execute("SELECT 1")  # Don't do this
    return {"status": "alive"}

No timeout on health checks

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
# BAD: Slow database hangs the health check
@app.get("/health/ready")
async def readiness():
    await db.execute("SELECT 1")  # Could take forever
    return {"status": "ready"}

# GOOD: Timeout on checks
@app.get("/health/ready")
async def readiness():
    try:
        await asyncio.wait_for(db.execute("SELECT 1"), timeout=2.0)
        return {"status": "ready"}
    except asyncio.TimeoutError:
        return Response(status_code=503)

Readiness that never recovers

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# BAD: Once marked unhealthy, stays unhealthy
cache_healthy = True

@app.get("/health/ready")
async def readiness():
    if not cache_healthy:
        return Response(status_code=503)
    return {"status": "ready"}

# Someone sets cache_healthy = False on error
# But nothing ever sets it back to True

Always check current state, not cached state from past failures.

The Checklist

Before deploying:

  • Liveness probe: minimal, checks process health only
  • Readiness probe: checks all dependencies needed to serve traffic
  • Startup probe: if initialization is slow (>30 seconds)
  • Graceful shutdown: readiness fails before process exits
  • Timeouts: all health check operations have timeouts
  • Deep health: detailed endpoint for debugging (not for probes)
  • No external dependencies in liveness
  • Recovery: readiness can return to healthy after transient failures

Health checks are your service’s immune system. Get them right, and your infrastructure handles failures automatically. Get them wrong, and you’ll be debugging restart loops at 3 AM.