Every modern service needs health check endpoints. Load balancers probe them. Kubernetes uses them. Monitoring systems scrape them. But a naive implementation—returning 200 OK if the process is running—tells you almost nothing useful. Here’s how to build health checks that actually help.

Two Types of Health

Liveness: Is the process alive and not deadlocked? Readiness: Can this instance handle requests right now?

These are different questions with different answers:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
# Liveness: Am I alive?
@app.get("/health/live")
def liveness():
    # If this returns, the process is alive
    return {"status": "alive"}

# Readiness: Can I serve traffic?
@app.get("/health/ready")
def readiness():
    checks = {
        "database": check_database(),
        "cache": check_cache(),
        "disk_space": check_disk_space(),
    }
    
    all_healthy = all(c["healthy"] for c in checks.values())
    
    return JSONResponse(
        status_code=200 if all_healthy else 503,
        content={"status": "ready" if all_healthy else "not_ready", "checks": checks}
    )

Why separate them?

  • Liveness failure: Restart the container (it’s stuck)
  • Readiness failure: Stop sending traffic (but don’t restart)

If you restart on every readiness failure, a database blip could cause a restart cascade across your entire fleet.

Check What Matters

Only check dependencies that would prevent the service from working:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
def check_database():
    """Database is required—we can't serve requests without it."""
    try:
        db.execute("SELECT 1")
        return {"healthy": True, "latency_ms": measure_latency()}
    except Exception as e:
        return {"healthy": False, "error": str(e)}

def check_cache():
    """Cache is optional—we can fall back to database."""
    try:
        cache.ping()
        return {"healthy": True}
    except Exception:
        # Degraded but functional
        return {"healthy": True, "degraded": True, "message": "Cache unavailable, using fallback"}

def check_external_api():
    """External API for non-critical feature—don't fail health on it."""
    try:
        external_api.ping()
        return {"healthy": True}
    except Exception:
        # Log it, but don't affect health
        return {"healthy": True, "degraded": True}

Rule: Only mark unhealthy for dependencies that truly prevent the service from functioning.

Include Useful Details

A good health response helps debugging:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
{
  "status": "ready",
  "version": "1.2.3",
  "uptime_seconds": 3600,
  "checks": {
    "database": {
      "healthy": true,
      "latency_ms": 2.3,
      "pool_size": 10,
      "pool_available": 8
    },
    "cache": {
      "healthy": true,
      "latency_ms": 0.5,
      "hit_rate": 0.92
    },
    "disk_space": {
      "healthy": true,
      "free_gb": 45.2,
      "threshold_gb": 10
    }
  }
}

This tells you not just “healthy” but “how healthy”—useful for spotting trends before they become outages.

Kubernetes Configuration

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
apiVersion: v1
kind: Pod
spec:
  containers:
  - name: app
    livenessProbe:
      httpGet:
        path: /health/live
        port: 8080
      initialDelaySeconds: 10  # Wait for startup
      periodSeconds: 10        # Check every 10s
      timeoutSeconds: 5        # Fail if no response in 5s
      failureThreshold: 3      # Restart after 3 failures
    
    readinessProbe:
      httpGet:
        path: /health/ready
        port: 8080
      initialDelaySeconds: 5
      periodSeconds: 5
      timeoutSeconds: 3
      failureThreshold: 2      # Remove from service after 2 failures
      successThreshold: 1      # Add back after 1 success

Tune these values:

  • initialDelaySeconds: Long enough for your app to actually start
  • periodSeconds: Balance between responsiveness and overhead
  • failureThreshold: Higher = more tolerant of transient issues

Startup Probes

For slow-starting apps, use a startup probe:

1
2
3
4
5
6
7
startupProbe:
  httpGet:
    path: /health/live
    port: 8080
  initialDelaySeconds: 0
  periodSeconds: 5
  failureThreshold: 30  # 30 * 5s = 150s max startup time

The startup probe runs first. Liveness and readiness probes only start after it succeeds. This prevents Kubernetes from killing slow-starting containers.

Don’t Check Too Much

Health checks run frequently. Keep them fast:

1
2
3
4
5
6
7
8
9
# BAD: Heavy query on every health check
def check_database():
    result = db.execute("SELECT COUNT(*) FROM orders WHERE created_at > NOW() - INTERVAL '1 hour'")
    return {"healthy": True, "recent_orders": result}

# GOOD: Minimal query
def check_database():
    db.execute("SELECT 1")
    return {"healthy": True}

If you need metrics, collect them separately on a longer interval.

Caching Health Checks

For expensive checks, cache the result:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
from functools import lru_cache
from time import time

class HealthCache:
    def __init__(self, ttl_seconds=5):
        self.ttl = ttl_seconds
        self.cache = {}
    
    def get_or_compute(self, key, compute_fn):
        now = time()
        if key in self.cache:
            value, timestamp = self.cache[key]
            if now - timestamp < self.ttl:
                return value
        
        value = compute_fn()
        self.cache[key] = (value, now)
        return value

health_cache = HealthCache(ttl_seconds=5)

@app.get("/health/ready")
def readiness():
    db_health = health_cache.get_or_compute("database", check_database)
    cache_health = health_cache.get_or_compute("cache", check_cache)
    # ...

Deep Health Checks

Sometimes you need a thorough check for debugging:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
@app.get("/health/deep")
def deep_health():
    """Expensive checks—don't use for load balancer probes."""
    return {
        "database": {
            **check_database(),
            "replication_lag_ms": get_replication_lag(),
            "connection_count": get_connection_count(),
        },
        "cache": {
            **check_cache(),
            "memory_usage_mb": get_cache_memory(),
            "key_count": get_cache_key_count(),
        },
        "system": {
            "cpu_percent": psutil.cpu_percent(),
            "memory_percent": psutil.virtual_memory().percent,
            "disk_percent": psutil.disk_usage('/').percent,
        }
    }

Protect it: This endpoint should not be exposed to the internet or used by load balancers. It’s for operators, not infrastructure.

Quick Reference

EndpointPurposeCheck FrequencyFailure Action
/health/liveProcess alive?Every 10sRestart
/health/readyCan serve traffic?Every 5sRemove from LB
/health/deepDetailed diagnosticsOn demandAlert/investigate

The goal: Health checks should tell the truth. If a service can’t do its job, it should say so. If it can, it shouldn’t cry wolf. The infrastructure will handle the rest.