Your container says it’s healthy. Your users say the app is broken. Sound familiar?

Basic health checks only tell you if a process is running. Here’s how to build checks that catch real problems.

Beyond “Is It Alive?”

Most health checks look like this:

1
HEALTHCHECK CMD curl -f http://localhost:8080/health || exit 1

This tells you the HTTP server responds. It doesn’t tell you:

  • Can the app reach the database?
  • Is the cache connected?
  • Are critical background workers running?
  • Is the disk filling up?

Layered Health Checks

Implement three levels:

1. Liveness: “Should Kubernetes restart me?”

Only fail if the process is fundamentally broken:

1
2
3
4
@app.get("/health/live")
def liveness():
    # Can the process respond at all?
    return {"status": "alive"}

Keep this simple. If it fails, the container restarts. False positives here cause restart loops.

2. Readiness: “Should I receive traffic?”

Check dependencies before accepting requests:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
@app.get("/health/ready")
async def readiness():
    checks = {
        "database": await check_db_connection(),
        "cache": await check_redis_connection(),
        "disk_space": check_disk_space() > 100_MB,
    }
    
    if all(checks.values()):
        return {"status": "ready", "checks": checks}
    
    raise HTTPException(503, {"status": "not_ready", "checks": checks})

Failing readiness removes the pod from the load balancer without restarting it.

3. Startup: “Am I done initializing?”

For slow-starting applications:

1
2
3
4
5
6
startupProbe:
  httpGet:
    path: /health/startup
    port: 8080
  failureThreshold: 30
  periodSeconds: 10

This gives the app 5 minutes to start before liveness checks begin.

The Dependency Problem

Don’t let one flaky dependency take down everything:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
async def check_db_connection():
    try:
        async with timeout(2.0):
            await db.execute("SELECT 1")
        return True
    except Exception:
        return False

@app.get("/health/ready")
async def readiness():
    checks = await asyncio.gather(
        check_db_connection(),
        check_redis_connection(),
        check_external_api(),
        return_exceptions=True
    )
    
    # Critical vs non-critical
    db_ok, cache_ok, api_ok = checks
    
    critical_ok = db_ok is True
    degraded = not (cache_ok is True and api_ok is True)
    
    if not critical_ok:
        raise HTTPException(503, "Critical dependency down")
    
    return {
        "status": "degraded" if degraded else "healthy",
        "checks": {
            "database": db_ok,
            "cache": cache_ok,
            "external_api": api_ok
        }
    }

Health Check Anti-Patterns

❌ Checking too much in liveness

1
2
3
4
5
# BAD: Database down = restart loop
@app.get("/health/live")
def liveness():
    db.execute("SELECT 1")  # Don't do this
    return {"status": "alive"}

❌ No timeouts

1
2
3
# BAD: Hangs forever if DB is slow
def check_db():
    return db.execute("SELECT 1")  # Add timeout!

❌ Exposing sensitive info

1
2
3
4
5
6
7
# BAD: Leaking internal details
@app.get("/health")
def health():
    return {
        "db_host": DB_HOST,  # Don't expose this
        "redis_password": REDIS_PASS  # Definitely not this
    }

Docker Compose Example

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
services:
  api:
    image: myapp:latest
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080/health/ready"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 40s
    depends_on:
      postgres:
        condition: service_healthy
      redis:
        condition: service_healthy

  postgres:
    image: postgres:15
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U postgres"]
      interval: 10s
      timeout: 5s
      retries: 5

  redis:
    image: redis:7
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 10s
      timeout: 5s
      retries: 5

Kubernetes Probe Configuration

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
apiVersion: v1
kind: Pod
spec:
  containers:
  - name: app
    livenessProbe:
      httpGet:
        path: /health/live
        port: 8080
      initialDelaySeconds: 10
      periodSeconds: 15
      timeoutSeconds: 5
      failureThreshold: 3
    
    readinessProbe:
      httpGet:
        path: /health/ready
        port: 8080
      initialDelaySeconds: 5
      periodSeconds: 10
      timeoutSeconds: 5
      failureThreshold: 3
    
    startupProbe:
      httpGet:
        path: /health/startup
        port: 8080
      failureThreshold: 30
      periodSeconds: 10

Monitoring Health Check Results

Log health check failures with context:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
import logging
from datetime import datetime

logger = logging.getLogger(__name__)

@app.get("/health/ready")
async def readiness():
    checks = await run_health_checks()
    
    failed = [k for k, v in checks.items() if not v]
    
    if failed:
        logger.warning(
            "Health check degraded",
            extra={
                "failed_checks": failed,
                "timestamp": datetime.utcnow().isoformat()
            }
        )
    
    return {"status": "degraded" if failed else "healthy", "checks": checks}

The Golden Rule

Liveness: Can the process function at all?
Readiness: Can the process serve requests correctly?
Startup: Has initialization completed?

Keep them separate. Keep them fast. Keep them honest.


A container that lies about its health is worse than one that admits it’s broken.