Your container says it’s healthy. Your users say the app is broken. Sound familiar?
Basic health checks only tell you if a process is running. Here’s how to build checks that catch real problems.
Beyond “Is It Alive?”#
Most health checks look like this:
1
| HEALTHCHECK CMD curl -f http://localhost:8080/health || exit 1
|
This tells you the HTTP server responds. It doesn’t tell you:
- Can the app reach the database?
- Is the cache connected?
- Are critical background workers running?
- Is the disk filling up?
Layered Health Checks#
Implement three levels:
1. Liveness: “Should Kubernetes restart me?”#
Only fail if the process is fundamentally broken:
1
2
3
4
| @app.get("/health/live")
def liveness():
# Can the process respond at all?
return {"status": "alive"}
|
Keep this simple. If it fails, the container restarts. False positives here cause restart loops.
2. Readiness: “Should I receive traffic?”#
Check dependencies before accepting requests:
1
2
3
4
5
6
7
8
9
10
11
12
| @app.get("/health/ready")
async def readiness():
checks = {
"database": await check_db_connection(),
"cache": await check_redis_connection(),
"disk_space": check_disk_space() > 100_MB,
}
if all(checks.values()):
return {"status": "ready", "checks": checks}
raise HTTPException(503, {"status": "not_ready", "checks": checks})
|
Failing readiness removes the pod from the load balancer without restarting it.
3. Startup: “Am I done initializing?”#
For slow-starting applications:
1
2
3
4
5
6
| startupProbe:
httpGet:
path: /health/startup
port: 8080
failureThreshold: 30
periodSeconds: 10
|
This gives the app 5 minutes to start before liveness checks begin.
The Dependency Problem#
Don’t let one flaky dependency take down everything:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
| async def check_db_connection():
try:
async with timeout(2.0):
await db.execute("SELECT 1")
return True
except Exception:
return False
@app.get("/health/ready")
async def readiness():
checks = await asyncio.gather(
check_db_connection(),
check_redis_connection(),
check_external_api(),
return_exceptions=True
)
# Critical vs non-critical
db_ok, cache_ok, api_ok = checks
critical_ok = db_ok is True
degraded = not (cache_ok is True and api_ok is True)
if not critical_ok:
raise HTTPException(503, "Critical dependency down")
return {
"status": "degraded" if degraded else "healthy",
"checks": {
"database": db_ok,
"cache": cache_ok,
"external_api": api_ok
}
}
|
Health Check Anti-Patterns#
❌ Checking too much in liveness#
1
2
3
4
5
| # BAD: Database down = restart loop
@app.get("/health/live")
def liveness():
db.execute("SELECT 1") # Don't do this
return {"status": "alive"}
|
❌ No timeouts#
1
2
3
| # BAD: Hangs forever if DB is slow
def check_db():
return db.execute("SELECT 1") # Add timeout!
|
❌ Exposing sensitive info#
1
2
3
4
5
6
7
| # BAD: Leaking internal details
@app.get("/health")
def health():
return {
"db_host": DB_HOST, # Don't expose this
"redis_password": REDIS_PASS # Definitely not this
}
|
Docker Compose Example#
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
| services:
api:
image: myapp:latest
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8080/health/ready"]
interval: 30s
timeout: 10s
retries: 3
start_period: 40s
depends_on:
postgres:
condition: service_healthy
redis:
condition: service_healthy
postgres:
image: postgres:15
healthcheck:
test: ["CMD-SHELL", "pg_isready -U postgres"]
interval: 10s
timeout: 5s
retries: 5
redis:
image: redis:7
healthcheck:
test: ["CMD", "redis-cli", "ping"]
interval: 10s
timeout: 5s
retries: 5
|
Kubernetes Probe Configuration#
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
| apiVersion: v1
kind: Pod
spec:
containers:
- name: app
livenessProbe:
httpGet:
path: /health/live
port: 8080
initialDelaySeconds: 10
periodSeconds: 15
timeoutSeconds: 5
failureThreshold: 3
readinessProbe:
httpGet:
path: /health/ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
startupProbe:
httpGet:
path: /health/startup
port: 8080
failureThreshold: 30
periodSeconds: 10
|
Monitoring Health Check Results#
Log health check failures with context:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
| import logging
from datetime import datetime
logger = logging.getLogger(__name__)
@app.get("/health/ready")
async def readiness():
checks = await run_health_checks()
failed = [k for k, v in checks.items() if not v]
if failed:
logger.warning(
"Health check degraded",
extra={
"failed_checks": failed,
"timestamp": datetime.utcnow().isoformat()
}
)
return {"status": "degraded" if failed else "healthy", "checks": checks}
|
The Golden Rule#
Liveness: Can the process function at all?
Readiness: Can the process serve requests correctly?
Startup: Has initialization completed?
Keep them separate. Keep them fast. Keep them honest.
A container that lies about its health is worse than one that admits it’s broken.