A health check that always returns 200 OK is worse than no health check at all. It gives you false confidence while your application silently fails. Let’s build health checks that actually tell you when something’s wrong.
The Three Types of Probes#
Kubernetes defines three probe types, each serving a distinct purpose:
Liveness Probe: “Is this process stuck?” If it fails, Kubernetes kills and restarts the container.
Readiness Probe: “Can this instance handle traffic?” If it fails, the instance is removed from load balancing but keeps running.
Startup Probe: “Has this application finished starting?” Disables liveness/readiness checks until the app is ready.
Understanding when to use each is crucial.
Basic Health Endpoint#
Start with a simple health endpoint:
1
2
3
4
5
6
7
8
9
10
| from fastapi import FastAPI, Response
from datetime import datetime
import psycopg2
import redis
app = FastAPI()
@app.get("/health")
async def health():
return {"status": "ok", "timestamp": datetime.utcnow().isoformat()}
|
This is a liveness check — it proves the process is running and can respond to HTTP. But it doesn’t tell you if the app can actually do its job.
Readiness: Check Your Dependencies#
A readiness check should verify the app can handle requests:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
| from contextlib import contextmanager
from typing import Dict, List, Tuple
import time
class HealthChecker:
def __init__(self):
self.checks: List[Tuple[str, callable]] = []
def add_check(self, name: str, check_fn: callable):
self.checks.append((name, check_fn))
def run_checks(self) -> Dict:
results = {}
all_healthy = True
for name, check_fn in self.checks:
start = time.time()
try:
check_fn()
results[name] = {
"status": "healthy",
"latency_ms": round((time.time() - start) * 1000, 2)
}
except Exception as e:
all_healthy = False
results[name] = {
"status": "unhealthy",
"error": str(e),
"latency_ms": round((time.time() - start) * 1000, 2)
}
return {
"status": "healthy" if all_healthy else "unhealthy",
"checks": results
}
health_checker = HealthChecker()
# Database check
def check_database():
conn = psycopg2.connect(DATABASE_URL)
cursor = conn.cursor()
cursor.execute("SELECT 1")
cursor.close()
conn.close()
health_checker.add_check("database", check_database)
# Redis check
def check_redis():
r = redis.from_url(REDIS_URL)
r.ping()
health_checker.add_check("redis", check_redis)
# External API check (with timeout)
def check_payment_api():
response = requests.get(
"https://api.stripe.com/health",
timeout=2
)
response.raise_for_status()
health_checker.add_check("payment_api", check_payment_api)
@app.get("/health/ready")
async def readiness(response: Response):
result = health_checker.run_checks()
if result["status"] != "healthy":
response.status_code = 503
return result
|
Kubernetes Configuration#
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
| apiVersion: apps/v1
kind: Deployment
metadata:
name: api-server
spec:
replicas: 3
template:
spec:
containers:
- name: api
image: myapp:latest
ports:
- containerPort: 8080
# Liveness: restart if stuck
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 10
periodSeconds: 15
timeoutSeconds: 5
failureThreshold: 3
# Readiness: remove from LB if dependencies down
readinessProbe:
httpGet:
path: /health/ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 2
# Startup: wait for slow initialization
startupProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 0
periodSeconds: 5
timeoutSeconds: 5
failureThreshold: 30 # 30 * 5s = 150s max startup
|
Common Mistakes#
1. Checking Dependencies in Liveness Probes#
1
2
3
4
| # DON'T do this
livenessProbe:
httpGet:
path: /health/ready # Checks database!
|
If your database goes down, Kubernetes will restart all your pods. They’ll come up, fail the liveness check (database still down), and restart again. Cascading failure.
Liveness should only check if the process itself is healthy — not external dependencies.
2. Timeouts Longer Than Probe Intervals#
1
2
3
4
| # Broken configuration
readinessProbe:
periodSeconds: 5
timeoutSeconds: 10 # Timeout > period!
|
Probes will overlap, queue up, and cause unpredictable behavior.
3. No Startup Probe for Slow Apps#
1
2
3
4
5
6
| # App takes 60s to start, but liveness kills it at 45s
livenessProbe:
initialDelaySeconds: 30
failureThreshold: 3
periodSeconds: 5
# 30 + (3 * 5) = 45s max before restart
|
Use startup probes for applications that need time to initialize:
1
2
3
4
5
6
7
| startupProbe:
httpGet:
path: /health
port: 8080
failureThreshold: 60
periodSeconds: 5
# 60 * 5 = 300s (5 min) to start
|
Advanced Patterns#
Graceful Degradation#
Return partial health when non-critical dependencies fail:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
| @app.get("/health/ready")
async def readiness(response: Response):
results = {}
critical_healthy = True
# Critical: must be healthy
for name in ["database", "redis"]:
try:
health_checker.checks[name]()
results[name] = "healthy"
except Exception as e:
results[name] = f"unhealthy: {e}"
critical_healthy = False
# Non-critical: log but don't fail
for name in ["email_service", "analytics"]:
try:
health_checker.checks[name]()
results[name] = "healthy"
except Exception as e:
results[name] = f"degraded: {e}"
# Don't set critical_healthy = False
if not critical_healthy:
response.status_code = 503
return {"status": "healthy" if critical_healthy else "unhealthy", "checks": results}
|
Circuit Breaker Integration#
Skip dependency checks if the circuit is open:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
| from circuitbreaker import circuit
@circuit(failure_threshold=3, recovery_timeout=30)
def check_external_api():
response = requests.get(EXTERNAL_API, timeout=2)
response.raise_for_status()
def check_external_api_safe():
try:
check_external_api()
return "healthy"
except CircuitBreakerError:
return "circuit_open" # Don't count as failure
except Exception as e:
return f"unhealthy: {e}"
|
Health Check Caching#
Avoid hammering dependencies on every probe:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
| from functools import lru_cache
from datetime import datetime, timedelta
class CachedHealthChecker:
def __init__(self, ttl_seconds: int = 5):
self.ttl = ttl_seconds
self.last_check = None
self.cached_result = None
def check(self, checker: HealthChecker):
now = datetime.utcnow()
if (self.last_check is None or
now - self.last_check > timedelta(seconds=self.ttl)):
self.cached_result = checker.run_checks()
self.last_check = now
return self.cached_result
cached_checker = CachedHealthChecker(ttl_seconds=5)
@app.get("/health/ready")
async def readiness(response: Response):
result = cached_checker.check(health_checker)
if result["status"] != "healthy":
response.status_code = 503
return result
|
Monitoring Health Checks#
Export health check results as metrics:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
| from prometheus_client import Gauge, Histogram
health_status = Gauge(
'app_health_check_status',
'Health check status (1=healthy, 0=unhealthy)',
['check_name']
)
health_latency = Histogram(
'app_health_check_latency_seconds',
'Health check latency',
['check_name']
)
def run_checks_with_metrics(checker: HealthChecker):
result = checker.run_checks()
for name, check_result in result["checks"].items():
is_healthy = 1 if check_result["status"] == "healthy" else 0
health_status.labels(check_name=name).set(is_healthy)
health_latency.labels(check_name=name).observe(
check_result["latency_ms"] / 1000
)
return result
|
Summary#
| Probe Type | Purpose | What to Check | On Failure |
|---|
| Liveness | Process alive? | Basic HTTP response | Container restart |
| Readiness | Can handle traffic? | Dependencies, resources | Remove from LB |
| Startup | Finished initializing? | Basic HTTP response | Wait longer |
Good health checks are the difference between “the system self-healed” and “we had a 3 AM incident.” Take the time to implement them properly.