Every container orchestrator, load balancer, and monitoring system asks the same question: is this service healthy? The answer you provide determines whether traffic gets routed, containers get replaced, and alerts get fired.
A health check that lies — always returning 200 even when the database is down — is worse than no health check at all. It creates false confidence while your users experience failures.
The Three Types of Health Checks#
Liveness: “Is the process alive?”#
Liveness checks answer: should this container be killed and restarted?
1
2
3
| @app.route('/health/live')
def liveness():
return {'status': 'alive'}, 200
|
This should almost always return 200. The only time to fail liveness is when the process is fundamentally broken — deadlocked, corrupted state, unrecoverable error. A failing database connection is NOT a reason to fail liveness.
Kubernetes behavior: Failed liveness → container restart.
Readiness: “Can this instance serve traffic?”#
Readiness checks answer: should traffic be routed to this instance?
1
2
3
4
5
6
7
8
9
10
11
12
| @app.route('/health/ready')
def readiness():
checks = {
'database': check_database(),
'cache': check_cache(),
'queue': check_queue(),
}
if all(checks.values()):
return {'status': 'ready', 'checks': checks}, 200
else:
return {'status': 'not_ready', 'checks': checks}, 503
|
Fail readiness when the instance can’t serve requests properly — database down, required cache unavailable, still warming up.
Kubernetes behavior: Failed readiness → removed from service endpoints (no restart).
Startup: “Has initial setup completed?”#
Startup checks are for slow-starting applications:
1
2
3
4
5
6
7
8
9
10
11
12
13
| startup_complete = False
@app.route('/health/startup')
def startup():
if startup_complete:
return {'status': 'started'}, 200
return {'status': 'starting'}, 503
def on_startup():
global startup_complete
load_ml_models() # Takes 60 seconds
warm_caches() # Takes 30 seconds
startup_complete = True
|
Kubernetes behavior: Startup probe runs first; liveness/readiness don’t start until startup succeeds.
Dependency Checks: The Nuance#
Not all dependencies are equal. A failed write database is critical. A failed analytics service is not.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
| def readiness():
critical = {
'primary_db': check_postgres(),
'auth_service': check_auth(),
}
degraded = {
'cache': check_redis(),
'search': check_elasticsearch(),
}
# Fail only on critical dependencies
if not all(critical.values()):
return {
'status': 'not_ready',
'critical': critical,
'degraded': degraded
}, 503
# Warn but stay ready if degraded
status = 'ready' if all(degraded.values()) else 'degraded'
return {
'status': status,
'critical': critical,
'degraded': degraded
}, 200
|
Timeout Handling#
Health checks that hang are useless. Set aggressive timeouts:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
| import requests
from functools import wraps
import signal
class TimeoutError(Exception):
pass
def timeout(seconds):
def decorator(func):
@wraps(func)
def wrapper(*args, **kwargs):
def handler(signum, frame):
raise TimeoutError(f"{func.__name__} timed out")
signal.signal(signal.SIGALRM, handler)
signal.alarm(seconds)
try:
return func(*args, **kwargs)
finally:
signal.alarm(0)
return wrapper
return decorator
@timeout(2)
def check_database():
try:
db.execute("SELECT 1")
return True
except Exception:
return False
|
If your database check takes more than 2 seconds, something is wrong whether or not it eventually succeeds.
Kubernetes Configuration#
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
| apiVersion: v1
kind: Pod
spec:
containers:
- name: app
livenessProbe:
httpGet:
path: /health/live
port: 8080
initialDelaySeconds: 10
periodSeconds: 10
timeoutSeconds: 3
failureThreshold: 3
readinessProbe:
httpGet:
path: /health/ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 2
startupProbe:
httpGet:
path: /health/startup
port: 8080
periodSeconds: 5
failureThreshold: 30 # 30 * 5s = 150s max startup time
|
Key settings:
initialDelaySeconds: Wait before first checkperiodSeconds: How often to checktimeoutSeconds: Max time for check to respondfailureThreshold: Consecutive failures before action
Load Balancer Health Checks#
AWS ALB, nginx, HAProxy — all need health endpoints too:
1
2
3
4
5
6
7
| # nginx upstream health check
upstream backend {
server app1:8080;
server app2:8080;
health_check interval=5s fails=2 passes=2 uri=/health/ready;
}
|
1
2
3
4
5
6
7
8
9
10
11
| # AWS ALB target group
resource "aws_lb_target_group" "app" {
health_check {
path = "/health/ready"
interval = 30
timeout = 5
healthy_threshold = 2
unhealthy_threshold = 3
matcher = "200"
}
}
|
Anti-Patterns to Avoid#
Health checks that do real work:
1
2
3
4
| # Bad: Health check runs expensive query
def health():
user_count = db.query("SELECT COUNT(*) FROM users") # Slow!
return {'users': user_count}, 200
|
Health checks run frequently. They should be fast and cheap.
Cascading health failures:
1
2
3
4
5
| # Bad: Service A's health depends on calling Service B's health
def health():
response = requests.get("http://service-b/health")
if response.status_code != 200:
return {'status': 'unhealthy'}, 503
|
Now a network blip between services causes both to fail health checks. Check connection pools, not remote health endpoints.
Swallowing errors:
1
2
3
4
5
6
7
| # Bad: Always healthy, even when nothing works
def health():
try:
check_everything()
except:
pass # Ignore all errors
return {'status': 'healthy'}, 200
|
If your health check can’t fail, it isn’t checking anything.
The Deep Health Endpoint#
For debugging, offer a detailed endpoint that isn’t used by orchestrators:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
| @app.route('/health/deep')
def deep_health():
return {
'status': 'healthy',
'version': app.config['VERSION'],
'uptime_seconds': time.time() - START_TIME,
'checks': {
'database': {
'status': 'connected',
'latency_ms': measure_db_latency(),
'pool_size': db.pool.size(),
'pool_available': db.pool.available(),
},
'cache': {
'status': 'connected',
'hit_rate': cache.stats()['hit_rate'],
},
'memory': {
'rss_mb': get_memory_usage(),
'gc_collections': gc.get_stats(),
}
},
'dependencies': {
'auth_service': check_auth_service(),
'payment_service': check_payment_service(),
}
}
|
Don’t expose this publicly — it contains operational details. But it’s invaluable for debugging.
Health checks are your service’s way of communicating with its environment. Lie, and the environment makes bad decisions. Be honest, and traffic flows to healthy instances, broken containers get replaced, and your monitoring tells you what’s actually happening.
Three endpoints, three purposes: liveness (should restart?), readiness (should receive traffic?), startup (has initialization completed?). Get these right and your infrastructure can help itself.