Your load balancer routes traffic to a pod that’s crashed. Or kills a pod that’s just slow. Or restarts a pod that’s still initializing. Health checks prevent these failures — when configured correctly.
Most teams get them wrong. Here’s how to get them right.
The Three Probe Types#
Kubernetes offers three distinct probes, each with a different purpose:
| Probe | Question | Failure Action |
|---|
| Liveness | Is the process alive? | Restart container |
| Readiness | Can it handle traffic? | Remove from Service |
| Startup | Has it finished starting? | Delay other probes |
Liveness: “Should I restart this?”#
Detects when your app is stuck — deadlocked, infinite loop, unrecoverable state.
1
2
3
4
5
6
7
8
| livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
|
The endpoint should be dead simple:
1
2
3
4
| @app.route('/healthz')
def liveness():
# Can the process respond at all?
return 'OK', 200
|
Don’t check dependencies here. If your database is down, restarting your app won’t fix it. You’ll just create a restart loop.
Readiness: “Should I send traffic?”#
Detects when your app can’t serve requests — overloaded, warming cache, waiting for dependency.
1
2
3
4
5
6
7
8
| readinessProbe:
httpGet:
path: /ready
port: 8080
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 3
successThreshold: 1
|
This endpoint can check dependencies:
1
2
3
4
5
6
7
8
9
10
11
12
| @app.route('/ready')
def readiness():
checks = {
'database': check_db_connection(),
'cache': check_cache_connection(),
'config': config_loaded(),
}
if all(checks.values()):
return jsonify({'status': 'ready', 'checks': checks}), 200
else:
return jsonify({'status': 'not_ready', 'checks': checks}), 503
|
When readiness fails, the pod is removed from the Service — no traffic, but no restart either. It stays alive and can recover.
Startup: “Is it still booting?”#
For slow-starting apps. Delays liveness/readiness probes until startup completes.
1
2
3
4
5
6
| startupProbe:
httpGet:
path: /healthz
port: 8080
periodSeconds: 10
failureThreshold: 30 # 30 * 10 = 300 seconds max startup
|
Without this, a Java app with a 2-minute startup gets killed by the liveness probe before it finishes loading.
Common Mistakes#
Mistake 1: Checking Dependencies in Liveness#
1
2
3
4
5
6
| # DON'T DO THIS
@app.route('/healthz')
def liveness():
db.execute('SELECT 1') # If DB is down, pod restarts
redis.ping() # If Redis is down, pod restarts
return 'OK'
|
When the database has a blip, all your pods restart simultaneously. Now you have an outage.
Fix: Only check if the process itself is healthy:
1
2
3
4
5
6
7
8
| @app.route('/healthz')
def liveness():
# Check internal state only
if app.is_shutting_down:
return 'Shutting down', 503
if main_thread_stuck():
return 'Deadlocked', 503
return 'OK', 200
|
Mistake 2: Same Endpoint for Both Probes#
1
2
3
4
5
6
7
| # PROBLEMATIC
livenessProbe:
httpGet:
path: /health
readinessProbe:
httpGet:
path: /health # Same endpoint!
|
If /health checks dependencies, liveness failures cause unnecessary restarts. If it doesn’t, readiness doesn’t catch dependency issues.
Fix: Separate endpoints with different logic.
Mistake 3: Too Aggressive Timing#
1
2
3
4
5
| # TOO AGGRESSIVE
livenessProbe:
periodSeconds: 1
timeoutSeconds: 1
failureThreshold: 1
|
One slow response = restart. GC pause = restart. Network hiccup = restart.
Fix: Give it room to recover:
1
2
3
4
| livenessProbe:
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3 # 3 failures = 30 seconds to recover
|
Mistake 4: No Startup Probe for Slow Apps#
1
2
3
4
5
6
| # Java app takes 90 seconds to start
livenessProbe:
initialDelaySeconds: 30 # Not enough!
periodSeconds: 10
failureThreshold: 3
# Pod killed at 30 + (10 * 3) = 60 seconds
|
Fix: Use startup probe:
1
2
3
4
5
6
7
8
9
10
11
12
13
| startupProbe:
httpGet:
path: /healthz
failureThreshold: 30
periodSeconds: 10
# Startup has 300 seconds to complete
livenessProbe:
httpGet:
path: /healthz
periodSeconds: 10
failureThreshold: 3
# Only starts after startup probe succeeds
|
Implementation Patterns#
Detailed Health Response#
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
| from datetime import datetime
import threading
class HealthChecker:
def __init__(self):
self.start_time = datetime.utcnow()
self.ready = False
self.checks = {}
def liveness(self):
"""Minimal check - is the process alive?"""
return {
'status': 'alive',
'uptime_seconds': (datetime.utcnow() - self.start_time).seconds,
'threads': threading.active_count()
}
def readiness(self):
"""Full check - can we serve traffic?"""
self.checks = {
'database': self._check_db(),
'cache': self._check_cache(),
'disk_space': self._check_disk(),
'memory': self._check_memory(),
}
all_healthy = all(c['healthy'] for c in self.checks.values())
return {
'status': 'ready' if all_healthy else 'not_ready',
'checks': self.checks
}, 200 if all_healthy else 503
def _check_db(self):
try:
db.execute('SELECT 1')
return {'healthy': True, 'latency_ms': 5}
except Exception as e:
return {'healthy': False, 'error': str(e)}
|
Graceful Shutdown Integration#
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
| import signal
shutdown_requested = False
def handle_sigterm(signum, frame):
global shutdown_requested
shutdown_requested = True
signal.signal(signal.SIGTERM, handle_sigterm)
@app.route('/healthz')
def liveness():
if shutdown_requested:
# Fail liveness during shutdown to speed up termination
return 'Shutting down', 503
return 'OK', 200
@app.route('/ready')
def readiness():
if shutdown_requested:
# Fail readiness immediately to stop traffic
return 'Shutting down', 503
# ... normal checks
|
Circuit Breaker Pattern#
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
| from datetime import datetime, timedelta
class DependencyCheck:
def __init__(self, name, check_fn, timeout=timedelta(seconds=30)):
self.name = name
self.check_fn = check_fn
self.timeout = timeout
self.last_success = None
self.consecutive_failures = 0
self.circuit_open = False
def check(self):
if self.circuit_open:
# Check if we should try again
if datetime.utcnow() - self.last_failure > self.timeout:
self.circuit_open = False
else:
return {'healthy': False, 'reason': 'circuit_open'}
try:
result = self.check_fn()
self.last_success = datetime.utcnow()
self.consecutive_failures = 0
return {'healthy': True, 'result': result}
except Exception as e:
self.consecutive_failures += 1
self.last_failure = datetime.utcnow()
if self.consecutive_failures >= 3:
self.circuit_open = True
return {'healthy': False, 'error': str(e)}
|
Load Balancer Health Checks#
Outside Kubernetes, configure your load balancer similarly:
AWS ALB#
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
| resource "aws_lb_target_group" "app" {
name = "app-tg"
port = 8080
protocol = "HTTP"
vpc_id = var.vpc_id
health_check {
enabled = true
path = "/ready"
port = "traffic-port"
healthy_threshold = 2
unhealthy_threshold = 3
timeout = 5
interval = 30
matcher = "200"
}
}
|
Nginx Upstream#
1
2
3
4
5
6
7
8
9
10
11
12
13
| upstream backend {
server 10.0.0.1:8080;
server 10.0.0.2:8080;
# Passive health checks
server 10.0.0.1:8080 max_fails=3 fail_timeout=30s;
}
# Active health checks (Nginx Plus)
location / {
proxy_pass http://backend;
health_check interval=10 fails=3 passes=2 uri=/ready;
}
|
The Checklist#
- Liveness = process health only — No dependency checks
- Readiness = traffic eligibility — Check everything needed to serve
- Startup = boot completion — For slow-starting apps
- Separate endpoints — Different logic for different purposes
- Conservative timing — Allow recovery before restarting
- Graceful shutdown aware — Fail probes when shutting down
- Detailed responses — Help debugging with check details
Health checks are your app telling the infrastructure “I’m okay” or “I need help.” Get the message right, and the infrastructure responds appropriately. Get it wrong, and you create the very outages you’re trying to prevent.
Check wisely. Restart sparingly. Recover gracefully.