Your load balancer routes traffic to a pod that’s crashed. Or kills a pod that’s just slow. Or restarts a pod that’s still initializing. Health checks prevent these failures — when configured correctly.

Most teams get them wrong. Here’s how to get them right.

The Three Probe Types

Kubernetes offers three distinct probes, each with a different purpose:

ProbeQuestionFailure Action
LivenessIs the process alive?Restart container
ReadinessCan it handle traffic?Remove from Service
StartupHas it finished starting?Delay other probes

Liveness: “Should I restart this?”

Detects when your app is stuck — deadlocked, infinite loop, unrecoverable state.

1
2
3
4
5
6
7
8
livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
  initialDelaySeconds: 30
  periodSeconds: 10
  timeoutSeconds: 5
  failureThreshold: 3

The endpoint should be dead simple:

1
2
3
4
@app.route('/healthz')
def liveness():
    # Can the process respond at all?
    return 'OK', 200

Don’t check dependencies here. If your database is down, restarting your app won’t fix it. You’ll just create a restart loop.

Readiness: “Should I send traffic?”

Detects when your app can’t serve requests — overloaded, warming cache, waiting for dependency.

1
2
3
4
5
6
7
8
readinessProbe:
  httpGet:
    path: /ready
    port: 8080
  periodSeconds: 5
  timeoutSeconds: 3
  failureThreshold: 3
  successThreshold: 1

This endpoint can check dependencies:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
@app.route('/ready')
def readiness():
    checks = {
        'database': check_db_connection(),
        'cache': check_cache_connection(),
        'config': config_loaded(),
    }
    
    if all(checks.values()):
        return jsonify({'status': 'ready', 'checks': checks}), 200
    else:
        return jsonify({'status': 'not_ready', 'checks': checks}), 503

When readiness fails, the pod is removed from the Service — no traffic, but no restart either. It stays alive and can recover.

Startup: “Is it still booting?”

For slow-starting apps. Delays liveness/readiness probes until startup completes.

1
2
3
4
5
6
startupProbe:
  httpGet:
    path: /healthz
    port: 8080
  periodSeconds: 10
  failureThreshold: 30  # 30 * 10 = 300 seconds max startup

Without this, a Java app with a 2-minute startup gets killed by the liveness probe before it finishes loading.

Common Mistakes

Mistake 1: Checking Dependencies in Liveness

1
2
3
4
5
6
# DON'T DO THIS
@app.route('/healthz')
def liveness():
    db.execute('SELECT 1')  # If DB is down, pod restarts
    redis.ping()             # If Redis is down, pod restarts
    return 'OK'

When the database has a blip, all your pods restart simultaneously. Now you have an outage.

Fix: Only check if the process itself is healthy:

1
2
3
4
5
6
7
8
@app.route('/healthz')
def liveness():
    # Check internal state only
    if app.is_shutting_down:
        return 'Shutting down', 503
    if main_thread_stuck():
        return 'Deadlocked', 503
    return 'OK', 200

Mistake 2: Same Endpoint for Both Probes

1
2
3
4
5
6
7
# PROBLEMATIC
livenessProbe:
  httpGet:
    path: /health
readinessProbe:
  httpGet:
    path: /health  # Same endpoint!

If /health checks dependencies, liveness failures cause unnecessary restarts. If it doesn’t, readiness doesn’t catch dependency issues.

Fix: Separate endpoints with different logic.

Mistake 3: Too Aggressive Timing

1
2
3
4
5
# TOO AGGRESSIVE
livenessProbe:
  periodSeconds: 1
  timeoutSeconds: 1
  failureThreshold: 1

One slow response = restart. GC pause = restart. Network hiccup = restart.

Fix: Give it room to recover:

1
2
3
4
livenessProbe:
  periodSeconds: 10
  timeoutSeconds: 5
  failureThreshold: 3  # 3 failures = 30 seconds to recover

Mistake 4: No Startup Probe for Slow Apps

1
2
3
4
5
6
# Java app takes 90 seconds to start
livenessProbe:
  initialDelaySeconds: 30  # Not enough!
  periodSeconds: 10
  failureThreshold: 3
# Pod killed at 30 + (10 * 3) = 60 seconds

Fix: Use startup probe:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
startupProbe:
  httpGet:
    path: /healthz
  failureThreshold: 30
  periodSeconds: 10
# Startup has 300 seconds to complete

livenessProbe:
  httpGet:
    path: /healthz
  periodSeconds: 10
  failureThreshold: 3
# Only starts after startup probe succeeds

Implementation Patterns

Detailed Health Response

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
from datetime import datetime
import threading

class HealthChecker:
    def __init__(self):
        self.start_time = datetime.utcnow()
        self.ready = False
        self.checks = {}
    
    def liveness(self):
        """Minimal check - is the process alive?"""
        return {
            'status': 'alive',
            'uptime_seconds': (datetime.utcnow() - self.start_time).seconds,
            'threads': threading.active_count()
        }
    
    def readiness(self):
        """Full check - can we serve traffic?"""
        self.checks = {
            'database': self._check_db(),
            'cache': self._check_cache(),
            'disk_space': self._check_disk(),
            'memory': self._check_memory(),
        }
        
        all_healthy = all(c['healthy'] for c in self.checks.values())
        
        return {
            'status': 'ready' if all_healthy else 'not_ready',
            'checks': self.checks
        }, 200 if all_healthy else 503
    
    def _check_db(self):
        try:
            db.execute('SELECT 1')
            return {'healthy': True, 'latency_ms': 5}
        except Exception as e:
            return {'healthy': False, 'error': str(e)}

Graceful Shutdown Integration

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
import signal

shutdown_requested = False

def handle_sigterm(signum, frame):
    global shutdown_requested
    shutdown_requested = True

signal.signal(signal.SIGTERM, handle_sigterm)

@app.route('/healthz')
def liveness():
    if shutdown_requested:
        # Fail liveness during shutdown to speed up termination
        return 'Shutting down', 503
    return 'OK', 200

@app.route('/ready')
def readiness():
    if shutdown_requested:
        # Fail readiness immediately to stop traffic
        return 'Shutting down', 503
    # ... normal checks

Circuit Breaker Pattern

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
from datetime import datetime, timedelta

class DependencyCheck:
    def __init__(self, name, check_fn, timeout=timedelta(seconds=30)):
        self.name = name
        self.check_fn = check_fn
        self.timeout = timeout
        self.last_success = None
        self.consecutive_failures = 0
        self.circuit_open = False
    
    def check(self):
        if self.circuit_open:
            # Check if we should try again
            if datetime.utcnow() - self.last_failure > self.timeout:
                self.circuit_open = False
            else:
                return {'healthy': False, 'reason': 'circuit_open'}
        
        try:
            result = self.check_fn()
            self.last_success = datetime.utcnow()
            self.consecutive_failures = 0
            return {'healthy': True, 'result': result}
        except Exception as e:
            self.consecutive_failures += 1
            self.last_failure = datetime.utcnow()
            
            if self.consecutive_failures >= 3:
                self.circuit_open = True
            
            return {'healthy': False, 'error': str(e)}

Load Balancer Health Checks

Outside Kubernetes, configure your load balancer similarly:

AWS ALB

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
resource "aws_lb_target_group" "app" {
  name     = "app-tg"
  port     = 8080
  protocol = "HTTP"
  vpc_id   = var.vpc_id
  
  health_check {
    enabled             = true
    path                = "/ready"
    port                = "traffic-port"
    healthy_threshold   = 2
    unhealthy_threshold = 3
    timeout             = 5
    interval            = 30
    matcher             = "200"
  }
}

Nginx Upstream

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
upstream backend {
    server 10.0.0.1:8080;
    server 10.0.0.2:8080;
    
    # Passive health checks
    server 10.0.0.1:8080 max_fails=3 fail_timeout=30s;
}

# Active health checks (Nginx Plus)
location / {
    proxy_pass http://backend;
    health_check interval=10 fails=3 passes=2 uri=/ready;
}

The Checklist

  1. Liveness = process health only — No dependency checks
  2. Readiness = traffic eligibility — Check everything needed to serve
  3. Startup = boot completion — For slow-starting apps
  4. Separate endpoints — Different logic for different purposes
  5. Conservative timing — Allow recovery before restarting
  6. Graceful shutdown aware — Fail probes when shutting down
  7. Detailed responses — Help debugging with check details

Health checks are your app telling the infrastructure “I’m okay” or “I need help.” Get the message right, and the infrastructure responds appropriately. Get it wrong, and you create the very outages you’re trying to prevent.

Check wisely. Restart sparingly. Recover gracefully.