Every container orchestrator, load balancer, and monitoring system asks the same question: is this service healthy? The answer you provide determines whether traffic gets routed, containers get replaced, and alerts get fired.

A health check that lies — always returning 200 even when the database is down — is worse than no health check at all. It creates false confidence while your users experience failures.

The Three Types of Health Checks

Liveness: “Is the process alive?”

Liveness checks answer: should this container be killed and restarted?

1
2
3
@app.route('/health/live')
def liveness():
    return {'status': 'alive'}, 200

This should almost always return 200. The only time to fail liveness is when the process is fundamentally broken — deadlocked, corrupted state, unrecoverable error. A failing database connection is NOT a reason to fail liveness.

Kubernetes behavior: Failed liveness → container restart.

Readiness: “Can this instance serve traffic?”

Readiness checks answer: should traffic be routed to this instance?

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
@app.route('/health/ready')
def readiness():
    checks = {
        'database': check_database(),
        'cache': check_cache(),
        'queue': check_queue(),
    }
    
    if all(checks.values()):
        return {'status': 'ready', 'checks': checks}, 200
    else:
        return {'status': 'not_ready', 'checks': checks}, 503

Fail readiness when the instance can’t serve requests properly — database down, required cache unavailable, still warming up.

Kubernetes behavior: Failed readiness → removed from service endpoints (no restart).

Startup: “Has initial setup completed?”

Startup checks are for slow-starting applications:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
startup_complete = False

@app.route('/health/startup')
def startup():
    if startup_complete:
        return {'status': 'started'}, 200
    return {'status': 'starting'}, 503

def on_startup():
    global startup_complete
    load_ml_models()  # Takes 60 seconds
    warm_caches()     # Takes 30 seconds
    startup_complete = True

Kubernetes behavior: Startup probe runs first; liveness/readiness don’t start until startup succeeds.

Dependency Checks: The Nuance

Not all dependencies are equal. A failed write database is critical. A failed analytics service is not.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
def readiness():
    critical = {
        'primary_db': check_postgres(),
        'auth_service': check_auth(),
    }
    
    degraded = {
        'cache': check_redis(),
        'search': check_elasticsearch(),
    }
    
    # Fail only on critical dependencies
    if not all(critical.values()):
        return {
            'status': 'not_ready',
            'critical': critical,
            'degraded': degraded
        }, 503
    
    # Warn but stay ready if degraded
    status = 'ready' if all(degraded.values()) else 'degraded'
    return {
        'status': status,
        'critical': critical,
        'degraded': degraded
    }, 200

Timeout Handling

Health checks that hang are useless. Set aggressive timeouts:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
import requests
from functools import wraps
import signal

class TimeoutError(Exception):
    pass

def timeout(seconds):
    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            def handler(signum, frame):
                raise TimeoutError(f"{func.__name__} timed out")
            
            signal.signal(signal.SIGALRM, handler)
            signal.alarm(seconds)
            try:
                return func(*args, **kwargs)
            finally:
                signal.alarm(0)
        return wrapper
    return decorator

@timeout(2)
def check_database():
    try:
        db.execute("SELECT 1")
        return True
    except Exception:
        return False

If your database check takes more than 2 seconds, something is wrong whether or not it eventually succeeds.

Kubernetes Configuration

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
apiVersion: v1
kind: Pod
spec:
  containers:
  - name: app
    livenessProbe:
      httpGet:
        path: /health/live
        port: 8080
      initialDelaySeconds: 10
      periodSeconds: 10
      timeoutSeconds: 3
      failureThreshold: 3
    
    readinessProbe:
      httpGet:
        path: /health/ready
        port: 8080
      initialDelaySeconds: 5
      periodSeconds: 5
      timeoutSeconds: 3
      failureThreshold: 2
    
    startupProbe:
      httpGet:
        path: /health/startup
        port: 8080
      periodSeconds: 5
      failureThreshold: 30  # 30 * 5s = 150s max startup time

Key settings:

  • initialDelaySeconds: Wait before first check
  • periodSeconds: How often to check
  • timeoutSeconds: Max time for check to respond
  • failureThreshold: Consecutive failures before action

Load Balancer Health Checks

AWS ALB, nginx, HAProxy — all need health endpoints too:

1
2
3
4
5
6
7
# nginx upstream health check
upstream backend {
    server app1:8080;
    server app2:8080;
    
    health_check interval=5s fails=2 passes=2 uri=/health/ready;
}
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# AWS ALB target group
resource "aws_lb_target_group" "app" {
  health_check {
    path                = "/health/ready"
    interval            = 30
    timeout             = 5
    healthy_threshold   = 2
    unhealthy_threshold = 3
    matcher             = "200"
  }
}

Anti-Patterns to Avoid

Health checks that do real work:

1
2
3
4
# Bad: Health check runs expensive query
def health():
    user_count = db.query("SELECT COUNT(*) FROM users")  # Slow!
    return {'users': user_count}, 200

Health checks run frequently. They should be fast and cheap.

Cascading health failures:

1
2
3
4
5
# Bad: Service A's health depends on calling Service B's health
def health():
    response = requests.get("http://service-b/health")
    if response.status_code != 200:
        return {'status': 'unhealthy'}, 503

Now a network blip between services causes both to fail health checks. Check connection pools, not remote health endpoints.

Swallowing errors:

1
2
3
4
5
6
7
# Bad: Always healthy, even when nothing works
def health():
    try:
        check_everything()
    except:
        pass  # Ignore all errors
    return {'status': 'healthy'}, 200

If your health check can’t fail, it isn’t checking anything.

The Deep Health Endpoint

For debugging, offer a detailed endpoint that isn’t used by orchestrators:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
@app.route('/health/deep')
def deep_health():
    return {
        'status': 'healthy',
        'version': app.config['VERSION'],
        'uptime_seconds': time.time() - START_TIME,
        'checks': {
            'database': {
                'status': 'connected',
                'latency_ms': measure_db_latency(),
                'pool_size': db.pool.size(),
                'pool_available': db.pool.available(),
            },
            'cache': {
                'status': 'connected',
                'hit_rate': cache.stats()['hit_rate'],
            },
            'memory': {
                'rss_mb': get_memory_usage(),
                'gc_collections': gc.get_stats(),
            }
        },
        'dependencies': {
            'auth_service': check_auth_service(),
            'payment_service': check_payment_service(),
        }
    }

Don’t expose this publicly — it contains operational details. But it’s invaluable for debugging.


Health checks are your service’s way of communicating with its environment. Lie, and the environment makes bad decisions. Be honest, and traffic flows to healthy instances, broken containers get replaced, and your monitoring tells you what’s actually happening.

Three endpoints, three purposes: liveness (should restart?), readiness (should receive traffic?), startup (has initialization completed?). Get these right and your infrastructure can help itself.