A health check that always returns 200 OK is worse than no health check at all. It gives you false confidence while your application silently fails. Let’s build health checks that actually tell you when something’s wrong.

The Three Types of Probes

Kubernetes defines three probe types, each serving a distinct purpose:

Liveness Probe: “Is this process stuck?” If it fails, Kubernetes kills and restarts the container.

Readiness Probe: “Can this instance handle traffic?” If it fails, the instance is removed from load balancing but keeps running.

Startup Probe: “Has this application finished starting?” Disables liveness/readiness checks until the app is ready.

Understanding when to use each is crucial.

Basic Health Endpoint

Start with a simple health endpoint:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
from fastapi import FastAPI, Response
from datetime import datetime
import psycopg2
import redis

app = FastAPI()

@app.get("/health")
async def health():
    return {"status": "ok", "timestamp": datetime.utcnow().isoformat()}

This is a liveness check — it proves the process is running and can respond to HTTP. But it doesn’t tell you if the app can actually do its job.

Readiness: Check Your Dependencies

A readiness check should verify the app can handle requests:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
from contextlib import contextmanager
from typing import Dict, List, Tuple
import time

class HealthChecker:
    def __init__(self):
        self.checks: List[Tuple[str, callable]] = []
    
    def add_check(self, name: str, check_fn: callable):
        self.checks.append((name, check_fn))
    
    def run_checks(self) -> Dict:
        results = {}
        all_healthy = True
        
        for name, check_fn in self.checks:
            start = time.time()
            try:
                check_fn()
                results[name] = {
                    "status": "healthy",
                    "latency_ms": round((time.time() - start) * 1000, 2)
                }
            except Exception as e:
                all_healthy = False
                results[name] = {
                    "status": "unhealthy",
                    "error": str(e),
                    "latency_ms": round((time.time() - start) * 1000, 2)
                }
        
        return {
            "status": "healthy" if all_healthy else "unhealthy",
            "checks": results
        }


health_checker = HealthChecker()

# Database check
def check_database():
    conn = psycopg2.connect(DATABASE_URL)
    cursor = conn.cursor()
    cursor.execute("SELECT 1")
    cursor.close()
    conn.close()

health_checker.add_check("database", check_database)

# Redis check
def check_redis():
    r = redis.from_url(REDIS_URL)
    r.ping()

health_checker.add_check("redis", check_redis)

# External API check (with timeout)
def check_payment_api():
    response = requests.get(
        "https://api.stripe.com/health",
        timeout=2
    )
    response.raise_for_status()

health_checker.add_check("payment_api", check_payment_api)


@app.get("/health/ready")
async def readiness(response: Response):
    result = health_checker.run_checks()
    if result["status"] != "healthy":
        response.status_code = 503
    return result

Kubernetes Configuration

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-server
spec:
  replicas: 3
  template:
    spec:
      containers:
      - name: api
        image: myapp:latest
        ports:
        - containerPort: 8080
        
        # Liveness: restart if stuck
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 10
          periodSeconds: 15
          timeoutSeconds: 5
          failureThreshold: 3
        
        # Readiness: remove from LB if dependencies down
        readinessProbe:
          httpGet:
            path: /health/ready
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 10
          timeoutSeconds: 5
          failureThreshold: 2
        
        # Startup: wait for slow initialization
        startupProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 0
          periodSeconds: 5
          timeoutSeconds: 5
          failureThreshold: 30  # 30 * 5s = 150s max startup

Common Mistakes

1. Checking Dependencies in Liveness Probes

1
2
3
4
# DON'T do this
livenessProbe:
  httpGet:
    path: /health/ready  # Checks database!

If your database goes down, Kubernetes will restart all your pods. They’ll come up, fail the liveness check (database still down), and restart again. Cascading failure.

Liveness should only check if the process itself is healthy — not external dependencies.

2. Timeouts Longer Than Probe Intervals

1
2
3
4
# Broken configuration
readinessProbe:
  periodSeconds: 5
  timeoutSeconds: 10  # Timeout > period!

Probes will overlap, queue up, and cause unpredictable behavior.

3. No Startup Probe for Slow Apps

1
2
3
4
5
6
# App takes 60s to start, but liveness kills it at 45s
livenessProbe:
  initialDelaySeconds: 30
  failureThreshold: 3
  periodSeconds: 5
  # 30 + (3 * 5) = 45s max before restart

Use startup probes for applications that need time to initialize:

1
2
3
4
5
6
7
startupProbe:
  httpGet:
    path: /health
    port: 8080
  failureThreshold: 60
  periodSeconds: 5
  # 60 * 5 = 300s (5 min) to start

Advanced Patterns

Graceful Degradation

Return partial health when non-critical dependencies fail:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
@app.get("/health/ready")
async def readiness(response: Response):
    results = {}
    critical_healthy = True
    
    # Critical: must be healthy
    for name in ["database", "redis"]:
        try:
            health_checker.checks[name]()
            results[name] = "healthy"
        except Exception as e:
            results[name] = f"unhealthy: {e}"
            critical_healthy = False
    
    # Non-critical: log but don't fail
    for name in ["email_service", "analytics"]:
        try:
            health_checker.checks[name]()
            results[name] = "healthy"
        except Exception as e:
            results[name] = f"degraded: {e}"
            # Don't set critical_healthy = False
    
    if not critical_healthy:
        response.status_code = 503
    
    return {"status": "healthy" if critical_healthy else "unhealthy", "checks": results}

Circuit Breaker Integration

Skip dependency checks if the circuit is open:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
from circuitbreaker import circuit

@circuit(failure_threshold=3, recovery_timeout=30)
def check_external_api():
    response = requests.get(EXTERNAL_API, timeout=2)
    response.raise_for_status()

def check_external_api_safe():
    try:
        check_external_api()
        return "healthy"
    except CircuitBreakerError:
        return "circuit_open"  # Don't count as failure
    except Exception as e:
        return f"unhealthy: {e}"

Health Check Caching

Avoid hammering dependencies on every probe:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
from functools import lru_cache
from datetime import datetime, timedelta

class CachedHealthChecker:
    def __init__(self, ttl_seconds: int = 5):
        self.ttl = ttl_seconds
        self.last_check = None
        self.cached_result = None
    
    def check(self, checker: HealthChecker):
        now = datetime.utcnow()
        
        if (self.last_check is None or 
            now - self.last_check > timedelta(seconds=self.ttl)):
            self.cached_result = checker.run_checks()
            self.last_check = now
        
        return self.cached_result


cached_checker = CachedHealthChecker(ttl_seconds=5)

@app.get("/health/ready")
async def readiness(response: Response):
    result = cached_checker.check(health_checker)
    if result["status"] != "healthy":
        response.status_code = 503
    return result

Monitoring Health Checks

Export health check results as metrics:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
from prometheus_client import Gauge, Histogram

health_status = Gauge(
    'app_health_check_status',
    'Health check status (1=healthy, 0=unhealthy)',
    ['check_name']
)

health_latency = Histogram(
    'app_health_check_latency_seconds',
    'Health check latency',
    ['check_name']
)

def run_checks_with_metrics(checker: HealthChecker):
    result = checker.run_checks()
    
    for name, check_result in result["checks"].items():
        is_healthy = 1 if check_result["status"] == "healthy" else 0
        health_status.labels(check_name=name).set(is_healthy)
        health_latency.labels(check_name=name).observe(
            check_result["latency_ms"] / 1000
        )
    
    return result

Summary

Probe TypePurposeWhat to CheckOn Failure
LivenessProcess alive?Basic HTTP responseContainer restart
ReadinessCan handle traffic?Dependencies, resourcesRemove from LB
StartupFinished initializing?Basic HTTP responseWait longer

Good health checks are the difference between “the system self-healed” and “we had a 3 AM incident.” Take the time to implement them properly.