Your application says it’s running. But is it actually working?

Health checks answer that question. They’re the difference between “process exists” and “service is functional.” Get them wrong, and your orchestrator will either route traffic to broken instances or restart healthy ones.

Three Types of Probes

Liveness: “Is this process stuck?”

Liveness probes detect deadlocks, infinite loops, and zombie processes. If liveness fails, the container gets killed and restarted.

1
2
3
4
5
6
7
livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
  initialDelaySeconds: 30
  periodSeconds: 10
  failureThreshold: 3

What to check:

  • Can the process respond at all?
  • Are critical threads alive?
  • Has the event loop stalled?

What NOT to check:

  • Database connectivity (that’s readiness)
  • External dependencies (don’t restart because Stripe is down)
1
2
3
4
app.get('/healthz', (req, res) => {
  // Just prove we're not deadlocked
  res.status(200).json({ status: 'alive' });
});

Readiness: “Can this instance handle traffic?”

Readiness probes control load balancer routing. If readiness fails, traffic stops flowing to that instance—but it doesn’t get restarted.

1
2
3
4
5
6
7
readinessProbe:
  httpGet:
    path: /ready
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 5
  failureThreshold: 3

What to check:

  • Database connections working?
  • Cache connected?
  • Required config loaded?
  • Warmup complete?
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
app.get('/ready', async (req, res) => {
  try {
    // Check critical dependencies
    await db.query('SELECT 1');
    await redis.ping();
    
    if (!configLoaded) {
      return res.status(503).json({ ready: false, reason: 'config not loaded' });
    }
    
    res.status(200).json({ ready: true });
  } catch (error) {
    res.status(503).json({ ready: false, reason: error.message });
  }
});

Startup: “Has this instance finished starting?”

Startup probes handle slow-starting containers. Until startup succeeds, liveness and readiness probes are disabled.

1
2
3
4
5
6
7
startupProbe:
  httpGet:
    path: /healthz
    port: 8080
  initialDelaySeconds: 10
  periodSeconds: 10
  failureThreshold: 30  # 30 * 10 = 300 seconds max startup

Use startup probes when your app needs time to:

  • Load large datasets into memory
  • Build caches
  • Run database migrations
  • Complete warmup routines

The Dependency Question

Should readiness check external dependencies?

Conservative view: Yes—if the database is down, this instance can’t serve requests.

Pragmatic view: Be careful. If every instance checks the same external dependency, one database hiccup marks ALL instances unready simultaneously. That’s a self-inflicted outage.

Middle ground: Check dependencies, but with nuance:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
app.get('/ready', async (req, res) => {
  const checks = await Promise.allSettled([
    checkDatabase(),
    checkCache(),
    checkMessageQueue(),
  ]);
  
  const results = {
    database: checks[0].status === 'fulfilled',
    cache: checks[1].status === 'fulfilled',
    queue: checks[2].status === 'fulfilled',
  };
  
  // Critical: database must work
  // Degraded: cache/queue can be down temporarily
  if (!results.database) {
    return res.status(503).json({ ready: false, checks: results });
  }
  
  // Ready but degraded
  res.status(200).json({ ready: true, degraded: !results.cache || !results.queue, checks: results });
});

Timeouts and Thresholds

1
2
3
4
5
6
7
8
9
readinessProbe:
  httpGet:
    path: /ready
    port: 8080
  initialDelaySeconds: 5   # Wait before first check
  periodSeconds: 5         # Check every 5 seconds
  timeoutSeconds: 3        # Each check times out after 3s
  successThreshold: 1      # 1 success = ready
  failureThreshold: 3      # 3 failures = not ready

Failure threshold matters: One slow database query shouldn’t immediately remove an instance from rotation. Three consecutive failures means something’s actually wrong.

Timeout matters: If your health check can hang indefinitely, set a timeout. A hanging health check is worse than a failed one.

Health Check Anti-Patterns

The Expensive Health Check

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
// DON'T: Full database query on every probe
app.get('/ready', async (req, res) => {
  const count = await db.query('SELECT COUNT(*) FROM orders');  // Expensive!
  res.json({ ready: true, orderCount: count });
});

// DO: Lightweight connectivity check
app.get('/ready', async (req, res) => {
  await db.query('SELECT 1');  // Just prove connection works
  res.json({ ready: true });
});

The External Dependency Cascade

1
2
3
4
5
6
7
8
// DON'T: Check every external API
app.get('/ready', async (req, res) => {
  await fetch('https://api.stripe.com/health');
  await fetch('https://api.sendgrid.com/health');
  await fetch('https://api.twilio.com/health');
  res.json({ ready: true });
});
// One external outage = your entire service is "unready"

Liveness Checking Dependencies

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
// DON'T: Restart when database is slow
app.get('/healthz', async (req, res) => {
  await db.query('SELECT 1');  // If this times out, container restarts
  res.json({ alive: true });
});

// DO: Liveness checks only local state
app.get('/healthz', (req, res) => {
  if (eventLoopBlocked()) {
    return res.status(503).json({ alive: false });
  }
  res.json({ alive: true });
});

Missing Graceful Shutdown Integration

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
// DON'T: Stay "ready" during shutdown
process.on('SIGTERM', () => {
  // App starts shutting down but still claims to be ready
  setTimeout(() => process.exit(0), 30000);
});

// DO: Mark unready immediately
let shuttingDown = false;

process.on('SIGTERM', () => {
  shuttingDown = true;  // Readiness probe will now fail
  // ... graceful shutdown logic
});

app.get('/ready', (req, res) => {
  if (shuttingDown) {
    return res.status(503).json({ ready: false, reason: 'shutting down' });
  }
  // ... other checks
});

Deep Health Checks

For debugging, expose a detailed health endpoint (protected, not used by probes):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
app.get('/health/detailed', authenticate, async (req, res) => {
  const health = {
    status: 'healthy',
    timestamp: new Date().toISOString(),
    uptime: process.uptime(),
    memory: process.memoryUsage(),
    checks: {},
  };
  
  // Database
  try {
    const start = Date.now();
    await db.query('SELECT 1');
    health.checks.database = { status: 'up', latencyMs: Date.now() - start };
  } catch (e) {
    health.checks.database = { status: 'down', error: e.message };
    health.status = 'unhealthy';
  }
  
  // Redis
  try {
    const start = Date.now();
    await redis.ping();
    health.checks.redis = { status: 'up', latencyMs: Date.now() - start };
  } catch (e) {
    health.checks.redis = { status: 'down', error: e.message };
    health.status = 'degraded';
  }
  
  // Queue depth
  const queueDepth = await messageQueue.getDepth();
  health.checks.queue = { 
    status: queueDepth < 10000 ? 'healthy' : 'backlogged',
    depth: queueDepth,
  };
  
  res.json(health);
});

The Mental Model

Think of health checks like a doctor’s visit:

  • Liveness: “Is the patient breathing?” (Basic vital signs)
  • Readiness: “Can the patient work today?” (Functional capacity)
  • Startup: “Has the patient recovered from surgery?” (Post-operation recovery)

You don’t send someone home from the hospital just because they can breathe. And you don’t restart a server just because the database is slow.

Match the check to the question. Liveness for “should this exist?” Readiness for “should this serve traffic?”

Get it right, and your orchestrator becomes a partner. Get it wrong, and it becomes a source of cascading failures.