Your application says it’s running. But is it actually working?
Health checks answer that question. They’re the difference between “process exists” and “service is functional.” Get them wrong, and your orchestrator will either route traffic to broken instances or restart healthy ones.
Three Types of Probes#
Liveness: “Is this process stuck?”#
Liveness probes detect deadlocks, infinite loops, and zombie processes. If liveness fails, the container gets killed and restarted.
1
2
3
4
5
6
7
| livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
failureThreshold: 3
|
What to check:
- Can the process respond at all?
- Are critical threads alive?
- Has the event loop stalled?
What NOT to check:
- Database connectivity (that’s readiness)
- External dependencies (don’t restart because Stripe is down)
1
2
3
4
| app.get('/healthz', (req, res) => {
// Just prove we're not deadlocked
res.status(200).json({ status: 'alive' });
});
|
Readiness: “Can this instance handle traffic?”#
Readiness probes control load balancer routing. If readiness fails, traffic stops flowing to that instance—but it doesn’t get restarted.
1
2
3
4
5
6
7
| readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
failureThreshold: 3
|
What to check:
- Database connections working?
- Cache connected?
- Required config loaded?
- Warmup complete?
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
| app.get('/ready', async (req, res) => {
try {
// Check critical dependencies
await db.query('SELECT 1');
await redis.ping();
if (!configLoaded) {
return res.status(503).json({ ready: false, reason: 'config not loaded' });
}
res.status(200).json({ ready: true });
} catch (error) {
res.status(503).json({ ready: false, reason: error.message });
}
});
|
Startup: “Has this instance finished starting?”#
Startup probes handle slow-starting containers. Until startup succeeds, liveness and readiness probes are disabled.
1
2
3
4
5
6
7
| startupProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 10
periodSeconds: 10
failureThreshold: 30 # 30 * 10 = 300 seconds max startup
|
Use startup probes when your app needs time to:
- Load large datasets into memory
- Build caches
- Run database migrations
- Complete warmup routines
The Dependency Question#
Should readiness check external dependencies?
Conservative view: Yes—if the database is down, this instance can’t serve requests.
Pragmatic view: Be careful. If every instance checks the same external dependency, one database hiccup marks ALL instances unready simultaneously. That’s a self-inflicted outage.
Middle ground: Check dependencies, but with nuance:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
| app.get('/ready', async (req, res) => {
const checks = await Promise.allSettled([
checkDatabase(),
checkCache(),
checkMessageQueue(),
]);
const results = {
database: checks[0].status === 'fulfilled',
cache: checks[1].status === 'fulfilled',
queue: checks[2].status === 'fulfilled',
};
// Critical: database must work
// Degraded: cache/queue can be down temporarily
if (!results.database) {
return res.status(503).json({ ready: false, checks: results });
}
// Ready but degraded
res.status(200).json({ ready: true, degraded: !results.cache || !results.queue, checks: results });
});
|
Timeouts and Thresholds#
1
2
3
4
5
6
7
8
9
| readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5 # Wait before first check
periodSeconds: 5 # Check every 5 seconds
timeoutSeconds: 3 # Each check times out after 3s
successThreshold: 1 # 1 success = ready
failureThreshold: 3 # 3 failures = not ready
|
Failure threshold matters: One slow database query shouldn’t immediately remove an instance from rotation. Three consecutive failures means something’s actually wrong.
Timeout matters: If your health check can hang indefinitely, set a timeout. A hanging health check is worse than a failed one.
Health Check Anti-Patterns#
The Expensive Health Check#
1
2
3
4
5
6
7
8
9
10
11
| // DON'T: Full database query on every probe
app.get('/ready', async (req, res) => {
const count = await db.query('SELECT COUNT(*) FROM orders'); // Expensive!
res.json({ ready: true, orderCount: count });
});
// DO: Lightweight connectivity check
app.get('/ready', async (req, res) => {
await db.query('SELECT 1'); // Just prove connection works
res.json({ ready: true });
});
|
The External Dependency Cascade#
1
2
3
4
5
6
7
8
| // DON'T: Check every external API
app.get('/ready', async (req, res) => {
await fetch('https://api.stripe.com/health');
await fetch('https://api.sendgrid.com/health');
await fetch('https://api.twilio.com/health');
res.json({ ready: true });
});
// One external outage = your entire service is "unready"
|
Liveness Checking Dependencies#
1
2
3
4
5
6
7
8
9
10
11
12
13
| // DON'T: Restart when database is slow
app.get('/healthz', async (req, res) => {
await db.query('SELECT 1'); // If this times out, container restarts
res.json({ alive: true });
});
// DO: Liveness checks only local state
app.get('/healthz', (req, res) => {
if (eventLoopBlocked()) {
return res.status(503).json({ alive: false });
}
res.json({ alive: true });
});
|
Missing Graceful Shutdown Integration#
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
| // DON'T: Stay "ready" during shutdown
process.on('SIGTERM', () => {
// App starts shutting down but still claims to be ready
setTimeout(() => process.exit(0), 30000);
});
// DO: Mark unready immediately
let shuttingDown = false;
process.on('SIGTERM', () => {
shuttingDown = true; // Readiness probe will now fail
// ... graceful shutdown logic
});
app.get('/ready', (req, res) => {
if (shuttingDown) {
return res.status(503).json({ ready: false, reason: 'shutting down' });
}
// ... other checks
});
|
Deep Health Checks#
For debugging, expose a detailed health endpoint (protected, not used by probes):
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
| app.get('/health/detailed', authenticate, async (req, res) => {
const health = {
status: 'healthy',
timestamp: new Date().toISOString(),
uptime: process.uptime(),
memory: process.memoryUsage(),
checks: {},
};
// Database
try {
const start = Date.now();
await db.query('SELECT 1');
health.checks.database = { status: 'up', latencyMs: Date.now() - start };
} catch (e) {
health.checks.database = { status: 'down', error: e.message };
health.status = 'unhealthy';
}
// Redis
try {
const start = Date.now();
await redis.ping();
health.checks.redis = { status: 'up', latencyMs: Date.now() - start };
} catch (e) {
health.checks.redis = { status: 'down', error: e.message };
health.status = 'degraded';
}
// Queue depth
const queueDepth = await messageQueue.getDepth();
health.checks.queue = {
status: queueDepth < 10000 ? 'healthy' : 'backlogged',
depth: queueDepth,
};
res.json(health);
});
|
The Mental Model#
Think of health checks like a doctor’s visit:
- Liveness: “Is the patient breathing?” (Basic vital signs)
- Readiness: “Can the patient work today?” (Functional capacity)
- Startup: “Has the patient recovered from surgery?” (Post-operation recovery)
You don’t send someone home from the hospital just because they can breathe. And you don’t restart a server just because the database is slow.
Match the check to the question. Liveness for “should this exist?” Readiness for “should this serve traffic?”
Get it right, and your orchestrator becomes a partner. Get it wrong, and it becomes a source of cascading failures.