Your homelab is running smoothly—until it isn’t. Services crash at 3 AM, tunnels drop silently, containers exit with code 255. You wake up to discover your dashboard has been down for two days.
The fix isn’t more monitoring dashboards. It’s automated health checks that fix what they can and only wake you when they can’t.
The Philosophy: Fix First, Alert Second#
Most monitoring systems are built around one idea: detect problems and notify humans. But for home infrastructure, this creates alert fatigue. Every transient failure becomes a notification.
A better approach:
- Detect - Find the problem
- Auto-fix - Attempt automated remediation
- Verify - Check if the fix worked
- Alert - Only notify if auto-fix failed
This means your 3 AM container crash gets silently restarted, and you only hear about it if something is genuinely broken.
Building the Health Check Script#
Here’s a battle-tested health check script structure:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
| #!/bin/bash
set -euo pipefail
LOG_FILE="/var/log/health-check.log"
ALERT_SCRIPT="/usr/local/bin/send-alert.sh"
log() {
echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1" | tee -a "$LOG_FILE"
}
alert() {
log "ALERT: $1"
"$ALERT_SCRIPT" "$1" 2>/dev/null || true
}
# Track failures for end-of-run summary
FAILURES=()
check_and_fix() {
local name="$1"
local check_cmd="$2"
local fix_cmd="$3"
if eval "$check_cmd" >/dev/null 2>&1; then
return 0
fi
log "FAILED: $name - attempting fix..."
if eval "$fix_cmd" >/dev/null 2>&1; then
sleep 5 # Give service time to start
if eval "$check_cmd" >/dev/null 2>&1; then
log "FIXED: $name"
return 0
fi
fi
FAILURES+=("$name")
return 1
}
|
Checking External Endpoints#
For services exposed to the internet, verify they’re actually reachable:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
| check_endpoint() {
local name="$1"
local url="$2"
local expected="${3:-200}"
local status
status=$(curl -s -o /dev/null -w "%{http_code}" \
--connect-timeout 10 \
--max-time 30 \
"$url") || status="000"
if [[ "$status" == "$expected" ]]; then
return 0
fi
FAILURES+=("$name (HTTP $status)")
return 1
}
# Usage
check_endpoint "Dashboard" "https://dashboard.example.com" "200"
check_endpoint "API" "https://api.example.com/health" "200"
|
Docker Container Health#
Containers crash. The fix is usually just restarting them:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
| check_container() {
local name="$1"
local state
state=$(docker inspect -f '{{.State.Status}}' "$name" 2>/dev/null) || state="missing"
if [[ "$state" == "running" ]]; then
return 0
fi
log "Container $name is $state - restarting..."
docker start "$name" 2>/dev/null || docker compose up -d "$name" 2>/dev/null || {
FAILURES+=("container:$name")
return 1
}
sleep 10
state=$(docker inspect -f '{{.State.Status}}' "$name" 2>/dev/null) || state="missing"
if [[ "$state" == "running" ]]; then
log "FIXED: Container $name restarted"
return 0
fi
FAILURES+=("container:$name (restart failed)")
return 1
}
# Usage
check_container "nginx"
check_container "postgres"
check_container "redis"
|
Tunnel Health (Cloudflare/SSH)#
Tunnels are notorious for silent failures:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
| check_cloudflared() {
if systemctl is-active --quiet cloudflared; then
return 0
fi
log "cloudflared is down - restarting..."
sudo systemctl restart cloudflared
sleep 5
if systemctl is-active --quiet cloudflared; then
log "FIXED: cloudflared restarted"
return 0
fi
FAILURES+=("cloudflared")
return 1
}
check_ssh_tunnel() {
local name="$1"
local tmux_session="$2"
if tmux has-session -t "$tmux_session" 2>/dev/null; then
return 0
fi
FAILURES+=("tunnel:$name")
return 1
}
|
System Resource Checks#
Disk space and memory are the silent killers:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
| check_disk_space() {
local threshold="${1:-80}"
local usage
usage=$(df / | tail -1 | awk '{print $5}' | tr -d '%')
if [[ "$usage" -lt "$threshold" ]]; then
return 0
fi
FAILURES+=("disk:${usage}%")
return 1
}
check_memory() {
local threshold="${1:-20}" # Alert if less than 20% free
local available
available=$(free | grep Mem | awk '{printf "%.0f", $7/$2 * 100}')
if [[ "$available" -gt "$threshold" ]]; then
return 0
fi
FAILURES+=("memory:${available}% free")
return 1
}
|
Putting It Together#
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
| #!/bin/bash
set -euo pipefail
# ... (include functions from above)
log "=== Health check starting ==="
# External endpoints
check_endpoint "Dashboard" "https://dashboard.example.com"
check_endpoint "API" "https://api.example.com/health"
# Critical services
check_cloudflared
check_container "nginx"
check_container "postgres"
check_container "app"
# System resources
check_disk_space 80
check_memory 20
# Summary
if [[ ${#FAILURES[@]} -eq 0 ]]; then
log "=== All checks passed ==="
else
alert "Health check failures: ${FAILURES[*]}"
fi
exit 0
|
Scheduling with Cron#
Run hourly—frequent enough to catch issues, rare enough to avoid noise:
1
2
| # /etc/cron.d/health-check
0 * * * * root /usr/local/bin/health-check.sh >> /var/log/health-check.log 2>&1
|
Or if you’re using an agent with cron capabilities, have it run the check and notify your preferred channel.
The Alert Function#
Keep alerts focused. You want to know what failed, not get a wall of logs:
1
2
3
4
5
6
7
8
9
10
11
| send_alert() {
local message="$1"
# Telegram example
curl -s -X POST "https://api.telegram.org/bot${BOT_TOKEN}/sendMessage" \
-d "chat_id=${CHAT_ID}" \
-d "text=🔴 Health Check: ${message}" \
>/dev/null
# Or Slack, Discord, email, whatever works
}
|
Real-World Lessons#
Fix silently, log everything. Your logs are for debugging later. Your alerts are for action now.
Give services time to restart. A sleep 5 after restarting a service prevents false positives.
Test the fixes. After attempting auto-remediation, verify it actually worked before marking it fixed.
Fail gracefully. If curl times out, if docker isn’t running, if the disk check syntax fails—handle it. set -e is good, but catching specific failures is better.
Keep the alert concise. “Dashboard down, nginx restart failed, disk at 92%” is actionable. A stack trace is not.
The goal isn’t perfect monitoring. It’s fewer 3 AM wake-ups because your infrastructure fixes itself.