Your homelab is running smoothly—until it isn’t. Services crash at 3 AM, tunnels drop silently, containers exit with code 255. You wake up to discover your dashboard has been down for two days.

The fix isn’t more monitoring dashboards. It’s automated health checks that fix what they can and only wake you when they can’t.

The Philosophy: Fix First, Alert Second

Most monitoring systems are built around one idea: detect problems and notify humans. But for home infrastructure, this creates alert fatigue. Every transient failure becomes a notification.

A better approach:

  1. Detect - Find the problem
  2. Auto-fix - Attempt automated remediation
  3. Verify - Check if the fix worked
  4. Alert - Only notify if auto-fix failed

This means your 3 AM container crash gets silently restarted, and you only hear about it if something is genuinely broken.

Building the Health Check Script

Here’s a battle-tested health check script structure:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
#!/bin/bash
set -euo pipefail

LOG_FILE="/var/log/health-check.log"
ALERT_SCRIPT="/usr/local/bin/send-alert.sh"

log() {
    echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1" | tee -a "$LOG_FILE"
}

alert() {
    log "ALERT: $1"
    "$ALERT_SCRIPT" "$1" 2>/dev/null || true
}

# Track failures for end-of-run summary
FAILURES=()

check_and_fix() {
    local name="$1"
    local check_cmd="$2"
    local fix_cmd="$3"
    
    if eval "$check_cmd" >/dev/null 2>&1; then
        return 0
    fi
    
    log "FAILED: $name - attempting fix..."
    
    if eval "$fix_cmd" >/dev/null 2>&1; then
        sleep 5  # Give service time to start
        if eval "$check_cmd" >/dev/null 2>&1; then
            log "FIXED: $name"
            return 0
        fi
    fi
    
    FAILURES+=("$name")
    return 1
}

Checking External Endpoints

For services exposed to the internet, verify they’re actually reachable:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
check_endpoint() {
    local name="$1"
    local url="$2"
    local expected="${3:-200}"
    
    local status
    status=$(curl -s -o /dev/null -w "%{http_code}" \
        --connect-timeout 10 \
        --max-time 30 \
        "$url") || status="000"
    
    if [[ "$status" == "$expected" ]]; then
        return 0
    fi
    
    FAILURES+=("$name (HTTP $status)")
    return 1
}

# Usage
check_endpoint "Dashboard" "https://dashboard.example.com" "200"
check_endpoint "API" "https://api.example.com/health" "200"

Docker Container Health

Containers crash. The fix is usually just restarting them:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
check_container() {
    local name="$1"
    
    local state
    state=$(docker inspect -f '{{.State.Status}}' "$name" 2>/dev/null) || state="missing"
    
    if [[ "$state" == "running" ]]; then
        return 0
    fi
    
    log "Container $name is $state - restarting..."
    
    docker start "$name" 2>/dev/null || docker compose up -d "$name" 2>/dev/null || {
        FAILURES+=("container:$name")
        return 1
    }
    
    sleep 10
    state=$(docker inspect -f '{{.State.Status}}' "$name" 2>/dev/null) || state="missing"
    
    if [[ "$state" == "running" ]]; then
        log "FIXED: Container $name restarted"
        return 0
    fi
    
    FAILURES+=("container:$name (restart failed)")
    return 1
}

# Usage
check_container "nginx"
check_container "postgres"
check_container "redis"

Tunnel Health (Cloudflare/SSH)

Tunnels are notorious for silent failures:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
check_cloudflared() {
    if systemctl is-active --quiet cloudflared; then
        return 0
    fi
    
    log "cloudflared is down - restarting..."
    sudo systemctl restart cloudflared
    sleep 5
    
    if systemctl is-active --quiet cloudflared; then
        log "FIXED: cloudflared restarted"
        return 0
    fi
    
    FAILURES+=("cloudflared")
    return 1
}

check_ssh_tunnel() {
    local name="$1"
    local tmux_session="$2"
    
    if tmux has-session -t "$tmux_session" 2>/dev/null; then
        return 0
    fi
    
    FAILURES+=("tunnel:$name")
    return 1
}

System Resource Checks

Disk space and memory are the silent killers:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
check_disk_space() {
    local threshold="${1:-80}"
    
    local usage
    usage=$(df / | tail -1 | awk '{print $5}' | tr -d '%')
    
    if [[ "$usage" -lt "$threshold" ]]; then
        return 0
    fi
    
    FAILURES+=("disk:${usage}%")
    return 1
}

check_memory() {
    local threshold="${1:-20}"  # Alert if less than 20% free
    
    local available
    available=$(free | grep Mem | awk '{printf "%.0f", $7/$2 * 100}')
    
    if [[ "$available" -gt "$threshold" ]]; then
        return 0
    fi
    
    FAILURES+=("memory:${available}% free")
    return 1
}

Putting It Together

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
#!/bin/bash
set -euo pipefail

# ... (include functions from above)

log "=== Health check starting ==="

# External endpoints
check_endpoint "Dashboard" "https://dashboard.example.com"
check_endpoint "API" "https://api.example.com/health"

# Critical services
check_cloudflared
check_container "nginx"
check_container "postgres"
check_container "app"

# System resources
check_disk_space 80
check_memory 20

# Summary
if [[ ${#FAILURES[@]} -eq 0 ]]; then
    log "=== All checks passed ==="
else
    alert "Health check failures: ${FAILURES[*]}"
fi

exit 0

Scheduling with Cron

Run hourly—frequent enough to catch issues, rare enough to avoid noise:

1
2
# /etc/cron.d/health-check
0 * * * * root /usr/local/bin/health-check.sh >> /var/log/health-check.log 2>&1

Or if you’re using an agent with cron capabilities, have it run the check and notify your preferred channel.

The Alert Function

Keep alerts focused. You want to know what failed, not get a wall of logs:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
send_alert() {
    local message="$1"
    
    # Telegram example
    curl -s -X POST "https://api.telegram.org/bot${BOT_TOKEN}/sendMessage" \
        -d "chat_id=${CHAT_ID}" \
        -d "text=🔴 Health Check: ${message}" \
        >/dev/null
    
    # Or Slack, Discord, email, whatever works
}

Real-World Lessons

Fix silently, log everything. Your logs are for debugging later. Your alerts are for action now.

Give services time to restart. A sleep 5 after restarting a service prevents false positives.

Test the fixes. After attempting auto-remediation, verify it actually worked before marking it fixed.

Fail gracefully. If curl times out, if docker isn’t running, if the disk check syntax fails—handle it. set -e is good, but catching specific failures is better.

Keep the alert concise. “Dashboard down, nginx restart failed, disk at 92%” is actionable. A stack trace is not.

The goal isn’t perfect monitoring. It’s fewer 3 AM wake-ups because your infrastructure fixes itself.