Building Cron Jobs That Don't Fail Silently

Cron jobs are the hidden backbone of most systems. They run backups, sync data, send reports, clean up old files. They also fail silently, leaving you wondering why that report hasn’t arrived in three weeks.

Here’s how to build scheduled jobs that actually work.

The Silent Failure Problem

Classic cron:

1
0 2 * * * /usr/local/bin/backup.sh

What happens when this fails?

No notification
No logging (unless you set it up)
No way to know it didn’t run
You find out when you need that backup and it’s not there

Capture Output

At minimum, capture stdout and stderr:

1
0 2 * * * /usr/local/bin/backup.sh >> /var/log/backup.log 2>&1

Better — timestamp the logs:

1
0 2 * * * /usr/local/bin/backup.sh 2>&1 | ts '[%Y-%m-%d %H:%M:%S]' >> /var/log/backup.log

Best — rotate them too:

1
2
3
4
5
6
7
8
# /etc/logrotate.d/backup
/var/log/backup.log {
    daily
    rotate 14
    compress
    missingok
    notifempty
}

Alert on Failure

Cron can email on output, but that requires mail setup. Instead, alert explicitly:

1
2
3
4
5
6
7
8
#!/bin/bash
set -euo pipefail

if ! /usr/local/bin/backup.sh; then
    curl -X POST "https://hooks.slack.com/..." \
        -d '{"text":"⚠️ Backup failed on server-01"}'
    exit 1
fi

Or use a dead man’s switch service (more on that below).

Use Proper Locking

What if a job takes longer than expected and overlaps with the next run?

1
2
3
4
5
# Bad: Two backups running simultaneously
0 * * * * /usr/local/bin/slow-backup.sh

# Good: Lock file prevents overlap
0 * * * * flock -n /tmp/backup.lock /usr/local/bin/slow-backup.sh

flock -n exits immediately if lock is held. Use -w 60 to wait up to 60 seconds instead.

For more control:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
#!/bin/bash
LOCKFILE="/var/run/backup.lock"

if [ -f "$LOCKFILE" ]; then
    pid=$(cat "$LOCKFILE")
    if kill -0 "$pid" 2>/dev/null; then
        echo "Already running (PID $pid), exiting"
        exit 0
    fi
fi

echo $$ > "$LOCKFILE"
trap "rm -f $LOCKFILE" EXIT

# Your actual job here

Dead Man’s Switches

A dead man’s switch alerts you when a job doesn’t run:

Job pings the service on success
Service expects ping within time window
No ping? Alert fires

Popular options:

Healthchecks.io — Free tier, simple
Cronitor — More features, monitoring dashboard
PagerDuty — Enterprise, integrates with incident management

1
2
3
4
5
6
7
8
#!/bin/bash
set -euo pipefail

# Run the job
/usr/local/bin/backup.sh

# Ping on success
curl -fsS --retry 3 https://hc-ping.com/your-uuid-here > /dev/null

If the job fails or doesn’t run, no ping — you get alerted.

Better Alternatives to Cron

Systemd Timers

More features than cron, better logging:

1
2
3
4
5
6
7
# /etc/systemd/system/backup.service
[Unit]
Description=Daily backup

[Service]
Type=oneshot
ExecStart=/usr/local/bin/backup.sh

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# /etc/systemd/system/backup.timer
[Unit]
Description=Run backup daily

[Timer]
OnCalendar=*-*-* 02:00:00
Persistent=true

[Install]
WantedBy=timers.target

1
2
3
4
5
6
systemctl enable backup.timer
systemctl start backup.timer

# Check status
systemctl list-timers
journalctl -u backup.service

Benefits:

Logs go to journald automatically
Persistent=true runs missed jobs after reboot
Dependencies with other services
Resource limits (memory, CPU)

Kubernetes CronJobs

For containerized workloads:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
apiVersion: batch/v1
kind: CronJob
metadata:
  name: backup
spec:
  schedule: "0 2 * * *"
  concurrencyPolicy: Forbid
  successfulJobsHistoryLimit: 3
  failedJobsHistoryLimit: 3
  jobTemplate:
    spec:
      template:
        spec:
          restartPolicy: OnFailure
          containers:
          - name: backup
            image: backup-tool:latest
            command: ["/backup.sh"]

Benefits:

concurrencyPolicy: Forbid prevents overlap
History retention for debugging
Native k8s logging and monitoring
Easy to scale and manage

Task Queues

For complex scheduling, use a proper task queue:

Celery (Python):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
from celery import Celery
from celery.schedules import crontab

app = Celery('tasks')

@app.task
def backup():
    # Your backup logic
    pass

app.conf.beat_schedule = {
    'daily-backup': {
        'task': 'tasks.backup',
        'schedule': crontab(hour=2, minute=0),
    },
}

Bull (Node.js):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
const Queue = require('bull');
const backupQueue = new Queue('backup');

backupQueue.process(async (job) => {
    await runBackup();
});

backupQueue.add({}, {
    repeat: { cron: '0 2 * * *' }
});

Benefits:

Retry logic built in
Job progress tracking
Distributed workers
Proper failure handling

Monitoring Patterns

The Three Signals

Monitor every scheduled job for:

Did it run? (Dead man’s switch)
Did it succeed? (Exit code, error logs)
Did it finish in time? (Duration tracking)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
#!/bin/bash
START=$(date +%s)

if /usr/local/bin/backup.sh; then
    DURATION=$(($(date +%s) - START))
    
    # Report success + duration
    curl -X POST "https://metrics.example.com/ingest" \
        -d "backup.success=1,duration=${DURATION}"
    
    # Ping dead man's switch
    curl -fsS https://hc-ping.com/uuid > /dev/null
else
    curl -X POST "https://metrics.example.com/ingest" \
        -d "backup.success=0"
    
    # Alert on failure (don't ping dead man's switch)
    curl -X POST "https://slack.webhook/..." \
        -d '{"text":"Backup failed!"}'
    exit 1
fi

Dashboards

Track job health over time:

Success rate (should be ~100%)
Duration trends (increasing = problem)
Last successful run (stale = problem)

Common Mistakes

1. Assuming PATH

Cron has minimal PATH. Always use full paths:

1
2
3
4
5
# Bad
0 * * * * python /scripts/job.py

# Good  
0 * * * * /usr/bin/python3 /scripts/job.py

Or set PATH explicitly:

1
2
PATH=/usr/local/bin:/usr/bin:/bin
0 * * * * python3 /scripts/job.py

2. No Error Handling

1
2
3
4
5
6
7
8
# Bad: Continues after failure
backup_database
upload_to_s3

# Good: Stops on first failure
set -euo pipefail
backup_database
upload_to_s3

3. Hardcoded Times Without Considering Timezone

1
2
3
4
5
6
# Which timezone?
0 2 * * * /backup.sh

# Be explicit
TZ=America/New_York
0 2 * * * /backup.sh

4. No Timeout

Jobs that hang forever:

1
2
# Good: Kill after 1 hour
0 2 * * * timeout 3600 /usr/local/bin/backup.sh

The Reliable Cron Checklist

Output captured to logs
Logs rotated
Lock file prevents overlap
Dead man’s switch monitors execution
Alerts on failure
Duration tracked
Timeout set
Full paths used
Error handling (set -e)
Tested manually before scheduling

Your cron jobs run when you’re not looking. Make sure they tell you when something’s wrong.

The best cron job is one you forget exists — because it just works, and you’d definitely hear about it if it didn’t.

The Silent Failure Problem#

Capture Output#

Alert on Failure#

Use Proper Locking#

Dead Man’s Switches#

Better Alternatives to Cron#

Systemd Timers#

Kubernetes CronJobs#

Task Queues#

Monitoring Patterns#

The Three Signals#

Dashboards#

Common Mistakes#

1. Assuming PATH#

2. No Error Handling#

3. Hardcoded Times Without Considering Timezone#

4. No Timeout#

The Reliable Cron Checklist#

📬 Get the Newsletter

The Silent Failure Problem

Capture Output

Alert on Failure

Use Proper Locking

Dead Man’s Switches

Better Alternatives to Cron

Systemd Timers

Kubernetes CronJobs

Task Queues

Monitoring Patterns

The Three Signals

Dashboards

Common Mistakes

1. Assuming PATH

2. No Error Handling

3. Hardcoded Times Without Considering Timezone

4. No Timeout

The Reliable Cron Checklist