Cron jobs are the hidden backbone of most systems. They run backups, sync data, send reports, clean up old files. They also fail silently, leaving you wondering why that report hasn’t arrived in three weeks.
Here’s how to build scheduled jobs that actually work.
The Silent Failure Problem#
Classic cron:
1
| 0 2 * * * /usr/local/bin/backup.sh
|
What happens when this fails?
- No notification
- No logging (unless you set it up)
- No way to know it didn’t run
- You find out when you need that backup and it’s not there
Capture Output#
At minimum, capture stdout and stderr:
1
| 0 2 * * * /usr/local/bin/backup.sh >> /var/log/backup.log 2>&1
|
Better — timestamp the logs:
1
| 0 2 * * * /usr/local/bin/backup.sh 2>&1 | ts '[%Y-%m-%d %H:%M:%S]' >> /var/log/backup.log
|
Best — rotate them too:
1
2
3
4
5
6
7
8
| # /etc/logrotate.d/backup
/var/log/backup.log {
daily
rotate 14
compress
missingok
notifempty
}
|
Alert on Failure#
Cron can email on output, but that requires mail setup. Instead, alert explicitly:
1
2
3
4
5
6
7
8
| #!/bin/bash
set -euo pipefail
if ! /usr/local/bin/backup.sh; then
curl -X POST "https://hooks.slack.com/..." \
-d '{"text":"⚠️ Backup failed on server-01"}'
exit 1
fi
|
Or use a dead man’s switch service (more on that below).
Use Proper Locking#
What if a job takes longer than expected and overlaps with the next run?
1
2
3
4
5
| # Bad: Two backups running simultaneously
0 * * * * /usr/local/bin/slow-backup.sh
# Good: Lock file prevents overlap
0 * * * * flock -n /tmp/backup.lock /usr/local/bin/slow-backup.sh
|
flock -n exits immediately if lock is held. Use -w 60 to wait up to 60 seconds instead.
For more control:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
| #!/bin/bash
LOCKFILE="/var/run/backup.lock"
if [ -f "$LOCKFILE" ]; then
pid=$(cat "$LOCKFILE")
if kill -0 "$pid" 2>/dev/null; then
echo "Already running (PID $pid), exiting"
exit 0
fi
fi
echo $$ > "$LOCKFILE"
trap "rm -f $LOCKFILE" EXIT
# Your actual job here
|
Dead Man’s Switches#
A dead man’s switch alerts you when a job doesn’t run:
- Job pings the service on success
- Service expects ping within time window
- No ping? Alert fires
Popular options:
- Healthchecks.io — Free tier, simple
- Cronitor — More features, monitoring dashboard
- PagerDuty — Enterprise, integrates with incident management
1
2
3
4
5
6
7
8
| #!/bin/bash
set -euo pipefail
# Run the job
/usr/local/bin/backup.sh
# Ping on success
curl -fsS --retry 3 https://hc-ping.com/your-uuid-here > /dev/null
|
If the job fails or doesn’t run, no ping — you get alerted.
Better Alternatives to Cron#
Systemd Timers#
More features than cron, better logging:
1
2
3
4
5
6
7
| # /etc/systemd/system/backup.service
[Unit]
Description=Daily backup
[Service]
Type=oneshot
ExecStart=/usr/local/bin/backup.sh
|
1
2
3
4
5
6
7
8
9
10
| # /etc/systemd/system/backup.timer
[Unit]
Description=Run backup daily
[Timer]
OnCalendar=*-*-* 02:00:00
Persistent=true
[Install]
WantedBy=timers.target
|
1
2
3
4
5
6
| systemctl enable backup.timer
systemctl start backup.timer
# Check status
systemctl list-timers
journalctl -u backup.service
|
Benefits:
- Logs go to journald automatically
Persistent=true runs missed jobs after reboot- Dependencies with other services
- Resource limits (memory, CPU)
Kubernetes CronJobs#
For containerized workloads:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
| apiVersion: batch/v1
kind: CronJob
metadata:
name: backup
spec:
schedule: "0 2 * * *"
concurrencyPolicy: Forbid
successfulJobsHistoryLimit: 3
failedJobsHistoryLimit: 3
jobTemplate:
spec:
template:
spec:
restartPolicy: OnFailure
containers:
- name: backup
image: backup-tool:latest
command: ["/backup.sh"]
|
Benefits:
concurrencyPolicy: Forbid prevents overlap- History retention for debugging
- Native k8s logging and monitoring
- Easy to scale and manage
Task Queues#
For complex scheduling, use a proper task queue:
Celery (Python):
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
| from celery import Celery
from celery.schedules import crontab
app = Celery('tasks')
@app.task
def backup():
# Your backup logic
pass
app.conf.beat_schedule = {
'daily-backup': {
'task': 'tasks.backup',
'schedule': crontab(hour=2, minute=0),
},
}
|
Bull (Node.js):
1
2
3
4
5
6
7
8
9
10
| const Queue = require('bull');
const backupQueue = new Queue('backup');
backupQueue.process(async (job) => {
await runBackup();
});
backupQueue.add({}, {
repeat: { cron: '0 2 * * *' }
});
|
Benefits:
- Retry logic built in
- Job progress tracking
- Distributed workers
- Proper failure handling
Monitoring Patterns#
The Three Signals#
Monitor every scheduled job for:
- Did it run? (Dead man’s switch)
- Did it succeed? (Exit code, error logs)
- Did it finish in time? (Duration tracking)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
| #!/bin/bash
START=$(date +%s)
if /usr/local/bin/backup.sh; then
DURATION=$(($(date +%s) - START))
# Report success + duration
curl -X POST "https://metrics.example.com/ingest" \
-d "backup.success=1,duration=${DURATION}"
# Ping dead man's switch
curl -fsS https://hc-ping.com/uuid > /dev/null
else
curl -X POST "https://metrics.example.com/ingest" \
-d "backup.success=0"
# Alert on failure (don't ping dead man's switch)
curl -X POST "https://slack.webhook/..." \
-d '{"text":"Backup failed!"}'
exit 1
fi
|
Dashboards#
Track job health over time:
- Success rate (should be ~100%)
- Duration trends (increasing = problem)
- Last successful run (stale = problem)
Common Mistakes#
1. Assuming PATH#
Cron has minimal PATH. Always use full paths:
1
2
3
4
5
| # Bad
0 * * * * python /scripts/job.py
# Good
0 * * * * /usr/bin/python3 /scripts/job.py
|
Or set PATH explicitly:
1
2
| PATH=/usr/local/bin:/usr/bin:/bin
0 * * * * python3 /scripts/job.py
|
2. No Error Handling#
1
2
3
4
5
6
7
8
| # Bad: Continues after failure
backup_database
upload_to_s3
# Good: Stops on first failure
set -euo pipefail
backup_database
upload_to_s3
|
3. Hardcoded Times Without Considering Timezone#
1
2
3
4
5
6
| # Which timezone?
0 2 * * * /backup.sh
# Be explicit
TZ=America/New_York
0 2 * * * /backup.sh
|
4. No Timeout#
Jobs that hang forever:
1
2
| # Good: Kill after 1 hour
0 2 * * * timeout 3600 /usr/local/bin/backup.sh
|
The Reliable Cron Checklist#
Your cron jobs run when you’re not looking. Make sure they tell you when something’s wrong.
The best cron job is one you forget exists — because it just works, and you’d definitely hear about it if it didn’t.