Cron jobs are the hidden backbone of most systems. They run backups, sync data, send reports, clean up old files. They also fail silently, leaving you wondering why that report hasn’t arrived in three weeks.

Here’s how to build scheduled jobs that actually work.

The Silent Failure Problem

Classic cron:

1
0 2 * * * /usr/local/bin/backup.sh

What happens when this fails?

  • No notification
  • No logging (unless you set it up)
  • No way to know it didn’t run
  • You find out when you need that backup and it’s not there

Capture Output

At minimum, capture stdout and stderr:

1
0 2 * * * /usr/local/bin/backup.sh >> /var/log/backup.log 2>&1

Better — timestamp the logs:

1
0 2 * * * /usr/local/bin/backup.sh 2>&1 | ts '[%Y-%m-%d %H:%M:%S]' >> /var/log/backup.log

Best — rotate them too:

1
2
3
4
5
6
7
8
# /etc/logrotate.d/backup
/var/log/backup.log {
    daily
    rotate 14
    compress
    missingok
    notifempty
}

Alert on Failure

Cron can email on output, but that requires mail setup. Instead, alert explicitly:

1
2
3
4
5
6
7
8
#!/bin/bash
set -euo pipefail

if ! /usr/local/bin/backup.sh; then
    curl -X POST "https://hooks.slack.com/..." \
        -d '{"text":"⚠️ Backup failed on server-01"}'
    exit 1
fi

Or use a dead man’s switch service (more on that below).

Use Proper Locking

What if a job takes longer than expected and overlaps with the next run?

1
2
3
4
5
# Bad: Two backups running simultaneously
0 * * * * /usr/local/bin/slow-backup.sh

# Good: Lock file prevents overlap
0 * * * * flock -n /tmp/backup.lock /usr/local/bin/slow-backup.sh

flock -n exits immediately if lock is held. Use -w 60 to wait up to 60 seconds instead.

For more control:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
#!/bin/bash
LOCKFILE="/var/run/backup.lock"

if [ -f "$LOCKFILE" ]; then
    pid=$(cat "$LOCKFILE")
    if kill -0 "$pid" 2>/dev/null; then
        echo "Already running (PID $pid), exiting"
        exit 0
    fi
fi

echo $$ > "$LOCKFILE"
trap "rm -f $LOCKFILE" EXIT

# Your actual job here

Dead Man’s Switches

A dead man’s switch alerts you when a job doesn’t run:

  1. Job pings the service on success
  2. Service expects ping within time window
  3. No ping? Alert fires

Popular options:

  • Healthchecks.io — Free tier, simple
  • Cronitor — More features, monitoring dashboard
  • PagerDuty — Enterprise, integrates with incident management
1
2
3
4
5
6
7
8
#!/bin/bash
set -euo pipefail

# Run the job
/usr/local/bin/backup.sh

# Ping on success
curl -fsS --retry 3 https://hc-ping.com/your-uuid-here > /dev/null

If the job fails or doesn’t run, no ping — you get alerted.

Better Alternatives to Cron

Systemd Timers

More features than cron, better logging:

1
2
3
4
5
6
7
# /etc/systemd/system/backup.service
[Unit]
Description=Daily backup

[Service]
Type=oneshot
ExecStart=/usr/local/bin/backup.sh
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# /etc/systemd/system/backup.timer
[Unit]
Description=Run backup daily

[Timer]
OnCalendar=*-*-* 02:00:00
Persistent=true

[Install]
WantedBy=timers.target
1
2
3
4
5
6
systemctl enable backup.timer
systemctl start backup.timer

# Check status
systemctl list-timers
journalctl -u backup.service

Benefits:

  • Logs go to journald automatically
  • Persistent=true runs missed jobs after reboot
  • Dependencies with other services
  • Resource limits (memory, CPU)

Kubernetes CronJobs

For containerized workloads:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
apiVersion: batch/v1
kind: CronJob
metadata:
  name: backup
spec:
  schedule: "0 2 * * *"
  concurrencyPolicy: Forbid
  successfulJobsHistoryLimit: 3
  failedJobsHistoryLimit: 3
  jobTemplate:
    spec:
      template:
        spec:
          restartPolicy: OnFailure
          containers:
          - name: backup
            image: backup-tool:latest
            command: ["/backup.sh"]

Benefits:

  • concurrencyPolicy: Forbid prevents overlap
  • History retention for debugging
  • Native k8s logging and monitoring
  • Easy to scale and manage

Task Queues

For complex scheduling, use a proper task queue:

Celery (Python):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
from celery import Celery
from celery.schedules import crontab

app = Celery('tasks')

@app.task
def backup():
    # Your backup logic
    pass

app.conf.beat_schedule = {
    'daily-backup': {
        'task': 'tasks.backup',
        'schedule': crontab(hour=2, minute=0),
    },
}

Bull (Node.js):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
const Queue = require('bull');
const backupQueue = new Queue('backup');

backupQueue.process(async (job) => {
    await runBackup();
});

backupQueue.add({}, {
    repeat: { cron: '0 2 * * *' }
});

Benefits:

  • Retry logic built in
  • Job progress tracking
  • Distributed workers
  • Proper failure handling

Monitoring Patterns

The Three Signals

Monitor every scheduled job for:

  1. Did it run? (Dead man’s switch)
  2. Did it succeed? (Exit code, error logs)
  3. Did it finish in time? (Duration tracking)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
#!/bin/bash
START=$(date +%s)

if /usr/local/bin/backup.sh; then
    DURATION=$(($(date +%s) - START))
    
    # Report success + duration
    curl -X POST "https://metrics.example.com/ingest" \
        -d "backup.success=1,duration=${DURATION}"
    
    # Ping dead man's switch
    curl -fsS https://hc-ping.com/uuid > /dev/null
else
    curl -X POST "https://metrics.example.com/ingest" \
        -d "backup.success=0"
    
    # Alert on failure (don't ping dead man's switch)
    curl -X POST "https://slack.webhook/..." \
        -d '{"text":"Backup failed!"}'
    exit 1
fi

Dashboards

Track job health over time:

  • Success rate (should be ~100%)
  • Duration trends (increasing = problem)
  • Last successful run (stale = problem)

Common Mistakes

1. Assuming PATH

Cron has minimal PATH. Always use full paths:

1
2
3
4
5
# Bad
0 * * * * python /scripts/job.py

# Good  
0 * * * * /usr/bin/python3 /scripts/job.py

Or set PATH explicitly:

1
2
PATH=/usr/local/bin:/usr/bin:/bin
0 * * * * python3 /scripts/job.py

2. No Error Handling

1
2
3
4
5
6
7
8
# Bad: Continues after failure
backup_database
upload_to_s3

# Good: Stops on first failure
set -euo pipefail
backup_database
upload_to_s3

3. Hardcoded Times Without Considering Timezone

1
2
3
4
5
6
# Which timezone?
0 2 * * * /backup.sh

# Be explicit
TZ=America/New_York
0 2 * * * /backup.sh

4. No Timeout

Jobs that hang forever:

1
2
# Good: Kill after 1 hour
0 2 * * * timeout 3600 /usr/local/bin/backup.sh

The Reliable Cron Checklist

  • Output captured to logs
  • Logs rotated
  • Lock file prevents overlap
  • Dead man’s switch monitors execution
  • Alerts on failure
  • Duration tracked
  • Timeout set
  • Full paths used
  • Error handling (set -e)
  • Tested manually before scheduling

Your cron jobs run when you’re not looking. Make sure they tell you when something’s wrong.


The best cron job is one you forget exists — because it just works, and you’d definitely hear about it if it didn’t.