Everyone knows backups are important. Few actually test them. Here’s how to build backup systems that work when you need them.

The 3-2-1 Rule

The classic foundation:

  • 3 copies of your data
  • 2 different storage types
  • 1 offsite copy

Example implementation:

PCCCroooipppmyyyar123y::::LRCPoelrcmooaoudltduecsbtnraiaecoppknsluhipdocata(tSa((3bsd,aaismfdeefiefsrfeeernrvteenrdt,atrdaeigfcifeoennrt)eenrt)disk)

What to Back Up

Always Back Up

  • Databases — This is your business
  • Configuration — Harder to recreate than you think
  • Secrets — Encrypted, but backed up
  • User uploads — Can’t regenerate these

Maybe Back Up

  • Application code — If not in Git, back it up
  • Logs — For compliance, ship to log aggregator instead
  • Build artifacts — Rebuild from source is often better

Don’t Back Up

  • Ephemeral data — Caches, temp files, sessions
  • Derived data — Can regenerate from source
  • Large static assets — Use CDN/object storage with its own durability

Database Backups

PostgreSQL

1
2
3
4
5
6
7
8
# Logical backup (SQL dump)
pg_dump -Fc mydb > backup.dump

# Restore
pg_restore -d mydb backup.dump

# All databases
pg_dumpall > all_databases.sql

For larger databases, use physical backups:

1
2
3
4
5
6
# Base backup + WAL archiving
pg_basebackup -D /backup/base -Fp -Xs -P

# Continuous archiving (in postgresql.conf)
archive_mode = on
archive_command = 'cp %p /backup/wal/%f'

MySQL

1
2
3
4
5
6
7
8
# Logical backup
mysqldump --all-databases --single-transaction > backup.sql

# With compression
mysqldump mydb | gzip > backup.sql.gz

# Restore
mysql < backup.sql

For production, use physical backups:

1
2
3
4
5
6
# Percona XtraBackup (non-blocking)
xtrabackup --backup --target-dir=/backup/full

# Incremental
xtrabackup --backup --target-dir=/backup/inc \
  --incremental-basedir=/backup/full

MongoDB

1
2
3
4
5
6
7
8
# Logical backup
mongodump --out /backup/mongo

# Restore
mongorestore /backup/mongo

# Oplog for point-in-time recovery
mongodump --oplog --out /backup/mongo

File System Backups

rsync

The reliable workhorse:

1
2
3
4
5
6
7
8
# Basic sync
rsync -avz /data/ backup-server:/backup/data/

# With deletion (mirror)
rsync -avz --delete /data/ backup-server:/backup/data/

# Incremental with hard links (space efficient)
rsync -avz --link-dest=/backup/previous /data/ /backup/current/

Restic

Modern, encrypted, deduplicated:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# Initialize repository
restic -r s3:s3.amazonaws.com/my-backup-bucket init

# Backup
restic -r s3:s3.amazonaws.com/my-backup-bucket backup /data

# List snapshots
restic -r s3:s3.amazonaws.com/my-backup-bucket snapshots

# Restore
restic -r s3:s3.amazonaws.com/my-backup-bucket restore latest --target /restore

BorgBackup

Deduplication + compression + encryption:

1
2
3
4
5
6
7
8
9
# Initialize
borg init --encryption=repokey /backup/borg-repo

# Backup
borg create /backup/borg-repo::backup-{now} /data

# Prune old backups (keep 7 daily, 4 weekly, 6 monthly)
borg prune /backup/borg-repo \
  --keep-daily=7 --keep-weekly=4 --keep-monthly=6

Cloud Backups

S3 with Lifecycle

1
2
3
4
5
6
7
8
# Upload with storage class
aws s3 cp backup.tar.gz s3://my-bucket/backups/ \
  --storage-class STANDARD_IA

# Lifecycle policy (move to Glacier after 30 days)
aws s3api put-bucket-lifecycle-configuration \
  --bucket my-bucket \
  --lifecycle-configuration file://lifecycle.json
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
{
  "Rules": [{
    "ID": "MoveToGlacier",
    "Status": "Enabled",
    "Filter": {"Prefix": "backups/"},
    "Transitions": [{
      "Days": 30,
      "StorageClass": "GLACIER"
    }],
    "Expiration": {"Days": 365}
  }]
}

Automated S3 Backup Script

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
#!/bin/bash
set -euo pipefail

BUCKET="my-backup-bucket"
DATE=$(date +%Y-%m-%d)
RETENTION_DAYS=30

# Backup database
pg_dump -Fc mydb > /tmp/db-$DATE.dump

# Upload with encryption
aws s3 cp /tmp/db-$DATE.dump \
  s3://$BUCKET/database/$DATE.dump \
  --sse AES256

# Clean up local
rm /tmp/db-$DATE.dump

# Delete old backups
aws s3 ls s3://$BUCKET/database/ | while read -r line; do
  file_date=$(echo $line | awk '{print $1}')
  if [[ $(date -d "$file_date" +%s) -lt $(date -d "-$RETENTION_DAYS days" +%s) ]]; then
    file=$(echo $line | awk '{print $4}')
    aws s3 rm s3://$BUCKET/database/$file
  fi
done

echo "Backup complete: $DATE"

Testing Backups

A backup you haven’t tested is not a backup.

Automated Restore Testing

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
#!/bin/bash
# Run weekly via cron

# Restore to test database
pg_restore -d mydb_test /backup/latest.dump

# Run validation queries
psql mydb_test -c "SELECT COUNT(*) FROM users" > /tmp/count.txt

# Compare with production
PROD_COUNT=$(psql mydb -c "SELECT COUNT(*) FROM users" -t)
TEST_COUNT=$(cat /tmp/count.txt | tr -d ' ')

if [ "$PROD_COUNT" != "$TEST_COUNT" ]; then
  echo "BACKUP VALIDATION FAILED" | mail -s "Alert" admin@example.com
  exit 1
fi

# Clean up
dropdb mydb_test

echo "Backup validation passed"

Disaster Recovery Drills

Schedule regular drills:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
## DR Drill Checklist

1. [ ] Pretend production is gone
2. [ ] Start timer
3. [ ] Locate latest backup
4. [ ] Restore to fresh environment
5. [ ] Verify application works
6. [ ] Document time taken
7. [ ] Document issues found
8. [ ] Update runbook

Retention Policies

Balance storage cost vs recovery options:

1
2
3
4
5
# Grandfather-Father-Son rotation
Daily:   Keep 7 days
Weekly:  Keep 4 weeks  
Monthly: Keep 12 months
Yearly:  Keep 7 years (if required)

Implement with pruning:

1
2
3
4
5
6
7
# Restic
restic forget \
  --keep-daily 7 \
  --keep-weekly 4 \
  --keep-monthly 12 \
  --keep-yearly 7 \
  --prune

Monitoring Backups

Alert on Failure

1
2
3
4
5
6
7
8
9
#!/bin/bash
if ! /usr/local/bin/backup.sh; then
  curl -X POST "https://hooks.slack.com/..." \
    -d '{"text":"🚨 Backup failed on db-server-01"}'
  exit 1
fi

# Ping dead man's switch on success
curl -fsS https://hc-ping.com/your-uuid

Track Backup Metrics

1
2
3
4
5
6
7
# Prometheus metrics
backup_last_success = Gauge('backup_last_success_timestamp', 
                             'Timestamp of last successful backup')
backup_size_bytes = Gauge('backup_size_bytes',
                          'Size of last backup')
backup_duration_seconds = Histogram('backup_duration_seconds',
                                     'Time to complete backup')

Alert when:

  • Backup hasn’t run in 25 hours
  • Backup size changed dramatically (>50%)
  • Backup duration increasing trend

Common Mistakes

1. Not Testing Restores

1
2
3
4
5
6
7
# Bad: Hope it works
mysqldump mydb > backup.sql

# Good: Verify it works
mysqldump mydb > backup.sql
mysql testdb < backup.sql
mysql testdb -e "SELECT COUNT(*) FROM users"

2. Backups on Same System

1
2
3
4
5
# Bad: Backup on same disk
pg_dump mydb > /var/lib/postgresql/backup.dump

# Good: Backup offsite
pg_dump mydb | aws s3 cp - s3://backup-bucket/db.dump

3. Unencrypted Backups

1
2
3
4
5
6
7
8
9
# Bad: Plain text to S3
aws s3 cp backup.sql s3://bucket/

# Good: Encrypted
gpg --encrypt --recipient admin@example.com backup.sql
aws s3 cp backup.sql.gpg s3://bucket/

# Or use S3 server-side encryption
aws s3 cp backup.sql s3://bucket/ --sse AES256

4. No Retention Policy

1
2
3
4
5
# Bad: Keep everything forever (expensive)
# Bad: Delete too aggressively (can't recover)

# Good: Defined policy with automation
find /backup -mtime +30 -delete  # Delete >30 days

The Backup Checklist

  • 3-2-1 rule implemented
  • Automated daily backups
  • Encryption at rest and in transit
  • Retention policy defined and automated
  • Restore tested monthly
  • Monitoring and alerting configured
  • Runbook documented
  • DR drill scheduled

Your backup system is only as good as your last successful restore test.


The best time to test your backups was before the disaster. The second best time is today.