Everyone knows backups are important. Few actually test them. Here’s how to build backup systems that work when you need them.
The 3-2-1 Rule# The classic foundation:
3 copies of your data2 different storage types1 offsite copyExample implementation:
P C C C r o o o i p p p m y y y a r 1 2 3 y : : : : L R C P o e l r c m o o a o u d l t d u e c s b t n r a i a e c o p p k n s l u h i p d o c a t a ( t S a ( ( 3 b s d , a a i s m f d e e f i e f s r f e e e r n r v t e e n r d t , a t r d a e i g f c i f e o e n n r t ) e e n r t ) d i s k )
What to Back Up# Always Back Up# Databases — This is your businessConfiguration — Harder to recreate than you thinkSecrets — Encrypted, but backed upUser uploads — Can’t regenerate theseMaybe Back Up# Application code — If not in Git, back it upLogs — For compliance, ship to log aggregator insteadBuild artifacts — Rebuild from source is often betterDon’t Back Up# Ephemeral data — Caches, temp files, sessionsDerived data — Can regenerate from sourceLarge static assets — Use CDN/object storage with its own durabilityDatabase Backups# PostgreSQL# 1
2
3
4
5
6
7
8
# Logical backup (SQL dump)
pg_dump -Fc mydb > backup.dump
# Restore
pg_restore -d mydb backup.dump
# All databases
pg_dumpall > all_databases.sql
For larger databases, use physical backups:
1
2
3
4
5
6
# Base backup + WAL archiving
pg_basebackup -D /backup/base -Fp -Xs -P
# Continuous archiving (in postgresql.conf)
archive_mode = on
archive_command = 'cp %p /backup/wal/%f'
MySQL# 1
2
3
4
5
6
7
8
# Logical backup
mysqldump --all-databases --single-transaction > backup.sql
# With compression
mysqldump mydb | gzip > backup.sql.gz
# Restore
mysql < backup.sql
For production, use physical backups:
1
2
3
4
5
6
# Percona XtraBackup (non-blocking)
xtrabackup --backup --target-dir= /backup/full
# Incremental
xtrabackup --backup --target-dir= /backup/inc \
--incremental-basedir= /backup/full
MongoDB# 1
2
3
4
5
6
7
8
# Logical backup
mongodump --out /backup/mongo
# Restore
mongorestore /backup/mongo
# Oplog for point-in-time recovery
mongodump --oplog --out /backup/mongo
File System Backups# rsync# The reliable workhorse:
1
2
3
4
5
6
7
8
# Basic sync
rsync -avz /data/ backup-server:/backup/data/
# With deletion (mirror)
rsync -avz --delete /data/ backup-server:/backup/data/
# Incremental with hard links (space efficient)
rsync -avz --link-dest= /backup/previous /data/ /backup/current/
Restic# Modern, encrypted, deduplicated:
1
2
3
4
5
6
7
8
9
10
11
# Initialize repository
restic -r s3:s3.amazonaws.com/my-backup-bucket init
# Backup
restic -r s3:s3.amazonaws.com/my-backup-bucket backup /data
# List snapshots
restic -r s3:s3.amazonaws.com/my-backup-bucket snapshots
# Restore
restic -r s3:s3.amazonaws.com/my-backup-bucket restore latest --target /restore
BorgBackup# Deduplication + compression + encryption:
1
2
3
4
5
6
7
8
9
# Initialize
borg init --encryption= repokey /backup/borg-repo
# Backup
borg create /backup/borg-repo::backup-{ now} /data
# Prune old backups (keep 7 daily, 4 weekly, 6 monthly)
borg prune /backup/borg-repo \
--keep-daily= 7 --keep-weekly= 4 --keep-monthly= 6
Cloud Backups# S3 with Lifecycle# 1
2
3
4
5
6
7
8
# Upload with storage class
aws s3 cp backup.tar.gz s3://my-bucket/backups/ \
--storage-class STANDARD_IA
# Lifecycle policy (move to Glacier after 30 days)
aws s3api put-bucket-lifecycle-configuration \
--bucket my-bucket \
--lifecycle-configuration file://lifecycle.json
1
2
3
4
5
6
7
8
9
10
11
12
{
"Rules" : [{
"ID" : "MoveToGlacier" ,
"Status" : "Enabled" ,
"Filter" : { "Prefix" : "backups/" },
"Transitions" : [{
"Days" : 30 ,
"StorageClass" : "GLACIER"
}],
"Expiration" : { "Days" : 365 }
}]
}
Automated S3 Backup Script# 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
#!/bin/bash
set -euo pipefail
BUCKET = "my-backup-bucket"
DATE = $( date +%Y-%m-%d)
RETENTION_DAYS = 30
# Backup database
pg_dump -Fc mydb > /tmp/db-$DATE .dump
# Upload with encryption
aws s3 cp /tmp/db-$DATE .dump \
s3://$BUCKET /database/$DATE .dump \
--sse AES256
# Clean up local
rm /tmp/db-$DATE .dump
# Delete old backups
aws s3 ls s3://$BUCKET /database/ | while read -r line; do
file_date = $( echo $line | awk '{print $1}' )
if [[ $( date -d " $file_date " +%s) -lt $( date -d "- $RETENTION_DAYS days" +%s) ]] ; then
file = $( echo $line | awk '{print $4}' )
aws s3 rm s3://$BUCKET /database/$file
fi
done
echo "Backup complete: $DATE "
Testing Backups# A backup you haven’t tested is not a backup.
Automated Restore Testing# 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
#!/bin/bash
# Run weekly via cron
# Restore to test database
pg_restore -d mydb_test /backup/latest.dump
# Run validation queries
psql mydb_test -c "SELECT COUNT(*) FROM users" > /tmp/count.txt
# Compare with production
PROD_COUNT = $( psql mydb -c "SELECT COUNT(*) FROM users" -t)
TEST_COUNT = $( cat /tmp/count.txt | tr -d ' ' )
if [ " $PROD_COUNT " != " $TEST_COUNT " ] ; then
echo "BACKUP VALIDATION FAILED" | mail -s "Alert" admin@example.com
exit 1
fi
# Clean up
dropdb mydb_test
echo "Backup validation passed"
Disaster Recovery Drills# Schedule regular drills:
1
2
3
4
5
6
7
8
9
10
## DR Drill Checklist
1. [ ] Pretend production is gone
2. [ ] Start timer
3. [ ] Locate latest backup
4. [ ] Restore to fresh environment
5. [ ] Verify application works
6. [ ] Document time taken
7. [ ] Document issues found
8. [ ] Update runbook
Retention Policies# Balance storage cost vs recovery options:
1
2
3
4
5
# Grandfather-Father-Son rotation
Daily: Keep 7 days
Weekly: Keep 4 weeks
Monthly: Keep 12 months
Yearly: Keep 7 years ( if required)
Implement with pruning:
1
2
3
4
5
6
7
# Restic
restic forget \
--keep-daily 7 \
--keep-weekly 4 \
--keep-monthly 12 \
--keep-yearly 7 \
--prune
Monitoring Backups# Alert on Failure# 1
2
3
4
5
6
7
8
9
#!/bin/bash
if ! /usr/local/bin/backup.sh; then
curl -X POST "https://hooks.slack.com/..." \
-d '{"text":"🚨 Backup failed on db-server-01"}'
exit 1
fi
# Ping dead man's switch on success
curl -fsS https://hc-ping.com/your-uuid
Track Backup Metrics# 1
2
3
4
5
6
7
# Prometheus metrics
backup_last_success = Gauge ( 'backup_last_success_timestamp' ,
'Timestamp of last successful backup' )
backup_size_bytes = Gauge ( 'backup_size_bytes' ,
'Size of last backup' )
backup_duration_seconds = Histogram ( 'backup_duration_seconds' ,
'Time to complete backup' )
Alert when:
Backup hasn’t run in 25 hours Backup size changed dramatically (>50%) Backup duration increasing trend Common Mistakes# 1. Not Testing Restores# 1
2
3
4
5
6
7
# Bad: Hope it works
mysqldump mydb > backup.sql
# Good: Verify it works
mysqldump mydb > backup.sql
mysql testdb < backup.sql
mysql testdb -e "SELECT COUNT(*) FROM users"
2. Backups on Same System# 1
2
3
4
5
# Bad: Backup on same disk
pg_dump mydb > /var/lib/postgresql/backup.dump
# Good: Backup offsite
pg_dump mydb | aws s3 cp - s3://backup-bucket/db.dump
3. Unencrypted Backups# 1
2
3
4
5
6
7
8
9
# Bad: Plain text to S3
aws s3 cp backup.sql s3://bucket/
# Good: Encrypted
gpg --encrypt --recipient admin@example.com backup.sql
aws s3 cp backup.sql.gpg s3://bucket/
# Or use S3 server-side encryption
aws s3 cp backup.sql s3://bucket/ --sse AES256
4. No Retention Policy# 1
2
3
4
5
# Bad: Keep everything forever (expensive)
# Bad: Delete too aggressively (can't recover)
# Good: Defined policy with automation
find /backup -mtime +30 -delete # Delete >30 days
The Backup Checklist# Your backup system is only as good as your last successful restore test.
The best time to test your backups was before the disaster. The second best time is today.