Everyone has backups. Few have tested restores. The backup that fails during a crisis is worse than no backup — it gave you false confidence.
Backup strategy isn’t about the backup. It’s about the restore.
The 3-2-1 Rule# A minimum viable backup strategy:
3 copies of your data2 different storage media/types1 copy offsiteP C C C r o o o i p p p m y y y a r 1 2 3 y : : : : P L R O r o e f o c m f d a o s u l t i c e t t s e i n r o a e a n p p r s l c d h i h a o c i t t a v a e b ( ( a s d ( s a i d e m f i e f f e f d r e a e r t n e a t n c t e r n e p t g r e i o r o v ) n i ) d e r )
What to Back Up# Always:
Databases (obviously) Configuration files SSL certificates and keys Application state/uploads Infrastructure as Code Often forgotten:
Secrets and credentials DNS records CI/CD pipeline configs Monitoring dashboards Documentation 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# Document what's backed up and where
cat > /etc/backup-manifest.yml << EOF
databases:
postgres_main:
schedule: hourly
retention: 30 days
location: s3://backups/postgres/
configs:
nginx:
schedule: daily
retention: 90 days
location: s3://backups/configs/
secrets:
vault_data:
schedule: daily
retention: 365 days
location: s3://backups/vault/ (encrypted)
EOF
Database Backups# PostgreSQL# 1
2
3
4
5
# Logical backup (portable, slower)
pg_dump -h localhost -U postgres mydb | gzip > mydb_$( date +%Y%m%d_%H%M%S) .sql.gz
# Physical backup (faster, same version required)
pg_basebackup -h localhost -D /backup/base -Ft -z -P
Continuous Archiving (Point-in-Time Recovery)# 1
2
3
# postgresql.conf
archive_mode = on
archive_command = 'aws s3 cp %p s3://wal-archive/%f'
This lets you restore to any point in time, not just backup snapshots.
MySQL# 1
2
3
4
5
# Logical backup
mysqldump --all-databases --single-transaction | gzip > mysql_$( date +%Y%m%d) .sql.gz
# Physical backup with Percona XtraBackup
xtrabackup --backup --target-dir= /backup/mysql/
File System Backups# Restic (Recommended)# 1
2
3
4
5
6
7
# Initialize repository
restic -r s3:s3.amazonaws.com/my-backups init
# Backup
restic -r s3:s3.amazonaws.com/my-backups backup /var/www /etc/nginx
# Incremental by default, deduplication built-in
1
2
# Automate with cron
0 * * * * restic -r s3:s3.amazonaws.com/my-backups backup /var/www --quiet
rsync for Simple Cases# 1
2
3
4
5
# Mirror to backup server
rsync -avz --delete /var/www/ backup-server:/backups/www/
# With hard links for space-efficient versioning
rsync -avz --delete --link-dest= /backups/www/latest /var/www/ /backups/www/$( date +%Y%m%d) /
Cloud-Native Backups# AWS# 1
2
3
4
5
6
7
8
9
10
11
12
13
14
# RDS automated backups
aws rds modify-db-instance \
--db-instance-identifier mydb \
--backup-retention-period 30
# EBS snapshots
aws ec2 create-snapshot \
--volume-id vol-1234567890 \
--description "Daily backup $( date +%Y%m%d) "
# S3 versioning (not a backup, but helps)
aws s3api put-bucket-versioning \
--bucket my-bucket \
--versioning-configuration Status = Enabled
Kubernetes# 1
2
3
4
5
6
7
8
# Velero for cluster backup
velero install --provider aws --bucket velero-backups
# Backup namespace
velero backup create app-backup --include-namespaces production
# Schedule daily backups
velero schedule create daily --schedule= "0 2 * * *" --include-namespaces production
Retention Policies# Keep more recent backups, fewer old ones:
1
2
3
4
5
6
retention :
hourly : 24 # Last 24 hours
daily : 30 # Last 30 days
weekly : 12 # Last 12 weeks
monthly : 12 # Last 12 months
yearly : 5 # Last 5 years
1
2
3
4
5
6
7
# Restic retention
restic forget \
--keep-hourly 24 \
--keep-daily 30 \
--keep-weekly 12 \
--keep-monthly 12 \
--prune
Encryption# Backups are a security risk. Encrypt everything:
1
2
3
4
5
6
# Restic encrypts by default (repository password)
restic -r s3:bucket/backups init
# Enter password for new repository:
# GPG for manual encryption
pg_dump mydb | gzip | gpg --encrypt -r backup@company.com > backup.sql.gz.gpg
Store encryption keys separately from backups. A backup you can’t decrypt is useless.
Testing Restores# The backup you don’t test doesn’t exist.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# Monthly restore drill
#!/bin/bash
set -e
echo "Starting restore test..."
# Restore to test environment
restic -r s3:bucket/backups restore latest --target /tmp/restore-test
# Verify database
pg_restore -d restore_test /tmp/restore-test/postgres/backup.dump
psql restore_test -c "SELECT count(*) FROM critical_table;"
# Compare checksums
find /tmp/restore-test -type f -exec md5sum {} \; > /tmp/restore-checksums.txt
echo "Restore test completed successfully"
Schedule restore tests. Document results. Fix failures before they matter.
Monitoring Backups# Backups fail silently. Monitor them:
1
2
3
4
5
6
7
# Check last backup age
last_backup = $( restic -r s3:bucket/backups snapshots --json | jq -r '.[-1].time' )
age_hours = $(( ( $( date +%s) - $( date -d " $last_backup " +%s) ) / 3600 ))
if [ $age_hours -gt 24 ] ; then
echo "ALERT: Last backup is $age_hours hours old"
fi
Alert on:
Backup job failures Backup age exceeding threshold Backup size anomalies (sudden drop = problem) Restore test failures Disaster Recovery Plan# Document the full recovery process:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
# Disaster Recovery Runbook
## Scenario: Complete datacenter loss
### Prerequisites
- AWS credentials in 1Password
- Terraform state in S3 (us-west-2)
- Backups in S3 (eu-west-1)
### Steps
1. Provision new infrastructure
- `cd terraform/dr && terraform apply`
2. Restore database
- Download latest backup: `restic restore latest`
- Import: `pg_restore -d production backup.dump`
3. Deploy application
- `kubectl apply -f k8s/`
4. Update DNS
- Point api.example.com to new load balancer
### Verification
- [ ] API responds at /health
- [ ] Database connectivity confirmed
- [ ] User can log in
### Expected RTO: 2 hours
### Expected RPO: 1 hour (hourly backups)
Practice the runbook. Time it. Improve it.
RTO and RPO# RPO (Recovery Point Objective) : How much data can you lose? (Determines backup frequency)RTO (Recovery Time Objective) : How long can you be down? (Determines restore speed)RPO Backup Strategy 24 hours Daily backups 1 hour Hourly snapshots 5 minutes Continuous replication 0 Synchronous replication
Backups are insurance. You pay for them hoping to never need them. But unlike insurance, you can test whether they’ll actually pay out.
Back up everything. Encrypt everything. Test restores regularly. Document recovery procedures. The crisis that finds you prepared is the one you survive.