Everyone has backups. Few have tested restores. The backup that fails during a crisis is worse than no backup — it gave you false confidence.

Backup strategy isn’t about the backup. It’s about the restore.

The 3-2-1 Rule

A minimum viable backup strategy:

  • 3 copies of your data
  • 2 different storage media/types
  • 1 copy offsite
PCCCroooipppmyyyar123y::::PLROroefocmfdaosulticettseinroaeanpprslcdhihaocittavaeb((asd(saidemfieffefdreaertneatncterneptgreiorov)ni)der)

What to Back Up

Always:

  • Databases (obviously)
  • Configuration files
  • SSL certificates and keys
  • Application state/uploads
  • Infrastructure as Code

Often forgotten:

  • Secrets and credentials
  • DNS records
  • CI/CD pipeline configs
  • Monitoring dashboards
  • Documentation
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
# Document what's backed up and where
cat > /etc/backup-manifest.yml << EOF
databases:
  postgres_main:
    schedule: hourly
    retention: 30 days
    location: s3://backups/postgres/
  
configs:
  nginx:
    schedule: daily
    retention: 90 days
    location: s3://backups/configs/
    
secrets:
  vault_data:
    schedule: daily
    retention: 365 days
    location: s3://backups/vault/ (encrypted)
EOF

Database Backups

PostgreSQL

1
2
3
4
5
# Logical backup (portable, slower)
pg_dump -h localhost -U postgres mydb | gzip > mydb_$(date +%Y%m%d_%H%M%S).sql.gz

# Physical backup (faster, same version required)
pg_basebackup -h localhost -D /backup/base -Ft -z -P

Continuous Archiving (Point-in-Time Recovery)

1
2
3
# postgresql.conf
archive_mode = on
archive_command = 'aws s3 cp %p s3://wal-archive/%f'

This lets you restore to any point in time, not just backup snapshots.

MySQL

1
2
3
4
5
# Logical backup
mysqldump --all-databases --single-transaction | gzip > mysql_$(date +%Y%m%d).sql.gz

# Physical backup with Percona XtraBackup
xtrabackup --backup --target-dir=/backup/mysql/

File System Backups

1
2
3
4
5
6
7
# Initialize repository
restic -r s3:s3.amazonaws.com/my-backups init

# Backup
restic -r s3:s3.amazonaws.com/my-backups backup /var/www /etc/nginx

# Incremental by default, deduplication built-in
1
2
# Automate with cron
0 * * * * restic -r s3:s3.amazonaws.com/my-backups backup /var/www --quiet

rsync for Simple Cases

1
2
3
4
5
# Mirror to backup server
rsync -avz --delete /var/www/ backup-server:/backups/www/

# With hard links for space-efficient versioning
rsync -avz --delete --link-dest=/backups/www/latest /var/www/ /backups/www/$(date +%Y%m%d)/

Cloud-Native Backups

AWS

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
# RDS automated backups
aws rds modify-db-instance \
  --db-instance-identifier mydb \
  --backup-retention-period 30

# EBS snapshots
aws ec2 create-snapshot \
  --volume-id vol-1234567890 \
  --description "Daily backup $(date +%Y%m%d)"

# S3 versioning (not a backup, but helps)
aws s3api put-bucket-versioning \
  --bucket my-bucket \
  --versioning-configuration Status=Enabled

Kubernetes

1
2
3
4
5
6
7
8
# Velero for cluster backup
velero install --provider aws --bucket velero-backups

# Backup namespace
velero backup create app-backup --include-namespaces production

# Schedule daily backups
velero schedule create daily --schedule="0 2 * * *" --include-namespaces production

Retention Policies

Keep more recent backups, fewer old ones:

1
2
3
4
5
6
retention:
  hourly: 24      # Last 24 hours
  daily: 30       # Last 30 days
  weekly: 12      # Last 12 weeks
  monthly: 12     # Last 12 months
  yearly: 5       # Last 5 years
1
2
3
4
5
6
7
# Restic retention
restic forget \
  --keep-hourly 24 \
  --keep-daily 30 \
  --keep-weekly 12 \
  --keep-monthly 12 \
  --prune

Encryption

Backups are a security risk. Encrypt everything:

1
2
3
4
5
6
# Restic encrypts by default (repository password)
restic -r s3:bucket/backups init
# Enter password for new repository:

# GPG for manual encryption
pg_dump mydb | gzip | gpg --encrypt -r backup@company.com > backup.sql.gz.gpg

Store encryption keys separately from backups. A backup you can’t decrypt is useless.

Testing Restores

The backup you don’t test doesn’t exist.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
# Monthly restore drill
#!/bin/bash
set -e

echo "Starting restore test..."

# Restore to test environment
restic -r s3:bucket/backups restore latest --target /tmp/restore-test

# Verify database
pg_restore -d restore_test /tmp/restore-test/postgres/backup.dump
psql restore_test -c "SELECT count(*) FROM critical_table;"

# Compare checksums
find /tmp/restore-test -type f -exec md5sum {} \; > /tmp/restore-checksums.txt

echo "Restore test completed successfully"

Schedule restore tests. Document results. Fix failures before they matter.

Monitoring Backups

Backups fail silently. Monitor them:

1
2
3
4
5
6
7
# Check last backup age
last_backup=$(restic -r s3:bucket/backups snapshots --json | jq -r '.[-1].time')
age_hours=$(( ($(date +%s) - $(date -d "$last_backup" +%s)) / 3600 ))

if [ $age_hours -gt 24 ]; then
  echo "ALERT: Last backup is $age_hours hours old"
fi

Alert on:

  • Backup job failures
  • Backup age exceeding threshold
  • Backup size anomalies (sudden drop = problem)
  • Restore test failures

Disaster Recovery Plan

Document the full recovery process:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
# Disaster Recovery Runbook

## Scenario: Complete datacenter loss

### Prerequisites
- AWS credentials in 1Password
- Terraform state in S3 (us-west-2)
- Backups in S3 (eu-west-1)

### Steps
1. Provision new infrastructure
   - `cd terraform/dr && terraform apply`
   
2. Restore database
   - Download latest backup: `restic restore latest`
   - Import: `pg_restore -d production backup.dump`
   
3. Deploy application
   - `kubectl apply -f k8s/`
   
4. Update DNS
   - Point api.example.com to new load balancer
   
### Verification
- [ ] API responds at /health
- [ ] Database connectivity confirmed
- [ ] User can log in

### Expected RTO: 2 hours
### Expected RPO: 1 hour (hourly backups)

Practice the runbook. Time it. Improve it.

RTO and RPO

  • RPO (Recovery Point Objective): How much data can you lose? (Determines backup frequency)
  • RTO (Recovery Time Objective): How long can you be down? (Determines restore speed)
RPOBackup Strategy
24 hoursDaily backups
1 hourHourly snapshots
5 minutesContinuous replication
0Synchronous replication

Backups are insurance. You pay for them hoping to never need them. But unlike insurance, you can test whether they’ll actually pay out.

Back up everything. Encrypt everything. Test restores regularly. Document recovery procedures. The crisis that finds you prepared is the one you survive.