Everyone has backups. Few people have tested restores. Here’s how to build backup strategies that work when disaster strikes.

The 3-2-1 Rule

The foundation of backup strategy:

  • 3 copies of your data
  • 2 different storage types
  • 1 copy offsite

Example:

  1. Production database (primary)
  2. Local backup server (different disk)
  3. S3 in another region (offsite)

This survives disk failure, server failure, and site failure.

Database Backups

PostgreSQL

Logical backups (pg_dump):

1
2
3
4
5
6
7
8
# Full database
pg_dump -Fc mydb > backup.dump

# Specific tables
pg_dump -Fc -t users -t orders mydb > partial.dump

# Schema only
pg_dump -Fc --schema-only mydb > schema.dump

Physical backups (pg_basebackup):

1
pg_basebackup -D /backup/base -Fp -Xs -P

Physical backups are faster to restore for large databases.

Continuous archiving (WAL):

1
2
3
# postgresql.conf
archive_mode = on
archive_command = 'aws s3 cp %p s3://backups/wal/%f'

Combined with base backups, enables point-in-time recovery.

Automated PostgreSQL Backup Script

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
#!/bin/bash
set -euo pipefail

DB_NAME="production"
BACKUP_DIR="/backups"
S3_BUCKET="s3://company-backups/postgres"
RETENTION_DAYS=30

# Create timestamped backup
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
BACKUP_FILE="${BACKUP_DIR}/${DB_NAME}_${TIMESTAMP}.dump"

pg_dump -Fc "$DB_NAME" > "$BACKUP_FILE"

# Compress and upload
gzip "$BACKUP_FILE"
aws s3 cp "${BACKUP_FILE}.gz" "${S3_BUCKET}/"

# Verify upload
if aws s3 ls "${S3_BUCKET}/${DB_NAME}_${TIMESTAMP}.dump.gz"; then
    echo "Backup verified in S3"
else
    echo "BACKUP VERIFICATION FAILED" >&2
    exit 1
fi

# Clean old local backups
find "$BACKUP_DIR" -name "*.dump.gz" -mtime +7 -delete

# Clean old S3 backups (lifecycle policy is better, but this works)
aws s3 ls "$S3_BUCKET/" | while read -r line; do
    file_date=$(echo "$line" | awk '{print $1}')
    if [[ $(date -d "$file_date" +%s) -lt $(date -d "-${RETENTION_DAYS} days" +%s) ]]; then
        file_name=$(echo "$line" | awk '{print $4}')
        aws s3 rm "${S3_BUCKET}/${file_name}"
    fi
done

echo "Backup complete: ${BACKUP_FILE}.gz"

MySQL/MariaDB

1
2
3
4
5
6
7
# Logical backup
mysqldump --single-transaction --routines --triggers \
  -u root -p mydb > backup.sql

# Binary backup with Percona XtraBackup
xtrabackup --backup --target-dir=/backup/full
xtrabackup --prepare --target-dir=/backup/full

MongoDB

1
2
3
4
5
# Logical backup
mongodump --uri="mongodb://localhost:27017" --out=/backup

# Continuous backup with oplog
mongodump --uri="mongodb://localhost:27017" --oplog --out=/backup

Application Data Backups

Files and Assets

1
2
3
4
5
# rsync to backup server
rsync -avz --delete /var/www/uploads/ backup-server:/backups/uploads/

# To S3
aws s3 sync /var/www/uploads/ s3://backups/uploads/ --delete

Configuration

1
2
3
4
5
6
# Git is your backup
cd /etc
git init
git add -A
git commit -m "Configuration backup $(date)"
git push origin main

Or use etckeeper for automatic commits.

Secrets

Never backup secrets in plain text. Use:

  • Vault export (encrypted)
  • AWS Secrets Manager (versioned automatically)
  • Sealed backups with age/SOPS
1
2
3
# Backup secrets encrypted
vault kv get -format=json secret/app | \
  age -r $BACKUP_KEY > secrets_backup.age

Kubernetes Backups

Velero

The standard for Kubernetes backup:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
# Install Velero
velero install \
  --provider aws \
  --bucket velero-backups \
  --secret-file ./credentials

# Backup entire cluster
velero backup create full-backup

# Backup specific namespace
velero backup create app-backup --include-namespaces production

# Schedule daily backups
velero schedule create daily --schedule="0 2 * * *"

What to Back Up

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
# Velero backup spec
apiVersion: velero.io/v1
kind: Backup
metadata:
  name: production-backup
spec:
  includedNamespaces:
    - production
    - monitoring
  includedResources:
    - deployments
    - services
    - configmaps
    - secrets
    - persistentvolumeclaims
  excludedResources:
    - events
    - pods
  storageLocation: default
  ttl: 720h  # 30 days

etcd Backup (Critical)

1
2
3
4
5
6
7
8
9
# Backup etcd
ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-snapshot.db \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key

# Verify
ETCDCTL_API=3 etcdctl snapshot status /backup/etcd-snapshot.db

Cloud-Native Backups

AWS RDS

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# Create manual snapshot
aws rds create-db-snapshot \
  --db-instance-identifier mydb \
  --db-snapshot-identifier mydb-$(date +%Y%m%d)

# Enable automated backups
aws rds modify-db-instance \
  --db-instance-identifier mydb \
  --backup-retention-period 30 \
  --preferred-backup-window "03:00-04:00"

S3 Versioning

1
2
3
4
5
6
7
# Enable versioning
aws s3api put-bucket-versioning \
  --bucket my-bucket \
  --versioning-configuration Status=Enabled

# List versions
aws s3api list-object-versions --bucket my-bucket --prefix important-file

Cross-Region Replication

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
# Terraform
resource "aws_s3_bucket_replication_configuration" "backup" {
  bucket = aws_s3_bucket.primary.id
  role   = aws_iam_role.replication.arn

  rule {
    status = "Enabled"
    destination {
      bucket        = aws_s3_bucket.backup.arn
      storage_class = "STANDARD_IA"
    }
  }
}

Testing Restores

The most important part of backups: testing them.

Monthly Restore Drill

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
#!/bin/bash
# restore-test.sh

echo "Starting monthly restore test..."

# Create test environment
createdb restore_test

# Restore latest backup
LATEST=$(aws s3 ls s3://backups/postgres/ | sort | tail -1 | awk '{print $4}')
aws s3 cp "s3://backups/postgres/${LATEST}" /tmp/
gunzip /tmp/"${LATEST}"
pg_restore -d restore_test /tmp/"${LATEST%.gz}"

# Run verification queries
psql restore_test -c "SELECT COUNT(*) FROM users;" > /tmp/restore_counts
psql production -c "SELECT COUNT(*) FROM users;" >> /tmp/restore_counts

# Compare (should be very close)
echo "User counts:"
cat /tmp/restore_counts

# Cleanup
dropdb restore_test

echo "Restore test complete. Review counts above."

Automated Restore Verification

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
# GitHub Actions
name: Backup Verification
on:
  schedule:
    - cron: '0 6 * * 0'  # Weekly

jobs:
  verify:
    runs-on: ubuntu-latest
    steps:
      - name: Download latest backup
        run: |
          aws s3 cp s3://backups/postgres/latest.dump.gz .
          gunzip latest.dump.gz
          
      - name: Restore to test database
        run: |
          createdb test_restore
          pg_restore -d test_restore latest.dump
          
      - name: Verify data integrity
        run: |
          psql test_restore -c "SELECT COUNT(*) FROM users;"
          psql test_restore -c "SELECT MAX(created_at) FROM orders;"
          
      - name: Alert on failure
        if: failure()
        run: |
          curl -X POST $SLACK_WEBHOOK -d '{"text":"⚠️ Backup verification failed!"}'

Disaster Recovery Runbook

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
## Database Recovery Procedure

### Prerequisites
- Access to AWS console or CLI
- Database credentials
- Latest backup location

### Steps

1. **Assess the damage**
   - What data is affected?
   - What's the recovery point objective (RPO)?
   - What's the recovery time objective (RTO)?

2. **Stop writes to prevent further damage**
   ```bash
   # Scale down application
   kubectl scale deployment app --replicas=0
  1. Identify correct backup

    1
    2
    
    aws s3 ls s3://backups/postgres/ | tail -10
    # Choose backup from BEFORE the incident
    
  2. Restore to new instance

    1
    2
    
    createdb recovery_db
    pg_restore -d recovery_db backup.dump
    
  3. Verify restored data

    1
    
    psql recovery_db -c "SELECT COUNT(*) FROM critical_table;"
    
  4. Swap databases

    1
    2
    
    psql -c "ALTER DATABASE production RENAME TO production_damaged;"
    psql -c "ALTER DATABASE recovery_db RENAME TO production;"
    
  5. Restore application

    1
    
    kubectl scale deployment app --replicas=3
    
  6. Verify application health

    • Check error rates
    • Verify critical flows
    • Monitor for issues
  7. Post-incident

    • Document what happened
    • Update runbook if needed
    • Schedule post-mortem
##`#g##`r#`Po-MyruoAaopnrnlmmsauiele:ml-trteeoth:saeflare:lxoanioubeprbsnsnnsarr:eeougct:lvtmBak:1seamBaluth:rtaacepBiiirckrsamtoykutceyn:upk(:spu):"sFpcBaT-raioiclobtkuOaiurlccpedkaulips_lmaosrte_stuhcacnes2s4_thiomuersstaomlpd">86400

Backup Dashboard

Track:

  • Last successful backup time
  • Backup size (sudden changes = problems)
  • Restore test results
  • Backup duration trends

The Checklist

  • 3-2-1 rule implemented
  • Automated backup schedule
  • Backups encrypted at rest
  • Cross-region/offsite copy
  • Retention policy defined
  • Restore tested monthly
  • Recovery runbook documented
  • Monitoring and alerting active
  • RTO/RPO defined and achievable

Start Here

  1. Today: Verify your current backups exist
  2. This week: Test a restore
  3. This month: Automate backup verification
  4. This quarter: Document and drill disaster recovery

The best backup is one you’ve restored. Everything else is just hope.


Backups are not about preventing data loss. They’re about recovering from it. The difference is testing.