The only backup that matters is the one you can restore. Everything else is wishful thinking with storage costs. Here’s how to build backup systems that work when disaster strikes.

The 3-2-1 Rule (Still Valid)

The classic backup rule holds up:

  • 3 copies of your data
  • 2 different storage types
  • 1 offsite location

Modern interpretation:

PLOroficfmasalir:tye::PDSra3oi/dlGuyCcSts/inAoaznpusrdheaottBaslboa(bsdei(fd(fiSefSrfDee)nrtentolruemgei)on)

This protects against:

  • Hardware failure (local backup)
  • Site disaster (offsite backup)
  • Corruption spreading to replicas (time-delayed copies)

What to Back Up

Databases

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# PostgreSQL - logical backup
pg_dump -Fc dbname > backup.dump

# PostgreSQL - base backup for PITR
pg_basebackup -D /backup/base -Fp -Xs -P

# MySQL
mysqldump --single-transaction --routines dbname > backup.sql

# MongoDB
mongodump --archive=backup.archive --gzip

Critical: Include schema, stored procedures, and permissions—not just data.

Application State

1
2
3
4
5
# What to capture
/etc/                    # System config
/home/*/                 # User data
/var/lib/docker/volumes/ # Container data
/opt/app/uploads/        # User uploads

Don’t forget:

  • SSL certificates and keys
  • Environment files (.env)
  • Cron jobs (crontab -l > crons.txt)
  • systemd units

Infrastructure as Code

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
# Terraform state
terraform state pull > terraform.tfstate.backup

# Or better: use remote state with versioning
terraform {
  backend "s3" {
    bucket         = "terraform-state"
    key            = "prod/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true
    dynamodb_table = "terraform-locks"
  }
}

Backup Frequency

Match backup frequency to acceptable data loss:

Data TypeRPOBackup Frequency
TransactionsMinutesContinuous/WAL
User dataHoursHourly
ConfigDailyDaily
ArchivesWeeklyWeekly

RPO = Recovery Point Objective = How much data loss is acceptable

1
2
3
4
# Example cron schedule
0 * * * *   /scripts/backup-hourly.sh   # User uploads
0 3 * * *   /scripts/backup-daily.sh    # Full database
0 4 * * 0   /scripts/backup-weekly.sh   # Everything

Backup Verification

Backups without verification are just hope:

Integrity Checks

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
#!/bin/bash
# verify-backup.sh

BACKUP_FILE=$1

# Check file exists and has size
if [[ ! -s "$BACKUP_FILE" ]]; then
    echo "ERROR: Backup file missing or empty"
    exit 1
fi

# Verify archive integrity
if [[ "$BACKUP_FILE" == *.gz ]]; then
    gzip -t "$BACKUP_FILE" || exit 1
fi

# For pg_dump custom format
if [[ "$BACKUP_FILE" == *.dump ]]; then
    pg_restore --list "$BACKUP_FILE" > /dev/null || exit 1
fi

echo "Backup verified: $BACKUP_FILE"

Restore Testing

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
#!/bin/bash
# test-restore.sh - Run weekly

# Spin up test database
docker run -d --name restore-test \
    -e POSTGRES_PASSWORD=test \
    postgres:15

sleep 10

# Restore latest backup
LATEST=$(ls -t /backups/*.dump | head -1)
pg_restore -h localhost -U postgres -d postgres "$LATEST"

# Run validation queries
psql -h localhost -U postgres -c "SELECT COUNT(*) FROM users;"
psql -h localhost -U postgres -c "SELECT COUNT(*) FROM orders;"

# Compare counts to production
# Alert if significantly different

# Cleanup
docker rm -f restore-test

Schedule restore tests. Monthly at minimum. Quarterly for full disaster recovery drills.

Retention Policies

Keep backups long enough, but not forever:

1
2
3
4
5
# retention.conf
hourly_keep=24      # Keep 24 hourly backups
daily_keep=7        # Keep 7 daily backups
weekly_keep=4       # Keep 4 weekly backups
monthly_keep=12     # Keep 12 monthly backups

Implementation:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
#!/bin/bash
# cleanup-old-backups.sh

BACKUP_DIR=/backups

# Remove hourly backups older than 24 hours
find "$BACKUP_DIR/hourly" -mtime +1 -delete

# Remove daily backups older than 7 days
find "$BACKUP_DIR/daily" -mtime +7 -delete

# Remove weekly backups older than 28 days
find "$BACKUP_DIR/weekly" -mtime +28 -delete

# Remove monthly backups older than 365 days
find "$BACKUP_DIR/monthly" -mtime +365 -delete

Encryption

Backups contain your most sensitive data. Encrypt them:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# Encrypt with GPG
pg_dump dbname | gzip | gpg --encrypt -r backup@company.com > backup.sql.gz.gpg

# Encrypt with OpenSSL (symmetric)
pg_dump dbname | gzip | openssl enc -aes-256-cbc -salt -pbkdf2 \
    -pass file:/etc/backup-key > backup.sql.gz.enc

# Decrypt
openssl enc -d -aes-256-cbc -pbkdf2 \
    -pass file:/etc/backup-key < backup.sql.gz.enc | gunzip > backup.sql

Key management:

  • Store encryption keys separately from backups
  • Use a secrets manager (Vault, AWS Secrets Manager)
  • Have key recovery procedures documented and tested

Cloud Backup Patterns

S3 with Lifecycle Rules

1
2
3
4
# Upload with server-side encryption
aws s3 cp backup.tar.gz s3://backups/daily/ \
    --sse AES256 \
    --storage-class STANDARD_IA
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
{
  "Rules": [
    {
      "ID": "TransitionToGlacier",
      "Status": "Enabled",
      "Transitions": [
        {
          "Days": 30,
          "StorageClass": "GLACIER"
        }
      ],
      "Expiration": {
        "Days": 365
      }
    }
  ]
}

Cross-Region Replication

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
# Terraform - replicate to another region
resource "aws_s3_bucket_replication_configuration" "backup" {
  bucket = aws_s3_bucket.backup.id
  role   = aws_iam_role.replication.arn

  rule {
    id     = "replicate-all"
    status = "Enabled"

    destination {
      bucket        = aws_s3_bucket.backup_replica.arn
      storage_class = "STANDARD_IA"
    }
  }
}

Database-Specific Strategies

PostgreSQL Point-in-Time Recovery

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
# Enable WAL archiving in postgresql.conf
archive_mode = on
archive_command = 'aws s3 cp %p s3://wal-archive/%f'

# Recovery: restore base backup + replay WAL
pg_basebackup -D /var/lib/postgresql/data -Fp -Xs -P

# recovery.signal triggers recovery mode
touch /var/lib/postgresql/data/recovery.signal

# postgresql.auto.conf
restore_command = 'aws s3 cp s3://wal-archive/%f %p'
recovery_target_time = '2026-03-05 10:30:00'

MySQL Binary Log Backup

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# Enable binary logging in my.cnf
[mysqld]
log_bin = /var/log/mysql/mysql-bin
binlog_expire_logs_seconds = 604800

# Backup binlogs
mysqlbinlog --read-from-remote-server \
    --host=db.example.com \
    --raw \
    --result-file=/backups/binlog/ \
    mysql-bin.000001

Monitoring Backups

Backups fail silently. Monitor them:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
#!/bin/bash
# check-backup-freshness.sh

MAX_AGE_HOURS=25
BACKUP_DIR=/backups/daily

LATEST=$(ls -t "$BACKUP_DIR"/*.dump 2>/dev/null | head -1)

if [[ -z "$LATEST" ]]; then
    echo "CRITICAL: No backups found"
    exit 2
fi

AGE_HOURS=$(( ($(date +%s) - $(stat -c %Y "$LATEST")) / 3600 ))

if [[ $AGE_HOURS -gt $MAX_AGE_HOURS ]]; then
    echo "WARNING: Latest backup is ${AGE_HOURS}h old"
    exit 1
fi

echo "OK: Latest backup is ${AGE_HOURS}h old"
exit 0

Alert on:

  • Backup job failures
  • Backup age exceeding threshold
  • Backup size anomalies (suddenly much smaller = probably broken)
  • Restore test failures

Disaster Recovery Runbook

Document the restore process before you need it:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# Database Restore Runbook

## Prerequisites
- Access to S3 backup bucket
- Database admin credentials
- Target server with PostgreSQL installed

## Steps

1. Identify latest good backup

aws s3 ls s3://backups/daily/ –recursive | sort | tail -5

2.Downloadbackup

aws s3 cp s3://backups/daily/YYYY-MM-DD.dump /tmp/

3.Stopapplicationservers

systemctl stop myapp

4.Restoredatabase

pg_restore -h localhost -U postgres -d mydb –clean /tmp/YYYY-MM-DD.dump

5.Verifyrestore

psql -c “SELECT COUNT(*) FROM users;”

6.Restartapplication

systemctl start myapp

#--#DCCBloAonutodanc-atccsaclels:s:

Test the runbook. If you can’t follow it during a drill, you can’t follow it during a crisis.

Common Failures

Backup succeeds, restore fails

  • Test restores regularly
  • Verify backup integrity after creation

Backups exist but are corrupted

  • Checksums on every backup
  • Periodic integrity verification

Can’t find the right backup

  • Clear naming conventions: dbname_YYYY-MM-DD_HH-MM.dump
  • Metadata tags in cloud storage

Restore takes too long

  • Know your RTO (Recovery Time Objective)
  • Test restore duration
  • Consider faster restore options (snapshots vs logical backups)

Encryption key lost

  • Key escrow or recovery procedures
  • Document key locations

The Test That Matters

Once a quarter, answer this question:

“If our production database disappeared right now, how long until we’re back online with yesterday’s data?”

If you don’t know the answer, your backup strategy has a hole. Fill it before you need it.