Everyone has backups. Almost no one tests them.

Don’t be everyone.

The 3-2-1 Rule

The foundation of backup strategy:

  • 3 copies of your data
  • 2 different storage types
  • 1 offsite location
PrimaryDLOaoftcfaasli(tPBeraocBdkauuccpktui(posna()mdeifdfaetraecnetntreerg,iodni/fpfreorveindters)erver)

Automated Database Backups

PostgreSQL

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
#!/bin/bash
# backup-postgres.sh

DB_NAME="myapp"
BACKUP_DIR="/var/backups/postgres"
S3_BUCKET="s3://myapp-backups/postgres"
RETENTION_DAYS=30
TIMESTAMP=$(date +%Y%m%d_%H%M%S)

# Create backup
pg_dump -Fc "$DB_NAME" > "${BACKUP_DIR}/${DB_NAME}_${TIMESTAMP}.dump"

# Compress
gzip "${BACKUP_DIR}/${DB_NAME}_${TIMESTAMP}.dump"

# Upload to S3
aws s3 cp "${BACKUP_DIR}/${DB_NAME}_${TIMESTAMP}.dump.gz" \
    "${S3_BUCKET}/${DB_NAME}_${TIMESTAMP}.dump.gz" \
    --storage-class STANDARD_IA

# Clean old local backups
find "$BACKUP_DIR" -name "*.dump.gz" -mtime +$RETENTION_DAYS -delete

# Verify backup exists in S3
if aws s3 ls "${S3_BUCKET}/${DB_NAME}_${TIMESTAMP}.dump.gz"; then
    echo "Backup successful: ${DB_NAME}_${TIMESTAMP}.dump.gz"
else
    echo "Backup FAILED!" >&2
    exit 1
fi

Schedule with cron:

1
2
# Daily at 3 AM
0 3 * * * /opt/scripts/backup-postgres.sh >> /var/log/backup.log 2>&1

MySQL/MariaDB

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
#!/bin/bash
# backup-mysql.sh

DB_NAME="myapp"
BACKUP_DIR="/var/backups/mysql"
S3_BUCKET="s3://myapp-backups/mysql"
TIMESTAMP=$(date +%Y%m%d_%H%M%S)

# Backup with compression
mysqldump --single-transaction --routines --triggers \
    "$DB_NAME" | gzip > "${BACKUP_DIR}/${DB_NAME}_${TIMESTAMP}.sql.gz"

# Upload to S3 with encryption
aws s3 cp "${BACKUP_DIR}/${DB_NAME}_${TIMESTAMP}.sql.gz" \
    "${S3_BUCKET}/${DB_NAME}_${TIMESTAMP}.sql.gz" \
    --sse AES256

Point-in-Time Recovery

PostgreSQL WAL Archiving

1
2
3
4
# postgresql.conf
archive_mode = on
archive_command = 'aws s3 cp %p s3://myapp-backups/wal/%f'
wal_level = replica

Restore to specific time:

1
2
3
4
# recovery.conf
restore_command = 'aws s3 cp s3://myapp-backups/wal/%f %p'
recovery_target_time = '2026-02-10 14:30:00'
recovery_target_action = 'promote'

AWS RDS Automated Backups

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
# Terraform
resource "aws_db_instance" "main" {
  identifier     = "myapp-db"
  engine         = "postgres"
  engine_version = "15"
  instance_class = "db.t3.medium"
  
  # Backup configuration
  backup_retention_period = 30
  backup_window          = "03:00-04:00"
  
  # Enable deletion protection
  deletion_protection = true
  
  # Enable automated snapshots
  copy_tags_to_snapshot = true
  
  # Multi-AZ for high availability
  multi_az = true
}

File System Backups

Using restic

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# Initialize repository
restic -r s3:s3.amazonaws.com/myapp-backups/files init

# Backup script
#!/bin/bash
export AWS_ACCESS_KEY_ID="..."
export AWS_SECRET_ACCESS_KEY="..."
export RESTIC_PASSWORD="..."
export RESTIC_REPOSITORY="s3:s3.amazonaws.com/myapp-backups/files"

# Backup with exclusions
restic backup /var/www/app \
    --exclude="*.log" \
    --exclude="node_modules" \
    --exclude=".git" \
    --tag "daily"

# Prune old backups
restic forget \
    --keep-daily 7 \
    --keep-weekly 4 \
    --keep-monthly 12 \
    --prune

# Verify integrity
restic check

S3 Versioning and Lifecycle

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
resource "aws_s3_bucket" "backups" {
  bucket = "myapp-backups"
}

resource "aws_s3_bucket_versioning" "backups" {
  bucket = aws_s3_bucket.backups.id
  versioning_configuration {
    status = "Enabled"
  }
}

resource "aws_s3_bucket_lifecycle_configuration" "backups" {
  bucket = aws_s3_bucket.backups.id

  rule {
    id     = "transition-to-glacier"
    status = "Enabled"

    transition {
      days          = 30
      storage_class = "GLACIER"
    }

    transition {
      days          = 90
      storage_class = "DEEP_ARCHIVE"
    }

    expiration {
      days = 365
    }
  }
}

Disaster Recovery Plans

RTO and RPO

Define your targets:

  • RTO (Recovery Time Objective): How long can you be down?
  • RPO (Recovery Point Objective): How much data can you lose?
TierRTORPOStrategy
Critical<1 hour<5 minutesMulti-region active-active
Important<4 hours<1 hourWarm standby
Standard<24 hours<24 hoursPilot light
Low<72 hours<24 hoursBackup/restore

Multi-Region Setup

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
# Primary region
provider "aws" {
  alias  = "primary"
  region = "us-east-1"
}

# DR region
provider "aws" {
  alias  = "dr"
  region = "us-west-2"
}

# Cross-region RDS replica
resource "aws_db_instance" "replica" {
  provider             = aws.dr
  replicate_source_db  = aws_db_instance.main.arn
  instance_class       = "db.t3.medium"
  skip_final_snapshot  = true
}

# S3 cross-region replication
resource "aws_s3_bucket_replication_configuration" "backups" {
  provider = aws.primary
  bucket   = aws_s3_bucket.backups.id
  role     = aws_iam_role.replication.arn

  rule {
    id     = "replicate-all"
    status = "Enabled"

    destination {
      bucket        = aws_s3_bucket.backups_dr.arn
      storage_class = "STANDARD_IA"
    }
  }
}

Testing Your Backups

The backup isn’t done until you’ve tested the restore.

Automated Restore Testing

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
#!/bin/bash
# test-restore.sh

set -e

BACKUP_FILE=$(aws s3 ls s3://myapp-backups/postgres/ | sort | tail -1 | awk '{print $4}')
TEST_DB="restore_test_$(date +%Y%m%d)"

echo "Testing restore of: $BACKUP_FILE"

# Download latest backup
aws s3 cp "s3://myapp-backups/postgres/${BACKUP_FILE}" /tmp/

# Decompress
gunzip -f "/tmp/${BACKUP_FILE}"

# Create test database
createdb "$TEST_DB"

# Restore
pg_restore -d "$TEST_DB" "/tmp/${BACKUP_FILE%.gz}"

# Run validation queries
psql -d "$TEST_DB" -c "SELECT COUNT(*) FROM users;" > /tmp/restore_validation.txt
psql -d "$TEST_DB" -c "SELECT MAX(created_at) FROM orders;" >> /tmp/restore_validation.txt

# Cleanup
dropdb "$TEST_DB"
rm -f "/tmp/${BACKUP_FILE%.gz}"

echo "Restore test completed successfully"
cat /tmp/restore_validation.txt

# Send notification
curl -X POST "$SLACK_WEBHOOK" \
    -H 'Content-type: application/json' \
    -d "{\"text\": \"✅ Backup restore test passed: ${BACKUP_FILE}\"}"

Schedule monthly tests:

1
2
# First Sunday of each month at 4 AM
0 4 1-7 * 0 /opt/scripts/test-restore.sh

Runbook Template

Create a disaster recovery runbook:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
# Disaster Recovery Runbook

## Scenario: Database Failure

### Detection
- CloudWatch alarm triggers
- Application health checks fail
- On-call engineer paged

### Response Steps

1. **Assess** (5 min)
   - Check RDS console for instance status
   - Review CloudWatch metrics
   - Determine if failover is needed

2. **Failover** (10 min)
   - If Multi-AZ: Automatic failover
   - If not: Promote read replica
   ```bash
   aws rds promote-read-replica --db-instance-identifier myapp-replica
  1. Restore from Backup (if needed, 30 min)

    1
    2
    3
    4
    
    aws rds restore-db-instance-to-point-in-time \
        --source-db-instance-identifier myapp-db \
        --target-db-instance-identifier myapp-db-restored \
        --restore-time 2026-02-10T14:00:00Z
    
  2. Update Application

    • Update connection string
    • Deploy config change
    • Verify connectivity
  3. Verify

    • Run health checks
    • Check recent transactions
    • Monitor error rates

Contacts

#`#ifdi#`mref`cpofMphomoyercsbpmrilaip_cntctdh3uraefagfrnhihkaecexsteiaeto_btc=kf_p'eanmconboekeiaoCrs=grtekratt_btxgnoatea(__icoiboesnidif_bnk3mat==_ets=a>s"aguecoheete=cpk3''o=nmet=kBsiu.mputEatiELua.mpcyorssxximxa"pcpp_lass3'c(mect__kyofipt.eredee_furrepg=lnpe.epsmrpten-riotsnlttaesstbe2stipotiisdh(as6toowaobnhan'c/_inn((na_ntesk'on(slh(c_ees3u#b"eaofk"sts'pjrN[tu"u:si()sAeeo'erLp(m)'lcsCssa:)e:ltpbot=t,osoan[me{w_nct'asltvskeLxtais2euna_tmo(:ptsabeemBsstgasdeu'Mectecf]o_k[lbko,dhu'tueuiopKaftnkfuef=deiriyeb!yess'ru"=d)]c)l':{}ka]avem.g(etbte{r,dz.aaitg2Pnoe4rxft.he:oatf)loix_tx[-sa='elpLlc_raaosestneftedciMssoxot(n)d[)di'/sfL3(ia6)es0/dt03'M:6]o.0)d10if:f}.i1ehfdo}'uh]rsagool)d"!)")

The Checklist

  • Automated daily backups
  • Offsite/cross-region replication
  • Encryption at rest and in transit
  • Monthly restore tests
  • Documented runbooks
  • Monitoring and alerts
  • Defined RTO/RPO
  • Team trained on procedures

The best time to test your backups was yesterday. The second best time is now.