Backup and Disaster Recovery: Because Hope Is Not a Strategy

Everyone has backups. Almost no one tests them.

Don’t be everyone.

The 3-2-1 Rule

The foundation of backup strategy:

3 copies of your data
2 different storage types
1 offsite location

Automated Database Backups

PostgreSQL

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
#!/bin/bash
# backup-postgres.sh

DB_NAME="myapp"
BACKUP_DIR="/var/backups/postgres"
S3_BUCKET="s3://myapp-backups/postgres"
RETENTION_DAYS=30
TIMESTAMP=$(date +%Y%m%d_%H%M%S)

# Create backup
pg_dump -Fc "$DB_NAME" > "${BACKUP_DIR}/${DB_NAME}_${TIMESTAMP}.dump"

# Compress
gzip "${BACKUP_DIR}/${DB_NAME}_${TIMESTAMP}.dump"

# Upload to S3
aws s3 cp "${BACKUP_DIR}/${DB_NAME}_${TIMESTAMP}.dump.gz" \
    "${S3_BUCKET}/${DB_NAME}_${TIMESTAMP}.dump.gz" \
    --storage-class STANDARD_IA

# Clean old local backups
find "$BACKUP_DIR" -name "*.dump.gz" -mtime +$RETENTION_DAYS -delete

# Verify backup exists in S3
if aws s3 ls "${S3_BUCKET}/${DB_NAME}_${TIMESTAMP}.dump.gz"; then
    echo "Backup successful: ${DB_NAME}_${TIMESTAMP}.dump.gz"
else
    echo "Backup FAILED!" >&2
    exit 1
fi

Schedule with cron:

1
2
# Daily at 3 AM
0 3 * * * /opt/scripts/backup-postgres.sh >> /var/log/backup.log 2>&1

MySQL/MariaDB

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
#!/bin/bash
# backup-mysql.sh

DB_NAME="myapp"
BACKUP_DIR="/var/backups/mysql"
S3_BUCKET="s3://myapp-backups/mysql"
TIMESTAMP=$(date +%Y%m%d_%H%M%S)

# Backup with compression
mysqldump --single-transaction --routines --triggers \
    "$DB_NAME" | gzip > "${BACKUP_DIR}/${DB_NAME}_${TIMESTAMP}.sql.gz"

# Upload to S3 with encryption
aws s3 cp "${BACKUP_DIR}/${DB_NAME}_${TIMESTAMP}.sql.gz" \
    "${S3_BUCKET}/${DB_NAME}_${TIMESTAMP}.sql.gz" \
    --sse AES256

Point-in-Time Recovery

PostgreSQL WAL Archiving

1
2
3
4
# postgresql.conf
archive_mode = on
archive_command = 'aws s3 cp %p s3://myapp-backups/wal/%f'
wal_level = replica

Restore to specific time:

1
2
3
4
# recovery.conf
restore_command = 'aws s3 cp s3://myapp-backups/wal/%f %p'
recovery_target_time = '2026-02-10 14:30:00'
recovery_target_action = 'promote'

AWS RDS Automated Backups

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
# Terraform
resource "aws_db_instance" "main" {
  identifier     = "myapp-db"
  engine         = "postgres"
  engine_version = "15"
  instance_class = "db.t3.medium"
  
  # Backup configuration
  backup_retention_period = 30
  backup_window          = "03:00-04:00"
  
  # Enable deletion protection
  deletion_protection = true
  
  # Enable automated snapshots
  copy_tags_to_snapshot = true
  
  # Multi-AZ for high availability
  multi_az = true
}

File System Backups

Using restic

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# Initialize repository
restic -r s3:s3.amazonaws.com/myapp-backups/files init

# Backup script
#!/bin/bash
export AWS_ACCESS_KEY_ID="..."
export AWS_SECRET_ACCESS_KEY="..."
export RESTIC_PASSWORD="..."
export RESTIC_REPOSITORY="s3:s3.amazonaws.com/myapp-backups/files"

# Backup with exclusions
restic backup /var/www/app \
    --exclude="*.log" \
    --exclude="node_modules" \
    --exclude=".git" \
    --tag "daily"

# Prune old backups
restic forget \
    --keep-daily 7 \
    --keep-weekly 4 \
    --keep-monthly 12 \
    --prune

# Verify integrity
restic check

S3 Versioning and Lifecycle

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
resource "aws_s3_bucket" "backups" {
  bucket = "myapp-backups"
}

resource "aws_s3_bucket_versioning" "backups" {
  bucket = aws_s3_bucket.backups.id
  versioning_configuration {
    status = "Enabled"
  }
}

resource "aws_s3_bucket_lifecycle_configuration" "backups" {
  bucket = aws_s3_bucket.backups.id

  rule {
    id     = "transition-to-glacier"
    status = "Enabled"

    transition {
      days          = 30
      storage_class = "GLACIER"
    }

    transition {
      days          = 90
      storage_class = "DEEP_ARCHIVE"
    }

    expiration {
      days = 365
    }
  }
}

Disaster Recovery Plans

RTO and RPO

Define your targets:

RTO (Recovery Time Objective): How long can you be down?
RPO (Recovery Point Objective): How much data can you lose?

Tier	RTO	RPO	Strategy
Critical	<1 hour	<5 minutes	Multi-region active-active
Important	<4 hours	<1 hour	Warm standby
Standard	<24 hours	<24 hours	Pilot light
Low	<72 hours	<24 hours	Backup/restore

Multi-Region Setup

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
# Primary region
provider "aws" {
  alias  = "primary"
  region = "us-east-1"
}

# DR region
provider "aws" {
  alias  = "dr"
  region = "us-west-2"
}

# Cross-region RDS replica
resource "aws_db_instance" "replica" {
  provider             = aws.dr
  replicate_source_db  = aws_db_instance.main.arn
  instance_class       = "db.t3.medium"
  skip_final_snapshot  = true
}

# S3 cross-region replication
resource "aws_s3_bucket_replication_configuration" "backups" {
  provider = aws.primary
  bucket   = aws_s3_bucket.backups.id
  role     = aws_iam_role.replication.arn

  rule {
    id     = "replicate-all"
    status = "Enabled"

    destination {
      bucket        = aws_s3_bucket.backups_dr.arn
      storage_class = "STANDARD_IA"
    }
  }
}

Testing Your Backups

The backup isn’t done until you’ve tested the restore.

Automated Restore Testing

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
#!/bin/bash
# test-restore.sh

set -e

BACKUP_FILE=$(aws s3 ls s3://myapp-backups/postgres/ | sort | tail -1 | awk '{print $4}')
TEST_DB="restore_test_$(date +%Y%m%d)"

echo "Testing restore of: $BACKUP_FILE"

# Download latest backup
aws s3 cp "s3://myapp-backups/postgres/${BACKUP_FILE}" /tmp/

# Decompress
gunzip -f "/tmp/${BACKUP_FILE}"

# Create test database
createdb "$TEST_DB"

# Restore
pg_restore -d "$TEST_DB" "/tmp/${BACKUP_FILE%.gz}"

# Run validation queries
psql -d "$TEST_DB" -c "SELECT COUNT(*) FROM users;" > /tmp/restore_validation.txt
psql -d "$TEST_DB" -c "SELECT MAX(created_at) FROM orders;" >> /tmp/restore_validation.txt

# Cleanup
dropdb "$TEST_DB"
rm -f "/tmp/${BACKUP_FILE%.gz}"

echo "Restore test completed successfully"
cat /tmp/restore_validation.txt

# Send notification
curl -X POST "$SLACK_WEBHOOK" \
    -H 'Content-type: application/json' \
    -d "{\"text\": \"✅ Backup restore test passed: ${BACKUP_FILE}\"}"

Schedule monthly tests:

1
2
# First Sunday of each month at 4 AM
0 4 1-7 * 0 /opt/scripts/test-restore.sh

Runbook Template

Create a disaster recovery runbook:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
# Disaster Recovery Runbook

## Scenario: Database Failure

### Detection
- CloudWatch alarm triggers
- Application health checks fail
- On-call engineer paged

### Response Steps

1. **Assess** (5 min)
   - Check RDS console for instance status
   - Review CloudWatch metrics
   - Determine if failover is needed

2. **Failover** (10 min)
   - If Multi-AZ: Automatic failover
   - If not: Promote read replica
   ```bash
   aws rds promote-read-replica --db-instance-identifier myapp-replica

Restore from Backup (if needed, 30 min)

1
2
3
4
aws rds restore-db-instance-to-point-in-time \
    --source-db-instance-identifier myapp-db \
    --target-db-instance-identifier myapp-db-restored \
    --restore-time 2026-02-10T14:00:00Z

Update Application
- Update connection string
- Deploy config change
- Verify connectivity
Verify
- Run health checks
- Check recent transactions
- Monitor error rates

Contacts

DBA: dba@company.com
Platform Lead: platform@company.com
Incident Channel: #incident-response

The Checklist

Automated daily backups
Offsite/cross-region replication
Encryption at rest and in transit
Monthly restore tests
Documented runbooks
Monitoring and alerts
Defined RTO/RPO
Team trained on procedures

The best time to test your backups was yesterday. The second best time is now.

The 3-2-1 Rule#

Automated Database Backups#

PostgreSQL#

MySQL/MariaDB#

Point-in-Time Recovery#

PostgreSQL WAL Archiving#

AWS RDS Automated Backups#

File System Backups#

Using restic#

S3 Versioning and Lifecycle#

Disaster Recovery Plans#

RTO and RPO#

Multi-Region Setup#

Testing Your Backups#

Automated Restore Testing#

Runbook Template#

Contacts#

The Checklist#

📬 Get the Newsletter