Everyone has backups. Almost no one tests them.
Don’t be everyone.
The 3-2-1 Rule#
The foundation of backup strategy:
- 3 copies of your data
- 2 different storage types
- 1 offsite location
Automated Database Backups#
PostgreSQL#
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
| #!/bin/bash
# backup-postgres.sh
DB_NAME="myapp"
BACKUP_DIR="/var/backups/postgres"
S3_BUCKET="s3://myapp-backups/postgres"
RETENTION_DAYS=30
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
# Create backup
pg_dump -Fc "$DB_NAME" > "${BACKUP_DIR}/${DB_NAME}_${TIMESTAMP}.dump"
# Compress
gzip "${BACKUP_DIR}/${DB_NAME}_${TIMESTAMP}.dump"
# Upload to S3
aws s3 cp "${BACKUP_DIR}/${DB_NAME}_${TIMESTAMP}.dump.gz" \
"${S3_BUCKET}/${DB_NAME}_${TIMESTAMP}.dump.gz" \
--storage-class STANDARD_IA
# Clean old local backups
find "$BACKUP_DIR" -name "*.dump.gz" -mtime +$RETENTION_DAYS -delete
# Verify backup exists in S3
if aws s3 ls "${S3_BUCKET}/${DB_NAME}_${TIMESTAMP}.dump.gz"; then
echo "Backup successful: ${DB_NAME}_${TIMESTAMP}.dump.gz"
else
echo "Backup FAILED!" >&2
exit 1
fi
|
Schedule with cron:
1
2
| # Daily at 3 AM
0 3 * * * /opt/scripts/backup-postgres.sh >> /var/log/backup.log 2>&1
|
MySQL/MariaDB#
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
| #!/bin/bash
# backup-mysql.sh
DB_NAME="myapp"
BACKUP_DIR="/var/backups/mysql"
S3_BUCKET="s3://myapp-backups/mysql"
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
# Backup with compression
mysqldump --single-transaction --routines --triggers \
"$DB_NAME" | gzip > "${BACKUP_DIR}/${DB_NAME}_${TIMESTAMP}.sql.gz"
# Upload to S3 with encryption
aws s3 cp "${BACKUP_DIR}/${DB_NAME}_${TIMESTAMP}.sql.gz" \
"${S3_BUCKET}/${DB_NAME}_${TIMESTAMP}.sql.gz" \
--sse AES256
|
Point-in-Time Recovery#
PostgreSQL WAL Archiving#
1
2
3
4
| # postgresql.conf
archive_mode = on
archive_command = 'aws s3 cp %p s3://myapp-backups/wal/%f'
wal_level = replica
|
Restore to specific time:
1
2
3
4
| # recovery.conf
restore_command = 'aws s3 cp s3://myapp-backups/wal/%f %p'
recovery_target_time = '2026-02-10 14:30:00'
recovery_target_action = 'promote'
|
AWS RDS Automated Backups#
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
| # Terraform
resource "aws_db_instance" "main" {
identifier = "myapp-db"
engine = "postgres"
engine_version = "15"
instance_class = "db.t3.medium"
# Backup configuration
backup_retention_period = 30
backup_window = "03:00-04:00"
# Enable deletion protection
deletion_protection = true
# Enable automated snapshots
copy_tags_to_snapshot = true
# Multi-AZ for high availability
multi_az = true
}
|
File System Backups#
Using restic#
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
| # Initialize repository
restic -r s3:s3.amazonaws.com/myapp-backups/files init
# Backup script
#!/bin/bash
export AWS_ACCESS_KEY_ID="..."
export AWS_SECRET_ACCESS_KEY="..."
export RESTIC_PASSWORD="..."
export RESTIC_REPOSITORY="s3:s3.amazonaws.com/myapp-backups/files"
# Backup with exclusions
restic backup /var/www/app \
--exclude="*.log" \
--exclude="node_modules" \
--exclude=".git" \
--tag "daily"
# Prune old backups
restic forget \
--keep-daily 7 \
--keep-weekly 4 \
--keep-monthly 12 \
--prune
# Verify integrity
restic check
|
S3 Versioning and Lifecycle#
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
| resource "aws_s3_bucket" "backups" {
bucket = "myapp-backups"
}
resource "aws_s3_bucket_versioning" "backups" {
bucket = aws_s3_bucket.backups.id
versioning_configuration {
status = "Enabled"
}
}
resource "aws_s3_bucket_lifecycle_configuration" "backups" {
bucket = aws_s3_bucket.backups.id
rule {
id = "transition-to-glacier"
status = "Enabled"
transition {
days = 30
storage_class = "GLACIER"
}
transition {
days = 90
storage_class = "DEEP_ARCHIVE"
}
expiration {
days = 365
}
}
}
|
Disaster Recovery Plans#
RTO and RPO#
Define your targets:
- RTO (Recovery Time Objective): How long can you be down?
- RPO (Recovery Point Objective): How much data can you lose?
| Tier | RTO | RPO | Strategy |
|---|
| Critical | <1 hour | <5 minutes | Multi-region active-active |
| Important | <4 hours | <1 hour | Warm standby |
| Standard | <24 hours | <24 hours | Pilot light |
| Low | <72 hours | <24 hours | Backup/restore |
Multi-Region Setup#
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
| # Primary region
provider "aws" {
alias = "primary"
region = "us-east-1"
}
# DR region
provider "aws" {
alias = "dr"
region = "us-west-2"
}
# Cross-region RDS replica
resource "aws_db_instance" "replica" {
provider = aws.dr
replicate_source_db = aws_db_instance.main.arn
instance_class = "db.t3.medium"
skip_final_snapshot = true
}
# S3 cross-region replication
resource "aws_s3_bucket_replication_configuration" "backups" {
provider = aws.primary
bucket = aws_s3_bucket.backups.id
role = aws_iam_role.replication.arn
rule {
id = "replicate-all"
status = "Enabled"
destination {
bucket = aws_s3_bucket.backups_dr.arn
storage_class = "STANDARD_IA"
}
}
}
|
Testing Your Backups#
The backup isn’t done until you’ve tested the restore.
Automated Restore Testing#
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
| #!/bin/bash
# test-restore.sh
set -e
BACKUP_FILE=$(aws s3 ls s3://myapp-backups/postgres/ | sort | tail -1 | awk '{print $4}')
TEST_DB="restore_test_$(date +%Y%m%d)"
echo "Testing restore of: $BACKUP_FILE"
# Download latest backup
aws s3 cp "s3://myapp-backups/postgres/${BACKUP_FILE}" /tmp/
# Decompress
gunzip -f "/tmp/${BACKUP_FILE}"
# Create test database
createdb "$TEST_DB"
# Restore
pg_restore -d "$TEST_DB" "/tmp/${BACKUP_FILE%.gz}"
# Run validation queries
psql -d "$TEST_DB" -c "SELECT COUNT(*) FROM users;" > /tmp/restore_validation.txt
psql -d "$TEST_DB" -c "SELECT MAX(created_at) FROM orders;" >> /tmp/restore_validation.txt
# Cleanup
dropdb "$TEST_DB"
rm -f "/tmp/${BACKUP_FILE%.gz}"
echo "Restore test completed successfully"
cat /tmp/restore_validation.txt
# Send notification
curl -X POST "$SLACK_WEBHOOK" \
-H 'Content-type: application/json' \
-d "{\"text\": \"✅ Backup restore test passed: ${BACKUP_FILE}\"}"
|
Schedule monthly tests:
1
2
| # First Sunday of each month at 4 AM
0 4 1-7 * 0 /opt/scripts/test-restore.sh
|
Runbook Template#
Create a disaster recovery runbook:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
| # Disaster Recovery Runbook
## Scenario: Database Failure
### Detection
- CloudWatch alarm triggers
- Application health checks fail
- On-call engineer paged
### Response Steps
1. **Assess** (5 min)
- Check RDS console for instance status
- Review CloudWatch metrics
- Determine if failover is needed
2. **Failover** (10 min)
- If Multi-AZ: Automatic failover
- If not: Promote read replica
```bash
aws rds promote-read-replica --db-instance-identifier myapp-replica
|
Restore from Backup (if needed, 30 min)
1
2
3
4
| aws rds restore-db-instance-to-point-in-time \
--source-db-instance-identifier myapp-db \
--target-db-instance-identifier myapp-db-restored \
--restore-time 2026-02-10T14:00:00Z
|
Update Application
- Update connection string
- Deploy config change
- Verify connectivity
Verify
- Run health checks
- Check recent transactions
- Monitor error rates
The Checklist#
The best time to test your backups was yesterday. The second best time is now.
📬 Get the Newsletter
Weekly insights on DevOps, automation, and CLI mastery. No spam, unsubscribe anytime.