The only backup that matters is the one you can restore. Everything else is wishful thinking with storage costs. Here’s how to build backup systems that work when disaster strikes.
The 3-2-1 Rule (Still Valid)# The classic backup rule holds up:
3 copies of your data2 different storage types1 offsite locationModern interpretation:
P L O r o f i c f m a s a l i r : t y e : : P D S r a 3 o i / d l G u y C c S t s / i n A o a z n p u s r d h e a o t t B a s l b o a ( b s d e i ( f d ( f i S e f S r f D e e ) n r t e n t o l r u e m g e i ) o n ) This protects against:
Hardware failure (local backup) Site disaster (offsite backup) Corruption spreading to replicas (time-delayed copies) What to Back Up# Databases# 1
2
3
4
5
6
7
8
9
10
11
# PostgreSQL - logical backup
pg_dump -Fc dbname > backup.dump
# PostgreSQL - base backup for PITR
pg_basebackup -D /backup/base -Fp -Xs -P
# MySQL
mysqldump --single-transaction --routines dbname > backup.sql
# MongoDB
mongodump --archive= backup.archive --gzip
Critical : Include schema, stored procedures, and permissions—not just data.
Application State# 1
2
3
4
5
# What to capture
/etc/ # System config
/home/*/ # User data
/var/lib/docker/volumes/ # Container data
/opt/app/uploads/ # User uploads
Don’t forget :
SSL certificates and keys Environment files (.env) Cron jobs (crontab -l > crons.txt) systemd units Infrastructure as Code# 1
2
3
4
5
6
7
8
9
10
11
12
13
# Terraform state
terraform state pull > terraform.tfstate.backup
# Or better: use remote state with versioning
terraform {
backend "s3" {
bucket = "terraform-state"
key = "prod/terraform.tfstate"
region = "us-east-1"
encrypt = true
dynamodb_table = "terraform-locks"
}
}
Backup Frequency# Match backup frequency to acceptable data loss:
Data Type RPO Backup Frequency Transactions Minutes Continuous/WAL User data Hours Hourly Config Daily Daily Archives Weekly Weekly
RPO = Recovery Point Objective = How much data loss is acceptable
1
2
3
4
# Example cron schedule
0 * * * * /scripts/backup-hourly.sh # User uploads
0 3 * * * /scripts/backup-daily.sh # Full database
0 4 * * 0 /scripts/backup-weekly.sh # Everything
Backup Verification# Backups without verification are just hope:
Integrity Checks# 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
#!/bin/bash
# verify-backup.sh
BACKUP_FILE = $1
# Check file exists and has size
if [[ ! -s " $BACKUP_FILE " ]] ; then
echo "ERROR: Backup file missing or empty"
exit 1
fi
# Verify archive integrity
if [[ " $BACKUP_FILE " == *.gz ]] ; then
gzip -t " $BACKUP_FILE " || exit 1
fi
# For pg_dump custom format
if [[ " $BACKUP_FILE " == *.dump ]] ; then
pg_restore --list " $BACKUP_FILE " > /dev/null || exit 1
fi
echo "Backup verified: $BACKUP_FILE "
Restore Testing# 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
#!/bin/bash
# test-restore.sh - Run weekly
# Spin up test database
docker run -d --name restore-test \
-e POSTGRES_PASSWORD = test \
postgres:15
sleep 10
# Restore latest backup
LATEST = $( ls -t /backups/*.dump | head -1)
pg_restore -h localhost -U postgres -d postgres " $LATEST "
# Run validation queries
psql -h localhost -U postgres -c "SELECT COUNT(*) FROM users;"
psql -h localhost -U postgres -c "SELECT COUNT(*) FROM orders;"
# Compare counts to production
# Alert if significantly different
# Cleanup
docker rm -f restore-test
Schedule restore tests . Monthly at minimum. Quarterly for full disaster recovery drills.
Retention Policies# Keep backups long enough, but not forever:
1
2
3
4
5
# retention.conf
hourly_keep = 24 # Keep 24 hourly backups
daily_keep = 7 # Keep 7 daily backups
weekly_keep = 4 # Keep 4 weekly backups
monthly_keep = 12 # Keep 12 monthly backups
Implementation:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
#!/bin/bash
# cleanup-old-backups.sh
BACKUP_DIR = /backups
# Remove hourly backups older than 24 hours
find " $BACKUP_DIR /hourly" -mtime +1 -delete
# Remove daily backups older than 7 days
find " $BACKUP_DIR /daily" -mtime +7 -delete
# Remove weekly backups older than 28 days
find " $BACKUP_DIR /weekly" -mtime +28 -delete
# Remove monthly backups older than 365 days
find " $BACKUP_DIR /monthly" -mtime +365 -delete
Encryption# Backups contain your most sensitive data. Encrypt them:
1
2
3
4
5
6
7
8
9
10
# Encrypt with GPG
pg_dump dbname | gzip | gpg --encrypt -r backup@company.com > backup.sql.gz.gpg
# Encrypt with OpenSSL (symmetric)
pg_dump dbname | gzip | openssl enc -aes-256-cbc -salt -pbkdf2 \
-pass file:/etc/backup-key > backup.sql.gz.enc
# Decrypt
openssl enc -d -aes-256-cbc -pbkdf2 \
-pass file:/etc/backup-key < backup.sql.gz.enc | gunzip > backup.sql
Key management :
Store encryption keys separately from backups Use a secrets manager (Vault, AWS Secrets Manager) Have key recovery procedures documented and tested Cloud Backup Patterns# S3 with Lifecycle Rules# 1
2
3
4
# Upload with server-side encryption
aws s3 cp backup.tar.gz s3://backups/daily/ \
--sse AES256 \
--storage-class STANDARD_IA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
{
"Rules" : [
{
"ID" : "TransitionToGlacier" ,
"Status" : "Enabled" ,
"Transitions" : [
{
"Days" : 30 ,
"StorageClass" : "GLACIER"
}
],
"Expiration" : {
"Days" : 365
}
}
]
}
Cross-Region Replication# 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# Terraform - replicate to another region
resource "aws_s3_bucket_replication_configuration" "backup" {
bucket = aws_s3_bucket . backup . id
role = aws_iam_role . replication . arn
rule {
id = "replicate-all"
status = "Enabled"
destination {
bucket = aws_s3_bucket . backup_replica . arn
storage_class = "STANDARD_IA"
}
}
}
Database-Specific Strategies# PostgreSQL Point-in-Time Recovery# 1
2
3
4
5
6
7
8
9
10
11
12
13
# Enable WAL archiving in postgresql.conf
archive_mode = on
archive_command = 'aws s3 cp %p s3://wal-archive/%f'
# Recovery: restore base backup + replay WAL
pg_basebackup -D /var/lib/postgresql/data -Fp -Xs -P
# recovery.signal triggers recovery mode
touch /var/lib/postgresql/data/recovery.signal
# postgresql.auto.conf
restore_command = 'aws s3 cp s3://wal-archive/%f %p'
recovery_target_time = '2026-03-05 10:30:00'
MySQL Binary Log Backup# 1
2
3
4
5
6
7
8
9
10
11
# Enable binary logging in my.cnf
[ mysqld]
log_bin = /var/log/mysql/mysql-bin
binlog_expire_logs_seconds = 604800
# Backup binlogs
mysqlbinlog --read-from-remote-server \
--host= db.example.com \
--raw \
--result-file= /backups/binlog/ \
mysql-bin.000001
Monitoring Backups# Backups fail silently. Monitor them:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
#!/bin/bash
# check-backup-freshness.sh
MAX_AGE_HOURS = 25
BACKUP_DIR = /backups/daily
LATEST = $( ls -t " $BACKUP_DIR " /*.dump 2>/dev/null | head -1)
if [[ -z " $LATEST " ]] ; then
echo "CRITICAL: No backups found"
exit 2
fi
AGE_HOURS = $(( ( $( date +%s) - $( stat -c %Y " $LATEST " ) ) / 3600 ))
if [[ $AGE_HOURS -gt $MAX_AGE_HOURS ]] ; then
echo "WARNING: Latest backup is ${ AGE_HOURS } h old"
exit 1
fi
echo "OK: Latest backup is ${ AGE_HOURS } h old"
exit 0
Alert on:
Backup job failures Backup age exceeding threshold Backup size anomalies (suddenly much smaller = probably broken) Restore test failures Disaster Recovery Runbook# Document the restore process before you need it:
1
2
3
4
5
6
7
8
9
10
# Database Restore Runbook
## Prerequisites
- Access to S3 backup bucket
- Database admin credentials
- Target server with PostgreSQL installed
## Steps
1. Identify latest good backup
aws s3 ls s3://backups/daily/ –recursive | sort | tail -5
2 . D o w n l o a d b a c k u p
aws s3 cp s3://backups/daily/YYYY-MM-DD.dump /tmp/
3 . S t o p a p p l i c a t i o n s e r v e r s
systemctl stop myapp
4 . R e s t o r e d a t a b a s e
pg_restore -h localhost -U postgres -d mydb –clean /tmp/YYYY-MM-DD.dump
5 . V e r i f y r e s t o r e
psql -c “SELECT COUNT(*) FROM users;”
6 . R e s t a r t a p p l i c a t i o n
systemctl start myapp
# - - # D C C B l o A o n u t o d a n c - a t c c s a c l e l s : s : Test the runbook . If you can’t follow it during a drill, you can’t follow it during a crisis.
Common Failures# Backup succeeds, restore fails
Test restores regularly Verify backup integrity after creation Backups exist but are corrupted
Checksums on every backup Periodic integrity verification Can’t find the right backup
Clear naming conventions: dbname_YYYY-MM-DD_HH-MM.dump Metadata tags in cloud storage Restore takes too long
Know your RTO (Recovery Time Objective) Test restore duration Consider faster restore options (snapshots vs logical backups) Encryption key lost
Key escrow or recovery procedures Document key locations The Test That Matters# Once a quarter, answer this question:
“If our production database disappeared right now, how long until we’re back online with yesterday’s data?”
If you don’t know the answer, your backup strategy has a hole. Fill it before you need it.