Database migrations are where deployments go to die. One bad migration can corrupt data, lock tables for hours, or bring down production entirely. Here’s how to make them boring.

The Golden Rules

  1. Every migration must be reversible (or explicitly marked as not)
  2. Never run migrations during deploy (separate the concerns)
  3. Always test against production-scale data (10 rows works, 10 million doesn’t)
  4. Assume the migration will fail halfway (design for it)

The Expand-Contract Pattern

The safest way to change schemas: expand first, contract later.

Bad: Rename column in one migration

1
2
-- DANGER: This locks the table and breaks running code
ALTER TABLE users RENAME COLUMN email TO email_address;

Good: Expand, migrate, contract

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
-- Migration 1: Add new column
ALTER TABLE users ADD COLUMN email_address VARCHAR(255);

-- Migration 2: Backfill data (run separately, in batches)
UPDATE users SET email_address = email WHERE email_address IS NULL;

-- Migration 3: Update code to write to both columns
-- (deploy this, verify it works)

-- Migration 4: Update code to read from new column
-- (deploy this, verify it works)

-- Migration 5: Drop old column (weeks later)
ALTER TABLE users DROP COLUMN email;

Yes, it’s five steps instead of one. Yes, it’s worth it.

Batched Updates

Never update millions of rows in one transaction.

Bad:

1
2
UPDATE orders SET status = 'legacy' WHERE created_at < '2024-01-01';
-- Locks table, fills transaction log, probably times out

Good:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
-- Run in a loop until no rows affected
DO $$
DECLARE
  batch_size INT := 10000;
  rows_updated INT;
BEGIN
  LOOP
    UPDATE orders 
    SET status = 'legacy' 
    WHERE id IN (
      SELECT id FROM orders 
      WHERE created_at < '2024-01-01' 
        AND status != 'legacy'
      LIMIT batch_size
      FOR UPDATE SKIP LOCKED
    );
    
    GET DIAGNOSTICS rows_updated = ROW_COUNT;
    EXIT WHEN rows_updated = 0;
    
    COMMIT;
    PERFORM pg_sleep(0.1);  -- Brief pause to let other queries through
  END LOOP;
END $$;

Key elements:

  • LIMIT controls batch size
  • FOR UPDATE SKIP LOCKED avoids deadlocks
  • pg_sleep prevents overwhelming the database
  • Loop until done

Online Index Creation

Indexes on large tables lock writes. Use concurrent creation.

Bad:

1
2
CREATE INDEX idx_orders_customer ON orders(customer_id);
-- Table locked for writes until complete

Good:

1
2
CREATE INDEX CONCURRENTLY idx_orders_customer ON orders(customer_id);
-- Takes longer but doesn't block writes

Caveats:

  • Can’t run in a transaction
  • If it fails, you get an invalid index (drop and retry)
  • Watch for INVALID indexes: SELECT * FROM pg_indexes WHERE indexdef LIKE '%INVALID%'

Adding NOT NULL Columns

Adding a NOT NULL column requires a default or backfill.

Bad:

1
2
ALTER TABLE users ADD COLUMN team_id INTEGER NOT NULL;
-- Fails: existing rows violate NOT NULL

Also bad:

1
2
ALTER TABLE users ADD COLUMN team_id INTEGER NOT NULL DEFAULT 1;
-- On older PostgreSQL: rewrites entire table

Good (PostgreSQL 11+):

1
2
3
4
5
6
7
8
-- Step 1: Add nullable column
ALTER TABLE users ADD COLUMN team_id INTEGER;

-- Step 2: Backfill in batches
-- (use batched update pattern from above)

-- Step 3: Add NOT NULL constraint
ALTER TABLE users ALTER COLUMN team_id SET NOT NULL;

PostgreSQL 11+ can add NOT NULL with a default without rewriting the table, but backfilling separately gives you more control.

Foreign Key Considerations

Foreign keys validate on creation if data exists.

1
2
3
4
-- This scans the entire orders table
ALTER TABLE orders 
ADD CONSTRAINT fk_customer 
FOREIGN KEY (customer_id) REFERENCES customers(id);

For large tables, add the constraint as NOT VALID first:

1
2
3
4
5
6
7
8
-- Instant: doesn't validate existing data
ALTER TABLE orders 
ADD CONSTRAINT fk_customer 
FOREIGN KEY (customer_id) REFERENCES customers(id)
NOT VALID;

-- Then validate in background (can be interrupted)
ALTER TABLE orders VALIDATE CONSTRAINT fk_customer;

Migration Tools

Use a proper migration tool. Raw SQL files lead to chaos.

Flyway/Liquibase (Java ecosystem):

VVV123______cacrdredea_ateteme_a_uioslre_drcesor.lssu.qmslnq.lsql

Alembic (Python):

1
2
3
4
5
def upgrade():
    op.add_column('users', sa.Column('team_id', sa.Integer()))

def downgrade():
    op.drop_column('users', 'team_id')

golang-migrate:

000000000011__ccrreeaattee__uusseerrss..udpo.wsnq.lsql

Key features to require:

  • Version tracking (what’s been applied)
  • Reversibility (down migrations)
  • Checksum validation (detect tampering)
  • Dry-run mode (see what would run)

Testing Migrations

Against Production Data

Never test migrations against an empty database only.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# Dump production schema + sample data
pg_dump --schema-only prod_db > schema.sql
pg_dump --data-only --table=users --limit=10000 prod_db >> sample.sql

# Restore to test database
psql test_db < schema.sql
psql test_db < sample.sql

# Run migration
flyway migrate -url=jdbc:postgresql://localhost/test_db

Better: restore a full production backup to a test environment nightly.

Timing Your Migrations

1
2
3
4
5
6
-- Wrap migration in timing
\timing on
BEGIN;
-- Your migration here
ALTER TABLE orders ADD COLUMN metadata JSONB;
ROLLBACK;  -- or COMMIT if you want to keep it

If it takes 5 seconds on test data, multiply by 100x for production scale estimates.

Deployment Separation

Don’t run migrations as part of your deployment pipeline.

Do run migrations as a separate, monitored step:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# Deployment pipeline
stages:
  - name: run-migrations
    script: flyway migrate
    manual: true  # Require explicit approval
    timeout: 30m
    
  - name: deploy-application
    needs: [run-migrations]
    script: kubectl apply -f deployment.yaml

Why separate?

  • Migrations can take unpredictable time
  • Failed migrations need different rollback than failed deploys
  • You might need to run migrations hours before deploy

Rollback Strategy

Not all migrations are reversible. Plan for it.

Reversible:

  • Add column
  • Add index
  • Add table
  • Add constraint (NOT VALID)

Not reversible (without data loss):

  • Drop column
  • Drop table
  • Data transformations

For irreversible migrations:

  1. Take a backup immediately before
  2. Document the rollback procedure (restore from backup)
  3. Have a tested restore process
  4. Consider: is the risk worth the change?

Lock Monitoring

Know when your migration is blocked or blocking.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
-- Find blocking queries
SELECT 
  blocked.pid AS blocked_pid,
  blocked.query AS blocked_query,
  blocking.pid AS blocking_pid,
  blocking.query AS blocking_query
FROM pg_stat_activity blocked
JOIN pg_locks blocked_locks ON blocked.pid = blocked_locks.pid
JOIN pg_locks blocking_locks ON blocked_locks.locktype = blocking_locks.locktype
  AND blocked_locks.relation = blocking_locks.relation
  AND blocked_locks.pid != blocking_locks.pid
JOIN pg_stat_activity blocking ON blocking_locks.pid = blocking.pid
WHERE NOT blocked_locks.granted;

Set statement timeouts for migrations:

1
2
3
SET statement_timeout = '5min';
-- Migration that might hang
ALTER TABLE orders ADD COLUMN...;

The Pre-Flight Checklist

Before running any migration:

  • Tested on production-scale data
  • Estimated runtime documented
  • Rollback procedure documented
  • Backup taken (or continuous backups verified)
  • Maintenance window scheduled (if needed)
  • Monitoring dashboards open
  • Team notified
  • Statement timeout set

Start Here

  1. Today: Add timing to your migration tests
  2. This week: Implement batched updates for large data changes
  3. This month: Set up production-data testing for migrations
  4. This quarter: Separate migration pipeline from deploy pipeline

The goal: migrations should be boring. Predictable. Uneventful. When they’re exciting, something has gone wrong.


The best migration is one nobody remembers running. It just worked, and everyone went home on time.