The scariest deploy isn’t code—it’s schema changes. One wrong migration can lock tables, corrupt data, or bring down production. Zero-downtime migrations require discipline, but they’re achievable.
The Problem#
Traditional migrations assume you can take the database offline:
1
2
| -- Dangerous in production
ALTER TABLE users ADD COLUMN phone VARCHAR(20) NOT NULL;
|
This locks the table, blocks all reads and writes, and fails if any existing rows lack a value. In a busy system, that’s an outage.
The Expand-Contract Pattern#
Safe migrations happen in phases:
- Expand: Add new structure (backward compatible)
- Migrate: Backfill data, update application
- Contract: Remove old structure
Each step is independently deployable. If something breaks, you can stop without rollback.
Example: Adding a Required Column#
Phase 1: Expand (add nullable column)
1
| ALTER TABLE users ADD COLUMN phone VARCHAR(20); -- Nullable, no lock
|
Phase 2: Migrate (backfill + update app)
1
2
| -- Backfill in batches to avoid long locks
UPDATE users SET phone = 'unknown' WHERE phone IS NULL LIMIT 1000;
|
1
2
3
4
5
| // Application writes to new column
await db.query(
'INSERT INTO users (name, email, phone) VALUES ($1, $2, $3)',
[name, email, phone || 'unknown']
);
|
Phase 3: Contract (add constraint)
1
2
| -- Only after all rows have values
ALTER TABLE users ALTER COLUMN phone SET NOT NULL;
|
Non-Blocking DDL Operations#
PostgreSQL and MySQL have different locking behaviors. Know your database.
PostgreSQL: Safe Operations#
1
2
3
4
5
6
7
8
| -- Non-blocking (metadata only)
ALTER TABLE users ADD COLUMN status VARCHAR(20);
ALTER TABLE users ALTER COLUMN status SET DEFAULT 'active';
CREATE INDEX CONCURRENTLY idx_users_status ON users(status);
-- Blocking (avoid on large tables)
ALTER TABLE users ALTER COLUMN status SET NOT NULL;
ALTER TABLE users ADD CONSTRAINT check_status CHECK (status IN ('active', 'inactive'));
|
The CONCURRENTLY keyword is crucial for indexes—without it, the table is locked during index creation.
MySQL: Online DDL#
1
2
3
4
5
| -- Online (InnoDB)
ALTER TABLE users ADD COLUMN status VARCHAR(20), ALGORITHM=INPLACE, LOCK=NONE;
-- Check if operation supports online
SELECT * FROM information_schema.innodb_online_ddl;
|
Renaming Columns Safely#
Never rename directly. Use expand-contract:
Phase 1: Add new column
1
| ALTER TABLE users ADD COLUMN full_name VARCHAR(255);
|
Phase 2: Dual-write
1
2
3
4
5
| // Write to both columns
await db.query(
'INSERT INTO users (name, full_name, email) VALUES ($1, $1, $2)',
[fullName, email]
);
|
Phase 3: Backfill
1
| UPDATE users SET full_name = name WHERE full_name IS NULL;
|
Phase 4: Switch reads
1
2
| // Read from new column
const user = await db.query('SELECT full_name as name FROM users WHERE id = $1', [id]);
|
Phase 5: Drop old column
1
| ALTER TABLE users DROP COLUMN name;
|
Foreign Key Considerations#
Adding foreign keys can lock both tables:
1
2
3
| -- Dangerous: locks both tables
ALTER TABLE orders ADD CONSTRAINT fk_user
FOREIGN KEY (user_id) REFERENCES users(id);
|
Safe approach:
1
2
3
4
5
6
| -- 1. Add constraint as NOT VALID (no full table scan)
ALTER TABLE orders ADD CONSTRAINT fk_user
FOREIGN KEY (user_id) REFERENCES users(id) NOT VALID;
-- 2. Validate in background (non-blocking in PG 12+)
ALTER TABLE orders VALIDATE CONSTRAINT fk_user;
|
Batch Processing for Backfills#
Never update millions of rows in one transaction:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
| async function backfillInBatches(batchSize = 1000) {
let updated = 0;
while (true) {
const result = await db.query(`
UPDATE users
SET status = 'active'
WHERE id IN (
SELECT id FROM users
WHERE status IS NULL
LIMIT $1
FOR UPDATE SKIP LOCKED
)
RETURNING id
`, [batchSize]);
updated += result.rowCount;
console.log(`Backfilled ${updated} rows`);
if (result.rowCount < batchSize) break;
// Pause to reduce load
await sleep(100);
}
return updated;
}
|
Key techniques:
LIMIT keeps transactions smallFOR UPDATE SKIP LOCKED avoids blocking concurrent operations- Sleep between batches reduces database load
Use a migration framework that tracks state:
1
2
3
4
5
6
7
8
| // migrations/20260216_add_user_status.js
exports.up = async (db) => {
await db.query('ALTER TABLE users ADD COLUMN status VARCHAR(20)');
};
exports.down = async (db) => {
await db.query('ALTER TABLE users DROP COLUMN status');
};
|
Popular tools:
- Flyway — SQL-based, version controlled
- Liquibase — XML/YAML/SQL, rollback support
- Knex/Prisma — ORM-integrated migrations
- golang-migrate — Lightweight, database-agnostic
Testing Migrations#
Test against production-like data:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
| # Dump production schema (no data)
pg_dump --schema-only production_db > schema.sql
# Create test database
psql test_db < schema.sql
# Insert realistic data volume
./scripts/generate_test_data.sh
# Run migration
./migrate up
# Verify
./scripts/verify_migration.sh
|
Test for:
- Lock duration on large tables
- Data integrity after backfill
- Application compatibility during transition
- Rollback functionality
Rollback Strategy#
Every migration should have a rollback plan:
1
2
3
4
5
6
7
8
9
10
| // Migration with rollback
exports.up = async (db) => {
// Expand: add column
await db.query('ALTER TABLE users ADD COLUMN phone VARCHAR(20)');
};
exports.down = async (db) => {
// Contract: remove column (safe if no app depends on it)
await db.query('ALTER TABLE users DROP COLUMN phone');
};
|
But for complex migrations, rollback might not be a simple reverse:
1
2
3
4
5
6
7
| // Sometimes rollback is "keep both, revert app"
exports.down = async (db) => {
// Don't drop the new column
// Instead, document manual cleanup steps
console.log('Rollback: revert application to use old column');
console.log('Manual step: DROP COLUMN after verification');
};
|
Common Mistakes#
Running migrations at deploy time:
1
2
| # Dangerous: migration runs during pod startup
command: ["./migrate", "up", "&&", "./app"]
|
Run migrations separately, before deploying new code:
1
2
| ./migrate up # Run migration
kubectl rollout ... # Deploy new code
|
Assuming instant migrations:
Large table alterations take time. A 100M row table might need hours for certain operations.
No connection to running code:
Old application code must work with new schema. New code must work with old schema (during rollout).
Ignoring replication lag:
Writes hit primary immediately; reads from replicas see old data until replication catches up.
The Mental Model#
Think of database migrations like renovating a house while people live in it:
- You can’t tear down walls (breaking changes) without temporary supports (dual-write)
- You work in phases (expand-contract)
- You keep one working bathroom at all times (backward compatibility)
- You clean up construction debris after the work is done (contract phase)
The goal: residents never notice the renovation. Your users should never notice the migration.
Schema changes are scary because they’re hard to undo. But with expand-contract, batched backfills, and proper testing, they become just another deploy.