The scariest deploy isn’t code—it’s schema changes. One wrong migration can lock tables, corrupt data, or bring down production. Zero-downtime migrations require discipline, but they’re achievable.

The Problem

Traditional migrations assume you can take the database offline:

1
2
-- Dangerous in production
ALTER TABLE users ADD COLUMN phone VARCHAR(20) NOT NULL;

This locks the table, blocks all reads and writes, and fails if any existing rows lack a value. In a busy system, that’s an outage.

The Expand-Contract Pattern

Safe migrations happen in phases:

  1. Expand: Add new structure (backward compatible)
  2. Migrate: Backfill data, update application
  3. Contract: Remove old structure

Each step is independently deployable. If something breaks, you can stop without rollback.

Example: Adding a Required Column

Phase 1: Expand (add nullable column)

1
ALTER TABLE users ADD COLUMN phone VARCHAR(20);  -- Nullable, no lock

Phase 2: Migrate (backfill + update app)

1
2
-- Backfill in batches to avoid long locks
UPDATE users SET phone = 'unknown' WHERE phone IS NULL LIMIT 1000;
1
2
3
4
5
// Application writes to new column
await db.query(
  'INSERT INTO users (name, email, phone) VALUES ($1, $2, $3)',
  [name, email, phone || 'unknown']
);

Phase 3: Contract (add constraint)

1
2
-- Only after all rows have values
ALTER TABLE users ALTER COLUMN phone SET NOT NULL;

Non-Blocking DDL Operations

PostgreSQL and MySQL have different locking behaviors. Know your database.

PostgreSQL: Safe Operations

1
2
3
4
5
6
7
8
-- Non-blocking (metadata only)
ALTER TABLE users ADD COLUMN status VARCHAR(20);
ALTER TABLE users ALTER COLUMN status SET DEFAULT 'active';
CREATE INDEX CONCURRENTLY idx_users_status ON users(status);

-- Blocking (avoid on large tables)
ALTER TABLE users ALTER COLUMN status SET NOT NULL;
ALTER TABLE users ADD CONSTRAINT check_status CHECK (status IN ('active', 'inactive'));

The CONCURRENTLY keyword is crucial for indexes—without it, the table is locked during index creation.

MySQL: Online DDL

1
2
3
4
5
-- Online (InnoDB)
ALTER TABLE users ADD COLUMN status VARCHAR(20), ALGORITHM=INPLACE, LOCK=NONE;

-- Check if operation supports online
SELECT * FROM information_schema.innodb_online_ddl;

Renaming Columns Safely

Never rename directly. Use expand-contract:

Phase 1: Add new column

1
ALTER TABLE users ADD COLUMN full_name VARCHAR(255);

Phase 2: Dual-write

1
2
3
4
5
// Write to both columns
await db.query(
  'INSERT INTO users (name, full_name, email) VALUES ($1, $1, $2)',
  [fullName, email]
);

Phase 3: Backfill

1
UPDATE users SET full_name = name WHERE full_name IS NULL;

Phase 4: Switch reads

1
2
// Read from new column
const user = await db.query('SELECT full_name as name FROM users WHERE id = $1', [id]);

Phase 5: Drop old column

1
ALTER TABLE users DROP COLUMN name;

Foreign Key Considerations

Adding foreign keys can lock both tables:

1
2
3
-- Dangerous: locks both tables
ALTER TABLE orders ADD CONSTRAINT fk_user 
  FOREIGN KEY (user_id) REFERENCES users(id);

Safe approach:

1
2
3
4
5
6
-- 1. Add constraint as NOT VALID (no full table scan)
ALTER TABLE orders ADD CONSTRAINT fk_user 
  FOREIGN KEY (user_id) REFERENCES users(id) NOT VALID;

-- 2. Validate in background (non-blocking in PG 12+)
ALTER TABLE orders VALIDATE CONSTRAINT fk_user;

Batch Processing for Backfills

Never update millions of rows in one transaction:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
async function backfillInBatches(batchSize = 1000) {
  let updated = 0;
  
  while (true) {
    const result = await db.query(`
      UPDATE users 
      SET status = 'active' 
      WHERE id IN (
        SELECT id FROM users 
        WHERE status IS NULL 
        LIMIT $1
        FOR UPDATE SKIP LOCKED
      )
      RETURNING id
    `, [batchSize]);
    
    updated += result.rowCount;
    console.log(`Backfilled ${updated} rows`);
    
    if (result.rowCount < batchSize) break;
    
    // Pause to reduce load
    await sleep(100);
  }
  
  return updated;
}

Key techniques:

  • LIMIT keeps transactions small
  • FOR UPDATE SKIP LOCKED avoids blocking concurrent operations
  • Sleep between batches reduces database load

Migration Tooling

Use a migration framework that tracks state:

1
2
3
4
5
6
7
8
// migrations/20260216_add_user_status.js
exports.up = async (db) => {
  await db.query('ALTER TABLE users ADD COLUMN status VARCHAR(20)');
};

exports.down = async (db) => {
  await db.query('ALTER TABLE users DROP COLUMN status');
};

Popular tools:

  • Flyway — SQL-based, version controlled
  • Liquibase — XML/YAML/SQL, rollback support
  • Knex/Prisma — ORM-integrated migrations
  • golang-migrate — Lightweight, database-agnostic

Testing Migrations

Test against production-like data:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
# Dump production schema (no data)
pg_dump --schema-only production_db > schema.sql

# Create test database
psql test_db < schema.sql

# Insert realistic data volume
./scripts/generate_test_data.sh

# Run migration
./migrate up

# Verify
./scripts/verify_migration.sh

Test for:

  • Lock duration on large tables
  • Data integrity after backfill
  • Application compatibility during transition
  • Rollback functionality

Rollback Strategy

Every migration should have a rollback plan:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
// Migration with rollback
exports.up = async (db) => {
  // Expand: add column
  await db.query('ALTER TABLE users ADD COLUMN phone VARCHAR(20)');
};

exports.down = async (db) => {
  // Contract: remove column (safe if no app depends on it)
  await db.query('ALTER TABLE users DROP COLUMN phone');
};

But for complex migrations, rollback might not be a simple reverse:

1
2
3
4
5
6
7
// Sometimes rollback is "keep both, revert app"
exports.down = async (db) => {
  // Don't drop the new column
  // Instead, document manual cleanup steps
  console.log('Rollback: revert application to use old column');
  console.log('Manual step: DROP COLUMN after verification');
};

Common Mistakes

Running migrations at deploy time:

1
2
# Dangerous: migration runs during pod startup
command: ["./migrate", "up", "&&", "./app"]

Run migrations separately, before deploying new code:

1
2
./migrate up          # Run migration
kubectl rollout ...   # Deploy new code

Assuming instant migrations: Large table alterations take time. A 100M row table might need hours for certain operations.

No connection to running code: Old application code must work with new schema. New code must work with old schema (during rollout).

Ignoring replication lag: Writes hit primary immediately; reads from replicas see old data until replication catches up.

The Mental Model

Think of database migrations like renovating a house while people live in it:

  • You can’t tear down walls (breaking changes) without temporary supports (dual-write)
  • You work in phases (expand-contract)
  • You keep one working bathroom at all times (backward compatibility)
  • You clean up construction debris after the work is done (contract phase)

The goal: residents never notice the renovation. Your users should never notice the migration.

Schema changes are scary because they’re hard to undo. But with expand-contract, batched backfills, and proper testing, they become just another deploy.