Zero-Downtime PostgreSQL Migrations: Expand/Contract, Backfill and Rollback Strategies

Zero-downtime migrations are possible, but you have to treat them as an engineering project. Friday, 4 PM. I ran ALTER TABLE on a table with 47 million rows. 3 seconds later, Slack lit up red - production was down. I spent the next 4 hours on rollback and explaining why this would never happen again.

That was 6 years ago. Since then, I’ve done hundreds of migrations on production databases without a single outage. These rules are carved in stone for me.

Tested on: PostgreSQL 14-16, tables from 1M to 200M rows, AWS RDS and self-hosted. All examples in this article I’ve personally run on production systems.

What “Zero-Downtime” Means in Practice

Definition: The application remains fully functional throughout the entire migration. No error pages, no timeouts, no “try again later.”

Reality:

Latency may temporarily increase (acceptable: +20%)
Some operations may be slower (acceptable: +50%)
No failed requests for end users

Most Common Causes of Outages

Cause	Why It’s a Problem	Solution
Table lock	Blocks all queries	Online DDL
Full table rewrite	Long exclusive operation	Expand/contract
Long transactions	Blocking others	Timeouts, batching
Incompatible changes	App crash	Feature flags

Core Pattern: Expand / Contract

This is the most important pattern. Every migration has two phases:

Expand (expansion)

Add new (columns, tables, indexes)
Old code still works
New code can start using new structures

Contract (contraction)

Remove old (only when nothing uses it)
Clean up data
Finalize migration

Example: Renaming a Column

Bad way (downtime):

ALTER TABLE users RENAME COLUMN name TO full_name;
-- All queries fail immediately!

Correct way (zero-downtime):

-- Phase 1: Expand
ALTER TABLE users ADD COLUMN full_name VARCHAR(255);

-- Phase 2: Backfill (batched) - PostgreSQL syntax
UPDATE users SET full_name = name
WHERE id IN (
    SELECT id FROM users
    WHERE full_name IS NULL
    LIMIT 1000
);
-- (repeat until done)

-- Phase 3: App update
-- Deploy code that reads from both, writes to both

-- Phase 4: App update 2
-- Deploy code that reads only from full_name

-- Phase 5: Contract
ALTER TABLE users DROP COLUMN name;

Two-Phase Deploy with Feature Flags

The application must be able to work with both old AND new schema simultaneously.

# Phase: Expand
def get_user_name(user):
    if feature_flag('use_full_name'):
        return user.full_name or user.name  # fallback to old
    return user.name

def save_user(user, name):
    user.name = name
    if feature_flag('use_full_name'):
        user.full_name = name  # write to both
    user.save()

Types of Changes and Recommended Approach

Adding a Column (safe)

-- Safe - no lock on existing rows
ALTER TABLE users ADD COLUMN avatar_url VARCHAR(255);

-- BUT BEWARE: with DEFAULT it's a problem!
-- PostgreSQL 11+ is OK, older versions = table rewrite
ALTER TABLE users ADD COLUMN status VARCHAR(20) DEFAULT 'active';

Changing Column Type (dangerous)

-- NOT like this - table rewrite!
ALTER TABLE orders ALTER COLUMN amount TYPE NUMERIC(12,2);

-- YES like this - expand/contract
-- 1. Add new column
ALTER TABLE orders ADD COLUMN amount_new NUMERIC(12,2);
-- 2. Backfill
UPDATE orders SET amount_new = amount WHERE amount_new IS NULL;
-- 3. Switch app
-- 4. Drop old
ALTER TABLE orders DROP COLUMN amount;
ALTER TABLE orders RENAME COLUMN amount_new TO amount;

Adding NOT NULL (dangerous)

-- NOT like this - full table scan with lock
ALTER TABLE users ALTER COLUMN email SET NOT NULL;

-- YES like this
-- 1. Add constraint as NOT VALID
ALTER TABLE users ADD CONSTRAINT users_email_not_null
  CHECK (email IS NOT NULL) NOT VALID;
-- 2. Validate in background (no lock)
ALTER TABLE users VALIDATE CONSTRAINT users_email_not_null;
-- 3. Now you can add NOT NULL (fast, because constraint already exists)
ALTER TABLE users ALTER COLUMN email SET NOT NULL;
ALTER TABLE users DROP CONSTRAINT users_email_not_null;

Adding UNIQUE (dangerous)

-- NOT like this - long lock
ALTER TABLE users ADD CONSTRAINT users_email_unique UNIQUE (email);

-- YES like this
-- 1. Create index CONCURRENTLY (no lock)
CREATE UNIQUE INDEX CONCURRENTLY users_email_unique_idx ON users(email);
-- 2. Add constraint using index (fast)
ALTER TABLE users ADD CONSTRAINT users_email_unique
  UNIQUE USING INDEX users_email_unique_idx;

Online Indexes

-- ALWAYS use CONCURRENTLY for production
CREATE INDEX CONCURRENTLY idx_orders_user_id ON orders(user_id);

-- If it fails, you must drop the INVALID index
DROP INDEX CONCURRENTLY idx_orders_user_id;
-- And try again

⚠️ Important: CREATE INDEX CONCURRENTLY does not work inside a transaction! If you use a migration tool like Flyway or Liquibase, make sure the migration with CONCURRENTLY runs outside a transaction (e.g., in Flyway use -- flyway:disableTransaction).

Strategies for Large Tables

For tables with millions of rows:

Create index during off-peak hours (even CONCURRENTLY loads DB)
Set maintenance_work_mem higher (temporarily)
Monitor replication lag (replica may fall behind)

-- Temporarily increase memory for session
SET maintenance_work_mem = '1GB';
CREATE INDEX CONCURRENTLY ...;
RESET maintenance_work_mem;

Backfilling Data Without Killing the DB

Basic Batching

BATCH_SIZE = 1000
SLEEP_BETWEEN = 0.1  # 100ms

def backfill_full_name():
    while True:
        with transaction():
            # PostgreSQL doesn't support UPDATE...LIMIT, use subquery
            count = db.execute("""
                UPDATE users
                SET full_name = name
                WHERE id IN (
                    SELECT id FROM users
                    WHERE full_name IS NULL
                    LIMIT %s
                )
            """, [BATCH_SIZE])

        if count == 0:
            break

        time.sleep(SLEEP_BETWEEN)
        log(f"Migrated {count} rows")

Throttling Based on Load

def backfill_with_throttle():
    while True:
        # Check load before each batch
        lag = get_replication_lag()
        if lag > timedelta(seconds=5):
            log("Replication lag too high, pausing...")
            time.sleep(10)
            continue

        connections = get_active_connections()
        if connections > 100:
            log("Too many connections, pausing...")
            time.sleep(5)
            continue

        # Run batch
        migrate_batch()

Idempotency

Migration MUST be restartable:

def migrate_batch():
    # Idempotent update - doesn't matter if run multiple times
    # PostgreSQL doesn't support UPDATE...ORDER BY...LIMIT, use subquery
    db.execute("""
        UPDATE users
        SET full_name = name
        WHERE id IN (
            SELECT id FROM users
            WHERE full_name IS NULL
            AND id > %s
            ORDER BY id
            LIMIT 1000
        )
    """, [last_processed_id])

Rollback Strategy

Every migration must have a rollback plan BEFORE execution.

Forward-Compatible Code

# Code must work with both schemas
def get_user(id):
    user = db.query("SELECT * FROM users WHERE id = %s", [id])
    # If new column doesn't exist, use old
    return {
        'name': user.full_name if hasattr(user, 'full_name') else user.name
    }

Safe Rollback Checklist

Before deploying migration:

Does rollback script exist?
Is rollback tested on staging?
Do we have backup of data that will be changed?
Is rollback time-bounded? (max 15 min)
Who has authority to trigger rollback?

Rollback Script Template

-- rollback_001_add_full_name.sql
-- Description: Rollback adding full_name column
-- Prerequisites: App must be on version < X.Y.Z

BEGIN;

-- Check that app no longer uses new column
DO $$
BEGIN
    IF EXISTS (
        SELECT 1 FROM pg_stat_activity
        WHERE query LIKE '%full_name%' AND state = 'active'
    ) THEN
        RAISE EXCEPTION 'Active queries using full_name column!';
    END IF;
END
$$;

-- Drop column
ALTER TABLE users DROP COLUMN IF EXISTS full_name;

COMMIT;

Observability for Migrations

Metrics to Monitor

-- Lock wait time
SELECT pid, wait_event_type, wait_event, query
FROM pg_stat_activity
WHERE wait_event IS NOT NULL;

-- Replication lag
SELECT client_addr,
       pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) as lag_bytes
FROM pg_stat_replication;

-- Long running queries
SELECT pid, now() - pg_stat_activity.query_start AS duration, query
FROM pg_stat_activity
WHERE state = 'active'
  AND now() - pg_stat_activity.query_start > interval '30 seconds';

Alerts During Migration

Set these alerts to lower thresholds during migration:

Metric	Normal Threshold	During Migration
Query latency p99	500ms	200ms
Error rate	1%	0.1%
Replication lag	30s	5s
Connection count	200	150

Runbook for Production Rollout

T-24h: Preparation

Migration tested on staging with production data
Rollback script prepared and tested
Team informed about migration window
Monitoring dashboards prepared

T-1h: Pre-flight

Backup started and verified
Replication lag < 1s
No long-running transactions
Error rate baseline recorded

T-0: Migration

Start migration
Monitor metrics every 30s
On any alert: STOP and evaluate

Stop Conditions (immediate rollback)

Error rate > 1%
Replication lag > 10s
Query latency p99 > 2x baseline
Any ERROR in app logs related to migration

Post-Migration

Validate data
Clear rollback alert
Document lessons learned

Tooling: Flyway/Liquibase Safely

Separate DDL and DML

migrations/
  V001__create_users_table.sql      # DDL
  V002__add_status_column.sql       # DDL
  R001__backfill_status.sql         # DML - repeatable

Review Rules

Every migration must pass review with checklist:

Timeouts for Migrations

-- In every migration
SET LOCAL lock_timeout = '5s';
SET LOCAL statement_timeout = '30s';

-- If timeout, migration fails and you can retry
ALTER TABLE ...;

Conclusion: Checklist Before Every Migration

## Pre-migration Checklist

### Planning
- [ ] Do I understand what the migration does?
- [ ] Is this expand or contract phase?
- [ ] What is the expected run time?
- [ ] How many rows/tables are affected?

### Safety
- [ ] Do I have a rollback script?
- [ ] Is the code forward-compatible?
- [ ] Do I have a backup?
- [ ] Are timeouts set?

### Execution
- [ ] Did staging test pass?
- [ ] Is monitoring ready?
- [ ] Is the team informed?
- [ ] Are stop conditions defined?

### Post-migration
- [ ] Is data valid?
- [ ] Are metrics normal?
- [ ] Is documentation updated?

Your next step: Take the last migration you did. Did it meet this checklist? If not, what would you do differently?

Frequently Asked Questions (FAQ)

How long does an expand/contract migration take?

Depends on table size and change complexity. For a table with millions of rows, expect days to weeks (including backfill). Never do it over the weekend - do it during working days when you can respond.

Can I use pg_repack instead of expand/contract?

pg_repack is useful for table bloat and some types of changes, but it’s not a universal solution. For column type changes or adding NOT NULL, you still need the expand/contract pattern.

What if backfill takes too long?

Increase batch size (if DB can handle it), add more parallel workers, or consider pg_dump/restore for extremely large tables. Always monitor replication lag.

Is it safe to run migrations during peak hours?

With the correct expand/contract pattern, yes. But I recommend starting with lower traffic until you build confidence in the process. Do your first 5 migrations during off-peak hours.

Architecture as Code: ADR, C4 Diagrams and CI Quality Gates - Document database decisions using ADR
CI/CD for Monorepo: Speed, Caching, Selective Tests and Supply-Chain Security - How to automate database migrations in your CI/CD pipeline