Back to blog

Zero-Downtime PostgreSQL Migrations: Expand/Contract, Backfill and Rollback Strategies

|
| postgresql, database, devops, migrations, zero-downtime

Zero-downtime migrations are possible, but you have to treat them as an engineering project. Friday, 4 PM. I ran ALTER TABLE on a table with 47 million rows. 3 seconds later, Slack lit up red - production was down. I spent the next 4 hours on rollback and explaining why this would never happen again.

That was 6 years ago. Since then, I’ve done hundreds of migrations on production databases without a single outage. These rules are carved in stone for me.

Tested on: PostgreSQL 14-16, tables from 1M to 200M rows, AWS RDS and self-hosted. All examples in this article I’ve personally run on production systems.

What “Zero-Downtime” Means in Practice

Definition: The application remains fully functional throughout the entire migration. No error pages, no timeouts, no “try again later.”

Reality:

  • Latency may temporarily increase (acceptable: +20%)
  • Some operations may be slower (acceptable: +50%)
  • No failed requests for end users

Most Common Causes of Outages

CauseWhy It’s a ProblemSolution
Table lockBlocks all queriesOnline DDL
Full table rewriteLong exclusive operationExpand/contract
Long transactionsBlocking othersTimeouts, batching
Incompatible changesApp crashFeature flags

Core Pattern: Expand / Contract

This is the most important pattern. Every migration has two phases:

Expand (expansion)

  • Add new (columns, tables, indexes)
  • Old code still works
  • New code can start using new structures

Contract (contraction)

  • Remove old (only when nothing uses it)
  • Clean up data
  • Finalize migration

Example: Renaming a Column

Bad way (downtime):

ALTER TABLE users RENAME COLUMN name TO full_name;
-- All queries fail immediately!

Correct way (zero-downtime):

-- Phase 1: Expand
ALTER TABLE users ADD COLUMN full_name VARCHAR(255);

-- Phase 2: Backfill (batched) - PostgreSQL syntax
UPDATE users SET full_name = name
WHERE id IN (
    SELECT id FROM users
    WHERE full_name IS NULL
    LIMIT 1000
);
-- (repeat until done)

-- Phase 3: App update
-- Deploy code that reads from both, writes to both

-- Phase 4: App update 2
-- Deploy code that reads only from full_name

-- Phase 5: Contract
ALTER TABLE users DROP COLUMN name;

Two-Phase Deploy with Feature Flags

The application must be able to work with both old AND new schema simultaneously.

# Phase: Expand
def get_user_name(user):
    if feature_flag('use_full_name'):
        return user.full_name or user.name  # fallback to old
    return user.name

def save_user(user, name):
    user.name = name
    if feature_flag('use_full_name'):
        user.full_name = name  # write to both
    user.save()

Adding a Column (safe)

-- Safe - no lock on existing rows
ALTER TABLE users ADD COLUMN avatar_url VARCHAR(255);

-- BUT BEWARE: with DEFAULT it's a problem!
-- PostgreSQL 11+ is OK, older versions = table rewrite
ALTER TABLE users ADD COLUMN status VARCHAR(20) DEFAULT 'active';

Changing Column Type (dangerous)

-- NOT like this - table rewrite!
ALTER TABLE orders ALTER COLUMN amount TYPE NUMERIC(12,2);

-- YES like this - expand/contract
-- 1. Add new column
ALTER TABLE orders ADD COLUMN amount_new NUMERIC(12,2);
-- 2. Backfill
UPDATE orders SET amount_new = amount WHERE amount_new IS NULL;
-- 3. Switch app
-- 4. Drop old
ALTER TABLE orders DROP COLUMN amount;
ALTER TABLE orders RENAME COLUMN amount_new TO amount;

Adding NOT NULL (dangerous)

-- NOT like this - full table scan with lock
ALTER TABLE users ALTER COLUMN email SET NOT NULL;

-- YES like this
-- 1. Add constraint as NOT VALID
ALTER TABLE users ADD CONSTRAINT users_email_not_null
  CHECK (email IS NOT NULL) NOT VALID;
-- 2. Validate in background (no lock)
ALTER TABLE users VALIDATE CONSTRAINT users_email_not_null;
-- 3. Now you can add NOT NULL (fast, because constraint already exists)
ALTER TABLE users ALTER COLUMN email SET NOT NULL;
ALTER TABLE users DROP CONSTRAINT users_email_not_null;

Adding UNIQUE (dangerous)

-- NOT like this - long lock
ALTER TABLE users ADD CONSTRAINT users_email_unique UNIQUE (email);

-- YES like this
-- 1. Create index CONCURRENTLY (no lock)
CREATE UNIQUE INDEX CONCURRENTLY users_email_unique_idx ON users(email);
-- 2. Add constraint using index (fast)
ALTER TABLE users ADD CONSTRAINT users_email_unique
  UNIQUE USING INDEX users_email_unique_idx;

Online Indexes

-- ALWAYS use CONCURRENTLY for production
CREATE INDEX CONCURRENTLY idx_orders_user_id ON orders(user_id);

-- If it fails, you must drop the INVALID index
DROP INDEX CONCURRENTLY idx_orders_user_id;
-- And try again

⚠️ Important: CREATE INDEX CONCURRENTLY does not work inside a transaction! If you use a migration tool like Flyway or Liquibase, make sure the migration with CONCURRENTLY runs outside a transaction (e.g., in Flyway use -- flyway:disableTransaction).

Strategies for Large Tables

For tables with millions of rows:

  1. Create index during off-peak hours (even CONCURRENTLY loads DB)
  2. Set maintenance_work_mem higher (temporarily)
  3. Monitor replication lag (replica may fall behind)
-- Temporarily increase memory for session
SET maintenance_work_mem = '1GB';
CREATE INDEX CONCURRENTLY ...;
RESET maintenance_work_mem;

Backfilling Data Without Killing the DB

Basic Batching

BATCH_SIZE = 1000
SLEEP_BETWEEN = 0.1  # 100ms

def backfill_full_name():
    while True:
        with transaction():
            # PostgreSQL doesn't support UPDATE...LIMIT, use subquery
            count = db.execute("""
                UPDATE users
                SET full_name = name
                WHERE id IN (
                    SELECT id FROM users
                    WHERE full_name IS NULL
                    LIMIT %s
                )
            """, [BATCH_SIZE])

        if count == 0:
            break

        time.sleep(SLEEP_BETWEEN)
        log(f"Migrated {count} rows")

Throttling Based on Load

def backfill_with_throttle():
    while True:
        # Check load before each batch
        lag = get_replication_lag()
        if lag > timedelta(seconds=5):
            log("Replication lag too high, pausing...")
            time.sleep(10)
            continue

        connections = get_active_connections()
        if connections > 100:
            log("Too many connections, pausing...")
            time.sleep(5)
            continue

        # Run batch
        migrate_batch()

Idempotency

Migration MUST be restartable:

def migrate_batch():
    # Idempotent update - doesn't matter if run multiple times
    # PostgreSQL doesn't support UPDATE...ORDER BY...LIMIT, use subquery
    db.execute("""
        UPDATE users
        SET full_name = name
        WHERE id IN (
            SELECT id FROM users
            WHERE full_name IS NULL
            AND id > %s
            ORDER BY id
            LIMIT 1000
        )
    """, [last_processed_id])

Rollback Strategy

Every migration must have a rollback plan BEFORE execution.

Forward-Compatible Code

# Code must work with both schemas
def get_user(id):
    user = db.query("SELECT * FROM users WHERE id = %s", [id])
    # If new column doesn't exist, use old
    return {
        'name': user.full_name if hasattr(user, 'full_name') else user.name
    }

Safe Rollback Checklist

Before deploying migration:

  • Does rollback script exist?
  • Is rollback tested on staging?
  • Do we have backup of data that will be changed?
  • Is rollback time-bounded? (max 15 min)
  • Who has authority to trigger rollback?

Rollback Script Template

-- rollback_001_add_full_name.sql
-- Description: Rollback adding full_name column
-- Prerequisites: App must be on version < X.Y.Z

BEGIN;

-- Check that app no longer uses new column
DO $$
BEGIN
    IF EXISTS (
        SELECT 1 FROM pg_stat_activity
        WHERE query LIKE '%full_name%' AND state = 'active'
    ) THEN
        RAISE EXCEPTION 'Active queries using full_name column!';
    END IF;
END
$$;

-- Drop column
ALTER TABLE users DROP COLUMN IF EXISTS full_name;

COMMIT;

Observability for Migrations

Metrics to Monitor

-- Lock wait time
SELECT pid, wait_event_type, wait_event, query
FROM pg_stat_activity
WHERE wait_event IS NOT NULL;

-- Replication lag
SELECT client_addr,
       pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) as lag_bytes
FROM pg_stat_replication;

-- Long running queries
SELECT pid, now() - pg_stat_activity.query_start AS duration, query
FROM pg_stat_activity
WHERE state = 'active'
  AND now() - pg_stat_activity.query_start > interval '30 seconds';

Alerts During Migration

Set these alerts to lower thresholds during migration:

MetricNormal ThresholdDuring Migration
Query latency p99500ms200ms
Error rate1%0.1%
Replication lag30s5s
Connection count200150

Runbook for Production Rollout

T-24h: Preparation

  • Migration tested on staging with production data
  • Rollback script prepared and tested
  • Team informed about migration window
  • Monitoring dashboards prepared

T-1h: Pre-flight

  • Backup started and verified
  • Replication lag < 1s
  • No long-running transactions
  • Error rate baseline recorded

T-0: Migration

  • Start migration
  • Monitor metrics every 30s
  • On any alert: STOP and evaluate

Stop Conditions (immediate rollback)

  • Error rate > 1%
  • Replication lag > 10s
  • Query latency p99 > 2x baseline
  • Any ERROR in app logs related to migration

Post-Migration

  • Validate data
  • Clear rollback alert
  • Document lessons learned

Tooling: Flyway/Liquibase Safely

Separate DDL and DML

migrations/
  V001__create_users_table.sql      # DDL
  V002__add_status_column.sql       # DDL
  R001__backfill_status.sql         # DML - repeatable

Review Rules

Every migration must pass review with checklist:

  • No ALTER TABLE ... ALTER COLUMN TYPE
  • All indexes are CONCURRENTLY
  • NOT NULL has NOT VALID phase
  • Rollback exists
  • Estimated run time < 5 min

Timeouts for Migrations

-- In every migration
SET LOCAL lock_timeout = '5s';
SET LOCAL statement_timeout = '30s';

-- If timeout, migration fails and you can retry
ALTER TABLE ...;

Conclusion: Checklist Before Every Migration

## Pre-migration Checklist

### Planning
- [ ] Do I understand what the migration does?
- [ ] Is this expand or contract phase?
- [ ] What is the expected run time?
- [ ] How many rows/tables are affected?

### Safety
- [ ] Do I have a rollback script?
- [ ] Is the code forward-compatible?
- [ ] Do I have a backup?
- [ ] Are timeouts set?

### Execution
- [ ] Did staging test pass?
- [ ] Is monitoring ready?
- [ ] Is the team informed?
- [ ] Are stop conditions defined?

### Post-migration
- [ ] Is data valid?
- [ ] Are metrics normal?
- [ ] Is documentation updated?

Your next step: Take the last migration you did. Did it meet this checklist? If not, what would you do differently?

Frequently Asked Questions (FAQ)

How long does an expand/contract migration take?

Depends on table size and change complexity. For a table with millions of rows, expect days to weeks (including backfill). Never do it over the weekend - do it during working days when you can respond.

Can I use pg_repack instead of expand/contract?

pg_repack is useful for table bloat and some types of changes, but it’s not a universal solution. For column type changes or adding NOT NULL, you still need the expand/contract pattern.

What if backfill takes too long?

Increase batch size (if DB can handle it), add more parallel workers, or consider pg_dump/restore for extremely large tables. Always monitor replication lag.

Is it safe to run migrations during peak hours?

With the correct expand/contract pattern, yes. But I recommend starting with lower traffic until you build confidence in the process. Do your first 5 migrations during off-peak hours.


Related posts

Cite this article

If you reference this post, please link to the original URL and credit the author.

Michal Drozd. "Zero-Downtime PostgreSQL Migrations: Expand/Contract, Backfill and Rollback Strategies". https://www.michal-drozd.com/en/blog/zero-downtime-postgresql-migrations/ (Published July 29, 2025).