Zero-Downtime PostgreSQL Migrations: Expand/Contract, Backfill and Rollback Strategies
Zero-downtime migrations are possible, but you have to treat them as an engineering project. Friday, 4 PM. I ran ALTER TABLE on a table with 47 million rows. 3 seconds later, Slack lit up red - production was down. I spent the next 4 hours on rollback and explaining why this would never happen again.
That was 6 years ago. Since then, I’ve done hundreds of migrations on production databases without a single outage. These rules are carved in stone for me.
Tested on: PostgreSQL 14-16, tables from 1M to 200M rows, AWS RDS and self-hosted. All examples in this article I’ve personally run on production systems.
What “Zero-Downtime” Means in Practice
Definition: The application remains fully functional throughout the entire migration. No error pages, no timeouts, no “try again later.”
Reality:
- Latency may temporarily increase (acceptable: +20%)
- Some operations may be slower (acceptable: +50%)
- No failed requests for end users
Most Common Causes of Outages
| Cause | Why It’s a Problem | Solution |
|---|---|---|
| Table lock | Blocks all queries | Online DDL |
| Full table rewrite | Long exclusive operation | Expand/contract |
| Long transactions | Blocking others | Timeouts, batching |
| Incompatible changes | App crash | Feature flags |
Core Pattern: Expand / Contract
This is the most important pattern. Every migration has two phases:
Expand (expansion)
- Add new (columns, tables, indexes)
- Old code still works
- New code can start using new structures
Contract (contraction)
- Remove old (only when nothing uses it)
- Clean up data
- Finalize migration
Example: Renaming a Column
Bad way (downtime):
ALTER TABLE users RENAME COLUMN name TO full_name;
-- All queries fail immediately!
Correct way (zero-downtime):
-- Phase 1: Expand
ALTER TABLE users ADD COLUMN full_name VARCHAR(255);
-- Phase 2: Backfill (batched) - PostgreSQL syntax
UPDATE users SET full_name = name
WHERE id IN (
SELECT id FROM users
WHERE full_name IS NULL
LIMIT 1000
);
-- (repeat until done)
-- Phase 3: App update
-- Deploy code that reads from both, writes to both
-- Phase 4: App update 2
-- Deploy code that reads only from full_name
-- Phase 5: Contract
ALTER TABLE users DROP COLUMN name;
Two-Phase Deploy with Feature Flags
The application must be able to work with both old AND new schema simultaneously.
# Phase: Expand
def get_user_name(user):
if feature_flag('use_full_name'):
return user.full_name or user.name # fallback to old
return user.name
def save_user(user, name):
user.name = name
if feature_flag('use_full_name'):
user.full_name = name # write to both
user.save()
Types of Changes and Recommended Approach
Adding a Column (safe)
-- Safe - no lock on existing rows
ALTER TABLE users ADD COLUMN avatar_url VARCHAR(255);
-- BUT BEWARE: with DEFAULT it's a problem!
-- PostgreSQL 11+ is OK, older versions = table rewrite
ALTER TABLE users ADD COLUMN status VARCHAR(20) DEFAULT 'active';
Changing Column Type (dangerous)
-- NOT like this - table rewrite!
ALTER TABLE orders ALTER COLUMN amount TYPE NUMERIC(12,2);
-- YES like this - expand/contract
-- 1. Add new column
ALTER TABLE orders ADD COLUMN amount_new NUMERIC(12,2);
-- 2. Backfill
UPDATE orders SET amount_new = amount WHERE amount_new IS NULL;
-- 3. Switch app
-- 4. Drop old
ALTER TABLE orders DROP COLUMN amount;
ALTER TABLE orders RENAME COLUMN amount_new TO amount;
Adding NOT NULL (dangerous)
-- NOT like this - full table scan with lock
ALTER TABLE users ALTER COLUMN email SET NOT NULL;
-- YES like this
-- 1. Add constraint as NOT VALID
ALTER TABLE users ADD CONSTRAINT users_email_not_null
CHECK (email IS NOT NULL) NOT VALID;
-- 2. Validate in background (no lock)
ALTER TABLE users VALIDATE CONSTRAINT users_email_not_null;
-- 3. Now you can add NOT NULL (fast, because constraint already exists)
ALTER TABLE users ALTER COLUMN email SET NOT NULL;
ALTER TABLE users DROP CONSTRAINT users_email_not_null;
Adding UNIQUE (dangerous)
-- NOT like this - long lock
ALTER TABLE users ADD CONSTRAINT users_email_unique UNIQUE (email);
-- YES like this
-- 1. Create index CONCURRENTLY (no lock)
CREATE UNIQUE INDEX CONCURRENTLY users_email_unique_idx ON users(email);
-- 2. Add constraint using index (fast)
ALTER TABLE users ADD CONSTRAINT users_email_unique
UNIQUE USING INDEX users_email_unique_idx;
Online Indexes
-- ALWAYS use CONCURRENTLY for production
CREATE INDEX CONCURRENTLY idx_orders_user_id ON orders(user_id);
-- If it fails, you must drop the INVALID index
DROP INDEX CONCURRENTLY idx_orders_user_id;
-- And try again
⚠️ Important:
CREATE INDEX CONCURRENTLYdoes not work inside a transaction! If you use a migration tool like Flyway or Liquibase, make sure the migration with CONCURRENTLY runs outside a transaction (e.g., in Flyway use-- flyway:disableTransaction).
Strategies for Large Tables
For tables with millions of rows:
- Create index during off-peak hours (even CONCURRENTLY loads DB)
- Set maintenance_work_mem higher (temporarily)
- Monitor replication lag (replica may fall behind)
-- Temporarily increase memory for session
SET maintenance_work_mem = '1GB';
CREATE INDEX CONCURRENTLY ...;
RESET maintenance_work_mem;
Backfilling Data Without Killing the DB
Basic Batching
BATCH_SIZE = 1000
SLEEP_BETWEEN = 0.1 # 100ms
def backfill_full_name():
while True:
with transaction():
# PostgreSQL doesn't support UPDATE...LIMIT, use subquery
count = db.execute("""
UPDATE users
SET full_name = name
WHERE id IN (
SELECT id FROM users
WHERE full_name IS NULL
LIMIT %s
)
""", [BATCH_SIZE])
if count == 0:
break
time.sleep(SLEEP_BETWEEN)
log(f"Migrated {count} rows")
Throttling Based on Load
def backfill_with_throttle():
while True:
# Check load before each batch
lag = get_replication_lag()
if lag > timedelta(seconds=5):
log("Replication lag too high, pausing...")
time.sleep(10)
continue
connections = get_active_connections()
if connections > 100:
log("Too many connections, pausing...")
time.sleep(5)
continue
# Run batch
migrate_batch()
Idempotency
Migration MUST be restartable:
def migrate_batch():
# Idempotent update - doesn't matter if run multiple times
# PostgreSQL doesn't support UPDATE...ORDER BY...LIMIT, use subquery
db.execute("""
UPDATE users
SET full_name = name
WHERE id IN (
SELECT id FROM users
WHERE full_name IS NULL
AND id > %s
ORDER BY id
LIMIT 1000
)
""", [last_processed_id])
Rollback Strategy
Every migration must have a rollback plan BEFORE execution.
Forward-Compatible Code
# Code must work with both schemas
def get_user(id):
user = db.query("SELECT * FROM users WHERE id = %s", [id])
# If new column doesn't exist, use old
return {
'name': user.full_name if hasattr(user, 'full_name') else user.name
}
Safe Rollback Checklist
Before deploying migration:
- Does rollback script exist?
- Is rollback tested on staging?
- Do we have backup of data that will be changed?
- Is rollback time-bounded? (max 15 min)
- Who has authority to trigger rollback?
Rollback Script Template
-- rollback_001_add_full_name.sql
-- Description: Rollback adding full_name column
-- Prerequisites: App must be on version < X.Y.Z
BEGIN;
-- Check that app no longer uses new column
DO $$
BEGIN
IF EXISTS (
SELECT 1 FROM pg_stat_activity
WHERE query LIKE '%full_name%' AND state = 'active'
) THEN
RAISE EXCEPTION 'Active queries using full_name column!';
END IF;
END
$$;
-- Drop column
ALTER TABLE users DROP COLUMN IF EXISTS full_name;
COMMIT;
Observability for Migrations
Metrics to Monitor
-- Lock wait time
SELECT pid, wait_event_type, wait_event, query
FROM pg_stat_activity
WHERE wait_event IS NOT NULL;
-- Replication lag
SELECT client_addr,
pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) as lag_bytes
FROM pg_stat_replication;
-- Long running queries
SELECT pid, now() - pg_stat_activity.query_start AS duration, query
FROM pg_stat_activity
WHERE state = 'active'
AND now() - pg_stat_activity.query_start > interval '30 seconds';
Alerts During Migration
Set these alerts to lower thresholds during migration:
| Metric | Normal Threshold | During Migration |
|---|---|---|
| Query latency p99 | 500ms | 200ms |
| Error rate | 1% | 0.1% |
| Replication lag | 30s | 5s |
| Connection count | 200 | 150 |
Runbook for Production Rollout
T-24h: Preparation
- Migration tested on staging with production data
- Rollback script prepared and tested
- Team informed about migration window
- Monitoring dashboards prepared
T-1h: Pre-flight
- Backup started and verified
- Replication lag < 1s
- No long-running transactions
- Error rate baseline recorded
T-0: Migration
- Start migration
- Monitor metrics every 30s
- On any alert: STOP and evaluate
Stop Conditions (immediate rollback)
- Error rate > 1%
- Replication lag > 10s
- Query latency p99 > 2x baseline
- Any ERROR in app logs related to migration
Post-Migration
- Validate data
- Clear rollback alert
- Document lessons learned
Tooling: Flyway/Liquibase Safely
Separate DDL and DML
migrations/
V001__create_users_table.sql # DDL
V002__add_status_column.sql # DDL
R001__backfill_status.sql # DML - repeatable
Review Rules
Every migration must pass review with checklist:
- No
ALTER TABLE ... ALTER COLUMN TYPE - All indexes are CONCURRENTLY
- NOT NULL has NOT VALID phase
- Rollback exists
- Estimated run time < 5 min
Timeouts for Migrations
-- In every migration
SET LOCAL lock_timeout = '5s';
SET LOCAL statement_timeout = '30s';
-- If timeout, migration fails and you can retry
ALTER TABLE ...;
Conclusion: Checklist Before Every Migration
## Pre-migration Checklist
### Planning
- [ ] Do I understand what the migration does?
- [ ] Is this expand or contract phase?
- [ ] What is the expected run time?
- [ ] How many rows/tables are affected?
### Safety
- [ ] Do I have a rollback script?
- [ ] Is the code forward-compatible?
- [ ] Do I have a backup?
- [ ] Are timeouts set?
### Execution
- [ ] Did staging test pass?
- [ ] Is monitoring ready?
- [ ] Is the team informed?
- [ ] Are stop conditions defined?
### Post-migration
- [ ] Is data valid?
- [ ] Are metrics normal?
- [ ] Is documentation updated?
Your next step: Take the last migration you did. Did it meet this checklist? If not, what would you do differently?
Frequently Asked Questions (FAQ)
How long does an expand/contract migration take?
Depends on table size and change complexity. For a table with millions of rows, expect days to weeks (including backfill). Never do it over the weekend - do it during working days when you can respond.
Can I use pg_repack instead of expand/contract?
pg_repack is useful for table bloat and some types of changes, but it’s not a universal solution. For column type changes or adding NOT NULL, you still need the expand/contract pattern.
What if backfill takes too long?
Increase batch size (if DB can handle it), add more parallel workers, or consider pg_dump/restore for extremely large tables. Always monitor replication lag.
Is it safe to run migrations during peak hours?
With the correct expand/contract pattern, yes. But I recommend starting with lower traffic until you build confidence in the process. Do your first 5 migrations during off-peak hours.
Related Articles
- Architecture as Code: ADR, C4 Diagrams and CI Quality Gates - Document database decisions using ADR
- CI/CD for Monorepo: Speed, Caching, Selective Tests and Supply-Chain Security - How to automate database migrations in your CI/CD pipeline
Related posts
PostgreSQL 'cached plan must not change result type' During Zero-Downtime Migrations
Rolling deploy fails with cached plan errors after ALTER TABLE. The cause: server-side prepared statements cache query plans that break when schema changes.
PostgreSQL Autovacuum SLO Tuning: How to Configure Vacuum for 200M Rows and 5k UPSERT/s
Autovacuum is either ignored or cargo-cult tuned. I'll show how to turn it into an SLO-driven system with specific numbers, pg_stat metrics, and reproducible tests.
Kubernetes Rollout Without DB Outage: How to Stop PostgreSQL Connection Storm
Reproducible lab demonstrating connection storm during K8s rollouts. PgBouncer, preStop hooks and jitter - practical solutions with benchmarks.
The Soft Delete Trap: Why is_deleted Kills Your Database (And What To Do)
A practical analysis of why soft delete destroys database performance over time. Benchmarks, partitioning solution, and migration checklist.
Cite this article
If you reference this post, please link to the original URL and credit the author.