PostgreSQL XID Wraparound: Emergency Playbook for Vacuum Freeze Under Fire
You see an error that looks like “the database is not accepting commands to avoid wraparound data loss” (or you’re approaching that state). The instinct is to run something heavy and hope.
That’s how you turn a recoverable situation into a self-inflicted outage.
This playbook is for the 3AM moment when:
- XID age is high
- autovacuum is behind
- large tables exist
- and you need to recover without making IO and locking worse
Tested on: PostgreSQL 15–17 (self-managed + managed offerings with similar semantics), multi-GB to multi-TB datasets, mixed OLTP workloads.
Incident narrative (anonymized)
A SaaS cluster ran a multi-tenant Postgres with high write volume. Autovacuum was tuned for throughput, but the team had two blind spots:
- long-running “idle in transaction” sessions blocking vacuum progress
- large append-heavy tables that aged quickly
Monitoring caught rising age(datfrozenxid) late. By the time the incident started, “normal” vacuum couldn’t keep up and the team risked crossing the wraparound safety boundary.
Blast radius: production DB at risk of forced read-only / shutdown behavior.
Constraint: minimal downtime; avoid heavy locks; keep replication healthy.
Timeline
- T-0: Alert fires: database XID age approaching critical threshold.
- T+10m: Identify top databases by XID age; confirm one is far worse.
- T+20m: Identify top relations by
age(relfrozenxid); largest tables are the oldest. - T+30m: Find long-running transactions and idle-in-tx sessions blocking vacuum.
- T+40m: Kill/block offenders; start targeted
VACUUM (FREEZE)on worst tables. - T+2h: XID age starts dropping; autovacuum catches up; incident ends.
- T+1d: Add guardrails: alerts, timeouts, and per-table autovacuum freeze tuning.
Mechanism: what wraparound really is
Postgres assigns transaction IDs (XIDs). Over time, XIDs wrap around. If old row versions aren’t “frozen”, wraparound can make old and new XIDs ambiguous, risking data correctness.
To prevent that, Postgres requires regular vacuuming that freezes old tuples. If the system gets too close to wraparound, Postgres will force aggressive anti-wraparound behavior. The exact symptoms depend on version and state, but the core constraint is consistent:
If freezing can’t advance, Postgres will protect correctness over availability.
Why vacuum freeze gets blocked
Even if autovacuum is running, freezing can stall if:
- long-running transactions keep
xminold - idle-in-transaction sessions hold snapshots open
- aggressive write churn creates continuous dead tuples
- IO budget is too low (autovacuum can’t keep up)
Runbook: emergency response
What to check first (10 minutes)
1) How close are we?
SELECT
datname,
age(datfrozenxid) AS xid_age
FROM pg_database
ORDER BY xid_age DESC;
Also check your configured threshold:
SELECT
current_setting('autovacuum_freeze_max_age') AS freeze_max_age,
current_setting('autovacuum_freeze_table_age') AS freeze_table_age;
A good mental model is not “absolute numbers” but “distance to your configured max”.
2) Which tables are the worst offenders?
SELECT
n.nspname,
c.relname,
pg_size_pretty(pg_total_relation_size(c.oid)) AS total_size,
age(c.relfrozenxid) AS xid_age
FROM pg_class c
JOIN pg_namespace n ON n.oid = c.relnamespace
WHERE c.relkind IN ('r','m') -- tables + materialized views
ORDER BY age(c.relfrozenxid) DESC
LIMIT 20;
3) Are long transactions blocking us?
SELECT
pid,
usename,
state,
now() - xact_start AS xact_age,
wait_event_type,
wait_event,
left(query, 120) AS query
FROM pg_stat_activity
WHERE xact_start IS NOT NULL
ORDER BY xact_start ASC
LIMIT 20;
If you see “idle in transaction” with huge xact_age, treat it as an emergency blocker.
How to confirm the hypothesis
A. Check vacuum progress
SELECT
pid,
datname,
relid::regclass AS relation,
phase,
heap_blks_total,
heap_blks_scanned,
heap_blks_vacuumed,
index_vacuum_count
FROM pg_stat_progress_vacuum;
If there’s no progress and you have long transactions, you likely have a snapshot blocker.
Safe mitigations (preferred order)
- Stop the blockers
- terminate idle-in-tx sessions first
- then terminate truly runaway long transactions (with care)
Example:
SELECT pg_terminate_backend(<pid>);
- Targeted
VACUUM (FREEZE)on the worst tables Start with the oldest-by-age relations, but consider IO impact (large tables are expensive).
VACUUM (FREEZE, VERBOSE) public.big_table;
- Increase autovacuum capacity temporarily
- more workers
- higher maintenance memory (within reason)
- ensure IO capacity is available
(Do this carefully; it’s easy to overload disk.)
- Use
vacuumdbin parallel (off-peak) If you can run from a maintenance host:
vacuumdb --all --freeze --jobs=4
Risky mitigations (avoid unless you truly understand)
VACUUM FULL
- takes stronger locks and rewrites the table
- often makes availability worse
- Turning autovacuum off
- guarantees you will lose the race
What we changed (concrete)
1) We added hard timeouts to prevent snapshot blockers
At the DB level (example):
ALTER DATABASE appdb SET idle_in_transaction_session_timeout = '60s';
ALTER DATABASE appdb SET statement_timeout = '30s';
(Choose values appropriate for your workload; the important part is: block “idle in transaction forever”.)
2) We tuned autovacuum to keep up with freeze debt
Diff (illustrative postgresql.conf change):
-autovacuum_max_workers = 3
-autovacuum_naptime = 60s
+autovacuum_max_workers = 6
+autovacuum_naptime = 20s
For the worst append-heavy tables, we applied per-table settings:
ALTER TABLE public.big_table SET (
autovacuum_vacuum_scale_factor = 0.02,
autovacuum_analyze_scale_factor = 0.05
);
(Per-table tuning is often safer than cluster-wide aggressive settings.)
3) We added wraparound SLO alerts
We alert on:
max(age(relfrozenxid))for the top relationsage(datfrozenxid)per database- presence of idle-in-transaction sessions older than X
How to verify (measurable)
- XID age starts dropping:
SELECT datname, age(datfrozenxid) FROM pg_database ORDER BY 2 DESC;
- Worst tables improve:
SELECT relname, age(relfrozenxid) FROM pg_class WHERE relkind='r' ORDER BY 2 DESC LIMIT 10;
- Vacuum progress shows forward movement:
SELECT * FROM pg_stat_progress_vacuum;
- Application impact is controlled:
- p95 DB latency stays within acceptable bounds
- replication lag stays bounded (if applicable)
Prevention / guardrails
Operational contracts
- No “idle in transaction” longer than X seconds.
- Wraparound budget: alert when XID age consumes > Y% of your configured freeze max.
- Large tables must have explicitly reviewed autovacuum settings.
Alert checklist
age(datfrozenxid)increasing too fast- oldest
relfrozenxidabove threshold - idle-in-tx sessions > threshold count and age
- vacuum not progressing (stuck phases)
Related reading
- PostgreSQL Autovacuum SLO Tuning: How to Configure Vacuum for 200M Rows and 5k UPSERT/s
- PostgreSQL Idle in Transaction: Emergency Playbook for Stuck Connections
- PostgreSQL Checkpoint Spikes: Why p99 Explodes Every N Minutes
- PostgreSQL OOM by Design: work_mem × Parallel Workers × Plan Nodes
- Logical Replication Slot WAL Bloat: When Subscribers Go Offline
- PostgreSQL Replication Slot Bloat: How One Inactive Slot Filled 500GB Disk
- Zero-Downtime PostgreSQL Migrations: Expand/Contract, Backfill and Rollback Strategies
Related posts
hot_standby_feedback Bloat Trap: Fixing Replica Conflicts by Slowly Killing the Primary
hot_standby_feedback stops replica query cancellations but can bloat the primary over days. Detect xmin pinning, mitigate safely, and add guardrails.
PostgreSQL Logical Replication Lag: Big Transactions and Reorder Buffer Spills
One huge transaction can pin logical replication for hours. Runbook to detect the blocker, tune decoding safely, and enforce bounded transactions in prod.
Pods Stuck in Terminating: A Production Decision Tree for Finalizers, Volumes, and Dead Nodes
A conservative runbook to unstick Pods safely: finalizers, CSI/volume cleanup stalls, dead nodes, and when (and how) to force-delete.
Redis AOF fsync Latency Spikes: When Durability Becomes Your p99
Redis AOF can turn durability into p99 spikes: fsync pressure and rewrite fork CoW. Runbook to confirm, mitigate safely, and add guardrails.
Cite this article
If you reference this post, please link to the original URL and credit the author.