PostgreSQL XID Wraparound: Emergency Playbook for Vacuum Freeze Under Fire

You see an error that looks like “the database is not accepting commands to avoid wraparound data loss” (or you’re approaching that state). The instinct is to run something heavy and hope.

That’s how you turn a recoverable situation into a self-inflicted outage.

This playbook is for the 3AM moment when:

XID age is high
autovacuum is behind
large tables exist
and you need to recover without making IO and locking worse

Tested on: PostgreSQL 15–17 (self-managed + managed offerings with similar semantics), multi-GB to multi-TB datasets, mixed OLTP workloads.

Incident narrative (anonymized)

A SaaS cluster ran a multi-tenant Postgres with high write volume. Autovacuum was tuned for throughput, but the team had two blind spots:

long-running “idle in transaction” sessions blocking vacuum progress
large append-heavy tables that aged quickly

Monitoring caught rising age(datfrozenxid) late. By the time the incident started, “normal” vacuum couldn’t keep up and the team risked crossing the wraparound safety boundary.

Blast radius: production DB at risk of forced read-only / shutdown behavior.

Constraint: minimal downtime; avoid heavy locks; keep replication healthy.

Timeline

T-0: Alert fires: database XID age approaching critical threshold.
T+10m: Identify top databases by XID age; confirm one is far worse.
T+20m: Identify top relations by age(relfrozenxid); largest tables are the oldest.
T+30m: Find long-running transactions and idle-in-tx sessions blocking vacuum.
T+40m: Kill/block offenders; start targeted VACUUM (FREEZE) on worst tables.
T+2h: XID age starts dropping; autovacuum catches up; incident ends.
T+1d: Add guardrails: alerts, timeouts, and per-table autovacuum freeze tuning.

Mechanism: what wraparound really is

Postgres assigns transaction IDs (XIDs). Over time, XIDs wrap around. If old row versions aren’t “frozen”, wraparound can make old and new XIDs ambiguous, risking data correctness.

To prevent that, Postgres requires regular vacuuming that freezes old tuples. If the system gets too close to wraparound, Postgres will force aggressive anti-wraparound behavior. The exact symptoms depend on version and state, but the core constraint is consistent:

If freezing can’t advance, Postgres will protect correctness over availability.

Why vacuum freeze gets blocked

Even if autovacuum is running, freezing can stall if:

long-running transactions keep xmin old
idle-in-transaction sessions hold snapshots open
aggressive write churn creates continuous dead tuples
IO budget is too low (autovacuum can’t keep up)

Runbook: emergency response

What to check first (10 minutes)

1) How close are we?

SELECT
  datname,
  age(datfrozenxid) AS xid_age
FROM pg_database
ORDER BY xid_age DESC;

Also check your configured threshold:

SELECT
  current_setting('autovacuum_freeze_max_age') AS freeze_max_age,
  current_setting('autovacuum_freeze_table_age') AS freeze_table_age;

A good mental model is not “absolute numbers” but “distance to your configured max”.

2) Which tables are the worst offenders?

SELECT
  n.nspname,
  c.relname,
  pg_size_pretty(pg_total_relation_size(c.oid)) AS total_size,
  age(c.relfrozenxid) AS xid_age
FROM pg_class c
JOIN pg_namespace n ON n.oid = c.relnamespace
WHERE c.relkind IN ('r','m') -- tables + materialized views
ORDER BY age(c.relfrozenxid) DESC
LIMIT 20;

3) Are long transactions blocking us?

SELECT
  pid,
  usename,
  state,
  now() - xact_start AS xact_age,
  wait_event_type,
  wait_event,
  left(query, 120) AS query
FROM pg_stat_activity
WHERE xact_start IS NOT NULL
ORDER BY xact_start ASC
LIMIT 20;

If you see “idle in transaction” with huge xact_age, treat it as an emergency blocker.

How to confirm the hypothesis

A. Check vacuum progress

SELECT
  pid,
  datname,
  relid::regclass AS relation,
  phase,
  heap_blks_total,
  heap_blks_scanned,
  heap_blks_vacuumed,
  index_vacuum_count
FROM pg_stat_progress_vacuum;

If there’s no progress and you have long transactions, you likely have a snapshot blocker.

Safe mitigations (preferred order)

Stop the blockers

terminate idle-in-tx sessions first
then terminate truly runaway long transactions (with care)

Example:

SELECT pg_terminate_backend(<pid>);

Targeted VACUUM (FREEZE) on the worst tables Start with the oldest-by-age relations, but consider IO impact (large tables are expensive).

VACUUM (FREEZE, VERBOSE) public.big_table;

Increase autovacuum capacity temporarily

more workers
higher maintenance memory (within reason)
ensure IO capacity is available

(Do this carefully; it’s easy to overload disk.)

Use vacuumdb in parallel (off-peak) If you can run from a maintenance host:

vacuumdb --all --freeze --jobs=4

Risky mitigations (avoid unless you truly understand)

VACUUM FULL

takes stronger locks and rewrites the table
often makes availability worse

Turning autovacuum off

guarantees you will lose the race

What we changed (concrete)

1) We added hard timeouts to prevent snapshot blockers

At the DB level (example):

ALTER DATABASE appdb SET idle_in_transaction_session_timeout = '60s';
ALTER DATABASE appdb SET statement_timeout = '30s';

(Choose values appropriate for your workload; the important part is: block “idle in transaction forever”.)

2) We tuned autovacuum to keep up with freeze debt

Diff (illustrative postgresql.conf change):

-autovacuum_max_workers = 3
-autovacuum_naptime = 60s
+autovacuum_max_workers = 6
+autovacuum_naptime = 20s

For the worst append-heavy tables, we applied per-table settings:

ALTER TABLE public.big_table SET (
  autovacuum_vacuum_scale_factor = 0.02,
  autovacuum_analyze_scale_factor = 0.05
);

(Per-table tuning is often safer than cluster-wide aggressive settings.)

3) We added wraparound SLO alerts

We alert on:

max(age(relfrozenxid)) for the top relations
age(datfrozenxid) per database
presence of idle-in-transaction sessions older than X

How to verify (measurable)

XID age starts dropping:

SELECT datname, age(datfrozenxid) FROM pg_database ORDER BY 2 DESC;

Worst tables improve:

SELECT relname, age(relfrozenxid) FROM pg_class WHERE relkind='r' ORDER BY 2 DESC LIMIT 10;

Vacuum progress shows forward movement:

SELECT * FROM pg_stat_progress_vacuum;

Application impact is controlled:

p95 DB latency stays within acceptable bounds
replication lag stays bounded (if applicable)

Prevention / guardrails

Operational contracts

No “idle in transaction” longer than X seconds.
Wraparound budget: alert when XID age consumes > Y% of your configured freeze max.
Large tables must have explicitly reviewed autovacuum settings.

Alert checklist

age(datfrozenxid) increasing too fast
oldest relfrozenxid above threshold
idle-in-tx sessions > threshold count and age
vacuum not progressing (stuck phases)

PostgreSQL XID Wraparound: Emergency Playbook for Vacuum Freeze Under Fire

Incident narrative (anonymized)

Timeline

Mechanism: what wraparound really is

Why vacuum freeze gets blocked

Runbook: emergency response

What to check first (10 minutes)

How to confirm the hypothesis

Safe mitigations (preferred order)

Risky mitigations (avoid unless you truly understand)

What we changed (concrete)

1) We added hard timeouts to prevent snapshot blockers

2) We tuned autovacuum to keep up with freeze debt

3) We added wraparound SLO alerts

How to verify (measurable)

Prevention / guardrails

Operational contracts

Alert checklist

Related posts

Cite this article

Incident narrative (anonymized)

Timeline

Mechanism: what wraparound really is

Why vacuum freeze gets blocked

Runbook: emergency response

What to check first (10 minutes)

How to confirm the hypothesis

Safe mitigations (preferred order)

Risky mitigations (avoid unless you truly understand)

What we changed (concrete)

1) We added hard timeouts to prevent snapshot blockers

2) We tuned autovacuum to keep up with freeze debt

3) We added wraparound SLO alerts

How to verify (measurable)

Prevention / guardrails

Operational contracts

Alert checklist

Related reading

Related posts

Cite this article