Back to blog

PostgreSQL XID Wraparound: Emergency Playbook for Vacuum Freeze Under Fire

You see an error that looks like “the database is not accepting commands to avoid wraparound data loss” (or you’re approaching that state). The instinct is to run something heavy and hope.

That’s how you turn a recoverable situation into a self-inflicted outage.

This playbook is for the 3AM moment when:

  • XID age is high
  • autovacuum is behind
  • large tables exist
  • and you need to recover without making IO and locking worse

Tested on: PostgreSQL 15–17 (self-managed + managed offerings with similar semantics), multi-GB to multi-TB datasets, mixed OLTP workloads.

Incident narrative (anonymized)

A SaaS cluster ran a multi-tenant Postgres with high write volume. Autovacuum was tuned for throughput, but the team had two blind spots:

  1. long-running “idle in transaction” sessions blocking vacuum progress
  2. large append-heavy tables that aged quickly

Monitoring caught rising age(datfrozenxid) late. By the time the incident started, “normal” vacuum couldn’t keep up and the team risked crossing the wraparound safety boundary.

Blast radius: production DB at risk of forced read-only / shutdown behavior.

Constraint: minimal downtime; avoid heavy locks; keep replication healthy.

Timeline

  • T-0: Alert fires: database XID age approaching critical threshold.
  • T+10m: Identify top databases by XID age; confirm one is far worse.
  • T+20m: Identify top relations by age(relfrozenxid); largest tables are the oldest.
  • T+30m: Find long-running transactions and idle-in-tx sessions blocking vacuum.
  • T+40m: Kill/block offenders; start targeted VACUUM (FREEZE) on worst tables.
  • T+2h: XID age starts dropping; autovacuum catches up; incident ends.
  • T+1d: Add guardrails: alerts, timeouts, and per-table autovacuum freeze tuning.

Mechanism: what wraparound really is

Postgres assigns transaction IDs (XIDs). Over time, XIDs wrap around. If old row versions aren’t “frozen”, wraparound can make old and new XIDs ambiguous, risking data correctness.

To prevent that, Postgres requires regular vacuuming that freezes old tuples. If the system gets too close to wraparound, Postgres will force aggressive anti-wraparound behavior. The exact symptoms depend on version and state, but the core constraint is consistent:

If freezing can’t advance, Postgres will protect correctness over availability.

Why vacuum freeze gets blocked

Even if autovacuum is running, freezing can stall if:

  • long-running transactions keep xmin old
  • idle-in-transaction sessions hold snapshots open
  • aggressive write churn creates continuous dead tuples
  • IO budget is too low (autovacuum can’t keep up)

Runbook: emergency response

What to check first (10 minutes)

1) How close are we?

SELECT
  datname,
  age(datfrozenxid) AS xid_age
FROM pg_database
ORDER BY xid_age DESC;

Also check your configured threshold:

SELECT
  current_setting('autovacuum_freeze_max_age') AS freeze_max_age,
  current_setting('autovacuum_freeze_table_age') AS freeze_table_age;

A good mental model is not “absolute numbers” but “distance to your configured max”.

2) Which tables are the worst offenders?

SELECT
  n.nspname,
  c.relname,
  pg_size_pretty(pg_total_relation_size(c.oid)) AS total_size,
  age(c.relfrozenxid) AS xid_age
FROM pg_class c
JOIN pg_namespace n ON n.oid = c.relnamespace
WHERE c.relkind IN ('r','m') -- tables + materialized views
ORDER BY age(c.relfrozenxid) DESC
LIMIT 20;

3) Are long transactions blocking us?

SELECT
  pid,
  usename,
  state,
  now() - xact_start AS xact_age,
  wait_event_type,
  wait_event,
  left(query, 120) AS query
FROM pg_stat_activity
WHERE xact_start IS NOT NULL
ORDER BY xact_start ASC
LIMIT 20;

If you see “idle in transaction” with huge xact_age, treat it as an emergency blocker.

How to confirm the hypothesis

A. Check vacuum progress

SELECT
  pid,
  datname,
  relid::regclass AS relation,
  phase,
  heap_blks_total,
  heap_blks_scanned,
  heap_blks_vacuumed,
  index_vacuum_count
FROM pg_stat_progress_vacuum;

If there’s no progress and you have long transactions, you likely have a snapshot blocker.

Safe mitigations (preferred order)

  1. Stop the blockers
  • terminate idle-in-tx sessions first
  • then terminate truly runaway long transactions (with care)

Example:

SELECT pg_terminate_backend(<pid>);
  1. Targeted VACUUM (FREEZE) on the worst tables Start with the oldest-by-age relations, but consider IO impact (large tables are expensive).
VACUUM (FREEZE, VERBOSE) public.big_table;
  1. Increase autovacuum capacity temporarily
  • more workers
  • higher maintenance memory (within reason)
  • ensure IO capacity is available

(Do this carefully; it’s easy to overload disk.)

  1. Use vacuumdb in parallel (off-peak) If you can run from a maintenance host:
vacuumdb --all --freeze --jobs=4

Risky mitigations (avoid unless you truly understand)

  1. VACUUM FULL
  • takes stronger locks and rewrites the table
  • often makes availability worse
  1. Turning autovacuum off
  • guarantees you will lose the race

What we changed (concrete)

1) We added hard timeouts to prevent snapshot blockers

At the DB level (example):

ALTER DATABASE appdb SET idle_in_transaction_session_timeout = '60s';
ALTER DATABASE appdb SET statement_timeout = '30s';

(Choose values appropriate for your workload; the important part is: block “idle in transaction forever”.)

2) We tuned autovacuum to keep up with freeze debt

Diff (illustrative postgresql.conf change):

-autovacuum_max_workers = 3
-autovacuum_naptime = 60s
+autovacuum_max_workers = 6
+autovacuum_naptime = 20s

For the worst append-heavy tables, we applied per-table settings:

ALTER TABLE public.big_table SET (
  autovacuum_vacuum_scale_factor = 0.02,
  autovacuum_analyze_scale_factor = 0.05
);

(Per-table tuning is often safer than cluster-wide aggressive settings.)

3) We added wraparound SLO alerts

We alert on:

  • max(age(relfrozenxid)) for the top relations
  • age(datfrozenxid) per database
  • presence of idle-in-transaction sessions older than X

How to verify (measurable)

  1. XID age starts dropping:
SELECT datname, age(datfrozenxid) FROM pg_database ORDER BY 2 DESC;
  1. Worst tables improve:
SELECT relname, age(relfrozenxid) FROM pg_class WHERE relkind='r' ORDER BY 2 DESC LIMIT 10;
  1. Vacuum progress shows forward movement:
SELECT * FROM pg_stat_progress_vacuum;
  1. Application impact is controlled:
  • p95 DB latency stays within acceptable bounds
  • replication lag stays bounded (if applicable)

Prevention / guardrails

Operational contracts

  • No “idle in transaction” longer than X seconds.
  • Wraparound budget: alert when XID age consumes > Y% of your configured freeze max.
  • Large tables must have explicitly reviewed autovacuum settings.

Alert checklist

  • age(datfrozenxid) increasing too fast
  • oldest relfrozenxid above threshold
  • idle-in-tx sessions > threshold count and age
  • vacuum not progressing (stuck phases)

Related posts

Cite this article

If you reference this post, please link to the original URL and credit the author.

Michal Drozd. "PostgreSQL XID Wraparound: Emergency Playbook for Vacuum Freeze Under Fire". https://www.michal-drozd.com/en/blog/postgresql-xid-wraparound-emergency-playbook/ (Published December 16, 2025).