PostgreSQL Logical Replication Lag: Big Transactions and Reorder Buffer Spills

The symptom pattern is deceptively calm:

the subscriber is up, CPU is moderate
the subscription status looks “connected”
yet replication lag grows from minutes to hours
apply appears stuck on one point in time
downstream systems time out because they depend on fresh replicated data

In real incidents, people fixate on “network” or “subscriber performance”. The actual root cause is often simpler:

One huge transaction on the publisher pins logical replication until it commits.

Tested on: PostgreSQL 15–17, logical replication using pgoutput, subscribers in Kubernetes and VMs, backfills and migration workloads.

Incident narrative (anonymized)

We ran a backfill that updated tens of millions of rows. It was written as “one transaction for safety”.

In production:

the backfill ran for about an hour
subscriber lag climbed steadily
apply worker looked alive but didn’t catch up
the logical slot started retaining more WAL than expected

The incident wasn’t “logical replication is slow”. It was “logical replication is transactional”: changes become visible only at commit, and big transactions create a wall-clock lag even when everything is healthy.

Constraint: multiple services depended on the replica for reads. Waiting an hour for commit wasn’t acceptable. We needed an explicit runbook and a prevention policy that makes giant transactions basically impossible.

Timeline

T-0: lag alert fires; downstream reads become stale.
T+10m: subscription looks connected, but latest_end_time falls behind.
T+20m: we find a long transaction on the publisher.
T+30m: mitigation: abort the transaction; replace with chunked batches.
T+60m: lag starts shrinking; apply catches up.
T+1d: guardrails: bounded backfills, decoding budgets, alerts on apply delay and slot lag.

Mechanism: why big transactions freeze logical replication progress

Logical replication applies changes in transaction order. For a large transaction:

the publisher decodes changes from WAL
the subscriber receives them
the subscriber can’t make them visible until the commit arrives

Meanwhile your “lag” becomes dominated by how long that transaction runs, not by bandwidth.

Reorder buffer spills turn a CPU problem into an IO problem

Decoding needs memory. When changes in a transaction exceed decoding budgets, PostgreSQL uses a reorder buffer and can spill to disk. Once you spill, apply can look like “stuck” even though it’s doing work.

This is why the same system can replicate fine for months and then “suddenly” lag: one migration changes the transaction shape.

Runbook: detect the blocker and recover safely

1) Measure apply delay on the subscriber

SELECT
  subname,
  status,
  received_lsn,
  latest_end_lsn,
  latest_end_time,
  last_msg_receipt_time
FROM pg_stat_subscription
ORDER BY latest_end_time NULLS LAST;

If latest_end_time is far behind now, you have apply delay.

2) Measure slot lag on the publisher

SELECT
  slot_name,
  active,
  restart_lsn,
  confirmed_flush_lsn,
  pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn)) AS confirmed_lag
FROM pg_replication_slots
WHERE slot_type = 'logical'
ORDER BY pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn) DESC;

This tells you how bad it is and whether the slot is moving.

3) Find the long transaction on the publisher

SELECT
  pid,
  usename,
  now() - xact_start AS xact_age,
  state,
  left(query, 160) AS query
FROM pg_stat_activity
WHERE xact_start IS NOT NULL
ORDER BY xact_age DESC
LIMIT 20;

In most incidents, the answer is obvious: a backfill, migration, or bulk update.

4) Confirm it’s the pin

I treat it as confirmed if:

there is a long-running transaction
lag grows while it runs
lag shrinks soon after it commits or is aborted

The fastest confirmation is operational: stop the transaction and watch lag start shrinking.

Safe mitigations

1) Abort the big transaction (stop the pin)

If the transaction hasn’t committed, aborting it is the fastest way to restore freshness.

This is a business decision, but the trade-off is real:

keep it running: replica stays stale and slot retains WAL
abort it: you re-run the job later, but downstream reads recover now

2) Replace “one big transaction” with bounded batches

Backfills should be:

chunked into small batches
committed per batch
resumable
progress-tracked

3) Tune decoding budgets intentionally

If you frequently replicate large changes, tune decoding memory budgets (for example logical_decoding_work_mem) deliberately:

too low: reorder buffer spills to disk and apply becomes IO-bound
too high: memory pressure under concurrency

The key is: treat it as a budget knob and test it under realistic load.

Risky mitigations

disabling the subscription to “stop lag” (you stop apply entirely, and WAL retention can get worse)
dropping and recreating the subscription/slot (can require resync and can lose changes if you guess wrong)
blindly increasing memory and timeouts without bounding transaction size (you’ll hit the same outage class again)

What we changed (concrete)

1) Enforce a “max transaction budget” for backfills

We made it enforceable:

batch size cap (rows per transaction)
time budget (statement timeout for the backfill role)
progress tracking and resume logic

Example:

ALTER ROLE backfill SET statement_timeout = '5min';

2) Add alerts that fire before WAL retention becomes a second incident

We alert on:

apply delay above a threshold
slot confirmed lag growing faster than expected
long-running transactions (especially for backfill roles)

How to verify

Subscriber should show latest_end_time moving toward now:

SELECT subname, latest_end_time, last_msg_receipt_time
FROM pg_stat_subscription;

Publisher should show slot lag stabilizing:

SELECT
  slot_name,
  pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn)) AS confirmed_lag
FROM pg_replication_slots
WHERE slot_type='logical';

PostgreSQL Logical Replication Lag: Big Transactions and Reorder Buffer Spills

Incident narrative (anonymized)

Timeline

Mechanism: why big transactions freeze logical replication progress

Reorder buffer spills turn a CPU problem into an IO problem

Runbook: detect the blocker and recover safely

1) Measure apply delay on the subscriber

2) Measure slot lag on the publisher

3) Find the long transaction on the publisher

4) Confirm it’s the pin

Safe mitigations

1) Abort the big transaction (stop the pin)

2) Replace “one big transaction” with bounded batches

3) Tune decoding budgets intentionally

Risky mitigations

What we changed (concrete)

1) Enforce a “max transaction budget” for backfills

2) Add alerts that fire before WAL retention becomes a second incident

How to verify

Related posts

Cite this article

Incident narrative (anonymized)

Timeline

Mechanism: why big transactions freeze logical replication progress

Reorder buffer spills turn a CPU problem into an IO problem

Runbook: detect the blocker and recover safely

1) Measure apply delay on the subscriber

2) Measure slot lag on the publisher

3) Find the long transaction on the publisher

4) Confirm it’s the pin

Safe mitigations

1) Abort the big transaction (stop the pin)

2) Replace “one big transaction” with bounded batches

3) Tune decoding budgets intentionally

Risky mitigations

What we changed (concrete)

1) Enforce a “max transaction budget” for backfills

2) Add alerts that fire before WAL retention becomes a second incident

How to verify

Related reading

Related posts

Cite this article