Logical Replication Slot WAL Bloat: When Subscribers Go Offline

Replication slots are easy to forget and expensive to ignore. “pg_wal directory is 500GB and growing.” The on-call engineer was panicking. Their production database had been stable for months, and suddenly disk space was disappearing at 10GB per hour. The database itself was fine—queries were fast, connections were normal. But something was eating disk space at an alarming rate.

The cause: a Debezium CDC connector had crashed three days ago, and nobody noticed. The logical replication slot it was using remained active, patiently waiting for the connector to reconnect. PostgreSQL, honoring its contract with the slot, kept every WAL file generated since the crash. Three days of production database changes added up to 500GB of WAL—and growing.

This is one of the most common PostgreSQL emergencies I encounter, and it’s almost always the same story. A subscriber goes offline—crashed, misconfigured, or simply forgotten during decommissioning—and PostgreSQL does exactly what it’s supposed to do: keep the data the subscriber needs. The problem is that “keep the data” means “fill up the disk.”

Environment: PostgreSQL 14+ with logical replication, CDC pipelines (Debezium, etc.), multiple subscribers

The Problem

Disk Space Disappearing

The first sign is usually a disk space alert. The database has been running fine, there’s no unusual growth in table data, but the PostgreSQL data directory is ballooning:

# Disk alert fires
df -h /var/lib/postgresql
# /dev/sda1  1000G  950G  50G  95% /var/lib/postgresql

# Where is the space?
du -sh /var/lib/postgresql/14/main/*
# 450G pg_wal  <- WAL files eating disk!

# How many WAL files?
ls -la /var/lib/postgresql/14/main/pg_wal/*.gz | wc -l
# 28,521 files

# Check oldest WAL file
ls -lt /var/lib/postgresql/14/main/pg_wal/ | tail -5
# -rw------- 1 postgres postgres 16777216 Dec 26 10:00 0000000100000ABC00000001
# That's 3 days old - why is it still here?

In normal operation, WAL files are recycled continuously. PostgreSQL generates them, ships them to replicas, and recycles them once all consumers have acknowledged receipt. A healthy database might have a few hours of WAL at most. Days of WAL means something is preventing cleanup.

Finding the Culprit

The pg_replication_slots view tells you exactly what’s happening. Look for inactive slots with large retained WAL:

-- Check replication slots
SELECT
    slot_name,
    slot_type,
    active,
    restart_lsn,
    confirmed_flush_lsn,
    pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS retained_wal,
    age(now(), COALESCE(
        (SELECT backend_start FROM pg_stat_activity
         WHERE pid = active_pid), '2000-01-01'
    )) AS last_active
FROM pg_replication_slots
ORDER BY pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) DESC;

-- Output:
-- slot_name           | active | retained_wal | last_active
-- debezium_cdc_slot   | f      | 450 GB       | 3 days 02:15:33
-- analytics_sub       | t      | 15 MB        | 00:00:05
-- audit_log_slot      | t      | 8 MB         | 00:00:12

-- debezium_cdc_slot is inactive and holding 450GB of WAL!

The active = false combined with large retained_wal is the smoking gun. This slot’s consumer disconnected, and PostgreSQL has been holding WAL ever since, waiting for it to return.

Root Cause

How Logical Replication Slots Work

To understand why this happens, you need to understand what a replication slot promises. When you create a logical replication slot, you’re telling PostgreSQL: “I’m going to read changes from this database, and I need you to keep those changes available until I’ve read them.”

PostgreSQL takes this promise seriously. A slot tracks how far its consumer has read (the confirmed_flush_lsn) and keeps all WAL from that point forward. This is the contract that makes logical replication work—if a consumer disconnects and reconnects later, it can resume exactly where it left off.

Normal operation:
┌─────────────────────────────────────────────────────────────┐
│ PostgreSQL Primary                                          │
│                                                             │
│ WAL Generation: [001][002][003][004][005][006]...          │
│                                                             │
│ Slot A (active):   confirmed_lsn = 005                     │
│ Slot B (active):   confirmed_lsn = 004                     │
│                                                             │
│ Can delete: [001][002][003] (before all slots)             │
│ Must keep:  [004][005][006]... (needed by some slot)       │
└─────────────────────────────────────────────────────────────┘

When subscriber goes offline:
┌─────────────────────────────────────────────────────────────┐
│ PostgreSQL Primary                                          │
│                                                             │
│ WAL Generation: [001][002]...[1000][1001][1002]...         │
│                                                             │
│ Slot A (OFFLINE): confirmed_lsn = 002 (3 days ago!)        │
│ Slot B (active):  confirmed_lsn = 1002                     │
│                                                             │
│ Can delete: [001] (before ALL slots including offline)     │
│ Must keep:  [002]...[1002] (Slot A might reconnect)        │
│                                                             │
│ 1000 × 16MB = 16GB minimum (often much more)               │
└─────────────────────────────────────────────────────────────┘

The key insight is that PostgreSQL retains WAL based on the oldest consumer, not the average. If you have 10 active slots consuming in real-time and 1 slot that disconnected a week ago, PostgreSQL keeps a week of WAL. The one lagging slot determines retention for everyone.

Why PostgreSQL Keeps the WAL

This behavior is intentional and correct from PostgreSQL’s perspective. The slot is a promise, and PostgreSQL keeps its promises. When you create a logical replication slot, you’re accepting responsibility for managing it—including cleaning it up when the consumer is decommissioned.

-- Logical slots are PERSISTENT
-- They survive restarts and wait indefinitely for consumer

-- The slot contract:
-- "I promise to have all changes since restart_lsn available
--  so subscriber can resume from where it left off"

-- This is correct behavior, but unbounded:
-- - No automatic cleanup of inactive slots
-- - No timeout for offline consumers
-- - No size limit on retained WAL

The problem is that this promise has no expiration. A slot created for a CDC pipeline that was later abandoned will wait forever. There’s no automatic “I give up after 7 days” logic—that would violate the slot contract. PostgreSQL assumes you meant to create the slot and will eventually use it.

This design makes sense for its intended use cases. If a subscriber has a planned maintenance window, you don’t want PostgreSQL to give up on it. But it means that forgotten slots become ticking time bombs, waiting to fill your disk the next time the slot goes inactive.

Diagnosis

Monitor Slot Lag

The best defense is monitoring slot health proactively. Create a view that surfaces problematic slots before they fill your disk:

-- Create monitoring query
CREATE OR REPLACE VIEW replication_slot_lag AS
SELECT
    slot_name,
    slot_type,
    active,
    pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS wal_retained,
    pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) / 1024 / 1024 AS mb_retained,
    CASE
        WHEN active THEN
            (SELECT now() - backend_start FROM pg_stat_activity
             WHERE pid = active_pid)
        ELSE
            'inactive'::interval
    END AS connection_age
FROM pg_replication_slots;

-- Check regularly
SELECT * FROM replication_slot_lag WHERE mb_retained > 1000;

Any slot retaining more than a gigabyte of WAL deserves investigation. If it’s active and consuming normally, that’s fine—it just has some lag. If it’s inactive, you need to find out why and decide what to do about it.

Check WAL Retention Settings

PostgreSQL 13 introduced max_slot_wal_keep_size, which limits how much WAL a slot can retain. Before this setting, there was no protection—a single forgotten slot could fill any disk given enough time.

-- These settings affect WAL retention but DON'T limit slot retention
SHOW wal_keep_size;        -- Minimum WAL to keep (not a max!)
SHOW max_slot_wal_keep_size; -- PostgreSQL 13+: Max WAL per slot!

-- max_slot_wal_keep_size is the key setting
-- Default: -1 (unlimited - dangerous!)
-- Set to limit: '100GB'

The wal_keep_size setting is often misunderstood. It’s a minimum, not a maximum. It tells PostgreSQL to keep at least this much WAL for streaming replication, regardless of what slots need. It doesn’t limit slots at all.

max_slot_wal_keep_size is the limit. When a slot tries to retain more than this amount, PostgreSQL marks it as “invalid” and stops retaining WAL for it. The subscriber will need to do a full resync when it reconnects, but your disk is protected.

Identify Slot Owner

Before dropping a slot, you need to know what created it. The plugin and naming conventions usually reveal the source:

-- Who created this slot? (check pg_stat_replication)
SELECT
    s.slot_name,
    s.plugin,
    r.application_name,
    r.client_addr,
    r.state
FROM pg_replication_slots s
LEFT JOIN pg_stat_replication r ON s.active_pid = r.pid;

-- Check slot metadata for hints
SELECT slot_name, plugin FROM pg_replication_slots;
-- plugin = 'pgoutput' → native logical replication
-- plugin = 'decoderbufs' → Debezium
-- plugin = 'wal2json' → wal2json CDC

The plugin tells you what ecosystem the slot belongs to. pgoutput is native PostgreSQL logical replication. decoderbufs is almost always Debezium. wal2json suggests a generic CDC pipeline. The naming convention often includes application or environment identifiers.

Once you know the owner, you can make an informed decision. Is this a production CDC pipeline that someone needs to restart? A test slot that was never cleaned up? A subscriber that was decommissioned without removing its slot?

The Fix

Option 1: Drop Inactive Slot (If Consumer is Gone)

If you’ve confirmed the consumer is truly gone—decommissioned, replaced, or abandoned—dropping the slot is the right fix:

-- DANGER: This loses all changes since slot went inactive!
-- Only do this if consumer is truly gone or will full-resync

-- Check what you're dropping
SELECT slot_name, pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn))
FROM pg_replication_slots WHERE slot_name = 'debezium_cdc_slot';

-- Drop the slot
SELECT pg_drop_replication_slot('debezium_cdc_slot');

-- WAL files will be cleaned up by next checkpoint
CHECKPOINT;

The CHECKPOINT forces immediate WAL cleanup. Without it, PostgreSQL will clean up WAL at the next regular checkpoint, which might be minutes away. When you’re at 95% disk usage, those minutes matter.

Be absolutely certain the consumer is gone. Dropping a slot for an active consumer means it loses its position and must do a full resync—potentially hours of work for large databases.

Option 2: Set max_slot_wal_keep_size (Prevention)

This is the setting that should be configured on every PostgreSQL 13+ database with logical replication:

-- PostgreSQL 13+: Limit WAL retained per slot
ALTER SYSTEM SET max_slot_wal_keep_size = '100GB';
SELECT pg_reload_conf();

-- If slot exceeds this, it becomes "invalid"
-- Subscriber must do full resync, but disk is protected

-- Check for invalidated slots
SELECT slot_name, wal_status FROM pg_replication_slots;
-- wal_status = 'lost' means slot is invalid (needs resync)

Choose the limit based on your environment. How much WAL can your disk afford to hold? How long should a subscriber be allowed to be offline before requiring a resync? For most environments, 24-48 hours of WAL (which you can estimate from your typical WAL generation rate) is a reasonable limit.

When a slot hits this limit, PostgreSQL marks it with wal_status = 'lost'. The slot still exists, but PostgreSQL has given up on it. When the subscriber reconnects, it will find that its position is no longer available and must start fresh.

Option 3: Proactive Slot Management

For environments with many slots (common with CDC pipelines feeding multiple downstream systems), automate slot hygiene:

-- Create function to drop stale slots
CREATE OR REPLACE FUNCTION cleanup_stale_slots(max_age interval, max_wal_gb int)
RETURNS TABLE (slot_name text, action text) AS $$
DECLARE
    slot RECORD;
BEGIN
    FOR slot IN
        SELECT
            s.slot_name,
            s.active,
            pg_wal_lsn_diff(pg_current_wal_lsn(), s.restart_lsn) / 1024 / 1024 / 1024 AS gb_retained
        FROM pg_replication_slots s
        WHERE s.slot_type = 'logical'
        AND s.active = false
        AND pg_wal_lsn_diff(pg_current_wal_lsn(), s.restart_lsn) > max_wal_gb * 1024 * 1024 * 1024
    LOOP
        PERFORM pg_drop_replication_slot(slot.slot_name);
        RETURN QUERY SELECT slot.slot_name, 'dropped'::text;
    END LOOP;
END;
$$ LANGUAGE plpgsql;

-- Run periodically (e.g., via pg_cron)
SELECT * FROM cleanup_stale_slots('1 day', 50);

This function identifies inactive slots that exceed your threshold and drops them. Schedule it to run daily via pg_cron or your favorite job scheduler. The function returns what it dropped, so you have a record for investigation.

Be cautious with aggressive thresholds. A subscriber might legitimately be offline for maintenance. The combination of inactive AND exceeds limit catches forgotten slots while giving active maintenance windows a chance.

Option 4: Alerting Before Disk Full

Prevention is better than cure. Set up alerts that fire well before disk fills:

groups:
  - name: postgresql-slots
    rules:
      - alert: ReplicationSlotWALRetention
        expr: |
          pg_replication_slot_wal_retained_bytes > 50 * 1024 * 1024 * 1024
        for: 30m
        labels:
          severity: warning
        annotations:
          summary: "Slot {{ $labels.slot_name }} retaining {{ $value | humanize }} WAL"

      - alert: ReplicationSlotInactive
        expr: |
          pg_replication_slot_active == 0
        for: 6h
        labels:
          severity: warning
        annotations:
          summary: "Slot {{ $labels.slot_name }} inactive for > 6 hours"

The retention alert catches any slot holding too much WAL, regardless of why. The inactive alert catches slots that have been disconnected for too long. Together, they give you early warning before disk pressure becomes critical.

Checklist

## Logical Replication Slot WAL Bloat

### Symptoms
- [ ] pg_wal directory growing unbounded
- [ ] Disk space alerts on PostgreSQL server
- [ ] WAL files older than expected retention
- [ ] inactive slots in pg_replication_slots

### Diagnosis
- [ ] Check pg_replication_slots for inactive slots
- [ ] Identify retained WAL size per slot
- [ ] Find slot owner (application/subscriber)
- [ ] Check max_slot_wal_keep_size setting

### Fixes
- [ ] Drop truly abandoned slots
- [ ] Set max_slot_wal_keep_size for protection
- [ ] Reconnect/restart offline subscriber
- [ ] Implement proactive slot cleanup

### Prevention
- [ ] Monitor slot lag metrics
- [ ] Alert on inactive slots
- [ ] Set max_slot_wal_keep_size
- [ ] Document slot owners and recovery procedures

Conclusion

Logical replication slots are promises that PostgreSQL takes seriously. When you create a slot, PostgreSQL commits to keeping all changes available until your subscriber has read them. A forgotten slot will honor that promise forever—or until it fills your disk.

The key insight is that this is correct behavior from PostgreSQL’s perspective. The fix isn’t to change PostgreSQL; it’s to manage your slots properly:

Inactive slots retain WAL indefinitely by design—that’s what makes them useful
max_slot_wal_keep_size is your safety net (PostgreSQL 13+)—configure it on every database
Monitor and alert on slot lag before disk fills—50GB warnings give you time to react
Document slot ownership so you know what can be dropped when things go wrong

The next time you set up logical replication, remember: you’re not just creating a slot, you’re making a commitment to manage it throughout its lifetime.

pg_waldump WAL Forensics - Analyzing WAL contents
PostgreSQL HOT Updates - Understanding PostgreSQL internals

Logical Replication Slot WAL Bloat: When Subscribers Go Offline

The Problem

Disk Space Disappearing

Finding the Culprit

Root Cause

How Logical Replication Slots Work

Why PostgreSQL Keeps the WAL

Diagnosis

Monitor Slot Lag

Check WAL Retention Settings

Identify Slot Owner

The Fix

Option 1: Drop Inactive Slot (If Consumer is Gone)

Option 2: Set max_slot_wal_keep_size (Prevention)

Option 3: Proactive Slot Management

Option 4: Alerting Before Disk Full

Checklist

Conclusion

Related posts

Cite this article

The Problem

Disk Space Disappearing

Finding the Culprit

Root Cause

How Logical Replication Slots Work

Why PostgreSQL Keeps the WAL

Diagnosis

Monitor Slot Lag

Check WAL Retention Settings

Identify Slot Owner

The Fix

Option 1: Drop Inactive Slot (If Consumer is Gone)

Option 2: Set max_slot_wal_keep_size (Prevention)

Option 3: Proactive Slot Management

Option 4: Alerting Before Disk Full

Checklist

Conclusion

Related Articles

Related posts

Cite this article