Back to blog

PostgreSQL Replication Slot Bloat: How One Inactive Slot Filled 500GB Disk

|
| postgresql, replication, wal, disk-bloat, logical-replication, monitoring

We saw disk usage climb and the culprit was a single forgotten slot. “Disk at 95%, database read-only in 30 minutes.” The alert woke me at 3 AM. I checked the disk usage breakdown: pg_wal had 400GB of WAL files. That’s when I noticed the replication slot called debezium_prod hadn’t advanced in 5 days—since the Debezium pod crashed after a Kubernetes node failure and nobody noticed it wasn’t consuming.

The terrifying thing about replication slots is how silently they fail. Debezium goes down, the slot stops advancing, and PostgreSQL dutifully keeps every WAL segment from that point forward—because the slot promises it will come back and need that data. Days pass. WAL accumulates. Nobody notices because there are no errors, no warnings, just steadily increasing disk usage. By the time the disk fills up, you have minutes to fix it before PostgreSQL goes read-only.

Replication slots solve a real problem: ensuring that a replica or logical consumer doesn’t miss data. Without slots, if a consumer goes offline for an hour, PostgreSQL might have already recycled the WAL segments it needed. With slots, PostgreSQL guarantees data availability. But that guarantee has a cost: the slot must be actively managed, monitored, and cleaned up when no longer needed.

This is a classic “set it and forget it” failure mode. You configure Debezium, it creates a replication slot, everything works great. Then Debezium crashes, and you discover that nobody ever set up monitoring for slot health. The safety mechanism designed to prevent data loss becomes the mechanism that causes an outage.

Tested on: PostgreSQL 16.1, logical replication to Debezium/Kafka

How Replication Slots Work

WAL Retention

Normal WAL lifecycle:
1. Transaction → WAL written
2. Checkpointer marks WAL as complete
3. Old WAL files deleted (based on wal_keep_size)

With replication slot:
1. Transaction → WAL written
2. Slot tracks "I need WAL from position X"
3. PostgreSQL keeps ALL WAL from position X
4. Even if slot is inactive for days!

Slot Types

-- Physical replication slot (streaming replication)
SELECT pg_create_physical_replication_slot('replica1');

-- Logical replication slot (Debezium, pg_logical, etc.)
SELECT pg_create_logical_replication_slot('debezium', 'pgoutput');

-- Both prevent WAL cleanup!

The Problem

Scenario: Debezium Goes Down

Day 1, 00:00: Debezium consuming changes normally
Day 1, 14:00: Debezium pod crashes, not noticed
Day 1, 14:00 → Day 5: Replication slot stops advancing
Day 5, 03:00: Disk at 95%, alerts fire

WAL accumulated: 400GB
Time to disaster: 30 minutes

Checking Slot Status

-- View all replication slots
SELECT
    slot_name,
    slot_type,
    database,
    active,
    restart_lsn,
    confirmed_flush_lsn,
    pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS lag_size,
    age(now(), pg_last_xact_replay_timestamp()) AS lag_time
FROM pg_replication_slots;

-- Example output (problem slot):
-- slot_name  | active | lag_size
-- -----------+--------+---------
-- debezium   | f      | 412 GB   ← Inactive, massive lag!

Prevention

1. max_slot_wal_keep_size (PostgreSQL 13+)

-- postgresql.conf
-- Maximum WAL retained by slots (in MB)
max_slot_wal_keep_size = '50GB'

-- When exceeded:
-- - Slot is invalidated
-- - WAL is freed
-- - Consumer must re-sync (full snapshot)

2. Monitoring Slot Lag

-- Prometheus query for slot lag
SELECT
    slot_name,
    pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) AS lag_bytes
FROM pg_replication_slots;
# Alert rule
groups:
- name: replication
  rules:
  - alert: ReplicationSlotLagHigh
    expr: pg_replication_slot_lag_bytes > 10737418240  # 10GB
    for: 30m
    labels:
      severity: warning
    annotations:
      summary: "Replication slot {{ $labels.slot_name }} lag > 10GB"

  - alert: ReplicationSlotInactive
    expr: pg_replication_slot_active == 0
    for: 1h
    labels:
      severity: critical
    annotations:
      summary: "Replication slot {{ $labels.slot_name }} inactive for 1h"

3. Slot Timeout (Custom Solution)

-- No built-in slot timeout, but can implement via cron

-- Create helper function
CREATE OR REPLACE FUNCTION drop_inactive_slots(max_inactive_hours int)
RETURNS TABLE(dropped_slot text) AS $$
DECLARE
    r RECORD;
BEGIN
    FOR r IN
        SELECT slot_name
        FROM pg_replication_slots
        WHERE NOT active
        AND age(now(), pg_last_xact_replay_timestamp()) > make_interval(hours => max_inactive_hours)
    LOOP
        PERFORM pg_drop_replication_slot(r.slot_name);
        dropped_slot := r.slot_name;
        RETURN NEXT;
    END LOOP;
END;
$$ LANGUAGE plpgsql;

-- Run via pg_cron
SELECT cron.schedule('drop-stale-slots', '0 * * * *',
    $$SELECT drop_inactive_slots(24)$$);

Recovery Playbook

Immediate: Free Disk Space

-- 1. Identify problematic slot
SELECT slot_name, pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS lag
FROM pg_replication_slots
ORDER BY pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) DESC;

-- 2. Drop the slot (consumer will need full resync!)
SELECT pg_drop_replication_slot('debezium');

-- 3. Trigger checkpoint to speed up WAL cleanup
CHECKPOINT;

-- 4. Verify WAL is being cleaned
-- (wait for next checkpoint cycle)
SELECT count(*), pg_size_pretty(sum(size))
FROM pg_ls_waldir();

Consumer Recovery

After dropping slot:

1. Debezium/Consumer needs FULL snapshot
2. This can take hours for large databases
3. Plan for:
   - Increased database load during snapshot
   - Temporary data lag in downstream systems
   - Possible duplicate processing (idempotency!)

Debezium Specific

# Debezium config for new snapshot
{
  "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
  "snapshot.mode": "initial",  # Full snapshot on start
  "slot.drop.on.stop": "false", # Keep slot when stopping (careful!)

  # Alternative: Clean restart
  "snapshot.mode": "initial_only"  # Snapshot, then stop
}

Architecture Improvements

1. Separate Replication User

-- Dedicated user for replication
CREATE USER replication_user REPLICATION LOGIN;
GRANT USAGE ON SCHEMA public TO replication_user;
GRANT SELECT ON ALL TABLES IN SCHEMA public TO replication_user;

-- Monitor this user's connections
SELECT * FROM pg_stat_replication WHERE usename = 'replication_user';

2. Health Check Endpoint

// health_check.go
func checkReplicationSlots(db *sql.DB) error {
    var slotName string
    var lagBytes int64

    rows, err := db.Query(`
        SELECT slot_name, pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)
        FROM pg_replication_slots
        WHERE NOT active
    `)
    if err != nil {
        return err
    }
    defer rows.Close()

    for rows.Next() {
        rows.Scan(&slotName, &lagBytes)
        if lagBytes > 10*1024*1024*1024 { // 10GB
            return fmt.Errorf("slot %s has %d bytes lag", slotName, lagBytes)
        }
    }
    return nil
}

3. Kubernetes Liveness Probe

# For Debezium deployment
apiVersion: apps/v1
kind: Deployment
spec:
  template:
    spec:
      containers:
      - name: debezium
        livenessProbe:
          httpGet:
            path: /connectors/postgres-connector/status
            port: 8083
          initialDelaySeconds: 60
          periodSeconds: 30
          failureThreshold: 3

Monitoring Dashboard

Key Metrics

# WAL directory size
pg_wal_directory_size_bytes

# Slot lag (per slot)
pg_replication_slot_wal_lsn_diff{slot_name=~".*"}

# Inactive slots count
count(pg_replication_slot_active == 0)

# WAL generation rate
rate(pg_wal_lsn_diff(pg_current_wal_lsn())[5m])

Grafana Panel

{
  "panels": [
    {
      "title": "Replication Slot Lag",
      "type": "timeseries",
      "targets": [
        {
          "expr": "pg_replication_slot_wal_lsn_diff",
          "legendFormat": "{{ slot_name }}"
        }
      ]
    },
    {
      "title": "WAL Directory Size",
      "type": "stat",
      "targets": [
        {
          "expr": "pg_wal_directory_size_bytes"
        }
      ]
    }
  ]
}

Checklist

## Replication Slot Management

### Prevention
- [ ] Set max_slot_wal_keep_size (PostgreSQL 13+)
- [ ] Alert on slot lag > 10GB
- [ ] Alert on inactive slots > 1 hour
- [ ] Implement automatic stale slot cleanup

### Monitoring
- [ ] Dashboard with slot lag per slot
- [ ] WAL directory size tracking
- [ ] Consumer health checks (Debezium status)

### Recovery Plan
- [ ] Document slot drop procedure
- [ ] Know consumer resync time for full snapshot
- [ ] Test recovery procedure in staging

### Consumer Setup
- [ ] Liveness probes for Debezium/consumers
- [ ] Auto-restart on failure
- [ ] Alerting on consumer failures

Conclusion

Replication slots represent a fascinating trade-off in database design: you gain guaranteed data availability for replicas and consumers, but you pay with the requirement for active monitoring and lifecycle management. A slot that isn’t being consumed isn’t just wasteful—it’s a time bomb slowly filling your disk.

What makes this particularly dangerous is the disconnect between symptom and cause. You get a disk space alert, you check for large tables, you look for bloat—but the problem isn’t in your tables, it’s in pg_wal. And even when you find the WAL accumulation, you might not immediately connect it to the replication slot that’s been inactive for days. The slot doesn’t announce its problem; it just quietly prevents cleanup.

The PostgreSQL 13+ feature max_slot_wal_keep_size is the safety net everyone should enable. It caps how much WAL a slot can retain, automatically invalidating slots that fall too far behind. Yes, this means the consumer will need a full snapshot on recovery, but that’s better than a database outage. It converts an unbounded retention into a bounded one.

Key principles:

  1. One inactive slot can fill your disk—there’s no warning until it’s nearly too late
  2. Set max_slot_wal_keep_size as a safety limit in PostgreSQL 13+
  3. Alert on inactive slots within 1 hour—this is early enough to investigate before it becomes critical
  4. Monitor slot lag in bytes, not just activity—an active slot that can’t keep up is just as dangerous
  5. Have a recovery plan—know how long a full snapshot takes and plan accordingly

The slot you forgot about is the one that takes down production. Monitor them all.


Related posts

Cite this article

If you reference this post, please link to the original URL and credit the author.

Michal Drozd. "PostgreSQL Replication Slot Bloat: How One Inactive Slot Filled 500GB Disk". https://www.michal-drozd.com/en/blog/postgresql-replication-slot-bloat/ (Published June 8, 2025).