PostgreSQL Replication Slot Bloat: How One Inactive Slot Filled 500GB Disk
We saw disk usage climb and the culprit was a single forgotten slot. “Disk at 95%, database read-only in 30 minutes.” The alert woke me at 3 AM. I checked the disk usage breakdown: pg_wal had 400GB of WAL files. That’s when I noticed the replication slot called debezium_prod hadn’t advanced in 5 days—since the Debezium pod crashed after a Kubernetes node failure and nobody noticed it wasn’t consuming.
The terrifying thing about replication slots is how silently they fail. Debezium goes down, the slot stops advancing, and PostgreSQL dutifully keeps every WAL segment from that point forward—because the slot promises it will come back and need that data. Days pass. WAL accumulates. Nobody notices because there are no errors, no warnings, just steadily increasing disk usage. By the time the disk fills up, you have minutes to fix it before PostgreSQL goes read-only.
Replication slots solve a real problem: ensuring that a replica or logical consumer doesn’t miss data. Without slots, if a consumer goes offline for an hour, PostgreSQL might have already recycled the WAL segments it needed. With slots, PostgreSQL guarantees data availability. But that guarantee has a cost: the slot must be actively managed, monitored, and cleaned up when no longer needed.
This is a classic “set it and forget it” failure mode. You configure Debezium, it creates a replication slot, everything works great. Then Debezium crashes, and you discover that nobody ever set up monitoring for slot health. The safety mechanism designed to prevent data loss becomes the mechanism that causes an outage.
Tested on: PostgreSQL 16.1, logical replication to Debezium/Kafka
How Replication Slots Work
WAL Retention
Normal WAL lifecycle:
1. Transaction → WAL written
2. Checkpointer marks WAL as complete
3. Old WAL files deleted (based on wal_keep_size)
With replication slot:
1. Transaction → WAL written
2. Slot tracks "I need WAL from position X"
3. PostgreSQL keeps ALL WAL from position X
4. Even if slot is inactive for days!
Slot Types
-- Physical replication slot (streaming replication)
SELECT pg_create_physical_replication_slot('replica1');
-- Logical replication slot (Debezium, pg_logical, etc.)
SELECT pg_create_logical_replication_slot('debezium', 'pgoutput');
-- Both prevent WAL cleanup!
The Problem
Scenario: Debezium Goes Down
Day 1, 00:00: Debezium consuming changes normally
Day 1, 14:00: Debezium pod crashes, not noticed
Day 1, 14:00 → Day 5: Replication slot stops advancing
Day 5, 03:00: Disk at 95%, alerts fire
WAL accumulated: 400GB
Time to disaster: 30 minutes
Checking Slot Status
-- View all replication slots
SELECT
slot_name,
slot_type,
database,
active,
restart_lsn,
confirmed_flush_lsn,
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS lag_size,
age(now(), pg_last_xact_replay_timestamp()) AS lag_time
FROM pg_replication_slots;
-- Example output (problem slot):
-- slot_name | active | lag_size
-- -----------+--------+---------
-- debezium | f | 412 GB ← Inactive, massive lag!
Prevention
1. max_slot_wal_keep_size (PostgreSQL 13+)
-- postgresql.conf
-- Maximum WAL retained by slots (in MB)
max_slot_wal_keep_size = '50GB'
-- When exceeded:
-- - Slot is invalidated
-- - WAL is freed
-- - Consumer must re-sync (full snapshot)
2. Monitoring Slot Lag
-- Prometheus query for slot lag
SELECT
slot_name,
pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) AS lag_bytes
FROM pg_replication_slots;
# Alert rule
groups:
- name: replication
rules:
- alert: ReplicationSlotLagHigh
expr: pg_replication_slot_lag_bytes > 10737418240 # 10GB
for: 30m
labels:
severity: warning
annotations:
summary: "Replication slot {{ $labels.slot_name }} lag > 10GB"
- alert: ReplicationSlotInactive
expr: pg_replication_slot_active == 0
for: 1h
labels:
severity: critical
annotations:
summary: "Replication slot {{ $labels.slot_name }} inactive for 1h"
3. Slot Timeout (Custom Solution)
-- No built-in slot timeout, but can implement via cron
-- Create helper function
CREATE OR REPLACE FUNCTION drop_inactive_slots(max_inactive_hours int)
RETURNS TABLE(dropped_slot text) AS $$
DECLARE
r RECORD;
BEGIN
FOR r IN
SELECT slot_name
FROM pg_replication_slots
WHERE NOT active
AND age(now(), pg_last_xact_replay_timestamp()) > make_interval(hours => max_inactive_hours)
LOOP
PERFORM pg_drop_replication_slot(r.slot_name);
dropped_slot := r.slot_name;
RETURN NEXT;
END LOOP;
END;
$$ LANGUAGE plpgsql;
-- Run via pg_cron
SELECT cron.schedule('drop-stale-slots', '0 * * * *',
$$SELECT drop_inactive_slots(24)$$);
Recovery Playbook
Immediate: Free Disk Space
-- 1. Identify problematic slot
SELECT slot_name, pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS lag
FROM pg_replication_slots
ORDER BY pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) DESC;
-- 2. Drop the slot (consumer will need full resync!)
SELECT pg_drop_replication_slot('debezium');
-- 3. Trigger checkpoint to speed up WAL cleanup
CHECKPOINT;
-- 4. Verify WAL is being cleaned
-- (wait for next checkpoint cycle)
SELECT count(*), pg_size_pretty(sum(size))
FROM pg_ls_waldir();
Consumer Recovery
After dropping slot:
1. Debezium/Consumer needs FULL snapshot
2. This can take hours for large databases
3. Plan for:
- Increased database load during snapshot
- Temporary data lag in downstream systems
- Possible duplicate processing (idempotency!)
Debezium Specific
# Debezium config for new snapshot
{
"connector.class": "io.debezium.connector.postgresql.PostgresConnector",
"snapshot.mode": "initial", # Full snapshot on start
"slot.drop.on.stop": "false", # Keep slot when stopping (careful!)
# Alternative: Clean restart
"snapshot.mode": "initial_only" # Snapshot, then stop
}
Architecture Improvements
1. Separate Replication User
-- Dedicated user for replication
CREATE USER replication_user REPLICATION LOGIN;
GRANT USAGE ON SCHEMA public TO replication_user;
GRANT SELECT ON ALL TABLES IN SCHEMA public TO replication_user;
-- Monitor this user's connections
SELECT * FROM pg_stat_replication WHERE usename = 'replication_user';
2. Health Check Endpoint
// health_check.go
func checkReplicationSlots(db *sql.DB) error {
var slotName string
var lagBytes int64
rows, err := db.Query(`
SELECT slot_name, pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)
FROM pg_replication_slots
WHERE NOT active
`)
if err != nil {
return err
}
defer rows.Close()
for rows.Next() {
rows.Scan(&slotName, &lagBytes)
if lagBytes > 10*1024*1024*1024 { // 10GB
return fmt.Errorf("slot %s has %d bytes lag", slotName, lagBytes)
}
}
return nil
}
3. Kubernetes Liveness Probe
# For Debezium deployment
apiVersion: apps/v1
kind: Deployment
spec:
template:
spec:
containers:
- name: debezium
livenessProbe:
httpGet:
path: /connectors/postgres-connector/status
port: 8083
initialDelaySeconds: 60
periodSeconds: 30
failureThreshold: 3
Monitoring Dashboard
Key Metrics
# WAL directory size
pg_wal_directory_size_bytes
# Slot lag (per slot)
pg_replication_slot_wal_lsn_diff{slot_name=~".*"}
# Inactive slots count
count(pg_replication_slot_active == 0)
# WAL generation rate
rate(pg_wal_lsn_diff(pg_current_wal_lsn())[5m])
Grafana Panel
{
"panels": [
{
"title": "Replication Slot Lag",
"type": "timeseries",
"targets": [
{
"expr": "pg_replication_slot_wal_lsn_diff",
"legendFormat": "{{ slot_name }}"
}
]
},
{
"title": "WAL Directory Size",
"type": "stat",
"targets": [
{
"expr": "pg_wal_directory_size_bytes"
}
]
}
]
}
Checklist
## Replication Slot Management
### Prevention
- [ ] Set max_slot_wal_keep_size (PostgreSQL 13+)
- [ ] Alert on slot lag > 10GB
- [ ] Alert on inactive slots > 1 hour
- [ ] Implement automatic stale slot cleanup
### Monitoring
- [ ] Dashboard with slot lag per slot
- [ ] WAL directory size tracking
- [ ] Consumer health checks (Debezium status)
### Recovery Plan
- [ ] Document slot drop procedure
- [ ] Know consumer resync time for full snapshot
- [ ] Test recovery procedure in staging
### Consumer Setup
- [ ] Liveness probes for Debezium/consumers
- [ ] Auto-restart on failure
- [ ] Alerting on consumer failures
Conclusion
Replication slots represent a fascinating trade-off in database design: you gain guaranteed data availability for replicas and consumers, but you pay with the requirement for active monitoring and lifecycle management. A slot that isn’t being consumed isn’t just wasteful—it’s a time bomb slowly filling your disk.
What makes this particularly dangerous is the disconnect between symptom and cause. You get a disk space alert, you check for large tables, you look for bloat—but the problem isn’t in your tables, it’s in pg_wal. And even when you find the WAL accumulation, you might not immediately connect it to the replication slot that’s been inactive for days. The slot doesn’t announce its problem; it just quietly prevents cleanup.
The PostgreSQL 13+ feature max_slot_wal_keep_size is the safety net everyone should enable. It caps how much WAL a slot can retain, automatically invalidating slots that fall too far behind. Yes, this means the consumer will need a full snapshot on recovery, but that’s better than a database outage. It converts an unbounded retention into a bounded one.
Key principles:
- One inactive slot can fill your disk—there’s no warning until it’s nearly too late
- Set
max_slot_wal_keep_sizeas a safety limit in PostgreSQL 13+ - Alert on inactive slots within 1 hour—this is early enough to investigate before it becomes critical
- Monitor slot lag in bytes, not just activity—an active slot that can’t keep up is just as dangerous
- Have a recovery plan—know how long a full snapshot takes and plan accordingly
The slot you forgot about is the one that takes down production. Monitor them all.
Related Articles
- PostgreSQL Autovacuum SLO - Database maintenance
- K8s PostgreSQL Connection Storm - Database in Kubernetes
Related posts
Logical Replication Slot WAL Bloat: When Subscribers Go Offline
Disk filling up with WAL files. The cause: a logical replication slot consumer went offline, and PostgreSQL retains all WAL since then because it might be needed.
PostgreSQL Idle in Transaction: Emergency Playbook for Stuck Connections
Autovacuum can't run, table bloat growing, all because of one 'idle in transaction' connection. Here's the detection and kill playbook.
PostgreSQL Read Replica Conflicts: Why Your Queries Get Canceled
Queries on read replicas fail with 'canceling statement due to conflict with recovery'. The fix depends on which of the 5 conflict types you have - here's how to diagnose and solve each one.
pg_waldump WAL Forensics: Reconstructing What Happened to Your Data
Something deleted rows from production but nobody admits to running DELETE. Use pg_waldump to analyze WAL files and reconstruct exactly what happened and when.
Cite this article
If you reference this post, please link to the original URL and credit the author.