PostgreSQL Logical Replication Lag: Big Transactions and Reorder Buffer Spills
One huge transaction can pin logical replication for hours. Runbook to detect the blocker, tune decoding safely, and enforce bounded transactions in prod.
32 posts
One huge transaction can pin logical replication for hours. Runbook to detect the blocker, tune decoding safely, and enforce bounded transactions in prod.
Your EXPLAIN looks perfect but production melts. The culprit: PostgreSQL silently switched from a custom plan to a generic plan after enough executions, and the generic plan is catastrophically wrong.
PostgreSQL LISTEN/NOTIFY works perfectly in local testing but notifications randomly stop arriving in production. The culprit: transaction pooling quietly reassigning your connection to someone else.
PostgreSQL can go read-only near XID wraparound. Use this emergency playbook to find the oldest tables, unblock vacuum freeze, and prevent repeat incidents.
hot_standby_feedback stops replica query cancellations but can bloat the primary over days. Detect xmin pinning, mitigate safely, and add guardrails.
A reproducible approach to diagnose and eliminate checkpoint-induced latency spikes using pgbench, pg_stat_bgwriter, and WAL/IO budgeting.
App hangs but the database looks healthy. Your pool is exhausted. I show how to detect it, size pools sanely, and prevent connection leaks.
Something deleted rows from production but nobody admits to running DELETE. Use pg_waldump to analyze WAL files and reconstruct exactly what happened and when.
Pool size 50 because that's how it's always been? I'll show how to use Little's Law to calculate optimal pool size and prove it with load tests.
Random UUIDs as Primary Keys cause index bloat and random I/O. Benchmark with specific numbers - index size, cache hit ratio, and WAL volume after 100M inserts.
Vacuum runs successfully but disk keeps growing and cache hit ratio drops. I'll show how to quantify HOT-update eligibility using pgstattuple and optimize fillfactor.
Same query, same params, but prod is slow and staging works fine. I'll show how to reproduce the generic plan problem with pgBouncer, Java/Go and how to fix it.
Disk filling up with WAL files. The cause: a logical replication slot consumer went offline, and PostgreSQL retains all WAL since then because it might be needed.
Autovacuum is either ignored or cargo-cult tuned. I'll show how to turn it into an SLO-driven system with specific numbers, pg_stat metrics, and reproducible tests.
A practical playbook for safe database migrations in production. From expand/contract pattern through online indexes to monitoring and rollback.
Adding Redis just for distributed locks? PostgreSQL advisory locks might be enough. I compare both with failure scenarios and performance benchmarks.
SELECT * on a table with JSON is 10x slower than expected. I'll show how TOAST storage works and when to change strategies for large columns.
Disk is 95% full, WAL directory has 400GB. I'll show how replication slots prevent WAL cleanup and a playbook for prevention and recovery.
Autovacuum can't run, table bloat growing, all because of one 'idle in transaction' connection. Here's the detection and kill playbook.
Why mocks lie and how Testcontainers will change your testing approach. Practical examples, CI setup, and data isolation strategies.
Full-text search was fast, now it's slow. The cause: GIN index pending list grew huge during bulk inserts, and every search must now scan the unsorted pending entries.
Reproducible lab demonstrating connection storm during K8s rollouts. PgBouncer, preStop hooks and jitter - practical solutions with benchmarks.
Practical Outbox pattern implementation in Node.js/TypeScript with PostgreSQL LISTEN/NOTIFY. Race-condition case study and production-ready solution.
A practical analysis of why soft delete destroys database performance over time. Benchmarks, partitioning solution, and migration checklist.
Query returns wrong results after OS upgrade. The cause: ICU library version changed, collation rules shifted, and indexes are now sorted inconsistently with the new sort order.
Query scans full table despite perfect partial index. The cause: query's WHERE clause doesn't match the index predicate exactly, or statistics mislead the planner.
Perfect idempotency logic, but customers still get charged twice. The cause: checking idempotency keys against a read replica that's seconds behind the primary during traffic spikes.
Queries on read replicas fail with 'canceling statement due to conflict with recovery'. The fix depends on which of the 5 conflict types you have - here's how to diagnose and solve each one.
Getting 'could not serialize access due to concurrent update'? The fix isn't just retry logic - it's understanding when to use which isolation level and how to reduce conflict frequency.
work_mem looks small at 256MB, but a parallel hash join with 4 workers across 3 plan nodes uses 3GB. Here's how to prevent PostgreSQL from legitimately OOMing your container.
Adding an index for performance made writes 10x slower. The counter-intuitive cause: the new index broke HOT updates, turning cheap in-place updates into full-row rewrites with massive bloat.
Rolling deploy fails with cached plan errors after ALTER TABLE. The cause: server-side prepared statements cache query plans that break when schema changes.