PostgreSQL Read Replica Conflicts: Why Your Queries Get Canceled

“ERROR: canceling statement due to conflict with recovery” - if you’re running read replicas in PostgreSQL, you’ve seen this. The query was fine. The replica is healthy. But PostgreSQL killed your query anyway because WAL replay needed something your query was using.

The frustrating part: the error message doesn’t tell you which conflict type caused it. There are five different conflict types, each with different causes and different solutions. Blindly setting max_standby_streaming_delay = -1 (infinite) might fix your immediate problem while creating a much worse one.

Environment: PostgreSQL 12+, streaming replication, read replicas handling production queries

The Five Conflict Types

PostgreSQL tracks conflicts in pg_stat_database_conflicts. This is your diagnostic starting point:

SELECT datname,
       confl_tablespace,
       confl_lock,
       confl_snapshot,
       confl_bufferpin,
       confl_deadlock
FROM pg_stat_database_conflicts
WHERE datname = current_database();

Each column represents a different conflict type:

┌─────────────────────┬─────────────────────────────────────────────────────┐
│ Conflict Type       │ What's Happening                                    │
├─────────────────────┼─────────────────────────────────────────────────────┤
│ confl_snapshot      │ VACUUM removed rows your query still needs          │
│ confl_lock          │ Exclusive lock needed for WAL replay                │
│ confl_bufferpin     │ WAL replay needs a buffer your query is using       │
│ confl_tablespace    │ Tablespace dropped while query accessing it         │
│ confl_deadlock      │ Deadlock between query and recovery process         │
└─────────────────────┴─────────────────────────────────────────────────────┘

90% of the time, it’s confl_snapshot. But check your numbers - the fix for each is different.

Conflict Type 1: Snapshot Conflicts (confl_snapshot)

This is the most common. Here’s what happens:

Timeline:

Primary:
T+0:00   Transaction inserts row (id=100)
T+0:05   Transaction commits
T+5:00   VACUUM runs, removes dead row versions
T+5:01   WAL record shipped: "remove rows from page X"

Replica:
T+0:00   Query starts: SELECT * FROM big_table (takes 10 minutes)
         Query has snapshot from T+0:00
T+5:02   WAL replay wants to apply VACUUM changes
         But query still needs those row versions!
         → confl_snapshot, query canceled

The replica can’t apply the VACUUM WAL record because your long-running query holds a snapshot that references rows VACUUM wants to remove.

Solution: hot_standby_feedback

-- On replica, in postgresql.conf
hot_standby_feedback = on

This tells the primary “don’t VACUUM rows that my queries still need.” The primary’s VACUUM will skip rows that any replica is still using.

The tradeoff: Table bloat on the primary. If a replica runs a 6-hour query, the primary can’t clean up any dead rows for 6 hours. Monitor pg_stat_user_tables.n_dead_tup.

-- Monitor bloat on primary
SELECT schemaname, relname, n_live_tup, n_dead_tup,
       round(n_dead_tup::numeric / nullif(n_live_tup, 0) * 100, 2) as dead_pct
FROM pg_stat_user_tables
WHERE n_dead_tup > 10000
ORDER BY n_dead_tup DESC;

Alternative: max_standby_streaming_delay

-- On replica
max_standby_streaming_delay = 300s  -- Wait up to 5 minutes before canceling

This delays WAL replay rather than canceling queries. Setting it to -1 means “wait forever” - but then your replica falls behind indefinitely during long queries.

Decision Tree for Snapshot Conflicts

Is query duration predictable?
├─ Yes, queries < 5 minutes
│   └─ Set max_standby_streaming_delay = 300s
│      (Acceptable replication lag)
│
├─ Yes, but some queries run hours
│   └─ hot_standby_feedback = on
│      + Monitor primary bloat
│      + Consider query optimization first
│
└─ No, unpredictable ad-hoc queries
    └─ hot_standby_feedback = on
       + Set statement_timeout on replica
       + Accept some primary bloat

Conflict Type 2: Lock Conflicts (confl_lock)

Less common, but trickier. This happens when WAL replay needs an exclusive lock that your query is blocking.

Primary:
T+0:00   ALTER TABLE users ADD COLUMN email_verified boolean;
         (Acquires AccessExclusiveLock, ships in WAL)

Replica:
T+0:00   Query: SELECT * FROM users WHERE ... (holds AccessShareLock)
T+0:01   WAL replay wants to apply ALTER TABLE
         Needs AccessExclusiveLock - blocked by query
         After max_standby_streaming_delay → query canceled

Solution: Avoid DDL During Peak Hours

Lock conflicts come from DDL operations (ALTER TABLE, CREATE INDEX CONCURRENTLY completion, etc.). The replica has no choice but to wait or cancel.

-- Check what locks recovery is waiting for
SELECT relation::regclass, mode, granted, pid
FROM pg_locks
WHERE NOT granted
  AND pid = (SELECT pid FROM pg_stat_activity WHERE backend_type = 'startup');

Best practice: Run DDL during low-traffic periods. There’s no configuration that magically solves this without either canceling queries or delaying replication.

Conflict Type 3: Buffer Pin Conflicts (confl_bufferpin)

This happens when WAL replay needs to modify a buffer page that a query is actively reading:

Replica:
Query is scanning page 12345 of table X (buffer pinned)
WAL replay: "I need to modify page 12345"
→ Can't proceed while buffer is pinned
→ After timeout, cancel the query

Solution: max_standby_streaming_delay

Buffer pin conflicts resolve quickly - usually within milliseconds. A modest delay handles them:

max_standby_streaming_delay = 30s

If you’re seeing high confl_bufferpin, the root cause is usually extremely write-heavy workload on primary + long sequential scans on replica.

Conflict Type 4: Tablespace Conflicts (confl_tablespace)

Rare. Only happens when you DROP TABLESPACE while a query is accessing objects in it.

-- Don't do this during production hours
DROP TABLESPACE old_tablespace;

Solution: Don’t drop tablespaces while queries are running. Coordinate maintenance windows.

Conflict Type 5: Deadlock Conflicts (confl_deadlock)

Extremely rare. Happens when recovery and a query form a deadlock cycle. PostgreSQL resolves it by canceling the query.

If you’re seeing these, something unusual is happening. Check for:

Extremely long-running queries holding many buffers
Very high write rate on primary
Misconfigured max_standby_streaming_delay

The hot_standby_feedback Trap

hot_standby_feedback = on seems like the perfect solution. But it has a nasty side effect that the docs underemphasize:

Scenario: Replica has runaway query

T+0:00   Replica query starts (maybe forgotten psql session)
T+1:00   Query still running... primary can't VACUUM
T+2:00   Query still running... dead tuples accumulating
...
T+24:00  Someone notices primary has 500GB of bloat

-- The replica's query prevented ALL cleanup on primary
-- Even for tables the query wasn't touching!

Mitigation:

-- On replica, set aggressive statement timeout
SET statement_timeout = '30min';

-- Or per-user
ALTER ROLE reporting_user SET statement_timeout = '30min';

Always pair hot_standby_feedback = on with statement_timeout.

Managed PostgreSQL Specifics

AWS RDS

hot_standby_feedback: Not configurable on read replicas (off by default)
max_standby_streaming_delay: Configurable via parameter group

-- RDS workaround: Increase delay
max_standby_streaming_delay = 300

RDS doesn’t expose hot_standby_feedback because Amazon doesn’t want customer replicas bloating the primary. You’re limited to tuning max_standby_streaming_delay.

Google Cloud SQL

hot_standby_feedback: Configurable via flags
max_standby_streaming_delay: Configurable via flags

-- Cloud SQL allows both, but warns about bloat

Azure Database for PostgreSQL

hot_standby_feedback: Configurable
max_standby_streaming_delay: Configurable

-- Flexible Server gives full control

Monitoring and Alerting

-- Create view for conflict monitoring
CREATE VIEW replica_conflicts AS
SELECT datname,
       confl_snapshot + confl_lock + confl_bufferpin +
       confl_tablespace + confl_deadlock as total_conflicts,
       confl_snapshot,
       confl_lock,
       confl_bufferpin,
       confl_tablespace,
       confl_deadlock
FROM pg_stat_database_conflicts
WHERE datname = current_database();

# Prometheus alert
groups:
  - name: postgresql-replica
    rules:
      - alert: HighReplicaConflicts
        expr: |
          rate(pg_stat_database_conflicts_confl_snapshot[5m]) > 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High snapshot conflict rate on replica"

      - alert: ReplicaLagDuringConflict
        expr: |
          pg_replication_lag_seconds > 300
          and rate(pg_stat_database_conflicts_confl_snapshot[5m]) > 0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Replica lagging due to query conflicts"

The Real Fix: Query Optimization

Before tuning replication parameters, ask: why are queries running so long?

-- Find long-running queries on replica
SELECT pid, now() - query_start as duration, query
FROM pg_stat_activity
WHERE state = 'active'
  AND query_start < now() - interval '1 minute'
ORDER BY query_start;

A query that runs for 30 minutes might run in 30 seconds with proper indexing. Fix the query and the conflict problem disappears.

-- Identify missing indexes
SELECT schemaname, relname, seq_scan, seq_tup_read,
       idx_scan, seq_tup_read / nullif(seq_scan, 0) as avg_seq_tup
FROM pg_stat_user_tables
WHERE seq_scan > 100
  AND seq_tup_read / nullif(seq_scan, 0) > 10000
ORDER BY seq_tup_read DESC;

Checklist

## PostgreSQL Replica Conflict Resolution

### Diagnosis
- [ ] Check pg_stat_database_conflicts for conflict type distribution
- [ ] Identify long-running queries on replica
- [ ] Monitor replication lag during conflicts
- [ ] Check primary table bloat (n_dead_tup)

### Quick Fixes
- [ ] Increase max_standby_streaming_delay (if acceptable lag)
- [ ] Enable hot_standby_feedback (if acceptable bloat)
- [ ] Set statement_timeout on replica (prevents runaway queries)

### Root Cause
- [ ] Optimize slow queries (indexes, query rewrite)
- [ ] Schedule heavy analytics during off-peak
- [ ] Consider dedicated analytics replica with different settings

### Monitoring
- [ ] Alert on conflict rate increase
- [ ] Alert on replication lag
- [ ] Alert on primary bloat (dead tuples)

Conclusion

“Canceling statement due to conflict with recovery” means PostgreSQL is doing its job - protecting replication consistency. The five conflict types each have different causes:

Snapshot conflicts (most common): VACUUM vs long queries → hot_standby_feedback or longer delay
Lock conflicts: DDL on primary → schedule DDL carefully
Buffer pin conflicts: page access collision → modest delay usually sufficient
Tablespace conflicts: dropped tablespace → don’t drop during queries
Deadlock conflicts: rare edge case → investigate if occurring

The key insight: there’s no free lunch. hot_standby_feedback trades replica stability for primary bloat. max_standby_streaming_delay trades query stability for replication lag. The best solution is often fixing the slow queries that trigger conflicts in the first place.

Check pg_stat_database_conflicts first. Know which conflict type you’re fighting before choosing your weapon.