Double Charges From Idempotency Keys: The Replica Lag Trap

We trusted idempotency keys until replica lag turned them into a lie. “Our idempotency implementation is bulletproof, but we still processed the same payment twice.” The finance team called with the report that nobody wants to hear: a customer was charged twice for the same order. We reviewed the code. The idempotency implementation was textbook-correct: check if the key exists, if not process and store the result, if yes return the stored result. The code was right. How did we charge someone twice?

The bug was in our database connection pool, not our application logic. To improve read performance, we’d configured the pool to route SELECT queries to read replicas and write queries to the primary. This works great for most use cases. But idempotency checks have a special requirement: the SELECT must see the most recent writes. When the primary is under load, replicas fall behind—sometimes by seconds. A retry that arrives during that window won’t find the original idempotency key because the replica hasn’t received it yet.

The sequence was maddening to reconstruct. Request 1 arrives, checks the replica (no key), processes payment, writes to primary. Request 2 arrives 500ms later (a retry from client timeout), checks the same replica (still no key because replica is 2 seconds behind), starts processing a second payment. By the time Request 2 tries to write the idempotency key, the primary shows a conflict—but the payment API call already happened. One database record, two credit card charges.

This is why idempotency is fundamentally a distributed systems problem. You can’t implement it correctly by looking at code in isolation. You have to understand the entire data path, including which database handles which queries, and what happens when those databases temporarily disagree about state.

Environment: PostgreSQL primary-replica setup, microservices with connection pooling, payment processing system

The Problem

The Double Charge Incident

Timeline of a duplicate payment:

T+0.000s  Client sends payment request (idempotency_key: "pay_abc123")
T+0.001s  Service checks replica: "Does pay_abc123 exist?" → NO
T+0.002s  Service starts payment processing
T+0.003s  Service writes to PRIMARY: INSERT pay_abc123

T+0.500s  Network timeout, client retries (same key: "pay_abc123")
T+0.501s  Service checks replica: "Does pay_abc123 exist?" → NO
          (Replica is 2 seconds behind during traffic spike!)
T+0.502s  Service starts SECOND payment processing
T+0.503s  Service writes to PRIMARY: INSERT pay_abc123 → CONFLICT!
          But payment already submitted to payment processor...

Result: Customer charged twice, one DB record

The “Correct” Code That Fails

class PaymentService:
    def __init__(self, db_pool):
        # Connection pool routes reads to replicas for "performance"
        self.db = db_pool

    def process_payment(self, idempotency_key: str, amount: Decimal):
        # Check if we've seen this request before
        # BUG: This query goes to a READ REPLICA!
        existing = self.db.query(
            "SELECT * FROM payments WHERE idempotency_key = %s",
            idempotency_key
        )

        if existing:
            return existing.result  # Return cached result

        # Process payment with external provider
        result = self.payment_provider.charge(amount)  # SIDE EFFECT!

        # Store result for future idempotent requests
        # This goes to PRIMARY
        self.db.execute(
            "INSERT INTO payments (idempotency_key, result) VALUES (%s, %s)",
            idempotency_key, result
        )

        return result

Root Cause

Replica Lag Is Variable

Normal operation (replica lag ~10ms):
┌─────────────────────────────────────────────────────┐
│ PRIMARY: INSERT pay_abc123 at T+0                   │
│ REPLICA: Receives INSERT at T+10ms                  │
│ Retry at T+500ms: Replica has the record ✓          │
└─────────────────────────────────────────────────────┘

During load spike (replica lag ~3 seconds):
┌─────────────────────────────────────────────────────┐
│ PRIMARY: INSERT pay_abc123 at T+0                   │
│ REPLICA: Still processing T-3s transactions...      │
│ Retry at T+500ms: Replica shows NO record! ✗        │
│                                                     │
│ Window of vulnerability: 0 to ~3 seconds            │
└─────────────────────────────────────────────────────┘

When does replica lag spike?
- Heavy write load on primary
- Large transactions (batch operations)
- Replica under CPU pressure
- Network congestion between primary/replica
- Vacuum operations on replica

Why Connection Pools Make This Worse

# Many connection pools auto-route based on query type

# PgBouncer with read/write splitting:
# - SELECT → replica
# - INSERT/UPDATE → primary

# SQLAlchemy with routing:
class RoutingSession(Session):
    def get_bind(self, mapper=None, clause=None):
        if self._flushing:
            return engines['primary']
        return engines['replica']  # All reads go to replica!

# Even "smart" routing misses the semantic requirement:
# "This SELECT MUST see the result of a recent INSERT"

Diagnosis

Check Replica Lag

-- On PostgreSQL replica:
SELECT
    now() - pg_last_xact_replay_timestamp() AS replica_lag,
    pg_is_in_recovery() AS is_replica;

-- On primary, check all replicas:
SELECT
    client_addr,
    state,
    sent_lsn,
    write_lsn,
    flush_lsn,
    replay_lsn,
    pg_wal_lsn_diff(sent_lsn, replay_lsn) AS bytes_behind
FROM pg_stat_replication;

Find Double-Processed Requests

-- Look for idempotency key conflicts
SELECT
    idempotency_key,
    COUNT(*) as attempts,
    MIN(created_at) as first_attempt,
    MAX(created_at) as last_attempt,
    MAX(created_at) - MIN(created_at) as window
FROM payment_attempts
GROUP BY idempotency_key
HAVING COUNT(*) > 1
ORDER BY last_attempt DESC;

-- Correlate with replica lag metrics
SELECT
    date_trunc('minute', timestamp) as minute,
    MAX(replica_lag_seconds) as max_lag,
    COUNT(CASE WHEN duplicate_detected THEN 1 END) as duplicates
FROM payment_metrics
GROUP BY 1
ORDER BY 1 DESC;

Trace Request Flow

# Add logging to identify which database handled the query
import logging

def process_payment(idempotency_key: str):
    # Log which server we're checking
    conn = db_pool.get_connection()
    server = conn.execute("SELECT inet_server_addr()").scalar()
    logging.info(f"Idempotency check on server: {server}")

    # Check for existing
    existing = conn.execute(
        "SELECT * FROM payments WHERE idempotency_key = %s",
        idempotency_key
    ).first()

    logging.info(f"Key {idempotency_key} exists: {existing is not None}")

The Fix

Option 1: Force Primary for Idempotency Checks

class PaymentService:
    def __init__(self, primary_db, replica_db):
        self.primary = primary_db
        self.replica = replica_db

    def process_payment(self, idempotency_key: str, amount: Decimal):
        # CRITICAL: Check PRIMARY for idempotency keys
        existing = self.primary.query(
            "SELECT * FROM payments WHERE idempotency_key = %s",
            idempotency_key
        )

        if existing:
            return existing.result

        # ... rest of processing

Option 2: Use INSERT … ON CONFLICT for Atomic Check

def process_payment(idempotency_key: str, amount: Decimal):
    # First, try to claim the idempotency key atomically
    # This MUST go to primary
    result = self.primary.execute("""
        INSERT INTO idempotency_locks (key, status, created_at)
        VALUES (%s, 'processing', now())
        ON CONFLICT (key) DO UPDATE
        SET key = idempotency_locks.key  -- no-op update
        RETURNING status, result, (xmax = 0) AS inserted
    """, idempotency_key)

    if not result.inserted:
        # Key already exists
        if result.status == 'completed':
            return result.result
        elif result.status == 'processing':
            raise ConcurrentProcessingError("Request in progress")

    try:
        # We won the lock, process payment
        payment_result = self.payment_provider.charge(amount)

        self.primary.execute("""
            UPDATE idempotency_locks
            SET status = 'completed', result = %s
            WHERE key = %s
        """, payment_result, idempotency_key)

        return payment_result
    except Exception as e:
        self.primary.execute("""
            UPDATE idempotency_locks
            SET status = 'failed', error = %s
            WHERE key = %s
        """, str(e), idempotency_key)
        raise

Option 3: Synchronous Replica for Critical Reads

-- Configure synchronous replication for critical tables
-- postgresql.conf on primary:
synchronous_commit = on
synchronous_standby_names = 'critical_replica'

-- Or use synchronous_commit per-transaction:
BEGIN;
SET LOCAL synchronous_commit = on;
-- Your critical idempotency operations here
COMMIT;

Option 4: Add Client-Side Deduplication

# Don't rely solely on server-side idempotency

class PaymentClient:
    def __init__(self):
        self.pending_requests = {}  # In Redis or similar

    def charge(self, idempotency_key: str, amount: Decimal):
        # Check if we already have a pending request
        if idempotency_key in self.pending_requests:
            return self.wait_for_result(idempotency_key)

        # Mark as pending BEFORE sending
        self.pending_requests[idempotency_key] = 'pending'

        try:
            result = self.api.process_payment(idempotency_key, amount)
            self.pending_requests[idempotency_key] = result
            return result
        except NetworkError:
            # Don't clear pending - let next retry check server
            raise

Monitoring

groups:
  - name: idempotency-safety
    rules:
      - alert: HighReplicaLag
        expr: |
          pg_replication_lag_seconds > 1
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: "Replica lag {{ $value }}s may cause idempotency failures"

      - alert: DuplicatePaymentsDetected
        expr: |
          rate(payment_duplicates_total[5m]) > 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Duplicate payments detected - check idempotency"

      - alert: IdempotencyKeyConflicts
        expr: |
          rate(idempotency_conflicts_total[5m]) > 10
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High rate of idempotency conflicts"

Checklist

## Idempotency Replica Lag

### Symptoms
- [ ] Duplicate transactions despite idempotency keys
- [ ] Duplicates correlate with high traffic periods
- [ ] Duplicates happen during retry storms
- [ ] Single DB record but double external effects

### Diagnosis
- [ ] Check which DB handles idempotency lookups
- [ ] Monitor replica lag during incidents
- [ ] Search logs for duplicate idempotency keys
- [ ] Verify connection pool routing logic

### Fixes
- [ ] Force idempotency checks to primary
- [ ] Use INSERT ON CONFLICT for atomic claims
- [ ] Consider synchronous replication
- [ ] Add client-side deduplication layer
- [ ] Alert on replica lag > 1 second

Conclusion

This bug highlights a fundamental tension in database architecture. Read replicas are a performance optimization—they let you scale reads without overloading the primary. But they introduce eventual consistency, which breaks any pattern that requires “read your own writes” semantics. Idempotency is exactly such a pattern.

The fix seems obvious in retrospect: force idempotency checks to the primary. But this has performance implications. If your service handles 10,000 RPS and every request does an idempotency check, that’s 10,000 additional primary queries per second. You’ve traded read replica efficiency for correctness. For payment systems, that’s the right trade. For other use cases, you might need different strategies.

The deeper lesson is that idempotency requires end-to-end thinking. The application code might be correct. The database schema might be correct. But if the infrastructure layer (connection pools, load balancers, routing rules) can route your queries to a stale view of the database, your idempotency guarantees break. You have to trace the entire path from client retry to database write and verify that every step preserves the invariants you need.

Key principles:

Idempotency checks MUST hit primary - never trust replicas for operations that require “read your own writes”
Atomic claim-then-process - use INSERT ON CONFLICT to claim the key before starting the side effect
Monitor replica lag - alert when it exceeds your retry window (usually 1 second is the threshold)
Defense in depth - combine server-side and client-side deduplication for critical operations
Audit your connection pools - know which queries go where, especially for critical paths

Clock Step Backwards Split-Brain - Another timing trap
PostgreSQL WAL Forensics - Debugging replication issues

Double Charges From Idempotency Keys: The Replica Lag Trap

The Problem

The Double Charge Incident

The “Correct” Code That Fails

Root Cause

Replica Lag Is Variable

Why Connection Pools Make This Worse

Diagnosis

Check Replica Lag

Find Double-Processed Requests

Trace Request Flow

The Fix

Option 1: Force Primary for Idempotency Checks

Option 2: Use INSERT … ON CONFLICT for Atomic Check

Option 3: Synchronous Replica for Critical Reads

Option 4: Add Client-Side Deduplication

Monitoring

Checklist

Conclusion

Related posts

Cite this article

The Problem

The Double Charge Incident

The “Correct” Code That Fails

Root Cause

Replica Lag Is Variable

Why Connection Pools Make This Worse

Diagnosis

Check Replica Lag

Find Double-Processed Requests

Trace Request Flow

The Fix

Option 1: Force Primary for Idempotency Checks

Option 2: Use INSERT … ON CONFLICT for Atomic Check

Option 3: Synchronous Replica for Critical Reads

Option 4: Add Client-Side Deduplication

Monitoring

Checklist

Conclusion

Related Articles

Related posts

Cite this article