PostgreSQL Serialization Failures: Beyond 'Just Retry'
“ERROR: could not serialize access due to concurrent update” - every PostgreSQL developer hits this eventually. The standard advice is “just retry the transaction.” But that’s like saying “just restart it” for any error. The real questions are: Why did it happen? Should you even be using that isolation level? And how do you reduce the retry frequency from 30% to 0.1%?
Environment: PostgreSQL 12+, applications using
REPEATABLE READorSERIALIZABLEisolation levels, high-concurrency workloads
Understanding the Error
The Three Isolation Levels
PostgreSQL offers three isolation levels, each with different trade-offs:
┌────────────────────┬─────────────────┬──────────────────┬─────────────────────┐
│ Isolation Level │ Dirty Reads │ Non-Repeatable │ Serialization │
│ │ │ Reads │ Failures │
├────────────────────┼─────────────────┼──────────────────┼─────────────────────┤
│ READ COMMITTED │ No │ Yes │ Never │
│ REPEATABLE READ │ No │ No │ Yes (concurrent │
│ │ │ │ updates only) │
│ SERIALIZABLE │ No │ No │ Yes (any conflict) │
└────────────────────┴─────────────────┴──────────────────┴─────────────────────┘
Key insight: READ COMMITTED (the default) never throws serialization failures. If you’re getting this error, you explicitly chose a stricter isolation level.
When Each Error Occurs
-- REPEATABLE READ: Fails on concurrent UPDATE/DELETE to same row
-- Transaction A -- Transaction B
BEGIN ISOLATION LEVEL REPEATABLE READ;
SELECT * FROM accounts WHERE id = 1;
BEGIN;
UPDATE accounts SET balance = 50 WHERE id = 1;
COMMIT;
UPDATE accounts SET balance = 100 WHERE id = 1;
-- ERROR: could not serialize access due to concurrent update
-- SERIALIZABLE: Fails on ANY read-write dependency
-- Transaction A -- Transaction B
BEGIN ISOLATION LEVEL SERIALIZABLE;
SELECT sum(balance) FROM accounts;
BEGIN ISOLATION LEVEL SERIALIZABLE;
INSERT INTO accounts (balance) VALUES (100);
COMMIT;
INSERT INTO accounts (balance) VALUES (200);
COMMIT;
-- ERROR: could not serialize access due to read/write dependencies
The Decision Tree: Which Isolation Level Do You Need?
Do you need to prevent non-repeatable reads?
│
├─ NO → Use READ COMMITTED (default)
│ No serialization failures, best performance
│ Handle concurrent updates with SELECT FOR UPDATE if needed
│
└─ YES → Do you need full serializability (as if transactions ran one-by-one)?
│
├─ NO → Use REPEATABLE READ
│ Only fails on direct row conflicts
│ Good for: Reports, read-heavy analytics
│
└─ YES → Use SERIALIZABLE
Fails on any read-write dependency
Good for: Financial calculations, inventory
Requires robust retry logic
When READ COMMITTED Is Actually Enough
90% of applications using REPEATABLE READ don’t actually need it:
-- Problem: "I need to read a value and update based on it"
-- Bad: Using REPEATABLE READ
BEGIN ISOLATION LEVEL REPEATABLE READ;
SELECT balance FROM accounts WHERE id = 1; -- Returns 100
-- Application calculates: new_balance = 100 + 50
UPDATE accounts SET balance = 150 WHERE id = 1;
COMMIT;
-- Can fail with serialization error!
-- Good: Using READ COMMITTED with atomic update
BEGIN; -- Default READ COMMITTED
UPDATE accounts SET balance = balance + 50 WHERE id = 1;
COMMIT;
-- Never fails with serialization error!
-- The UPDATE sees the latest committed value
-- Good: Using SELECT FOR UPDATE when you need the value
BEGIN;
SELECT balance FROM accounts WHERE id = 1 FOR UPDATE;
-- Now you have a lock, no concurrent updates possible
UPDATE accounts SET balance = 150 WHERE id = 1;
COMMIT;
Implementing Proper Retry Logic
If you genuinely need REPEATABLE READ or SERIALIZABLE, implement proper retry:
Basic Retry Pattern
import psycopg2
from psycopg2 import errors
import time
import random
def execute_with_retry(conn_params, transaction_func, max_retries=5):
"""Execute a transaction with exponential backoff retry."""
for attempt in range(max_retries):
conn = psycopg2.connect(**conn_params)
try:
conn.set_isolation_level(
psycopg2.extensions.ISOLATION_LEVEL_REPEATABLE_READ
)
with conn.cursor() as cur:
result = transaction_func(cur)
conn.commit()
return result
except errors.SerializationFailure as e:
conn.rollback()
if attempt == max_retries - 1:
raise # Final attempt failed
# Exponential backoff with jitter
wait_time = (2 ** attempt) * 0.1 + random.uniform(0, 0.1)
time.sleep(wait_time)
finally:
conn.close()
# Usage
def transfer_funds(cur):
cur.execute("SELECT balance FROM accounts WHERE id = 1")
balance = cur.fetchone()[0]
cur.execute("UPDATE accounts SET balance = %s WHERE id = 1", (balance - 100,))
cur.execute("UPDATE accounts SET balance = balance + 100 WHERE id = 2")
return True
execute_with_retry(conn_params, transfer_funds)
Java/Spring Pattern
@Service
public class AccountService {
@Autowired
private AccountRepository accountRepository;
@Retryable(
value = {SerializationFailureException.class},
maxAttempts = 5,
backoff = @Backoff(delay = 100, multiplier = 2, random = true)
)
@Transactional(isolation = Isolation.REPEATABLE_READ)
public void transfer(Long fromId, Long toId, BigDecimal amount) {
Account from = accountRepository.findById(fromId).orElseThrow();
Account to = accountRepository.findById(toId).orElseThrow();
from.setBalance(from.getBalance().subtract(amount));
to.setBalance(to.getBalance().add(amount));
accountRepository.save(from);
accountRepository.save(to);
}
@Recover
public void transferFailed(SerializationFailureException e,
Long fromId, Long toId, BigDecimal amount) {
log.error("Transfer failed after max retries: {} -> {} amount {}",
fromId, toId, amount);
throw new TransferFailedException("Max retries exceeded", e);
}
}
Go Pattern
func ExecuteSerializable(db *sql.DB, fn func(*sql.Tx) error) error {
maxRetries := 5
for attempt := 0; attempt < maxRetries; attempt++ {
tx, err := db.BeginTx(context.Background(), &sql.TxOptions{
Isolation: sql.LevelSerializable,
})
if err != nil {
return err
}
err = fn(tx)
if err != nil {
tx.Rollback()
// Check if it's a serialization failure (SQLSTATE 40001)
if pqErr, ok := err.(*pq.Error); ok && pqErr.Code == "40001" {
// Exponential backoff with jitter
waitTime := time.Duration(math.Pow(2, float64(attempt))) * 100 * time.Millisecond
jitter := time.Duration(rand.Int63n(100)) * time.Millisecond
time.Sleep(waitTime + jitter)
continue
}
return err
}
err = tx.Commit()
if err != nil {
// Commit can also fail with serialization error
if pqErr, ok := err.(*pq.Error); ok && pqErr.Code == "40001" {
continue
}
return err
}
return nil // Success
}
return fmt.Errorf("max retries exceeded for serializable transaction")
}
Reducing Conflict Frequency
Retry logic is necessary, but reducing the need for retries is better:
1. Keep Transactions Short
-- Bad: Long transaction with idle time
BEGIN ISOLATION LEVEL SERIALIZABLE;
SELECT * FROM products WHERE category = 'electronics';
-- Application processes data for 5 seconds...
UPDATE products SET stock = stock - 1 WHERE id = 123;
COMMIT;
-- High conflict probability!
-- Good: Minimal transaction time
-- Do processing OUTSIDE the transaction
products = fetch_products('electronics') -- Separate query
process_products(products) -- Application logic
BEGIN ISOLATION LEVEL SERIALIZABLE;
UPDATE products SET stock = stock - 1 WHERE id = 123;
COMMIT;
-- Low conflict probability
2. Use Indexes to Narrow Scan Scope
-- Without index: Scans all rows, creates wide read dependencies
BEGIN ISOLATION LEVEL SERIALIZABLE;
SELECT * FROM orders WHERE customer_id = 123 AND status = 'pending';
-- Sequential scan = read dependency on ALL rows
-- With index: Only reads specific rows
CREATE INDEX idx_orders_customer_status ON orders(customer_id, status);
-- Now only creates dependencies on rows that match
3. Order Your Operations Consistently
-- Deadlock-prone: Different order in different transactions
-- Transaction A: UPDATE accounts SET ... WHERE id = 1; then id = 2
-- Transaction B: UPDATE accounts SET ... WHERE id = 2; then id = 1
-- Safe: Always same order
-- Sort by ID before updating
UPDATE accounts SET balance = balance - 100 WHERE id = 1;
UPDATE accounts SET balance = balance + 100 WHERE id = 2;
-- Both transactions use same order = no deadlocks
4. Consider Optimistic Locking Instead
-- Add version column
ALTER TABLE products ADD COLUMN version INTEGER DEFAULT 1;
-- Read with version
SELECT id, name, stock, version FROM products WHERE id = 123;
-- Returns: (123, 'Widget', 10, 5)
-- Update with version check (READ COMMITTED is fine!)
UPDATE products
SET stock = 9, version = version + 1
WHERE id = 123 AND version = 5;
-- If affected rows = 0, someone else updated first → retry in application
Monitoring Serialization Failures
-- Check serialization failure rate
SELECT datname,
xact_commit,
xact_rollback,
conflicts,
round(100.0 * xact_rollback / nullif(xact_commit + xact_rollback, 0), 2)
as rollback_pct
FROM pg_stat_database
WHERE datname = current_database();
-- Find queries causing conflicts (requires pg_stat_statements)
SELECT query,
calls,
total_time / calls as avg_time_ms
FROM pg_stat_statements
WHERE query ILIKE '%ISOLATION LEVEL%'
ORDER BY calls DESC
LIMIT 20;
Prometheus Alerting
groups:
- name: postgresql-serialization
rules:
- alert: HighSerializationFailureRate
expr: |
rate(pg_stat_database_xact_rollback[5m]) /
(rate(pg_stat_database_xact_commit[5m]) +
rate(pg_stat_database_xact_rollback[5m])) > 0.05
for: 5m
labels:
severity: warning
annotations:
summary: "High transaction rollback rate ({{ $value | humanizePercentage }})"
- alert: SerializationRetriesExhausted
expr: |
rate(app_serialization_retries_exhausted_total[5m]) > 0
for: 1m
labels:
severity: critical
annotations:
summary: "Transactions failing after max retries"
Managed PostgreSQL Specifics
AWS RDS
# RDS doesn't have special settings for serialization
# But connection limits affect retry storms
max_connections = 100 # Default varies by instance size
# Monitor via Performance Insights
# Look for wait events: Lock:transactionid
Google Cloud SQL
# Same PostgreSQL behavior
# Monitor via Cloud SQL Insights
# Filter by: Serialization failures
Azure Database for PostgreSQL
# Standard PostgreSQL isolation levels
# Use Query Performance Insight for monitoring
Checklist
## Serialization Failure Resolution
### Diagnosis
- [ ] Confirm you actually need REPEATABLE READ/SERIALIZABLE
- [ ] Check if READ COMMITTED with SELECT FOR UPDATE would work
- [ ] Measure current failure rate from pg_stat_database
### If You Need Strict Isolation
- [ ] Implement retry with exponential backoff + jitter
- [ ] Set max retry limit (5 is usually enough)
- [ ] Log retries for monitoring
### Reduce Failure Frequency
- [ ] Keep transactions as short as possible
- [ ] Add indexes to narrow read dependencies
- [ ] Use consistent ordering for multi-row updates
- [ ] Consider optimistic locking instead
### Monitoring
- [ ] Alert on rollback rate > 5%
- [ ] Alert on retry exhaustion
- [ ] Track avg retries per transaction
Conclusion
“Could not serialize access” isn’t really an error - it’s PostgreSQL telling you that concurrent transactions couldn’t be ordered consistently. The question is whether you actually need that guarantee.
Key takeaways:
- 90% of applications don’t need REPEATABLE READ - atomic UPDATE or SELECT FOR UPDATE with READ COMMITTED solves most use cases
- If you need it, implement proper retry - exponential backoff with jitter, not immediate retry
- Reduce conflict probability - short transactions, proper indexes, consistent ordering
- Monitor your rollback rate - if it’s above 5%, something is wrong with your access patterns
The best serialization failure is the one that never happens because you chose the right isolation level for your use case.
Related Articles
- PostgreSQL Read Replica Conflicts - Recovery conflicts on read replicas
- PostgreSQL Idle Transaction Playbook - Connection exhaustion from idle transactions
Related posts
PostgreSQL Read Replica Conflicts: Why Your Queries Get Canceled
Queries on read replicas fail with 'canceling statement due to conflict with recovery'. The fix depends on which of the 5 conflict types you have - here's how to diagnose and solve each one.
Double Charges From Idempotency Keys: The Replica Lag Trap
Perfect idempotency logic, but customers still get charged twice. The cause: checking idempotency keys against a read replica that's seconds behind the primary during traffic spikes.
PostgreSQL OOM by Design: work_mem × Parallel Workers × Plan Nodes
work_mem looks small at 256MB, but a parallel hash join with 4 workers across 3 plan nodes uses 3GB. Here's how to prevent PostgreSQL from legitimately OOMing your container.
The Index That Killed Write Performance: Losing PostgreSQL HOT Updates
Adding an index for performance made writes 10x slower. The counter-intuitive cause: the new index broke HOT updates, turning cheap in-place updates into full-row rewrites with massive bloat.
Cite this article
If you reference this post, please link to the original URL and credit the author.