PostgreSQL OOM by Design: work_mem × Parallel Workers × Plan Nodes
Parallel queries are fun until they OOM the box. “PostgreSQL OOMKilled but work_mem is only 256MB!” The incident started with a support ticket: “the reporting dashboard is down.” PostgreSQL pods were restarting. Kubernetes showed OOMKilled as the reason. But our configuration looked conservative—work_mem was 256MB, shared_buffers was 1GB, the container had 4GB total. Where was the memory going?
The answer took hours to find because we were thinking about work_mem wrong. We assumed it was a global limit—the total memory PostgreSQL would use for sorting and hashing. It’s not. It’s a per-operation, per-worker, per-plan-node allocation. A single query with hash joins, aggregations, and sorts, running with 4 parallel workers, can legitimately consume 256MB × 4 workers × 3 operations = 3GB. Add shared_buffers and you exceed the container’s memory limit. PostgreSQL is doing exactly what it’s configured to do.
This is one of those cases where the documentation is technically correct but practically misleading. The docs say work_mem is the memory used for “internal sort operations and hash tables.” What they don’t emphasize is that each operation in a query plan gets its own work_mem allocation, and parallel query multiplies that by the number of workers. It’s memory amplification that only manifests with complex analytical queries.
Environment: PostgreSQL 15, Kubernetes with 4GB memory limit, analytical queries on 100M+ row tables
The Problem
The Deceptive Configuration
-- Looks reasonable, right?
SHOW work_mem;
-- 256MB
-- Container has 4GB memory
-- PostgreSQL shared_buffers = 1GB
-- Remaining ~3GB should be plenty for work_mem... right?
-- Then this query runs:
EXPLAIN ANALYZE
SELECT customer_id, SUM(amount)
FROM orders
WHERE order_date > '2024-01-01'
GROUP BY customer_id
ORDER BY SUM(amount) DESC;
-- Container: OOMKilled
-- PostgreSQL: terminated
-- Application: 500 errors everywhere
Why 256MB Became 3GB
Query plan breakdown:
┌─────────────────────────────────────────────────┐
│ Gather Merge (parallel coordinator) │
│ Workers Planned: 4 │
│ ├─ Sort (work_mem used here) │
│ │ └─ Parallel Hash Aggregate │
│ │ (work_mem used here) │
│ │ └─ Parallel Seq Scan │
│ │ (work_mem for bitmap here) │
└─────────────────────────────────────────────────┘
Memory calculation:
work_mem per operation: 256 MB
Plan nodes using work_mem: × 3 (sort + hash agg + scan bitmap)
Parallel workers: × 4
Plus coordinator process: × 1.25
Total possible memory:
256 MB × 3 × 4 × 1.25 = 3.84 GB
Container limit: 4 GB
Shared buffers: 1 GB
Available: 3 GB
Required: 3.84 GB
Result: OOM!
Root Cause
How work_mem Actually Works
-- Common misconception:
-- "work_mem is the total memory PostgreSQL uses for sorting/hashing"
-- Reality:
-- work_mem is per-operation, per-worker, per-plan-node
-- A single query can use work_mem MANY times:
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT ...
FROM table1
JOIN table2 ON ... -- Hash join: work_mem
JOIN table3 ON ... -- Hash join: work_mem
WHERE complex_condition -- Bitmap scan: work_mem
GROUP BY ... -- HashAggregate: work_mem
ORDER BY ... -- Sort: work_mem
LIMIT 100;
-- Each of these operations gets its own work_mem allocation
-- And with parallel query, multiply by worker count!
The Parallel Query Multiplier
Without parallelism:
┌────────────────────────────────┐
│ Sort (256MB) │
│ └─ HashAggregate (256MB) │
│ └─ Seq Scan │
└────────────────────────────────┘
Total: ~512MB worst case
With parallel_workers = 4:
┌────────────────────────────────────────────────────┐
│ Gather Merge │
│ ├─ Worker 0: Sort (256MB) + HashAgg (256MB) │
│ ├─ Worker 1: Sort (256MB) + HashAgg (256MB) │
│ ├─ Worker 2: Sort (256MB) + HashAgg (256MB) │
│ └─ Worker 3: Sort (256MB) + HashAgg (256MB) │
│ └─ Leader: Sort (256MB) + HashAgg (256MB) │
└────────────────────────────────────────────────────┘
Total: ~2.5GB worst case (and that's just 2 operations!)
Diagnosis
Step 1: Check Actual Memory Usage
-- See memory usage in query plans
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT ...;
-- Look for:
-- "Sort Method: external merge Disk: 1234kB" -- Spilled to disk
-- "Sort Method: quicksort Memory: 25kB" -- Fits in memory
-- "HashAggregate Batches: 1 Memory Usage: 2457kB"
-- Batches > 1 means it spilled to disk (good for memory, bad for speed)
Step 2: Monitor Per-Query Memory
-- Enable memory context tracking (PostgreSQL 14+)
SET log_statement_stats = on;
SET log_parser_stats = on;
SET log_planner_stats = on;
SET log_executor_stats = on;
-- Check pg_stat_activity for memory info
SELECT
pid,
query,
pg_size_pretty(pg_backend_memory_contexts.total_bytes)
FROM pg_stat_activity
JOIN pg_backend_memory_contexts ON pg_stat_activity.pid = pg_backend_memory_contexts.pid
WHERE state = 'active';
Step 3: Check Parallel Query Settings
-- Current parallel settings
SHOW max_parallel_workers_per_gather; -- e.g., 4
SHOW max_parallel_workers; -- e.g., 8
SHOW max_worker_processes; -- e.g., 8
SHOW parallel_tuple_cost;
SHOW parallel_setup_cost;
SHOW min_parallel_table_scan_size;
-- Calculate worst case:
-- work_mem × max_parallel_workers_per_gather × (estimated plan nodes)
The Fix
Option 1: Limit Parallel Workers
-- Globally reduce parallel workers
ALTER SYSTEM SET max_parallel_workers_per_gather = 2;
SELECT pg_reload_conf();
-- Or per-query for problematic queries
SET max_parallel_workers_per_gather = 0; -- Disable parallelism
SELECT ...;
RESET max_parallel_workers_per_gather;
Option 2: Reduce work_mem
-- Calculate safe work_mem:
-- (Available memory) / (max workers × estimated plan nodes)
-- (3GB) / (4 × 4) = ~192MB
ALTER SYSTEM SET work_mem = '128MB';
SELECT pg_reload_conf();
-- Trade-off: more disk spilling, slower queries
-- But: won't OOM!
Option 3: Per-Query work_mem Control
-- For specific expensive queries, reduce work_mem
BEGIN;
SET LOCAL work_mem = '64MB';
SET LOCAL max_parallel_workers_per_gather = 1;
-- Run the expensive query
SELECT ... ;
COMMIT;
-- Settings automatically reset
Option 4: Use hash_mem_multiplier
-- PostgreSQL 13+: control hash operation memory separately
SET hash_mem_multiplier = 1.0; -- Default is 2.0
-- This limits HashAggregate and Hash Join specifically
-- work_mem × hash_mem_multiplier = max for hash operations
Option 5: Application-Level Control
// In your application, wrap analytical queries
func runAnalyticalQuery(ctx context.Context, db *sql.DB, query string) error {
tx, err := db.BeginTx(ctx, nil)
if err != nil {
return err
}
defer tx.Rollback()
// Reduce memory for this session
_, err = tx.ExecContext(ctx, "SET LOCAL work_mem = '64MB'")
if err != nil {
return err
}
_, err = tx.ExecContext(ctx, "SET LOCAL max_parallel_workers_per_gather = 2")
if err != nil {
return err
}
// Now run the query safely
rows, err := tx.QueryContext(ctx, query)
// ...
return tx.Commit()
}
Monitoring
Prometheus Alerts
groups:
- name: postgresql-memory
rules:
- alert: PostgreSQLHighMemoryQuery
expr: |
pg_stat_activity_backend_memory_bytes > 500000000
for: 5m
labels:
severity: warning
annotations:
summary: "PostgreSQL query using >500MB on {{ $labels.instance }}"
- alert: ContainerNearOOM
expr: |
(container_memory_working_set_bytes / container_spec_memory_limit_bytes) > 0.9
for: 2m
labels:
severity: critical
annotations:
summary: "PostgreSQL container near OOM"
Tracking Query Memory
-- Create view to monitor memory-heavy queries
CREATE VIEW high_memory_queries AS
SELECT
pid,
usename,
application_name,
state,
query_start,
now() - query_start AS duration,
wait_event_type,
wait_event,
LEFT(query, 100) AS query_preview
FROM pg_stat_activity
WHERE state = 'active'
AND backend_type = 'client backend'
ORDER BY query_start;
Checklist
## PostgreSQL work_mem OOM
### Symptoms
- [ ] Container OOMKilled during queries
- [ ] work_mem looks small but memory explodes
- [ ] Happens with analytical/reporting queries
- [ ] Parallel queries involved
### Diagnosis
- [ ] Check work_mem setting
- [ ] Check max_parallel_workers_per_gather
- [ ] EXPLAIN ANALYZE problematic queries
- [ ] Calculate: work_mem × workers × plan_nodes
### Fixes
- [ ] Reduce work_mem globally
- [ ] Reduce max_parallel_workers_per_gather
- [ ] Use SET LOCAL for specific queries
- [ ] Implement application-level controls
- [ ] Set hash_mem_multiplier = 1.0
### Prevention
- [ ] Size work_mem conservatively
- [ ] Monitor container memory usage
- [ ] Test analytical queries in staging
- [ ] Document expected memory per query type
Conclusion
This problem challenges common intuition about PostgreSQL configuration. Most settings in PostgreSQL are straightforward—shared_buffers is a fixed allocation, max_connections limits concurrent connections. But work_mem is multiplicative in ways that aren’t obvious until you run an analytical query on a large dataset.
The issue is particularly acute in containerized environments where memory limits are hard. PostgreSQL has no internal governor that prevents it from exceeding your container’s memory. It trusts you to set work_mem appropriately. And “appropriate” depends on the most complex query you might run, times the maximum parallel workers, times the maximum plan nodes in that query. That’s a calculation that most DBAs don’t do.
The fix is to think about work_mem differently. Instead of asking “how much memory should I give sorts and hashes?”, ask “what’s the maximum total memory I can allow for query operations, and how do I divide that across worst-case parallel execution?” For a 4GB container with 1GB shared_buffers and 4 parallel workers, a work_mem of 64-128MB might be more appropriate than 256MB.
Key takeaways:
work_memis per-operation, per-worker, per-plan-node - not a global limit- Parallel queries multiply memory consumption by worker count
- Set conservative defaults and use SET LOCAL for specific analytical queries
- Monitor container memory usage, not just PostgreSQL internal metrics
- The fix isn’t more memory—it’s understanding the multiplication factor
Related Articles
- PostgreSQL HOT Updates Index Trap - Another PostgreSQL gotcha
- Container Memory Limits - Right-sizing containers
Related posts
Redis Memory Fragmentation: When maxmemory Isn't Enough
Your Redis has 4GB maxmemory but RSS shows 6GB. OOM killer strikes. I explain jemalloc fragmentation with reproduction steps and activedefrag tuning.
The Index That Killed Write Performance: Losing PostgreSQL HOT Updates
Adding an index for performance made writes 10x slower. The counter-intuitive cause: the new index broke HOT updates, turning cheap in-place updates into full-row rewrites with massive bloat.
Java OOMKilled With Stable Heap: Native Memory, Direct Buffers, and glibc Arenas
Heap metrics look fine, GC is happy, but the container keeps dying. The culprit: native memory from direct buffers, JNI, and glibc memory allocator fragmentation.
PostgreSQL Read Replica Conflicts: Why Your Queries Get Canceled
Queries on read replicas fail with 'canceling statement due to conflict with recovery'. The fix depends on which of the 5 conflict types you have - here's how to diagnose and solve each one.
Cite this article
If you reference this post, please link to the original URL and credit the author.