Back to blog

PostgreSQL OOM by Design: work_mem × Parallel Workers × Plan Nodes

Parallel queries are fun until they OOM the box. “PostgreSQL OOMKilled but work_mem is only 256MB!” The incident started with a support ticket: “the reporting dashboard is down.” PostgreSQL pods were restarting. Kubernetes showed OOMKilled as the reason. But our configuration looked conservative—work_mem was 256MB, shared_buffers was 1GB, the container had 4GB total. Where was the memory going?

The answer took hours to find because we were thinking about work_mem wrong. We assumed it was a global limit—the total memory PostgreSQL would use for sorting and hashing. It’s not. It’s a per-operation, per-worker, per-plan-node allocation. A single query with hash joins, aggregations, and sorts, running with 4 parallel workers, can legitimately consume 256MB × 4 workers × 3 operations = 3GB. Add shared_buffers and you exceed the container’s memory limit. PostgreSQL is doing exactly what it’s configured to do.

This is one of those cases where the documentation is technically correct but practically misleading. The docs say work_mem is the memory used for “internal sort operations and hash tables.” What they don’t emphasize is that each operation in a query plan gets its own work_mem allocation, and parallel query multiplies that by the number of workers. It’s memory amplification that only manifests with complex analytical queries.

Environment: PostgreSQL 15, Kubernetes with 4GB memory limit, analytical queries on 100M+ row tables

The Problem

The Deceptive Configuration

-- Looks reasonable, right?
SHOW work_mem;
-- 256MB

-- Container has 4GB memory
-- PostgreSQL shared_buffers = 1GB
-- Remaining ~3GB should be plenty for work_mem... right?

-- Then this query runs:
EXPLAIN ANALYZE
SELECT customer_id, SUM(amount)
FROM orders
WHERE order_date > '2024-01-01'
GROUP BY customer_id
ORDER BY SUM(amount) DESC;

-- Container: OOMKilled
-- PostgreSQL: terminated
-- Application: 500 errors everywhere

Why 256MB Became 3GB

Query plan breakdown:

┌─────────────────────────────────────────────────┐
│ Gather Merge (parallel coordinator)             │
│   Workers Planned: 4                            │
│   ├─ Sort (work_mem used here)                  │
│   │    └─ Parallel Hash Aggregate               │
│   │         (work_mem used here)                │
│   │         └─ Parallel Seq Scan                │
│   │              (work_mem for bitmap here)     │
└─────────────────────────────────────────────────┘

Memory calculation:

work_mem per operation:           256 MB
Plan nodes using work_mem:        × 3 (sort + hash agg + scan bitmap)
Parallel workers:                 × 4
Plus coordinator process:         × 1.25

Total possible memory:
256 MB × 3 × 4 × 1.25 = 3.84 GB

Container limit: 4 GB
Shared buffers: 1 GB
Available: 3 GB
Required: 3.84 GB

Result: OOM!

Root Cause

How work_mem Actually Works

-- Common misconception:
-- "work_mem is the total memory PostgreSQL uses for sorting/hashing"

-- Reality:
-- work_mem is per-operation, per-worker, per-plan-node

-- A single query can use work_mem MANY times:

EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT ...
FROM table1
JOIN table2 ON ...      -- Hash join: work_mem
JOIN table3 ON ...      -- Hash join: work_mem
WHERE complex_condition  -- Bitmap scan: work_mem
GROUP BY ...            -- HashAggregate: work_mem
ORDER BY ...            -- Sort: work_mem
LIMIT 100;

-- Each of these operations gets its own work_mem allocation
-- And with parallel query, multiply by worker count!

The Parallel Query Multiplier

Without parallelism:
┌────────────────────────────────┐
│ Sort (256MB)                   │
│  └─ HashAggregate (256MB)      │
│       └─ Seq Scan              │
└────────────────────────────────┘
Total: ~512MB worst case

With parallel_workers = 4:
┌────────────────────────────────────────────────────┐
│ Gather Merge                                       │
│  ├─ Worker 0: Sort (256MB) + HashAgg (256MB)      │
│  ├─ Worker 1: Sort (256MB) + HashAgg (256MB)      │
│  ├─ Worker 2: Sort (256MB) + HashAgg (256MB)      │
│  └─ Worker 3: Sort (256MB) + HashAgg (256MB)      │
│  └─ Leader:   Sort (256MB) + HashAgg (256MB)      │
└────────────────────────────────────────────────────┘
Total: ~2.5GB worst case (and that's just 2 operations!)

Diagnosis

Step 1: Check Actual Memory Usage

-- See memory usage in query plans
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT ...;

-- Look for:
-- "Sort Method: external merge  Disk: 1234kB"  -- Spilled to disk
-- "Sort Method: quicksort  Memory: 25kB"       -- Fits in memory
-- "HashAggregate  Batches: 1  Memory Usage: 2457kB"

-- Batches > 1 means it spilled to disk (good for memory, bad for speed)

Step 2: Monitor Per-Query Memory

-- Enable memory context tracking (PostgreSQL 14+)
SET log_statement_stats = on;
SET log_parser_stats = on;
SET log_planner_stats = on;
SET log_executor_stats = on;

-- Check pg_stat_activity for memory info
SELECT
    pid,
    query,
    pg_size_pretty(pg_backend_memory_contexts.total_bytes)
FROM pg_stat_activity
JOIN pg_backend_memory_contexts ON pg_stat_activity.pid = pg_backend_memory_contexts.pid
WHERE state = 'active';

Step 3: Check Parallel Query Settings

-- Current parallel settings
SHOW max_parallel_workers_per_gather;  -- e.g., 4
SHOW max_parallel_workers;             -- e.g., 8
SHOW max_worker_processes;             -- e.g., 8
SHOW parallel_tuple_cost;
SHOW parallel_setup_cost;
SHOW min_parallel_table_scan_size;

-- Calculate worst case:
-- work_mem × max_parallel_workers_per_gather × (estimated plan nodes)

The Fix

Option 1: Limit Parallel Workers

-- Globally reduce parallel workers
ALTER SYSTEM SET max_parallel_workers_per_gather = 2;
SELECT pg_reload_conf();

-- Or per-query for problematic queries
SET max_parallel_workers_per_gather = 0;  -- Disable parallelism
SELECT ...;
RESET max_parallel_workers_per_gather;

Option 2: Reduce work_mem

-- Calculate safe work_mem:
-- (Available memory) / (max workers × estimated plan nodes)
-- (3GB) / (4 × 4) = ~192MB

ALTER SYSTEM SET work_mem = '128MB';
SELECT pg_reload_conf();

-- Trade-off: more disk spilling, slower queries
-- But: won't OOM!

Option 3: Per-Query work_mem Control

-- For specific expensive queries, reduce work_mem
BEGIN;
SET LOCAL work_mem = '64MB';
SET LOCAL max_parallel_workers_per_gather = 1;

-- Run the expensive query
SELECT ... ;

COMMIT;
-- Settings automatically reset

Option 4: Use hash_mem_multiplier

-- PostgreSQL 13+: control hash operation memory separately
SET hash_mem_multiplier = 1.0;  -- Default is 2.0

-- This limits HashAggregate and Hash Join specifically
-- work_mem × hash_mem_multiplier = max for hash operations

Option 5: Application-Level Control

// In your application, wrap analytical queries
func runAnalyticalQuery(ctx context.Context, db *sql.DB, query string) error {
    tx, err := db.BeginTx(ctx, nil)
    if err != nil {
        return err
    }
    defer tx.Rollback()

    // Reduce memory for this session
    _, err = tx.ExecContext(ctx, "SET LOCAL work_mem = '64MB'")
    if err != nil {
        return err
    }

    _, err = tx.ExecContext(ctx, "SET LOCAL max_parallel_workers_per_gather = 2")
    if err != nil {
        return err
    }

    // Now run the query safely
    rows, err := tx.QueryContext(ctx, query)
    // ...

    return tx.Commit()
}

Monitoring

Prometheus Alerts

groups:
  - name: postgresql-memory
    rules:
      - alert: PostgreSQLHighMemoryQuery
        expr: |
          pg_stat_activity_backend_memory_bytes > 500000000
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "PostgreSQL query using >500MB on {{ $labels.instance }}"

      - alert: ContainerNearOOM
        expr: |
          (container_memory_working_set_bytes / container_spec_memory_limit_bytes) > 0.9
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "PostgreSQL container near OOM"

Tracking Query Memory

-- Create view to monitor memory-heavy queries
CREATE VIEW high_memory_queries AS
SELECT
    pid,
    usename,
    application_name,
    state,
    query_start,
    now() - query_start AS duration,
    wait_event_type,
    wait_event,
    LEFT(query, 100) AS query_preview
FROM pg_stat_activity
WHERE state = 'active'
  AND backend_type = 'client backend'
ORDER BY query_start;

Checklist

## PostgreSQL work_mem OOM

### Symptoms
- [ ] Container OOMKilled during queries
- [ ] work_mem looks small but memory explodes
- [ ] Happens with analytical/reporting queries
- [ ] Parallel queries involved

### Diagnosis
- [ ] Check work_mem setting
- [ ] Check max_parallel_workers_per_gather
- [ ] EXPLAIN ANALYZE problematic queries
- [ ] Calculate: work_mem × workers × plan_nodes

### Fixes
- [ ] Reduce work_mem globally
- [ ] Reduce max_parallel_workers_per_gather
- [ ] Use SET LOCAL for specific queries
- [ ] Implement application-level controls
- [ ] Set hash_mem_multiplier = 1.0

### Prevention
- [ ] Size work_mem conservatively
- [ ] Monitor container memory usage
- [ ] Test analytical queries in staging
- [ ] Document expected memory per query type

Conclusion

This problem challenges common intuition about PostgreSQL configuration. Most settings in PostgreSQL are straightforward—shared_buffers is a fixed allocation, max_connections limits concurrent connections. But work_mem is multiplicative in ways that aren’t obvious until you run an analytical query on a large dataset.

The issue is particularly acute in containerized environments where memory limits are hard. PostgreSQL has no internal governor that prevents it from exceeding your container’s memory. It trusts you to set work_mem appropriately. And “appropriate” depends on the most complex query you might run, times the maximum parallel workers, times the maximum plan nodes in that query. That’s a calculation that most DBAs don’t do.

The fix is to think about work_mem differently. Instead of asking “how much memory should I give sorts and hashes?”, ask “what’s the maximum total memory I can allow for query operations, and how do I divide that across worst-case parallel execution?” For a 4GB container with 1GB shared_buffers and 4 parallel workers, a work_mem of 64-128MB might be more appropriate than 256MB.

Key takeaways:

  1. work_mem is per-operation, per-worker, per-plan-node - not a global limit
  2. Parallel queries multiply memory consumption by worker count
  3. Set conservative defaults and use SET LOCAL for specific analytical queries
  4. Monitor container memory usage, not just PostgreSQL internal metrics
  5. The fix isn’t more memory—it’s understanding the multiplication factor

Related posts

Cite this article

If you reference this post, please link to the original URL and credit the author.

Michal Drozd. "PostgreSQL OOM by Design: work_mem × Parallel Workers × Plan Nodes". https://www.michal-drozd.com/en/blog/postgresql-work-mem-parallel-oom/ (Published December 28, 2024).