PostgreSQL Checkpoint Spikes: Why p99 Explodes Every N Minutes
This is a classic “mystery graph”:
- CPU looks stable
- throughput looks steady
- but tail latency (especially p99) has periodic spikes — every 5 or 15 minutes
- storage latency rises during the spikes
Very often the root cause is: checkpoints.
Not “checkpoints are bad”, but checkpoints can turn “dirty page flushing” into a short IO burst if your configuration and storage throughput don’t match your write rate.
Goal of this post: a methodology to reproduce, measure, and tune checkpoints until they stop being a p99 killer.
Tested on: PostgreSQL 13–16, both local NVMe and network-attached disks (cloud). Examples use Linux tools.
What a checkpoint does (only what matters for performance)
Operationally, a checkpoint means Postgres must ensure that “a certain point in time is safely on disk”.
In practice that is:
- writing many dirty buffers,
- and syncing them (fsync),
- which can become a burst that competes with normal query IO.
If your storage queue fills up, even read-only queries can slow down because they wait behind checkpoint writes.
Method: measure first, tune second
Minimum signals you want
- PostgreSQL checkpoint/bgwriter stats
- WAL rate (how fast you generate it)
- OS disk latency/queue (e.g.
iostat) - Workload latency (pgbench or your service SLI)
SQL: checkpoint and bgwriter statistics
Start with pg_stat_bgwriter:
SELECT
checkpoints_timed,
checkpoints_req,
checkpoint_write_time,
checkpoint_sync_time,
buffers_checkpoint,
buffers_clean,
maxwritten_clean,
buffers_backend,
buffers_backend_fsync
FROM pg_stat_bgwriter;
Practical interpretation:
checkpoints_timedvscheckpoints_req: ifcheckpoints_reqgrows fast, you’re often doing forced checkpoints (WAL fills up before the timeout).checkpoint_write_timeandcheckpoint_sync_time: spikes here often correlate with p99 spikes.
If you’re on PostgreSQL 16+, pg_stat_io can add more detail — but you can do a solid diagnosis without it.
OS: disk queue and latency
On the DB node:
iostat -xz 1
Watch for:
- sustained
%utilnear 100% awaitspikes during checkpoint windows- queue indicators (platform-dependent)
A reproducible lab (on purpose)
Do this on a test DB, not production.
1) Generate a steady workload
pgbench -i -s 50 mydb
pgbench -c 32 -j 32 -T 300 -P 1 mydb
-P 1 prints periodic latency/throughput so you can align it with checkpoint stats.
2) Make checkpoints painful (lab-only)
The idea is to create conditions that force frequent checkpoints (for example: low max_wal_size or short checkpoint_timeout) and observe:
- p99 spikes line up with checkpoint write/sync time
- storage latency spikes at the same time
Avoid blindly copying “recommended values”. The point is to learn the shape of the problem with your storage.
3) Correlate p99 with checkpoint signals
During the test:
- log pgbench latency
- sample
pg_stat_bgwriter - watch
iostat
If spikes line up with checkpoint_write_time/checkpoint_sync_time and disk latency, you’ve found the culprit.
The checkpoint budget: a reality check
To stop checkpoint bursts, you must align:
- your write rate / WAL rate
- with your storage throughput
If WAL is generated quickly and max_wal_size is small, checkpoints will be frequent and often forced.
Tuning goal is not “the fewest checkpoints”. It’s:
- predictable, spread out checkpoint work
- and storage latency that stays within your p99 budget
Tuning: what to try (and how to verify)
1) Reduce forced checkpoints via WAL sizing
If checkpoints_req dominates, you’re likely hitting the WAL size limit before the timeout.
Direction:
- increase
max_wal_size(within disk constraints)
Verify:
checkpoints_reqslows down relative tocheckpoints_timed- disk latency spikes become smaller or less frequent
2) Spread checkpoint IO over time
checkpoint_completion_target exists so the system can spread work across more of the interval.
Verify:
- fewer short IO bursts
- smoother
awaitiniostat - reduced p99 spikes
3) Storage is sometimes the real limit
Cloud disks often have burst behavior and then throttling.
If your spikes align with storage throttling, DB tuning can only do so much — you may need:
- a higher disk tier
- different disk layout
- or architectural changes (write shaping, batching, buffering)
Common traps
“CPU is fine, so it’s not Postgres”
Checkpoint spikes are primarily IO-driven. CPU can look perfect while latency collapses.
“Just increase checkpoint_timeout”
It can help, but if you’re constrained by max_wal_size, checkpoints will still be forced.
“We tuned the queries, but the spikes remain”
If disk queue is saturated, query tuning doesn’t help. You must fix IO contention.
What I’d do in production
- Build correlations: p99 spikes ↔ checkpoint stats ↔ disk latency
- Check
checkpoints_timedvscheckpoints_req - If forced checkpoints dominate, address WAL sizing and storage limits
- Define a checkpoint budget (IO stability, predictable checkpoints, alerts)
- Change one thing at a time and verify with metrics
FAQ
How do I know if checkpoints are forced?
If checkpoints_req grows quickly compared to checkpoints_timed, you’re often forcing checkpoints.
Why does a checkpoint slow down read-only queries?
Because reads also wait on the disk. If checkpoint writes saturate storage, reads queue behind them.
Is increasing max_wal_size always the answer?
Not always. It reduces checkpoint frequency, but if storage can’t sustain the spread-out flushing either, you still need better IO capacity.
Can archiving or replication change the behavior?
Yes. If your WAL pipeline is constrained, the system dynamics change. Measure WAL rate and replication/archiving lag too.
Related reading
/en/blog/postgresql-wal-forensics/(WAL tooling and what it reveals)/en/blog/logical-replication-slot-wal-retention/(WAL retention pressure)/en/blog/postgresql-autovacuum-slo/(another periodic performance killer)
Further reading
Related posts
Database Connection Pool Exhaustion: The Silent Outage Trigger
App hangs but the database looks healthy. Your pool is exhausted. I show how to detect it, size pools sanely, and prevent connection leaks.
Prometheus Native Histograms in Production: Rollout Plan, Budgets, and Failure Modes
Prometheus native histograms can blow up memory, WAL, and remote_write. This guide shows a staged rollout, budgets, and concrete queries to verify safety.
EXPLAIN Lied to You: The PostgreSQL Prepared Statement Plan Cliff
Your EXPLAIN looks perfect but production melts. The culprit: PostgreSQL silently switched from a custom plan to a generic plan after enough executions, and the generic plan is catastrophically wrong.
Prometheus remote_write backpressure: when monitoring fills the disk (and still loses data)
A practical runbook for remote_write outages: measure lag, estimate time-to-disk-full, tune queue_config safely, and choose explicit survival trade-offs.
Cite this article
If you reference this post, please link to the original URL and credit the author.