Prometheus WAL Replay Hell: Slow Restarts and Missing Alerts
The symptom pattern is a monitoring nightmare:
- Prometheus pod is Running but not Ready for a long time.
- Alerts go quiet (not because everything is fine, but because evaluation isn’t happening).
- A restart (node drain, upgrade, eviction) turns into a 30–90 minute blind spot.
- Logs show something like “replaying WAL”.
When this happens, it’s usually not CPU, not networking, not “Prometheus is broken”.
It’s TSDB startup work: replaying the WAL to rebuild the in-memory head.
Tested on: Prometheus 2.48–2.53, Kubernetes 1.29–1.31, PV-backed storage (fast SSD and slower network disks), high-churn clusters with remote_write.
Incident narrative (anonymized)
We had a planned node rotation. Prometheus got evicted and rescheduled. It started quickly… and then stayed NotReady.
The logs showed WAL replay progressing slowly. Disk IO was pegged. After 20 minutes it OOMKilled and restarted, repeating the loop.
Root cause wasn’t a Prometheus bug. It was a combination of:
- a larger-than-normal WAL (ingestion spike + series churn)
- a slow PV (high IO latency)
- too-tight memory limits for head rebuild during replay
Constraint: we needed monitoring back fast, but we didn’t want to “fix” it by deleting data blindly and losing the only evidence we had.
Timeline
- T-0: Prometheus restarts during node rotation.
- T+5m: pod Running but NotReady; alert evaluations missing.
- T+10m: logs clearly show WAL replay dominates startup.
- T+20m: WAL directory is huge; disk is saturated; memory climbs.
- T+30m: mitigation: temporarily increase memory and reduce ingestion pressure.
- T+60m: Prometheus becomes Ready; alerting recovers.
- T+1d: prevention: bound churn, alert on WAL size and restart duration, and treat “time-to-ready” as an SLO.
Mechanism: why WAL replay dominates startup
Prometheus TSDB writes recent samples into the WAL. On startup, it replays WAL segments to rebuild:
- active series metadata
- in-memory chunks (“head”)
This process gets slow or unstable when:
- WAL is huge (many segments)
- disk is slow or saturated
- churn is high (many new series)
- memory limits are too low (replay allocates aggressively)
The dangerous loop is:
- restart
- replay starts, memory climbs
- OOMKilled mid-replay
- restart again
- WAL is still huge, repeat
Runbook: diagnose WAL replay and recover safely
1) Confirm it’s WAL replay (don’t guess)
kubectl -n monitoring logs pod/<prometheus-pod> --since=60m | \
rg -n "WAL|replay|TSDB|corrupt|repair|head" | tail -n 200
You’re looking for:
- “replaying WAL”
- progress that barely moves
- OOM / panic / corruption messages
2) Measure how big the WAL really is
kubectl -n monitoring exec -it <prometheus-pod> -- sh -lc \
"du -sh /prometheus/wal /prometheus 2>/dev/null || true; ls -1 /prometheus/wal 2>/dev/null | wc -l || true"
Two signals matter operationally:
- WAL size (GiB)
- segment count (many files means slow scans and more work)
3) Check if you’re IO-bound or memory-bound
Practical heuristics:
- IO-bound: replay is slow, CPU is not high, node/PV IO latency is high
- memory-bound: replay starts, RSS climbs, then OOMKilled
If you can access the node:
iostat -x 1 10
If you can’t, infer from the PV type and from repeated OOM cycles.
4) Find why WAL grew (the real root cause)
Most “huge WAL” cases are driven by one of:
- high churn metrics (new series per minute)
- cardinality explosions (label blowups)
- remote_write backpressure and ingestion amplification
- slow storage that prevents truncation/compaction from keeping up
This is where you connect to your existing “budget” content: the WAL is your tax bill.
Safe mitigations (pick the least invasive first)
1) Temporarily increase memory limits
WAL replay needs headroom above steady-state usage. If you size only for steady state, replays fail.
Representative bump:
resources:
requests:
memory: "4Gi"
limits:
memory: "6Gi"
2) Temporarily reduce ingestion volume
If the WAL is huge and replay is slow, you may need to reduce load so it can catch up:
- increase scrape interval for the noisiest jobs
- temporarily disable the worst offenders
- drop low-value metrics via relabeling (especially high-churn labels)
3) Move TSDB to faster storage (planned fix)
If replay is IO-bound, this is a storage problem. Faster disk is the difference between minutes and hours.
4) Only as a last resort: delete WAL
Deleting WAL can make Prometheus start quickly, but it is data loss by design.
If you must do it:
- take a snapshot of the TSDB directory first (if feasible)
- document what you deleted and why (for incident review)
What we changed (concrete)
1) We turned “time-to-ready” into an SLO
We wrote down a contract:
- after restart, Prometheus must become Ready within N minutes
If it violates the budget, it’s a production incident (monitoring is a production dependency).
2) We reduced churn at the source
Label budgets and churn control reduced:
- WAL growth rate
- replay memory footprint
- startup duration variance
3) We sized headroom for replay peaks
We increased memory headroom and avoided “tight limits” on the single instance that must recover fast.
4) We added alerts before we become blind
Guardrails we actually use:
- “Prometheus NotReady for more than X minutes”
- WAL size above a threshold
- new series per minute above a threshold
How to verify
- Prometheus becomes Ready within the budget after a controlled restart.
- WAL size stays bounded under normal load.
- During node drains, alerting does not go blind for long windows.
Related reading
- Prometheus remote_write backpressure: when monitoring fills the disk (and still loses data)
- Prometheus Cardinality Explosion: Detection, Prevention, and Recovery
- Cardinality Contracts: Prometheus Labels as an API with Budgets
- Dash Contracts in Go: CI Compiler for Grafana Dashboards and Prometheus Alerts
- Linux Page Cache Thrashing in Containers: When Free Memory Isn’t Free
- Structured Logging Performance: When Your Logger Becomes the Bottleneck
Related posts
Prometheus Native Histograms in Production: Rollout Plan, Budgets, and Failure Modes
Prometheus native histograms can blow up memory, WAL, and remote_write. This guide shows a staged rollout, budgets, and concrete queries to verify safety.
eBPF Off-CPU Analysis: Finding Latency That Metrics Miss
CPU is at 20% but latency is 500ms. Standard profilers show nothing. The app is waiting, not computing. I show how to use eBPF to find what it's waiting for.
Prometheus Cardinality Explosion: Detection, Prevention, and Recovery
One developer added user_id label. Prometheus OOM'd. I show how to detect high-cardinality metrics before they kill your monitoring, with relabel configs to drop them.
tcpdump Sees SYNs, but the Service Times Out: The Listen Backlog Trap
Clients time out, tcpdump shows SYNs (sometimes even SYN-ACK), yet your app logs nothing. The culprit is often the Linux listen/accept queues overflowing under load or CPU starvation.
Cite this article
If you reference this post, please link to the original URL and credit the author.