Prometheus WAL Replay Hell: Slow Restarts and Missing Alerts

The symptom pattern is a monitoring nightmare:

Prometheus pod is Running but not Ready for a long time.
Alerts go quiet (not because everything is fine, but because evaluation isn’t happening).
A restart (node drain, upgrade, eviction) turns into a 30–90 minute blind spot.
Logs show something like “replaying WAL”.

When this happens, it’s usually not CPU, not networking, not “Prometheus is broken”.

It’s TSDB startup work: replaying the WAL to rebuild the in-memory head.

Tested on: Prometheus 2.48–2.53, Kubernetes 1.29–1.31, PV-backed storage (fast SSD and slower network disks), high-churn clusters with remote_write.

Incident narrative (anonymized)

We had a planned node rotation. Prometheus got evicted and rescheduled. It started quickly… and then stayed NotReady.

The logs showed WAL replay progressing slowly. Disk IO was pegged. After 20 minutes it OOMKilled and restarted, repeating the loop.

Root cause wasn’t a Prometheus bug. It was a combination of:

a larger-than-normal WAL (ingestion spike + series churn)
a slow PV (high IO latency)
too-tight memory limits for head rebuild during replay

Constraint: we needed monitoring back fast, but we didn’t want to “fix” it by deleting data blindly and losing the only evidence we had.

Timeline

T-0: Prometheus restarts during node rotation.
T+5m: pod Running but NotReady; alert evaluations missing.
T+10m: logs clearly show WAL replay dominates startup.
T+20m: WAL directory is huge; disk is saturated; memory climbs.
T+30m: mitigation: temporarily increase memory and reduce ingestion pressure.
T+60m: Prometheus becomes Ready; alerting recovers.
T+1d: prevention: bound churn, alert on WAL size and restart duration, and treat “time-to-ready” as an SLO.

Mechanism: why WAL replay dominates startup

Prometheus TSDB writes recent samples into the WAL. On startup, it replays WAL segments to rebuild:

active series metadata
in-memory chunks (“head”)

This process gets slow or unstable when:

WAL is huge (many segments)
disk is slow or saturated
churn is high (many new series)
memory limits are too low (replay allocates aggressively)

The dangerous loop is:

restart
replay starts, memory climbs
OOMKilled mid-replay
restart again
WAL is still huge, repeat

Runbook: diagnose WAL replay and recover safely

1) Confirm it’s WAL replay (don’t guess)

kubectl -n monitoring logs pod/<prometheus-pod> --since=60m | \
  rg -n "WAL|replay|TSDB|corrupt|repair|head" | tail -n 200

You’re looking for:

“replaying WAL”
progress that barely moves
OOM / panic / corruption messages

2) Measure how big the WAL really is

kubectl -n monitoring exec -it <prometheus-pod> -- sh -lc \
  "du -sh /prometheus/wal /prometheus 2>/dev/null || true; ls -1 /prometheus/wal 2>/dev/null | wc -l || true"

Two signals matter operationally:

WAL size (GiB)
segment count (many files means slow scans and more work)

3) Check if you’re IO-bound or memory-bound

Practical heuristics:

IO-bound: replay is slow, CPU is not high, node/PV IO latency is high
memory-bound: replay starts, RSS climbs, then OOMKilled

If you can access the node:

iostat -x 1 10

If you can’t, infer from the PV type and from repeated OOM cycles.

4) Find why WAL grew (the real root cause)

Most “huge WAL” cases are driven by one of:

high churn metrics (new series per minute)
cardinality explosions (label blowups)
remote_write backpressure and ingestion amplification
slow storage that prevents truncation/compaction from keeping up

This is where you connect to your existing “budget” content: the WAL is your tax bill.

Safe mitigations (pick the least invasive first)

1) Temporarily increase memory limits

WAL replay needs headroom above steady-state usage. If you size only for steady state, replays fail.

Representative bump:

resources:
  requests:
    memory: "4Gi"
  limits:
    memory: "6Gi"

2) Temporarily reduce ingestion volume

If the WAL is huge and replay is slow, you may need to reduce load so it can catch up:

increase scrape interval for the noisiest jobs
temporarily disable the worst offenders
drop low-value metrics via relabeling (especially high-churn labels)

3) Move TSDB to faster storage (planned fix)

If replay is IO-bound, this is a storage problem. Faster disk is the difference between minutes and hours.

4) Only as a last resort: delete WAL

Deleting WAL can make Prometheus start quickly, but it is data loss by design.

If you must do it:

take a snapshot of the TSDB directory first (if feasible)
document what you deleted and why (for incident review)

What we changed (concrete)

1) We turned “time-to-ready” into an SLO

We wrote down a contract:

after restart, Prometheus must become Ready within N minutes

If it violates the budget, it’s a production incident (monitoring is a production dependency).

2) We reduced churn at the source

Label budgets and churn control reduced:

WAL growth rate
replay memory footprint
startup duration variance

3) We sized headroom for replay peaks

We increased memory headroom and avoided “tight limits” on the single instance that must recover fast.

Guardrails we actually use:

“Prometheus NotReady for more than X minutes”
WAL size above a threshold
new series per minute above a threshold

How to verify

Prometheus becomes Ready within the budget after a controlled restart.
WAL size stays bounded under normal load.
During node drains, alerting does not go blind for long windows.

Prometheus WAL Replay Hell: Slow Restarts and Missing Alerts

Incident narrative (anonymized)

Timeline

Mechanism: why WAL replay dominates startup

Runbook: diagnose WAL replay and recover safely

1) Confirm it’s WAL replay (don’t guess)

2) Measure how big the WAL really is

3) Check if you’re IO-bound or memory-bound

4) Find why WAL grew (the real root cause)

Safe mitigations (pick the least invasive first)

1) Temporarily increase memory limits

2) Temporarily reduce ingestion volume

3) Move TSDB to faster storage (planned fix)

4) Only as a last resort: delete WAL

What we changed (concrete)

1) We turned “time-to-ready” into an SLO

2) We reduced churn at the source

3) We sized headroom for replay peaks

4) We added alerts before we become blind

How to verify

Related posts

Cite this article

Incident narrative (anonymized)

Timeline

Mechanism: why WAL replay dominates startup

Runbook: diagnose WAL replay and recover safely

1) Confirm it’s WAL replay (don’t guess)

2) Measure how big the WAL really is

3) Check if you’re IO-bound or memory-bound

4) Find why WAL grew (the real root cause)

Safe mitigations (pick the least invasive first)

1) Temporarily increase memory limits

2) Temporarily reduce ingestion volume

3) Move TSDB to faster storage (planned fix)

4) Only as a last resort: delete WAL

What we changed (concrete)

1) We turned “time-to-ready” into an SLO

2) We reduced churn at the source

3) We sized headroom for replay peaks

4) We added alerts before we become blind

How to verify

Related reading

Related posts

Cite this article