Prometheus WAL Replay Hell: Slow Restarts and Missing Alerts
When Prometheus takes minutes or hours to restart, WAL replay is often the culprit. Prove it from logs and disk, recover safely, and prevent it.
6 posts
When Prometheus takes minutes or hours to restart, WAL replay is often the culprit. Prove it from logs and disk, recover safely, and prevent it.
A practical runbook for remote_write outages: measure lag, estimate time-to-disk-full, tune queue_config safely, and choose explicit survival trade-offs.
Define label budgets, enforce them in CI, and add a runtime firewall to stop cardinality explosions before production.
Prometheus native histograms can blow up memory, WAL, and remote_write. This guide shows a staged rollout, budgets, and concrete queries to verify safety.
Extract PromQL from dashboards and rules, verify selectors against /metrics, and fail CI before dashboards go dark.
One developer added user_id label. Prometheus OOM'd. I show how to detect high-cardinality metrics before they kill your monitoring, with relabel configs to drop them.