Prometheus WAL Replay Hell: Slow Restarts and Missing Alerts
When Prometheus takes minutes or hours to restart, WAL replay is often the culprit. Prove it from logs and disk, recover safely, and prevent it.
13 posts
When Prometheus takes minutes or hours to restart, WAL replay is often the culprit. Prove it from logs and disk, recover safely, and prevent it.
Detect API breaking changes by hashing response shapes from OTel spans and fail CI without storing payloads.
A practical runbook for remote_write outages: measure lag, estimate time-to-disk-full, tune queue_config safely, and choose explicit survival trade-offs.
Define label budgets, enforce them in CI, and add a runtime firewall to stop cardinality explosions before production.
Prometheus native histograms can blow up memory, WAL, and remote_write. This guide shows a staged rollout, budgets, and concrete queries to verify safety.
Extract PromQL from dashboards and rules, verify selectors against /metrics, and fail CI before dashboards go dark.
A reproducible approach to diagnose and eliminate checkpoint-induced latency spikes using pgbench, pg_stat_bgwriter, and WAL/IO budgeting.
OpenTelemetry Collector drops spans under load when exporters backpressure. Fix with memory_limiter, queues, and batch tuning, with commands to verify.
Use cgroup RSS budgets, CI sampling, and runtime headroom to catch JVM memory regressions before they hit production.
At 50k logs/sec, JSON serialization eats 30% CPU. Standard library encoding/json is slow. I benchmark zap vs zerolog vs slog with real numbers.
CPU is at 20% but latency is 500ms. Standard profilers show nothing. The app is waiting, not computing. I show how to use eBPF to find what it's waiting for.
One developer added user_id label. Prometheus OOM'd. I show how to detect high-cardinality metrics before they kill your monitoring, with relabel configs to drop them.
Practical sizing guide for tail sampling in OpenTelemetry Collector. From decision_wait through memory limits to cost-benefit analysis.