#prometheus

6 posts

Prometheus WAL Replay Hell: Slow Restarts and Missing Alerts

When Prometheus takes minutes or hours to restart, WAL replay is often the culprit. Prove it from logs and disk, recover safely, and prevent it.

January 5, 2026

Prometheus remote_write backpressure: when monitoring fills the disk (and still loses data)

A practical runbook for remote_write outages: measure lag, estimate time-to-disk-full, tune queue_config safely, and choose explicit survival trade-offs.

December 24, 2025

Cardinality Contracts: Prometheus Labels as an API with Budgets

Define label budgets, enforce them in CI, and add a runtime firewall to stop cardinality explosions before production.

December 21, 2025

Prometheus Native Histograms in Production: Rollout Plan, Budgets, and Failure Modes

Prometheus native histograms can blow up memory, WAL, and remote_write. This guide shows a staged rollout, budgets, and concrete queries to verify safety.

December 20, 2025

Dash Contracts in Go: CI Compiler for Grafana Dashboards and Prometheus Alerts

Extract PromQL from dashboards and rules, verify selectors against /metrics, and fail CI before dashboards go dark.

December 15, 2025

Prometheus Cardinality Explosion: Detection, Prevention, and Recovery

One developer added user_id label. Prometheus OOM'd. I show how to detect high-cardinality metrics before they kill your monitoring, with relabel configs to drop them.

July 23, 2025