Prometheus remote_write backpressure: when monitoring fills the disk (and still loses data)
A practical runbook for remote_write outages: measure lag, estimate time-to-disk-full, tune queue_config safely, and choose explicit survival trade-offs.
8 posts
A practical runbook for remote_write outages: measure lag, estimate time-to-disk-full, tune queue_config safely, and choose explicit survival trade-offs.
Define label budgets, enforce them in CI, and add a runtime firewall to stop cardinality explosions before production.
Extract PromQL from dashboards and rules, verify selectors against /metrics, and fail CI before dashboards go dark.
Certificate expired at 3AM, service down. cert-manager renewal failed silently. I show monitoring, testing rotation, and preventing cert-related outages.
One developer added user_id label. Prometheus OOM'd. I show how to detect high-cardinality metrics before they kill your monitoring, with relabel configs to drop them.
Practical sizing guide for tail sampling in OpenTelemetry Collector. From decision_wait through memory limits to cost-benefit analysis.
Disk is 95% full, WAL directory has 400GB. I'll show how replication slots prevent WAL cleanup and a playbook for prevention and recovery.
Autovacuum can't run, table bloat growing, all because of one 'idle in transaction' connection. Here's the detection and kill playbook.