#monitoring

8 posts

Prometheus remote_write backpressure: when monitoring fills the disk (and still loses data)

A practical runbook for remote_write outages: measure lag, estimate time-to-disk-full, tune queue_config safely, and choose explicit survival trade-offs.

December 24, 2025

Cardinality Contracts: Prometheus Labels as an API with Budgets

Define label budgets, enforce them in CI, and add a runtime firewall to stop cardinality explosions before production.

December 21, 2025

Dash Contracts in Go: CI Compiler for Grafana Dashboards and Prometheus Alerts

Extract PromQL from dashboards and rules, verify selectors against /metrics, and fail CI before dashboards go dark.

December 15, 2025

Kubernetes TLS Certificate Rotation: The 3AM Outage

Certificate expired at 3AM, service down. cert-manager renewal failed silently. I show monitoring, testing rotation, and preventing cert-related outages.

December 9, 2025

Prometheus Cardinality Explosion: Detection, Prevention, and Recovery

One developer added user_id label. Prometheus OOM'd. I show how to detect high-cardinality metrics before they kill your monitoring, with relabel configs to drop them.

July 23, 2025

Tail-Based Sampling in OpenTelemetry: Sizing, Memory Crashes and Cost Model

Practical sizing guide for tail sampling in OpenTelemetry Collector. From decision_wait through memory limits to cost-benefit analysis.

June 21, 2025

PostgreSQL Replication Slot Bloat: How One Inactive Slot Filled 500GB Disk

Disk is 95% full, WAL directory has 400GB. I'll show how replication slots prevent WAL cleanup and a playbook for prevention and recovery.

June 8, 2025

PostgreSQL Idle in Transaction: Emergency Playbook for Stuck Connections

Autovacuum can't run, table bloat growing, all because of one 'idle in transaction' connection. Here's the detection and kill playbook.

May 20, 2025