#observability

13 posts

Prometheus WAL Replay Hell: Slow Restarts and Missing Alerts

When Prometheus takes minutes or hours to restart, WAL replay is often the culprit. Prove it from logs and disk, recover safely, and prevent it.

January 5, 2026

Span Contracts: Trace-Driven API Contract Testing with OpenTelemetry

Detect API breaking changes by hashing response shapes from OTel spans and fail CI without storing payloads.

December 31, 2025

Prometheus remote_write backpressure: when monitoring fills the disk (and still loses data)

A practical runbook for remote_write outages: measure lag, estimate time-to-disk-full, tune queue_config safely, and choose explicit survival trade-offs.

December 24, 2025

Cardinality Contracts: Prometheus Labels as an API with Budgets

Define label budgets, enforce them in CI, and add a runtime firewall to stop cardinality explosions before production.

December 21, 2025

Prometheus Native Histograms in Production: Rollout Plan, Budgets, and Failure Modes

Prometheus native histograms can blow up memory, WAL, and remote_write. This guide shows a staged rollout, budgets, and concrete queries to verify safety.

December 20, 2025

Dash Contracts in Go: CI Compiler for Grafana Dashboards and Prometheus Alerts

Extract PromQL from dashboards and rules, verify selectors against /metrics, and fail CI before dashboards go dark.

December 15, 2025

PostgreSQL Checkpoint Spikes: Why p99 Explodes Every N Minutes

A reproducible approach to diagnose and eliminate checkpoint-induced latency spikes using pgbench, pg_stat_bgwriter, and WAL/IO budgeting.

December 8, 2025

OpenTelemetry Collector Backpressure: Fixing Drops with memory_limiter and Queues

OpenTelemetry Collector drops spans under load when exporters backpressure. Fix with memory_limiter, queues, and batch tuning, with commands to verify.

December 4, 2025

RSS Contracts: Stop OOMKilled Java Pods in Kubernetes by Testing RSS as an API

Use cgroup RSS budgets, CI sampling, and runtime headroom to catch JVM memory regressions before they hit production.

November 27, 2025

Structured Logging Performance: When Your Logger Becomes the Bottleneck

At 50k logs/sec, JSON serialization eats 30% CPU. Standard library encoding/json is slow. I benchmark zap vs zerolog vs slog with real numbers.

September 28, 2025

eBPF Off-CPU Analysis: Finding Latency That Metrics Miss

CPU is at 20% but latency is 500ms. Standard profilers show nothing. The app is waiting, not computing. I show how to use eBPF to find what it's waiting for.

September 7, 2025

Prometheus Cardinality Explosion: Detection, Prevention, and Recovery

One developer added user_id label. Prometheus OOM'd. I show how to detect high-cardinality metrics before they kill your monitoring, with relabel configs to drop them.

July 23, 2025

Tail-Based Sampling in OpenTelemetry: Sizing, Memory Crashes and Cost Model

Practical sizing guide for tail sampling in OpenTelemetry Collector. From decision_wait through memory limits to cost-benefit analysis.

June 21, 2025