#sre

4 posts

Prometheus remote_write backpressure: when monitoring fills the disk (and still loses data)

A practical runbook for remote_write outages: measure lag, estimate time-to-disk-full, tune queue_config safely, and choose explicit survival trade-offs.

December 24, 2025

PostgreSQL Checkpoint Spikes: Why p99 Explodes Every N Minutes

A reproducible approach to diagnose and eliminate checkpoint-induced latency spikes using pgbench, pg_stat_bgwriter, and WAL/IO budgeting.

December 8, 2025

Pods Stuck in Terminating: A Production Decision Tree for Finalizers, Volumes, and Dead Nodes

A conservative runbook to unstick Pods safely: finalizers, CSI/volume cleanup stalls, dead nodes, and when (and how) to force-delete.

November 26, 2025

Kubernetes Graceful Shutdown as a Contract: Zero 502s During Rollouts (HTTP + gRPC)

A reproducible way to eliminate rollout 502/ECONNRESET: readiness-driven draining, preStop, SIGTERM handling, and a measurable drain budget.

November 22, 2025