#operations

9 posts

Redis AOF fsync Latency Spikes: When Durability Becomes Your p99

Redis AOF can turn durability into p99 spikes: fsync pressure and rewrite fork CoW. Runbook to confirm, mitigate safely, and add guardrails.

January 9, 2026

Prometheus WAL Replay Hell: Slow Restarts and Missing Alerts

When Prometheus takes minutes or hours to restart, WAL replay is often the culprit. Prove it from logs and disk, recover safely, and prevent it.

January 5, 2026

PostgreSQL Logical Replication Lag: Big Transactions and Reorder Buffer Spills

One huge transaction can pin logical replication for hours. Runbook to detect the blocker, tune decoding safely, and enforce bounded transactions in prod.

January 1, 2026

PostgreSQL XID Wraparound: Emergency Playbook for Vacuum Freeze Under Fire

PostgreSQL can go read-only near XID wraparound. Use this emergency playbook to find the oldest tables, unblock vacuum freeze, and prevent repeat incidents.

December 16, 2025

hot_standby_feedback Bloat Trap: Fixing Replica Conflicts by Slowly Killing the Primary

hot_standby_feedback stops replica query cancellations but can bloat the primary over days. Detect xmin pinning, mitigate safely, and add guardrails.

December 12, 2025

CSI VolumeAttachment Stuck: Pods in ContainerCreating and Drains That Never Finish

Pods stuck in ContainerCreating often hide a stuck CSI VolumeAttachment. Runbook to find the blocker, detach safely, prevent data loss, and add alerts.

November 30, 2025

Pods Stuck in Terminating: A Production Decision Tree for Finalizers, Volumes, and Dead Nodes

A conservative runbook to unstick Pods safely: finalizers, CSI/volume cleanup stalls, dead nodes, and when (and how) to force-delete.

November 26, 2025

Ephemeral-Storage Evictions in Kubernetes: The Log Storm That Took Down Healthy Pods

Pods get evicted for ephemeral-storage while disk looks free. Debug nodefs/imagefs, container logs, kubelet GC, then enforce budgets and log rotation.

November 18, 2025

Kafka Consumer Rebalance Storms: Why Scaling Consumers Can Increase Lag

Kafka consumer rebalances can make lag worse when you scale out. Diagnose max.poll interval, heartbeats, and assignment strategy; apply safe config diffs.

November 10, 2025