Redis AOF fsync Latency Spikes: When Durability Becomes Your p99
Redis AOF can turn durability into p99 spikes: fsync pressure and rewrite fork CoW. Runbook to confirm, mitigate safely, and add guardrails.
9 posts
Redis AOF can turn durability into p99 spikes: fsync pressure and rewrite fork CoW. Runbook to confirm, mitigate safely, and add guardrails.
When Prometheus takes minutes or hours to restart, WAL replay is often the culprit. Prove it from logs and disk, recover safely, and prevent it.
One huge transaction can pin logical replication for hours. Runbook to detect the blocker, tune decoding safely, and enforce bounded transactions in prod.
PostgreSQL can go read-only near XID wraparound. Use this emergency playbook to find the oldest tables, unblock vacuum freeze, and prevent repeat incidents.
hot_standby_feedback stops replica query cancellations but can bloat the primary over days. Detect xmin pinning, mitigate safely, and add guardrails.
Pods stuck in ContainerCreating often hide a stuck CSI VolumeAttachment. Runbook to find the blocker, detach safely, prevent data loss, and add alerts.
A conservative runbook to unstick Pods safely: finalizers, CSI/volume cleanup stalls, dead nodes, and when (and how) to force-delete.
Pods get evicted for ephemeral-storage while disk looks free. Debug nodefs/imagefs, container logs, kubelet GC, then enforce budgets and log rotation.
Kafka consumer rebalances can make lag worse when you scale out. Diagnose max.poll interval, heartbeats, and assignment strategy; apply safe config diffs.