#debugging

52 posts

Prometheus WAL Replay Hell: Slow Restarts and Missing Alerts

When Prometheus takes minutes or hours to restart, WAL replay is often the culprit. Prove it from logs and disk, recover safely, and prevent it.

January 5, 2026

tcpdump Sees SYNs, but the Service Times Out: The Listen Backlog Trap

Clients time out, tcpdump shows SYNs (sometimes even SYN-ACK), yet your app logs nothing. The culprit is often the Linux listen/accept queues overflowing under load or CPU starvation.

January 3, 2026

PostgreSQL Logical Replication Lag: Big Transactions and Reorder Buffer Spills

One huge transaction can pin logical replication for hours. Runbook to detect the blocker, tune decoding safely, and enforce bounded transactions in prod.

January 1, 2026

ingress-nginx Reload Storms: Why 502 Spikes Track Ingress Churn

NGINX Ingress reload storms can drop keep-alives and cause 502 spikes. Runbook to prove reload impact, reduce churn, and harden graceful reload.

December 28, 2025

The Cert Isn't Expired, Your Node Is: Time Drift Breaking TLS and JWT in Kubernetes

Sporadic TLS handshake failures and JWT rejections across services. Everything passes when you check. The culprit: a node's clock drifted or jumped, and NTP fixed it before you could catch it.

December 26, 2025

EXPLAIN Lied to You: The PostgreSQL Prepared Statement Plan Cliff

Your EXPLAIN looks perfect but production melts. The culprit: PostgreSQL silently switched from a custom plan to a generic plan after enough executions, and the generic plan is catastrophically wrong.

December 24, 2025

Works in psql, Flaky in Prod: PgBouncer's Silent Murder of LISTEN/NOTIFY

PostgreSQL LISTEN/NOTIFY works perfectly in local testing but notifications randomly stop arriving in production. The culprit: transaction pooling quietly reassigning your connection to someone else.

December 18, 2025

Packets Arrive but the App Times Out: The rp_filter Trap in Kubernetes

tcpdump shows packets arriving, but your application sees nothing. The culprit: Linux reverse path filtering silently dropping packets before they reach iptables, triggered by asymmetric routing in multi-homed setups.

December 12, 2025

'No Space Left on Device' with 40% Disk Free: The Inode and OverlayFS Death Spiral

df -h shows 40% free. But your container keeps crashing with ENOSPC. The culprit: inode exhaustion on overlayfs layers, invisible to standard monitoring.

December 7, 2025

Database Connection Pool Exhaustion: The Silent Outage Trigger

App hangs but the database looks healthy. Your pool is exhausted. I show how to detect it, size pools sanely, and prevent connection leaks.

November 30, 2025

pg_waldump WAL Forensics: Reconstructing What Happened to Your Data

Something deleted rows from production but nobody admits to running DELETE. Use pg_waldump to analyze WAL files and reconstruct exactly what happened and when.

November 24, 2025

5000 Unacked Messages and Climbing: Stop RabbitMQ Consumer Meltdowns in CI

Queue looks healthy until deployment, then messages_unacknowledged explodes, memory spikes, and redelivery storms start. The culprit: your prefetch is too high and nobody tested actual ack behavior.

November 22, 2025

Kubernetes OOM Killer: Why Your Container Dies at 50% Memory

Container memory limit is 4GB but OOM kills at 2GB used. Kernel buffers, page cache, and cgroup accounting tricks cause early OOMKills. Here's the full picture.

November 16, 2025

One Partition at 99% CPU: Stop Kafka Hotspots Before They Reach Production

All partitions look balanced in testing, then production traffic arrives and one partition melts. The culprit: your partition key has terrible cardinality and nobody noticed until now.

November 15, 2025

Kubernetes APF Starvation: When One Controller Makes kubectl Hang

APF can starve your Kubernetes API: kubectl hangs, controllers timeout, and 429s spike. Runbook to isolate the noisy client, fix FlowSchemas, and prove it.

November 14, 2025

Envoy/Istio 503 UF/UO/UT: When the Mesh, Not the App, Is Your Outage

Envoy/Istio can return 503 UF/UO/UT when connection pools overflow. Decode flags, inspect proxy stats, patch DestinationRules, and verify fast.

November 2, 2025

Elasticsearch Hot Shard Problem: When One Node Does All the Work

5 data nodes but one is at 100% CPU. Uneven routing keys create hot shards. I show how to detect skew and fix it with routing strategies.

October 16, 2025

Logical Replication Slot WAL Bloat: When Subscribers Go Offline

Disk filling up with WAL files. The cause: a logical replication slot consumer went offline, and PostgreSQL retains all WAL since then because it might be needed.

September 9, 2025

eBPF Off-CPU Analysis: Finding Latency That Metrics Miss

CPU is at 20% but latency is 500ms. Standard profilers show nothing. The app is waiting, not computing. I show how to use eBPF to find what it's waiting for.

September 7, 2025

Kubernetes conntrack Table Exhaustion: The Silent Packet Killer

Random DNS timeouts, dropped connections, services timing out. Your nf_conntrack table is full. I show how to diagnose, monitor, and fix this Kubernetes networking issue.

June 3, 2025

Redis Memory Fragmentation: When maxmemory Isn't Enough

Your Redis has 4GB maxmemory but RSS shows 6GB. OOM killer strikes. I explain jemalloc fragmentation with reproduction steps and activedefrag tuning.

May 22, 2025

GIN Index Pending List Overflow: Fast Writes, Slow Searches

Full-text search was fast, now it's slow. The cause: GIN index pending list grew huge during bulk inserts, and every search must now scan the unsorted pending entries.

April 17, 2025

ICU Collation Version Drift: When Database Upgrades Break Your Indexes

Query returns wrong results after OS upgrade. The cause: ICU library version changed, collation rules shifted, and indexes are now sorted inconsistently with the new sort order.

March 15, 2025

Java Profiling in Hardened Kubernetes: When Security Blocks Your Debugger

Can't attach profiler to production JVM. seccomp blocks perf_event_open, container drops CAP_SYS_PTRACE, and PodSecurityPolicy prevents privileged mode. Here's how to profile anyway.

March 7, 2025

PostgreSQL Partial Index: Planner Ignores Your Index

Query scans full table despite perfect partial index. The cause: query's WHERE clause doesn't match the index predicate exactly, or statistics mislead the planner.

March 4, 2025

Go cgo DNS Resolution Thread Explosion: When net.LookupHost Spawns Thousands of Threads

Go application suddenly has 10,000 threads consuming all memory. The cause: cgo-based DNS resolution blocking in slow DNS environments, bypassing Go's goroutine scheduler.

February 25, 2025

eBPF Run-Queue Latency: Finding the Off-CPU Bottleneck

CPU utilization is low but requests are slow. The hidden culprit: time spent waiting in the scheduler run-queue, invisible to traditional profilers but visible with eBPF off-CPU analysis.

February 17, 2025

Linux ARP Cache Stale Entries: Failover Traffic Blackhole

Traffic goes to old server after failover. The cause: Linux ARP cache retains MAC address of failed node, sending packets to unreachable destination for minutes.

February 14, 2025

Gossip Protocol Ghost Nodes: IP Reuse Haunting Your Cluster

New node joins cluster but gets shunned. Old node's IP is still in gossip protocol's failure detection blacklist. The zombie membership record lives on.

February 10, 2025

Kubernetes Ghost Connections: Stale Conntrack DNAT Entries

Service returns wrong pod IPs after scaling. The cause: Linux conntrack keeps DNAT entries alive longer than pods exist, routing traffic to deleted endpoints.

February 5, 2025

Double Charges From Idempotency Keys: The Replica Lag Trap

Perfect idempotency logic, but customers still get charged twice. The cause: checking idempotency keys against a read replica that's seconds behind the primary during traffic spikes.

January 29, 2025

PostgreSQL Read Replica Conflicts: Why Your Queries Get Canceled

Queries on read replicas fail with 'canceling statement due to conflict with recovery'. The fix depends on which of the 5 conflict types you have - here's how to diagnose and solve each one.

January 28, 2025

Redis Cluster Slot Migration: Temporary Memory Explosion

Redis nodes OOMKilled during cluster rebalancing. The cause: slot migration copies keys to destination before deleting from source, temporarily doubling memory usage.

January 27, 2025

Split-Brain From a Clock Step Backwards: Wall Time in Lease-Based Systems

Two nodes both believe they hold the leader lease. The cause: a small NTP time step backwards combined with code that mixes wall-clock time with duration-based timeouts.

January 22, 2025

Java OOMKilled With Stable Heap: Native Memory, Direct Buffers, and glibc Arenas

Heap metrics look fine, GC is happy, but the container keeps dying. The culprit: native memory from direct buffers, JNI, and glibc memory allocator fragmentation.

January 20, 2025

Go p99 Latency Cliffs: Nested context.WithTimeout Timer Storms

Periodic latency spikes that look like network jitter. The real cause: nested timeouts creating thousands of timers that pressure the Go runtime timer heap and trigger GC scanning.

January 15, 2025

PostgreSQL Serialization Failures: Beyond 'Just Retry'

Getting 'could not serialize access due to concurrent update'? The fix isn't just retry logic - it's understanding when to use which isolation level and how to reduce conflict frequency.

January 15, 2025

gRPC Keepalive Mismatch: Transport Closing After Idle

gRPC connections randomly close with 'transport is closing'. The cause: client and server keepalive settings don't match, causing the server to terminate idle connections.

January 13, 2025

The Ghost Pod: Why Your Service Still Sends Traffic to Dead Endpoints

Random ECONNRESET on some nodes but not others. Endpoints look fine. The culprit: conntrack NAT entries keeping long-lived connections pinned to pods that no longer exist.

January 5, 2025

PostgreSQL OOM by Design: work_mem × Parallel Workers × Plan Nodes

work_mem looks small at 256MB, but a parallel hash join with 4 workers across 3 plan nodes uses 3GB. Here's how to prevent PostgreSQL from legitimately OOMing your container.

December 28, 2024

JVM Metaspace OOM in Kubernetes: Why MaxMetaspaceSize Alone Won't Save You

Pod OOMKilled despite MaxMetaspaceSize set. The cause: Metaspace grows outside heap, container memory limit doesn't account for it, and class unloading isn't happening.

December 23, 2024

The Index That Killed Write Performance: Losing PostgreSQL HOT Updates

Adding an index for performance made writes 10x slower. The counter-intuitive cause: the new index broke HOT updates, turning cheap in-place updates into full-row rewrites with massive bloat.

December 19, 2024

PostgreSQL 'cached plan must not change result type' During Zero-Downtime Migrations

Rolling deploy fails with cached plan errors after ALTER TABLE. The cause: server-side prepared statements cache query plans that break when schema changes.

December 11, 2024

etcd Watch Replay Storms: When Giant ConfigMaps Kill the Control Plane

The apiserver becomes 'randomly slow'. Root cause: large, frequently updated ConfigMaps trigger watch compaction, causing thousands of controllers to relist simultaneously.

December 5, 2024

etcd Quota Alarm: When Your Kubernetes Cluster Goes Read-Only

Cluster stops accepting writes, pods can't schedule. The cause: etcd hit its storage quota because compaction wasn't running, history accumulated beyond limits.

November 27, 2024

Kubernetes Headless Service DNS: Stale Records After Pod Deletion

Requests go to non-existent pods. The cause: headless service DNS records persist in client DNS cache after pods are deleted, before endpoints update propagates.

November 22, 2024

Traffic Hitting Dead Pods: Conntrack's Stale NAT Mapping

Deploy causes 503s for exactly 2 minutes. The issue: conntrack keeps NAT mappings to old pod IPs even after Kubernetes removes endpoints, sending traffic to dead pods.

November 14, 2024

Ephemeral Port Exhaustion: The Node That 'Goes Bad'

A single Kubernetes node starts failing connections to external services while pods look healthy. The hidden cause: sidecar proxies exhausting ephemeral ports with short-lived connections.

November 11, 2024

PMTU Blackholes: When Only Large Responses Hang

Small API responses work, large ones hang forever. The cause: ICMP 'Fragmentation Needed' messages filtered by firewalls, breaking Path MTU Discovery in overlay networks.

November 7, 2024

kube-proxy Micro-Outages: The xtables Lock Contention Problem

Random 1-3 second connection drops during deployments. CPU looks fine, memory is stable. The hidden cause: iptables-restore grabbing the xtables lock while endpoints churn.

November 4, 2024

TCP TIME_WAIT Port Exhaustion: When Connection Pooling Isn't Enough

Service can't connect to database - 'cannot assign requested address'. The cause: ephemeral ports exhausted by thousands of sockets stuck in TIME_WAIT state.

October 28, 2024

VXLAN Random Packet Drops: The Checksum Offload Trap

Cross-node gRPC calls randomly fail but local traffic works fine. The culprit: TX checksum offload corrupting VXLAN headers on specific NIC drivers. Here's how to diagnose and fix.

October 21, 2024