#linux

14 posts

Redis AOF fsync Latency Spikes: When Durability Becomes Your p99

Redis AOF can turn durability into p99 spikes: fsync pressure and rewrite fork CoW. Runbook to confirm, mitigate safely, and add guardrails.

January 9, 2026

tcpdump Sees SYNs, but the Service Times Out: The Listen Backlog Trap

Clients time out, tcpdump shows SYNs (sometimes even SYN-ACK), yet your app logs nothing. The culprit is often the Linux listen/accept queues overflowing under load or CPU starvation.

January 3, 2026

Packets Arrive but the App Times Out: The rp_filter Trap in Kubernetes

tcpdump shows packets arriving, but your application sees nothing. The culprit: Linux reverse path filtering silently dropping packets before they reach iptables, triggered by asymmetric routing in multi-homed setups.

December 12, 2025

'No Space Left on Device' with 40% Disk Free: The Inode and OverlayFS Death Spiral

df -h shows 40% free. But your container keeps crashing with ENOSPC. The culprit: inode exhaustion on overlayfs layers, invisible to standard monitoring.

December 7, 2025

Kubernetes OOM Killer: Why Your Container Dies at 50% Memory

Container memory limit is 4GB but OOM kills at 2GB used. Kernel buffers, page cache, and cgroup accounting tricks cause early OOMKills. Here's the full picture.

November 16, 2025

Kubernetes p99 Spikes Without OOM: Diagnosing cgroup v2 memory.high with PSI

Use PSI and cgroup v2 memory.high to explain p99 spikes without OOMKills. Kubernetes runbook with commands, safe mitigations, diffs, and alerts.

October 25, 2025

eBPF Off-CPU Analysis: Finding Latency That Metrics Miss

CPU is at 20% but latency is 500ms. Standard profilers show nothing. The app is waiting, not computing. I show how to use eBPF to find what it's waiting for.

September 7, 2025

Linux Page Cache Thrashing in Containers: When Free Memory Isn't Free

Your container has 2GB free but runs slow. Page cache counts against memory limit. File I/O forces code pages out. I explain with benchmarks and solutions.

August 6, 2025

Kubernetes conntrack Table Exhaustion: The Silent Packet Killer

Random DNS timeouts, dropped connections, services timing out. Your nf_conntrack table is full. I show how to diagnose, monitor, and fix this Kubernetes networking issue.

June 3, 2025

eBPF Run-Queue Latency: Finding the Off-CPU Bottleneck

CPU utilization is low but requests are slow. The hidden culprit: time spent waiting in the scheduler run-queue, invisible to traditional profilers but visible with eBPF off-CPU analysis.

February 17, 2025

Linux ARP Cache Stale Entries: Failover Traffic Blackhole

Traffic goes to old server after failover. The cause: Linux ARP cache retains MAC address of failed node, sending packets to unreachable destination for minutes.

February 14, 2025

Kubernetes Ghost Connections: Stale Conntrack DNAT Entries

Service returns wrong pod IPs after scaling. The cause: Linux conntrack keeps DNAT entries alive longer than pods exist, routing traffic to deleted endpoints.

February 5, 2025

Ephemeral Port Exhaustion: The Node That 'Goes Bad'

A single Kubernetes node starts failing connections to external services while pods look healthy. The hidden cause: sidecar proxies exhausting ephemeral ports with short-lived connections.

November 11, 2024

TCP TIME_WAIT Port Exhaustion: When Connection Pooling Isn't Enough

Service can't connect to database - 'cannot assign requested address'. The cause: ephemeral ports exhausted by thousands of sockets stuck in TIME_WAIT state.

October 28, 2024