#kubernetes

50 posts

Redis AOF fsync Latency Spikes: When Durability Becomes Your p99

Redis AOF can turn durability into p99 spikes: fsync pressure and rewrite fork CoW. Runbook to confirm, mitigate safely, and add guardrails.

January 9, 2026

tcpdump Sees SYNs, but the Service Times Out: The Listen Backlog Trap

Clients time out, tcpdump shows SYNs (sometimes even SYN-ACK), yet your app logs nothing. The culprit is often the Linux listen/accept queues overflowing under load or CPU starvation.

January 3, 2026

ingress-nginx Reload Storms: Why 502 Spikes Track Ingress Churn

NGINX Ingress reload storms can drop keep-alives and cause 502 spikes. Runbook to prove reload impact, reduce churn, and harden graceful reload.

December 28, 2025

The Cert Isn't Expired, Your Node Is: Time Drift Breaking TLS and JWT in Kubernetes

Sporadic TLS handshake failures and JWT rejections across services. Everything passes when you check. The culprit: a node's clock drifted or jumped, and NTP fixed it before you could catch it.

December 26, 2025

Prometheus remote_write backpressure: when monitoring fills the disk (and still loses data)

A practical runbook for remote_write outages: measure lag, estimate time-to-disk-full, tune queue_config safely, and choose explicit survival trade-offs.

December 24, 2025

Packets Arrive but the App Times Out: The rp_filter Trap in Kubernetes

tcpdump shows packets arriving, but your application sees nothing. The culprit: Linux reverse path filtering silently dropping packets before they reach iptables, triggered by asymmetric routing in multi-homed setups.

December 12, 2025

Kubernetes TLS Certificate Rotation: The 3AM Outage

Certificate expired at 3AM, service down. cert-manager renewal failed silently. I show monitoring, testing rotation, and preventing cert-related outages.

December 9, 2025

'No Space Left on Device' with 40% Disk Free: The Inode and OverlayFS Death Spiral

df -h shows 40% free. But your container keeps crashing with ENOSPC. The culprit: inode exhaustion on overlayfs layers, invisible to standard monitoring.

December 7, 2025

OpenTelemetry Collector Backpressure: Fixing Drops with memory_limiter and Queues

OpenTelemetry Collector drops spans under load when exporters backpressure. Fix with memory_limiter, queues, and batch tuning, with commands to verify.

December 4, 2025

CSI VolumeAttachment Stuck: Pods in ContainerCreating and Drains That Never Finish

Pods stuck in ContainerCreating often hide a stuck CSI VolumeAttachment. Runbook to find the blocker, detach safely, prevent data loss, and add alerts.

November 30, 2025

RSS Contracts: Stop OOMKilled Java Pods in Kubernetes by Testing RSS as an API

Use cgroup RSS budgets, CI sampling, and runtime headroom to catch JVM memory regressions before they hit production.

November 27, 2025

Pods Stuck in Terminating: A Production Decision Tree for Finalizers, Volumes, and Dead Nodes

A conservative runbook to unstick Pods safely: finalizers, CSI/volume cleanup stalls, dead nodes, and when (and how) to force-delete.

November 26, 2025

Kubernetes Graceful Shutdown as a Contract: Zero 502s During Rollouts (HTTP + gRPC)

A reproducible way to eliminate rollout 502/ECONNRESET: readiness-driven draining, preStop, SIGTERM handling, and a measurable drain budget.

November 22, 2025

Ephemeral-Storage Evictions in Kubernetes: The Log Storm That Took Down Healthy Pods

Pods get evicted for ephemeral-storage while disk looks free. Debug nodefs/imagefs, container logs, kubelet GC, then enforce budgets and log rotation.

November 18, 2025

Kubernetes OOM Killer: Why Your Container Dies at 50% Memory

Container memory limit is 4GB but OOM kills at 2GB used. Kernel buffers, page cache, and cgroup accounting tricks cause early OOMKills. Here's the full picture.

November 16, 2025

Kubernetes APF Starvation: When One Controller Makes kubectl Hang

APF can starve your Kubernetes API: kubectl hangs, controllers timeout, and 429s spike. Runbook to isolate the noisy client, fix FlowSchemas, and prove it.

November 14, 2025

Kubernetes DNS: The ndots:5 Latency Tax

Every DNS query in K8s makes 5 failed lookups before succeeding. ndots:5 default causes 100ms+ latency. Here's how to fix it properly.

November 10, 2025

Envoy Outlier Detection Brownouts: When the Mesh Ejects Healthy Pods

Debug Istio/Envoy outlier detection brownouts: why healthy pods get ejected and 503s spike in production. Includes xDS checks, safe fixes, and alerting.

November 6, 2025

Go GOMAXPROCS in Containers: The CPU Detection Problem

Go sees 64 host CPUs but your container has 2 CPU limit. GOMAXPROCS=64 causes excessive context switching and throttling. Here's the fix.

November 5, 2025

Envoy/Istio 503 UF/UO/UT: When the Mesh, Not the App, Is Your Outage

Envoy/Istio can return 503 UF/UO/UT when connection pools overflow. Decode flags, inspect proxy stats, patch DestinationRules, and verify fast.

November 2, 2025

Cilium BPF Conntrack Map Exhaustion: Random Resets While conntrack Looks Fine

Random resets with Cilium? Learn how eBPF conntrack (CT) maps fill up, why netfilter conntrack looks fine, and how to size + verify fixes in Kubernetes.

October 29, 2025

Python GIL and Kubernetes CPU Limits: The Threading Trap

Your Python app has 4 threads but K8s gives 1 CPU. GIL + CFS quota = severe throttling. I show why and how to configure workers correctly.

October 27, 2025

Kubernetes p99 Spikes Without OOM: Diagnosing cgroup v2 memory.high with PSI

Use PSI and cgroup v2 memory.high to explain p99 spikes without OOMKills. Kubernetes runbook with commands, safe mitigations, diffs, and alerts.

October 25, 2025

Kubernetes CPU Throttling Autopsy: Why p99 Latency Explodes at 40% CPU Usage

CPU looks OK but tail latency is catastrophic. I'll show how to correlate CFS throttling with latency spikes and why removing CPU limits can paradoxically help.

October 19, 2025

CI/CD for Monorepo: Speed, Caching, Selective Tests and Supply-Chain Security

A complete blueprint for efficient CI/CD pipelines in monorepo - from path filters through remote cache to SBOM and SLSA. Practical solutions, not theory.

October 4, 2025

JVM Native Memory in Kubernetes: Why Your Pod Gets OOMKilled with 50% Heap

Heap is 50% full but pod gets OOMKilled. I'll show how to track native memory (Metaspace, threads, NIO) and prevent container memory issues.

August 16, 2025

gRPC in Kubernetes: Why Service Round-Robin Lies

Why one pod has 90% of traffic with gRPC. Reproducible lab, solutions from client-side LB to service mesh, and production checklist.

August 11, 2025

Linux Page Cache Thrashing in Containers: When Free Memory Isn't Free

Your container has 2GB free but runs slow. Page cache counts against memory limit. File I/O forces code pages out. I explain with benchmarks and solutions.

August 6, 2025

HTTP Keep-Alive Connection Reset: Why Your Requests Fail with 'Connection Reset by Peer'

Sporadic 'connection reset by peer' errors in production. I'll show how keep-alive timeout mismatches between client and server cause this and how to fix it.

July 16, 2025

Tail-Based Sampling in OpenTelemetry: Sizing, Memory Crashes and Cost Model

Practical sizing guide for tail sampling in OpenTelemetry Collector. From decision_wait through memory limits to cost-benefit analysis.

June 21, 2025

Kubernetes conntrack Table Exhaustion: The Silent Packet Killer

Random DNS timeouts, dropped connections, services timing out. Your nf_conntrack table is full. I show how to diagnose, monitor, and fix this Kubernetes networking issue.

June 3, 2025

CoreDNS vs NodeLocal DNS Cache: Cutting Kubernetes DNS Latency by 10x

Your pods make 100 DNS queries per request. CoreDNS is a bottleneck. I benchmark NodeLocal DNS cache and show configuration for production.

May 8, 2025

Kubernetes Cross-Zone Traffic: The Hidden Cost Eating Your Cloud Bill

Your AWS bill has $5000/month in data transfer. Half is cross-zone traffic within your cluster. I show how to measure and reduce it.

April 8, 2025

Kubernetes Rollout Without DB Outage: How to Stop PostgreSQL Connection Storm

Reproducible lab demonstrating connection storm during K8s rollouts. PgBouncer, preStop hooks and jitter - practical solutions with benchmarks.

April 1, 2025

Java Profiling in Hardened Kubernetes: When Security Blocks Your Debugger

Can't attach profiler to production JVM. seccomp blocks perf_event_open, container drops CAP_SYS_PTRACE, and PodSecurityPolicy prevents privileged mode. Here's how to profile anyway.

March 7, 2025

Go cgo DNS Resolution Thread Explosion: When net.LookupHost Spawns Thousands of Threads

Go application suddenly has 10,000 threads consuming all memory. The cause: cgo-based DNS resolution blocking in slow DNS environments, bypassing Go's goroutine scheduler.

February 25, 2025

Gossip Protocol Ghost Nodes: IP Reuse Haunting Your Cluster

New node joins cluster but gets shunned. Old node's IP is still in gossip protocol's failure detection blacklist. The zombie membership record lives on.

February 10, 2025

Kubernetes Ghost Connections: Stale Conntrack DNAT Entries

Service returns wrong pod IPs after scaling. The cause: Linux conntrack keeps DNAT entries alive longer than pods exist, routing traffic to deleted endpoints.

February 5, 2025

Redis Cluster Slot Migration: Temporary Memory Explosion

Redis nodes OOMKilled during cluster rebalancing. The cause: slot migration copies keys to destination before deleting from source, temporarily doubling memory usage.

January 27, 2025

Java OOMKilled With Stable Heap: Native Memory, Direct Buffers, and glibc Arenas

Heap metrics look fine, GC is happy, but the container keeps dying. The culprit: native memory from direct buffers, JNI, and glibc memory allocator fragmentation.

January 20, 2025

The Ghost Pod: Why Your Service Still Sends Traffic to Dead Endpoints

Random ECONNRESET on some nodes but not others. Endpoints look fine. The culprit: conntrack NAT entries keeping long-lived connections pinned to pods that no longer exist.

January 5, 2025

JVM Metaspace OOM in Kubernetes: Why MaxMetaspaceSize Alone Won't Save You

Pod OOMKilled despite MaxMetaspaceSize set. The cause: Metaspace grows outside heap, container memory limit doesn't account for it, and class unloading isn't happening.

December 23, 2024

etcd Watch Replay Storms: When Giant ConfigMaps Kill the Control Plane

The apiserver becomes 'randomly slow'. Root cause: large, frequently updated ConfigMaps trigger watch compaction, causing thousands of controllers to relist simultaneously.

December 5, 2024

etcd Quota Alarm: When Your Kubernetes Cluster Goes Read-Only

Cluster stops accepting writes, pods can't schedule. The cause: etcd hit its storage quota because compaction wasn't running, history accumulated beyond limits.

November 27, 2024

Kubernetes Headless Service DNS: Stale Records After Pod Deletion

Requests go to non-existent pods. The cause: headless service DNS records persist in client DNS cache after pods are deleted, before endpoints update propagates.

November 22, 2024

Traffic Hitting Dead Pods: Conntrack's Stale NAT Mapping

Deploy causes 503s for exactly 2 minutes. The issue: conntrack keeps NAT mappings to old pod IPs even after Kubernetes removes endpoints, sending traffic to dead pods.

November 14, 2024

Ephemeral Port Exhaustion: The Node That 'Goes Bad'

A single Kubernetes node starts failing connections to external services while pods look healthy. The hidden cause: sidecar proxies exhausting ephemeral ports with short-lived connections.

November 11, 2024

PMTU Blackholes: When Only Large Responses Hang

Small API responses work, large ones hang forever. The cause: ICMP 'Fragmentation Needed' messages filtered by firewalls, breaking Path MTU Discovery in overlay networks.

November 7, 2024

kube-proxy Micro-Outages: The xtables Lock Contention Problem

Random 1-3 second connection drops during deployments. CPU looks fine, memory is stable. The hidden cause: iptables-restore grabbing the xtables lock while endpoints churn.

November 4, 2024

VXLAN Random Packet Drops: The Checksum Offload Trap

Cross-node gRPC calls randomly fail but local traffic works fine. The culprit: TX checksum offload corrupting VXLAN headers on specific NIC drivers. Here's how to diagnose and fix.

October 21, 2024