#performance

49 posts

Redis AOF fsync Latency Spikes: When Durability Becomes Your p99

Redis AOF can turn durability into p99 spikes: fsync pressure and rewrite fork CoW. Runbook to confirm, mitigate safely, and add guardrails.

January 9, 2026

Prometheus WAL Replay Hell: Slow Restarts and Missing Alerts

When Prometheus takes minutes or hours to restart, WAL replay is often the culprit. Prove it from logs and disk, recover safely, and prevent it.

January 5, 2026

tcpdump Sees SYNs, but the Service Times Out: The Listen Backlog Trap

Clients time out, tcpdump shows SYNs (sometimes even SYN-ACK), yet your app logs nothing. The culprit is often the Linux listen/accept queues overflowing under load or CPU starvation.

January 3, 2026

ingress-nginx Reload Storms: Why 502 Spikes Track Ingress Churn

NGINX Ingress reload storms can drop keep-alives and cause 502 spikes. Runbook to prove reload impact, reduce churn, and harden graceful reload.

December 28, 2025

EXPLAIN Lied to You: The PostgreSQL Prepared Statement Plan Cliff

Your EXPLAIN looks perfect but production melts. The culprit: PostgreSQL silently switched from a custom plan to a generic plan after enough executions, and the generic plan is catastrophically wrong.

December 24, 2025

Prometheus Native Histograms in Production: Rollout Plan, Budgets, and Failure Modes

Prometheus native histograms can blow up memory, WAL, and remote_write. This guide shows a staged rollout, budgets, and concrete queries to verify safety.

December 20, 2025

PostgreSQL Checkpoint Spikes: Why p99 Explodes Every N Minutes

A reproducible approach to diagnose and eliminate checkpoint-induced latency spikes using pgbench, pg_stat_bgwriter, and WAL/IO budgeting.

December 8, 2025

Database Connection Pool Exhaustion: The Silent Outage Trigger

App hangs but the database looks healthy. Your pool is exhausted. I show how to detect it, size pools sanely, and prevent connection leaks.

November 30, 2025

ClickHouse ReplacingMergeTree: The Deduplication Illusion

ReplacingMergeTree doesn't dedupe on SELECT. It merges eventually. Your queries return duplicates until background merge runs. Here's how to handle it.

November 13, 2025

Kafka Consumer Rebalance Storms: Why Scaling Consumers Can Increase Lag

Kafka consumer rebalances can make lag worse when you scale out. Diagnose max.poll interval, heartbeats, and assignment strategy; apply safe config diffs.

November 10, 2025

Kubernetes DNS: The ndots:5 Latency Tax

Every DNS query in K8s makes 5 failed lookups before succeeding. ndots:5 default causes 100ms+ latency. Here's how to fix it properly.

November 10, 2025

Go GOMAXPROCS in Containers: The CPU Detection Problem

Go sees 64 host CPUs but your container has 2 CPU limit. GOMAXPROCS=64 causes excessive context switching and throttling. Here's the fix.

November 5, 2025

Python GIL and Kubernetes CPU Limits: The Threading Trap

Your Python app has 4 threads but K8s gives 1 CPU. GIL + CFS quota = severe throttling. I show why and how to configure workers correctly.

October 27, 2025

Kubernetes p99 Spikes Without OOM: Diagnosing cgroup v2 memory.high with PSI

Use PSI and cgroup v2 memory.high to explain p99 spikes without OOMKills. Kubernetes runbook with commands, safe mitigations, diffs, and alerts.

October 25, 2025

Connection Pool Sizing with Little's Law: Mathematical Approach to HikariCP and PgBouncer

Pool size 50 because that's how it's always been? I'll show how to use Little's Law to calculate optimal pool size and prove it with load tests.

October 22, 2025

Kubernetes CPU Throttling Autopsy: Why p99 Latency Explodes at 40% CPU Usage

CPU looks OK but tail latency is catastrophic. I'll show how to correlate CFS throttling with latency spikes and why removing CPU limits can paradoxically help.

October 19, 2025

Elasticsearch Hot Shard Problem: When One Node Does All the Work

5 data nodes but one is at 100% CPU. Uneven routing keys create hot shards. I show how to detect skew and fix it with routing strategies.

October 16, 2025

UUIDv4 vs ULID vs TSID: Impact on PostgreSQL B-Tree Indexes After 100M Records

Random UUIDs as Primary Keys cause index bloat and random I/O. Benchmark with specific numbers - index size, cache hit ratio, and WAL volume after 100M inserts.

October 14, 2025

JWT Revocation Strategies: When Stateless Tokens Need State

User compromised, need to revoke JWT immediately. But JWTs are immutable. I compare allowlist, denylist, and short expiration with performance benchmarks.

October 12, 2025

Structured Logging Performance: When Your Logger Becomes the Bottleneck

At 50k logs/sec, JSON serialization eats 30% CPU. Standard library encoding/json is slow. I benchmark zap vs zerolog vs slog with real numbers.

September 28, 2025

PostgreSQL HOT Updates + FILLFACTOR: How to Reduce Index Bloat by 60%

Vacuum runs successfully but disk keeps growing and cache hit ratio drops. I'll show how to quantify HOT-update eligibility using pgstattuple and optimize fillfactor.

September 23, 2025

When Prepared Statements Make PostgreSQL 10× Slower: Generic Plan Trap

Same query, same params, but prod is slow and staging works fine. I'll show how to reproduce the generic plan problem with pgBouncer, Java/Go and how to fix it.

September 15, 2025

eBPF Off-CPU Analysis: Finding Latency That Metrics Miss

CPU is at 20% but latency is 500ms. Standard profilers show nothing. The app is waiting, not computing. I show how to use eBPF to find what it's waiting for.

September 7, 2025

PostgreSQL Autovacuum SLO Tuning: How to Configure Vacuum for 200M Rows and 5k UPSERT/s

Autovacuum is either ignored or cargo-cult tuned. I'll show how to turn it into an SLO-driven system with specific numbers, pg_stat metrics, and reproducible tests.

September 4, 2025

Java Virtual Threads vs Reactive: When to Drop WebFlux for Project Loom

Virtual Threads in Java 21 promise simpler code than Reactive. I benchmark both under 10k concurrent connections and show where each wins.

August 27, 2025

gRPC Deadline Propagation: Preventing Cascading Failures

Frontend gives up after 5s but backend keeps working for 30s. Without deadline propagation, you waste resources on doomed requests. I show how to implement it in Go.

August 23, 2025

JVM Native Memory in Kubernetes: Why Your Pod Gets OOMKilled with 50% Heap

Heap is 50% full but pod gets OOMKilled. I'll show how to track native memory (Metaspace, threads, NIO) and prevent container memory issues.

August 16, 2025

gRPC in Kubernetes: Why Service Round-Robin Lies

Why one pod has 90% of traffic with gRPC. Reproducible lab, solutions from client-side LB to service mesh, and production checklist.

August 11, 2025

Linux Page Cache Thrashing in Containers: When Free Memory Isn't Free

Your container has 2GB free but runs slow. Page cache counts against memory limit. File I/O forces code pages out. I explain with benchmarks and solutions.

August 6, 2025

Prometheus Cardinality Explosion: Detection, Prevention, and Recovery

One developer added user_id label. Prometheus OOM'd. I show how to detect high-cardinality metrics before they kill your monitoring, with relabel configs to drop them.

July 23, 2025

PostgreSQL TOAST Strategy: Why Your JSON Column Kills Query Performance

SELECT * on a table with JSON is 10x slower than expected. I'll show how TOAST storage works and when to change strategies for large columns.

June 24, 2025

Tail-Based Sampling in OpenTelemetry: Sizing, Memory Crashes and Cost Model

Practical sizing guide for tail sampling in OpenTelemetry Collector. From decision_wait through memory limits to cost-benefit analysis.

June 21, 2025

Cache Stampede Prevention: Probabilistic Early Expiration (X-Fetch)

100 requests hit expired cache simultaneously. All 100 query the database. I implement the X-Fetch algorithm that refreshes cache before expiration without locks.

June 14, 2025

Redis Memory Fragmentation: When maxmemory Isn't Enough

Your Redis has 4GB maxmemory but RSS shows 6GB. OOM killer strikes. I explain jemalloc fragmentation with reproduction steps and activedefrag tuning.

May 22, 2025

CoreDNS vs NodeLocal DNS Cache: Cutting Kubernetes DNS Latency by 10x

Your pods make 100 DNS queries per request. CoreDNS is a bottleneck. I benchmark NodeLocal DNS cache and show configuration for production.

May 8, 2025

GIN Index Pending List Overflow: Fast Writes, Slow Searches

Full-text search was fast, now it's slow. The cause: GIN index pending list grew huge during bulk inserts, and every search must now scan the unsorted pending entries.

April 17, 2025

Adaptive Concurrency Limits: Stop Guessing Thread Pool Sizes

Thread pool 200 because that's what Stack Overflow says? Netflix's algorithm adjusts concurrency automatically based on latency. I show how it works with benchmarks.

April 11, 2025

The Soft Delete Trap: Why is_deleted Kills Your Database (And What To Do)

A practical analysis of why soft delete destroys database performance over time. Benchmarks, partitioning solution, and migration checklist.

March 23, 2025

Java Profiling in Hardened Kubernetes: When Security Blocks Your Debugger

Can't attach profiler to production JVM. seccomp blocks perf_event_open, container drops CAP_SYS_PTRACE, and PodSecurityPolicy prevents privileged mode. Here's how to profile anyway.

March 7, 2025

PostgreSQL Partial Index: Planner Ignores Your Index

Query scans full table despite perfect partial index. The cause: query's WHERE clause doesn't match the index predicate exactly, or statistics mislead the planner.

March 4, 2025

Go cgo DNS Resolution Thread Explosion: When net.LookupHost Spawns Thousands of Threads

Go application suddenly has 10,000 threads consuming all memory. The cause: cgo-based DNS resolution blocking in slow DNS environments, bypassing Go's goroutine scheduler.

February 25, 2025

eBPF Run-Queue Latency: Finding the Off-CPU Bottleneck

CPU utilization is low but requests are slow. The hidden culprit: time spent waiting in the scheduler run-queue, invisible to traditional profilers but visible with eBPF off-CPU analysis.

February 17, 2025

PostgreSQL Read Replica Conflicts: Why Your Queries Get Canceled

Queries on read replicas fail with 'canceling statement due to conflict with recovery'. The fix depends on which of the 5 conflict types you have - here's how to diagnose and solve each one.

January 28, 2025

Go p99 Latency Cliffs: Nested context.WithTimeout Timer Storms

Periodic latency spikes that look like network jitter. The real cause: nested timeouts creating thousands of timers that pressure the Go runtime timer heap and trigger GC scanning.

January 15, 2025

PostgreSQL OOM by Design: work_mem × Parallel Workers × Plan Nodes

work_mem looks small at 256MB, but a parallel hash join with 4 workers across 3 plan nodes uses 3GB. Here's how to prevent PostgreSQL from legitimately OOMing your container.

December 28, 2024

The Index That Killed Write Performance: Losing PostgreSQL HOT Updates

Adding an index for performance made writes 10x slower. The counter-intuitive cause: the new index broke HOT updates, turning cheap in-place updates into full-row rewrites with massive bloat.

December 19, 2024

etcd Watch Replay Storms: When Giant ConfigMaps Kill the Control Plane

The apiserver becomes 'randomly slow'. Root cause: large, frequently updated ConfigMaps trigger watch compaction, causing thousands of controllers to relist simultaneously.

December 5, 2024

kube-proxy Micro-Outages: The xtables Lock Contention Problem

Random 1-3 second connection drops during deployments. CPU looks fine, memory is stable. The hidden cause: iptables-restore grabbing the xtables lock while endpoints churn.

November 4, 2024

TCP TIME_WAIT Port Exhaustion: When Connection Pooling Isn't Enough

Service can't connect to database - 'cannot assign requested address'. The cause: ephemeral ports exhausted by thousands of sockets stuck in TIME_WAIT state.

October 28, 2024