Michal Drozd

Michal Drozd - BlogArticles about software development, architecture and technologies.https://www.michal-drozd.com/enBuild a Solana Escrow Program for Service Marketplaces (Anchor Blueprint)https://www.michal-drozd.com/en/blog/solana-escrow-program-marketplace/https://www.michal-drozd.com/en/blog/solana-escrow-program-marketplace/A practical Solana escrow architecture for marketplaces: account model, instruction set, security invariants, and production rollout plan.Tue, 24 Feb 2026 00:00:00 GMTSolana in 2026: Use Cases That Actually Shiphttps://www.michal-drozd.com/en/blog/solana-use-cases-2026/https://www.michal-drozd.com/en/blog/solana-use-cases-2026/A practical map of real Solana use cases in 2026: stablecoin payments, embedded Actions, and operations patterns teams can implement this quarter.Fri, 20 Feb 2026 00:00:00 GMTRedis AOF fsync Latency Spikes: When Durability Becomes Your p99https://www.michal-drozd.com/en/blog/redis-aof-fsync-latency-spikes/https://www.michal-drozd.com/en/blog/redis-aof-fsync-latency-spikes/Redis AOF can turn durability into p99 spikes: fsync pressure and rewrite fork CoW. Runbook to confirm, mitigate safely, and add guardrails.Fri, 09 Jan 2026 00:00:00 GMTPrometheus WAL Replay Hell: Slow Restarts and Missing Alertshttps://www.michal-drozd.com/en/blog/prometheus-wal-replay-slow-startup/https://www.michal-drozd.com/en/blog/prometheus-wal-replay-slow-startup/When Prometheus takes minutes or hours to restart, WAL replay is often the culprit. Prove it from logs and disk, recover safely, and prevent it.Mon, 05 Jan 2026 00:00:00 GMTtcpdump Sees SYNs, but the Service Times Out: The Listen Backlog Traphttps://www.michal-drozd.com/en/blog/tcpdump-syn-no-accept-backlog-trap/https://www.michal-drozd.com/en/blog/tcpdump-syn-no-accept-backlog-trap/Clients time out, tcpdump shows SYNs (sometimes even SYN-ACK), yet your app logs nothing. The culprit is often the Linux listen/accept queues overflowing under load or CPU starvation.Sat, 03 Jan 2026 00:00:00 GMTPostgreSQL Logical Replication Lag: Big Transactions and Reorder Buffer Spillshttps://www.michal-drozd.com/en/blog/postgresql-logical-replication-lag-big-transactions/https://www.michal-drozd.com/en/blog/postgresql-logical-replication-lag-big-transactions/One huge transaction can pin logical replication for hours. Runbook to detect the blocker, tune decoding safely, and enforce bounded transactions in prod.Thu, 01 Jan 2026 00:00:00 GMTSpan Contracts: Trace-Driven API Contract Testing with OpenTelemetryhttps://www.michal-drozd.com/en/blog/span-contracts-otel-contract-testing/https://www.michal-drozd.com/en/blog/span-contracts-otel-contract-testing/Detect API breaking changes by hashing response shapes from OTel spans and fail CI without storing payloads.Wed, 31 Dec 2025 00:00:00 GMTCircuit Breaker Anti-Patterns: When Protection Causes Outageshttps://www.michal-drozd.com/en/blog/circuit-breaker-anti-patterns/https://www.michal-drozd.com/en/blog/circuit-breaker-anti-patterns/Circuit breakers prevent cascading failures but wrong config makes them worse. I show 5 anti-patterns: shared breakers, wrong thresholds, instant open, no fallback, and testing gaps.Mon, 29 Dec 2025 00:00:00 GMTingress-nginx Reload Storms: Why 502 Spikes Track Ingress Churnhttps://www.michal-drozd.com/en/blog/ingress-nginx-reload-storms/https://www.michal-drozd.com/en/blog/ingress-nginx-reload-storms/NGINX Ingress reload storms can drop keep-alives and cause 502 spikes. Runbook to prove reload impact, reduce churn, and harden graceful reload.Sun, 28 Dec 2025 00:00:00 GMTThe Cert Isn't Expired, Your Node Is: Time Drift Breaking TLS and JWT in Kuberneteshttps://www.michal-drozd.com/en/blog/time-drift-tls-jwt-kubernetes/https://www.michal-drozd.com/en/blog/time-drift-tls-jwt-kubernetes/Sporadic TLS handshake failures and JWT rejections across services. Everything passes when you check. The culprit: a node's clock drifted or jumped, and NTP fixed it before you could catch it.Fri, 26 Dec 2025 00:00:00 GMTEXPLAIN Lied to You: The PostgreSQL Prepared Statement Plan Cliffhttps://www.michal-drozd.com/en/blog/postgresql-prepared-statement-plan-cliff/https://www.michal-drozd.com/en/blog/postgresql-prepared-statement-plan-cliff/Your EXPLAIN looks perfect but production melts. The culprit: PostgreSQL silently switched from a custom plan to a generic plan after enough executions, and the generic plan is catastrophically wrong.Wed, 24 Dec 2025 00:00:00 GMTPrometheus remote_write backpressure: when monitoring fills the disk (and still loses data)https://www.michal-drozd.com/en/blog/prometheus-remote-write-backpressure/https://www.michal-drozd.com/en/blog/prometheus-remote-write-backpressure/A practical runbook for remote_write outages: measure lag, estimate time-to-disk-full, tune queue_config safely, and choose explicit survival trade-offs.Wed, 24 Dec 2025 00:00:00 GMTCardinality Contracts: Prometheus Labels as an API with Budgetshttps://www.michal-drozd.com/en/blog/cardinality-contracts-prometheus-label-budgets/https://www.michal-drozd.com/en/blog/cardinality-contracts-prometheus-label-budgets/Define label budgets, enforce them in CI, and add a runtime firewall to stop cardinality explosions before production.Sun, 21 Dec 2025 00:00:00 GMTPrometheus Native Histograms in Production: Rollout Plan, Budgets, and Failure Modeshttps://www.michal-drozd.com/en/blog/prometheus-native-histograms-production/https://www.michal-drozd.com/en/blog/prometheus-native-histograms-production/Prometheus native histograms can blow up memory, WAL, and remote_write. This guide shows a staged rollout, budgets, and concrete queries to verify safety.Sat, 20 Dec 2025 00:00:00 GMTWorks in psql, Flaky in Prod: PgBouncer's Silent Murder of LISTEN/NOTIFYhttps://www.michal-drozd.com/en/blog/pgbouncer-listen-notify-transaction-pooling/https://www.michal-drozd.com/en/blog/pgbouncer-listen-notify-transaction-pooling/PostgreSQL LISTEN/NOTIFY works perfectly in local testing but notifications randomly stop arriving in production. The culprit: transaction pooling quietly reassigning your connection to someone else.Thu, 18 Dec 2025 00:00:00 GMTPostgreSQL XID Wraparound: Emergency Playbook for Vacuum Freeze Under Firehttps://www.michal-drozd.com/en/blog/postgresql-xid-wraparound-emergency-playbook/https://www.michal-drozd.com/en/blog/postgresql-xid-wraparound-emergency-playbook/PostgreSQL can go read-only near XID wraparound. Use this emergency playbook to find the oldest tables, unblock vacuum freeze, and prevent repeat incidents.Tue, 16 Dec 2025 00:00:00 GMTDash Contracts in Go: CI Compiler for Grafana Dashboards and Prometheus Alertshttps://www.michal-drozd.com/en/blog/dash-contracts-grafana-alerts-ci/https://www.michal-drozd.com/en/blog/dash-contracts-grafana-alerts-ci/Extract PromQL from dashboards and rules, verify selectors against /metrics, and fail CI before dashboards go dark.Mon, 15 Dec 2025 00:00:00 GMTPackets Arrive but the App Times Out: The rp_filter Trap in Kuberneteshttps://www.michal-drozd.com/en/blog/linux-rp-filter-asymmetric-routing/https://www.michal-drozd.com/en/blog/linux-rp-filter-asymmetric-routing/tcpdump shows packets arriving, but your application sees nothing. The culprit: Linux reverse path filtering silently dropping packets before they reach iptables, triggered by asymmetric routing in multi-homed setups.Fri, 12 Dec 2025 00:00:00 GMThot_standby_feedback Bloat Trap: Fixing Replica Conflicts by Slowly Killing the Primaryhttps://www.michal-drozd.com/en/blog/postgresql-hot-standby-feedback-bloat/https://www.michal-drozd.com/en/blog/postgresql-hot-standby-feedback-bloat/hot_standby_feedback stops replica query cancellations but can bloat the primary over days. Detect xmin pinning, mitigate safely, and add guardrails.Fri, 12 Dec 2025 00:00:00 GMTKubernetes TLS Certificate Rotation: The 3AM Outagehttps://www.michal-drozd.com/en/blog/kubernetes-tls-certificate-rotation/https://www.michal-drozd.com/en/blog/kubernetes-tls-certificate-rotation/Certificate expired at 3AM, service down. cert-manager renewal failed silently. I show monitoring, testing rotation, and preventing cert-related outages.Tue, 09 Dec 2025 00:00:00 GMTPostgreSQL Checkpoint Spikes: Why p99 Explodes Every N Minuteshttps://www.michal-drozd.com/en/blog/postgresql-checkpoint-spikes/https://www.michal-drozd.com/en/blog/postgresql-checkpoint-spikes/A reproducible approach to diagnose and eliminate checkpoint-induced latency spikes using pgbench, pg_stat_bgwriter, and WAL/IO budgeting.Mon, 08 Dec 2025 00:00:00 GMT'No Space Left on Device' with 40% Disk Free: The Inode and OverlayFS Death Spiralhttps://www.michal-drozd.com/en/blog/kubernetes-inode-exhaustion-overlayfs/https://www.michal-drozd.com/en/blog/kubernetes-inode-exhaustion-overlayfs/df -h shows 40% free. But your container keeps crashing with ENOSPC. The culprit: inode exhaustion on overlayfs layers, invisible to standard monitoring.Sun, 07 Dec 2025 00:00:00 GMTOpenTelemetry Collector Backpressure: Fixing Drops with memory_limiter and Queueshttps://www.michal-drozd.com/en/blog/otel-collector-backpressure-memory-limiter/https://www.michal-drozd.com/en/blog/otel-collector-backpressure-memory-limiter/OpenTelemetry Collector drops spans under load when exporters backpressure. Fix with memory_limiter, queues, and batch tuning, with commands to verify.Thu, 04 Dec 2025 00:00:00 GMTDatabase Connection Pool Exhaustion: The Silent Outage Triggerhttps://www.michal-drozd.com/en/blog/database-connection-pool-exhaustion/https://www.michal-drozd.com/en/blog/database-connection-pool-exhaustion/App hangs but the database looks healthy. Your pool is exhausted. I show how to detect it, size pools sanely, and prevent connection leaks.Sun, 30 Nov 2025 00:00:00 GMTCSI VolumeAttachment Stuck: Pods in ContainerCreating and Drains That Never Finishhttps://www.michal-drozd.com/en/blog/kubernetes-volumeattachment-stuck-csi/https://www.michal-drozd.com/en/blog/kubernetes-volumeattachment-stuck-csi/Pods stuck in ContainerCreating often hide a stuck CSI VolumeAttachment. Runbook to find the blocker, detach safely, prevent data loss, and add alerts.Sun, 30 Nov 2025 00:00:00 GMTRSS Contracts: Stop OOMKilled Java Pods in Kubernetes by Testing RSS as an APIhttps://www.michal-drozd.com/en/blog/rss-contracts-jvm-oomkilled-kubernetes/https://www.michal-drozd.com/en/blog/rss-contracts-jvm-oomkilled-kubernetes/Use cgroup RSS budgets, CI sampling, and runtime headroom to catch JVM memory regressions before they hit production.Thu, 27 Nov 2025 00:00:00 GMTPods Stuck in Terminating: A Production Decision Tree for Finalizers, Volumes, and Dead Nodeshttps://www.michal-drozd.com/en/blog/kubernetes-pod-stuck-terminating-playbook/https://www.michal-drozd.com/en/blog/kubernetes-pod-stuck-terminating-playbook/A conservative runbook to unstick Pods safely: finalizers, CSI/volume cleanup stalls, dead nodes, and when (and how) to force-delete.Wed, 26 Nov 2025 00:00:00 GMTpg_waldump WAL Forensics: Reconstructing What Happened to Your Datahttps://www.michal-drozd.com/en/blog/postgresql-wal-forensics/https://www.michal-drozd.com/en/blog/postgresql-wal-forensics/Something deleted rows from production but nobody admits to running DELETE. Use pg_waldump to analyze WAL files and reconstruct exactly what happened and when.Mon, 24 Nov 2025 00:00:00 GMTKubernetes Graceful Shutdown as a Contract: Zero 502s During Rollouts (HTTP + gRPC)https://www.michal-drozd.com/en/blog/kubernetes-graceful-shutdown-rollouts/https://www.michal-drozd.com/en/blog/kubernetes-graceful-shutdown-rollouts/A reproducible way to eliminate rollout 502/ECONNRESET: readiness-driven draining, preStop, SIGTERM handling, and a measurable drain budget.Sat, 22 Nov 2025 00:00:00 GMT5000 Unacked Messages and Climbing: Stop RabbitMQ Consumer Meltdowns in CIhttps://www.michal-drozd.com/en/blog/rabbitmq-ack-contracts/https://www.michal-drozd.com/en/blog/rabbitmq-ack-contracts/Queue looks healthy until deployment, then messages_unacknowledged explodes, memory spikes, and redelivery storms start. The culprit: your prefetch is too high and nobody tested actual ack behavior.Sat, 22 Nov 2025 00:00:00 GMTEphemeral-Storage Evictions in Kubernetes: The Log Storm That Took Down Healthy Podshttps://www.michal-drozd.com/en/blog/kubernetes-ephemeral-storage-eviction-log-storm/https://www.michal-drozd.com/en/blog/kubernetes-ephemeral-storage-eviction-log-storm/Pods get evicted for ephemeral-storage while disk looks free. Debug nodefs/imagefs, container logs, kubelet GC, then enforce budgets and log rotation.Tue, 18 Nov 2025 00:00:00 GMTKubernetes OOM Killer: Why Your Container Dies at 50% Memoryhttps://www.michal-drozd.com/en/blog/kubernetes-oom-killer-memory-limits/https://www.michal-drozd.com/en/blog/kubernetes-oom-killer-memory-limits/Container memory limit is 4GB but OOM kills at 2GB used. Kernel buffers, page cache, and cgroup accounting tricks cause early OOMKills. Here's the full picture.Sun, 16 Nov 2025 00:00:00 GMTOne Partition at 99% CPU: Stop Kafka Hotspots Before They Reach Productionhttps://www.michal-drozd.com/en/blog/kafka-partition-skew-contracts/https://www.michal-drozd.com/en/blog/kafka-partition-skew-contracts/All partitions look balanced in testing, then production traffic arrives and one partition melts. The culprit: your partition key has terrible cardinality and nobody noticed until now.Sat, 15 Nov 2025 00:00:00 GMTKubernetes APF Starvation: When One Controller Makes kubectl Hanghttps://www.michal-drozd.com/en/blog/kubernetes-api-priority-fairness-starvation/https://www.michal-drozd.com/en/blog/kubernetes-api-priority-fairness-starvation/APF can starve your Kubernetes API: kubectl hangs, controllers timeout, and 429s spike. Runbook to isolate the noisy client, fix FlowSchemas, and prove it.Fri, 14 Nov 2025 00:00:00 GMTClickHouse ReplacingMergeTree: The Deduplication Illusionhttps://www.michal-drozd.com/en/blog/clickhouse-replacingmergetree-deduplication/https://www.michal-drozd.com/en/blog/clickhouse-replacingmergetree-deduplication/ReplacingMergeTree doesn't dedupe on SELECT. It merges eventually. Your queries return duplicates until background merge runs. Here's how to handle it.Thu, 13 Nov 2025 00:00:00 GMTKafka Consumer Rebalance Storms: Why Scaling Consumers Can Increase Laghttps://www.michal-drozd.com/en/blog/kafka-consumer-rebalance-storm/https://www.michal-drozd.com/en/blog/kafka-consumer-rebalance-storm/Kafka consumer rebalances can make lag worse when you scale out. Diagnose max.poll interval, heartbeats, and assignment strategy; apply safe config diffs.Mon, 10 Nov 2025 00:00:00 GMTKubernetes DNS: The ndots:5 Latency Taxhttps://www.michal-drozd.com/en/blog/kubernetes-dns-caching-ndots/https://www.michal-drozd.com/en/blog/kubernetes-dns-caching-ndots/Every DNS query in K8s makes 5 failed lookups before succeeding. ndots:5 default causes 100ms+ latency. Here's how to fix it properly.Mon, 10 Nov 2025 00:00:00 GMTEnvoy Outlier Detection Brownouts: When the Mesh Ejects Healthy Podshttps://www.michal-drozd.com/en/blog/envoy-outlier-detection-brownouts/https://www.michal-drozd.com/en/blog/envoy-outlier-detection-brownouts/Debug Istio/Envoy outlier detection brownouts: why healthy pods get ejected and 503s spike in production. Includes xDS checks, safe fixes, and alerting.Thu, 06 Nov 2025 00:00:00 GMTGo GOMAXPROCS in Containers: The CPU Detection Problemhttps://www.michal-drozd.com/en/blog/go-gomaxprocs-containers/https://www.michal-drozd.com/en/blog/go-gomaxprocs-containers/Go sees 64 host CPUs but your container has 2 CPU limit. GOMAXPROCS=64 causes excessive context switching and throttling. Here's the fix.Wed, 05 Nov 2025 00:00:00 GMTEnvoy/Istio 503 UF/UO/UT: When the Mesh, Not the App, Is Your Outagehttps://www.michal-drozd.com/en/blog/envoy-istio-503-uf-uo-ut/https://www.michal-drozd.com/en/blog/envoy-istio-503-uf-uo-ut/Envoy/Istio can return 503 UF/UO/UT when connection pools overflow. Decode flags, inspect proxy stats, patch DestinationRules, and verify fast.Sun, 02 Nov 2025 00:00:00 GMTArchitecture as Code: ADR, C4 Diagrams and CI Quality Gateshttps://www.michal-drozd.com/en/blog/architecture-as-code/https://www.michal-drozd.com/en/blog/architecture-as-code/A complete guide to implementing living documentation using Architecture Decision Records, C4 model, and CI/CD pipeline automation.Fri, 31 Oct 2025 00:00:00 GMTCilium BPF Conntrack Map Exhaustion: Random Resets While conntrack Looks Finehttps://www.michal-drozd.com/en/blog/cilium-bpf-conntrack-map-exhaustion/https://www.michal-drozd.com/en/blog/cilium-bpf-conntrack-map-exhaustion/Random resets with Cilium? Learn how eBPF conntrack (CT) maps fill up, why netfilter conntrack looks fine, and how to size + verify fixes in Kubernetes.Wed, 29 Oct 2025 00:00:00 GMTPython GIL and Kubernetes CPU Limits: The Threading Traphttps://www.michal-drozd.com/en/blog/python-gil-kubernetes-cpu-limits/https://www.michal-drozd.com/en/blog/python-gil-kubernetes-cpu-limits/Your Python app has 4 threads but K8s gives 1 CPU. GIL + CFS quota = severe throttling. I show why and how to configure workers correctly.Mon, 27 Oct 2025 00:00:00 GMTKubernetes p99 Spikes Without OOM: Diagnosing cgroup v2 memory.high with PSIhttps://www.michal-drozd.com/en/blog/cgroup-v2-memory-high-psi-kubernetes/https://www.michal-drozd.com/en/blog/cgroup-v2-memory-high-psi-kubernetes/Use PSI and cgroup v2 memory.high to explain p99 spikes without OOMKills. Kubernetes runbook with commands, safe mitigations, diffs, and alerts.Sat, 25 Oct 2025 00:00:00 GMTS3 Intelligent-Tiering: The Small Object Cost Traphttps://www.michal-drozd.com/en/blog/s3-intelligent-tiering-trap/https://www.michal-drozd.com/en/blog/s3-intelligent-tiering-trap/S3 Intelligent-Tiering saves money for large files but charges minimum 128KB overhead. For millions of small objects, it INCREASES costs. I show the math.Sat, 25 Oct 2025 00:00:00 GMTConnection Pool Sizing with Little's Law: Mathematical Approach to HikariCP and PgBouncerhttps://www.michal-drozd.com/en/blog/connection-pool-littles-law/https://www.michal-drozd.com/en/blog/connection-pool-littles-law/Pool size 50 because that's how it's always been? I'll show how to use Little's Law to calculate optimal pool size and prove it with load tests.Wed, 22 Oct 2025 00:00:00 GMTKubernetes CPU Throttling Autopsy: Why p99 Latency Explodes at 40% CPU Usagehttps://www.michal-drozd.com/en/blog/k8s-cpu-throttling/https://www.michal-drozd.com/en/blog/k8s-cpu-throttling/CPU looks OK but tail latency is catastrophic. I'll show how to correlate CFS throttling with latency spikes and why removing CPU limits can paradoxically help.Sun, 19 Oct 2025 00:00:00 GMTElasticsearch Hot Shard Problem: When One Node Does All the Workhttps://www.michal-drozd.com/en/blog/elasticsearch-hot-shard-problem/https://www.michal-drozd.com/en/blog/elasticsearch-hot-shard-problem/5 data nodes but one is at 100% CPU. Uneven routing keys create hot shards. I show how to detect skew and fix it with routing strategies.Thu, 16 Oct 2025 00:00:00 GMTUUIDv4 vs ULID vs TSID: Impact on PostgreSQL B-Tree Indexes After 100M Recordshttps://www.michal-drozd.com/en/blog/uuid-ulid-tsid-postgresql/https://www.michal-drozd.com/en/blog/uuid-ulid-tsid-postgresql/Random UUIDs as Primary Keys cause index bloat and random I/O. Benchmark with specific numbers - index size, cache hit ratio, and WAL volume after 100M inserts.Tue, 14 Oct 2025 00:00:00 GMTJWT Revocation Strategies: When Stateless Tokens Need Statehttps://www.michal-drozd.com/en/blog/jwt-revocation-strategies/https://www.michal-drozd.com/en/blog/jwt-revocation-strategies/User compromised, need to revoke JWT immediately. But JWTs are immutable. I compare allowlist, denylist, and short expiration with performance benchmarks.Sun, 12 Oct 2025 00:00:00 GMTFields Disappeared But Nothing Crashed: Catch Schema Evolution Bugs Before Productionhttps://www.michal-drozd.com/en/blog/schema-evolution-contracts/https://www.michal-drozd.com/en/blog/schema-evolution-contracts/Producer upgraded Protobuf, consumer still on old version. No errors, no warnings—just silent data loss in production. Your schema evolution broke backward compatibility and CI didn't notice.Wed, 08 Oct 2025 00:00:00 GMTCI/CD for Monorepo: Speed, Caching, Selective Tests and Supply-Chain Securityhttps://www.michal-drozd.com/en/blog/cicd-monorepo/https://www.michal-drozd.com/en/blog/cicd-monorepo/A complete blueprint for efficient CI/CD pipelines in monorepo - from path filters through remote cache to SBOM and SLSA. Practical solutions, not theory.Sat, 04 Oct 2025 00:00:00 GMTStructured Logging Performance: When Your Logger Becomes the Bottleneckhttps://www.michal-drozd.com/en/blog/structured-logging-performance/https://www.michal-drozd.com/en/blog/structured-logging-performance/At 50k logs/sec, JSON serialization eats 30% CPU. Standard library encoding/json is slow. I benchmark zap vs zerolog vs slog with real numbers.Sun, 28 Sep 2025 00:00:00 GMTPostgreSQL HOT Updates + FILLFACTOR: How to Reduce Index Bloat by 60%https://www.michal-drozd.com/en/blog/postgresql-hot-updates-fillfactor/https://www.michal-drozd.com/en/blog/postgresql-hot-updates-fillfactor/Vacuum runs successfully but disk keeps growing and cache hit ratio drops. I'll show how to quantify HOT-update eligibility using pgstattuple and optimize fillfactor.Tue, 23 Sep 2025 00:00:00 GMTCircuit Breaker vs Rate Limiter vs Bulkhead: When to Use Which Patternhttps://www.michal-drozd.com/en/blog/circuit-breaker-rate-limiter-bulkhead/https://www.michal-drozd.com/en/blog/circuit-breaker-rate-limiter-bulkhead/Three resilience patterns that are often confused. I'll show exactly when each prevents cascading failures and when it makes things worse with real metrics.Fri, 19 Sep 2025 00:00:00 GMTWhen Prepared Statements Make PostgreSQL 10× Slower: Generic Plan Traphttps://www.michal-drozd.com/en/blog/postgresql-prepared-statements-trap/https://www.michal-drozd.com/en/blog/postgresql-prepared-statements-trap/Same query, same params, but prod is slow and staging works fine. I'll show how to reproduce the generic plan problem with pgBouncer, Java/Go and how to fix it.Mon, 15 Sep 2025 00:00:00 GMTLogical Replication Slot WAL Bloat: When Subscribers Go Offlinehttps://www.michal-drozd.com/en/blog/logical-replication-slot-wal-retention/https://www.michal-drozd.com/en/blog/logical-replication-slot-wal-retention/Disk filling up with WAL files. The cause: a logical replication slot consumer went offline, and PostgreSQL retains all WAL since then because it might be needed.Tue, 09 Sep 2025 00:00:00 GMTeBPF Off-CPU Analysis: Finding Latency That Metrics Misshttps://www.michal-drozd.com/en/blog/ebpf-off-cpu-debugging/https://www.michal-drozd.com/en/blog/ebpf-off-cpu-debugging/CPU is at 20% but latency is 500ms. Standard profilers show nothing. The app is waiting, not computing. I show how to use eBPF to find what it's waiting for.Sun, 07 Sep 2025 00:00:00 GMTPostgreSQL Autovacuum SLO Tuning: How to Configure Vacuum for 200M Rows and 5k UPSERT/shttps://www.michal-drozd.com/en/blog/postgresql-autovacuum-slo/https://www.michal-drozd.com/en/blog/postgresql-autovacuum-slo/Autovacuum is either ignored or cargo-cult tuned. I'll show how to turn it into an SLO-driven system with specific numbers, pg_stat metrics, and reproducible tests.Thu, 04 Sep 2025 00:00:00 GMTJava Virtual Threads vs Reactive: When to Drop WebFlux for Project Loomhttps://www.michal-drozd.com/en/blog/java-virtual-threads-vs-reactive/https://www.michal-drozd.com/en/blog/java-virtual-threads-vs-reactive/Virtual Threads in Java 21 promise simpler code than Reactive. I benchmark both under 10k concurrent connections and show where each wins.Wed, 27 Aug 2025 00:00:00 GMTgRPC Deadline Propagation: Preventing Cascading Failureshttps://www.michal-drozd.com/en/blog/grpc-deadline-propagation/https://www.michal-drozd.com/en/blog/grpc-deadline-propagation/Frontend gives up after 5s but backend keeps working for 30s. Without deadline propagation, you waste resources on doomed requests. I show how to implement it in Go.Sat, 23 Aug 2025 00:00:00 GMTJVM Native Memory in Kubernetes: Why Your Pod Gets OOMKilled with 50% Heaphttps://www.michal-drozd.com/en/blog/jvm-native-memory-kubernetes/https://www.michal-drozd.com/en/blog/jvm-native-memory-kubernetes/Heap is 50% full but pod gets OOMKilled. I'll show how to track native memory (Metaspace, threads, NIO) and prevent container memory issues.Sat, 16 Aug 2025 00:00:00 GMTgRPC in Kubernetes: Why Service Round-Robin Lieshttps://www.michal-drozd.com/en/blog/grpc-load-balancing-k8s/https://www.michal-drozd.com/en/blog/grpc-load-balancing-k8s/Why one pod has 90% of traffic with gRPC. Reproducible lab, solutions from client-side LB to service mesh, and production checklist.Mon, 11 Aug 2025 00:00:00 GMTLinux Page Cache Thrashing in Containers: When Free Memory Isn't Freehttps://www.michal-drozd.com/en/blog/container-page-cache-thrashing/https://www.michal-drozd.com/en/blog/container-page-cache-thrashing/Your container has 2GB free but runs slow. Page cache counts against memory limit. File I/O forces code pages out. I explain with benchmarks and solutions.Wed, 06 Aug 2025 00:00:00 GMTZero-Downtime PostgreSQL Migrations: Expand/Contract, Backfill and Rollback Strategieshttps://www.michal-drozd.com/en/blog/zero-downtime-postgresql-migrations/https://www.michal-drozd.com/en/blog/zero-downtime-postgresql-migrations/A practical playbook for safe database migrations in production. From expand/contract pattern through online indexes to monitoring and rollback.Tue, 29 Jul 2025 00:00:00 GMTPrometheus Cardinality Explosion: Detection, Prevention, and Recoveryhttps://www.michal-drozd.com/en/blog/prometheus-cardinality-explosion/https://www.michal-drozd.com/en/blog/prometheus-cardinality-explosion/One developer added user_id label. Prometheus OOM'd. I show how to detect high-cardinality metrics before they kill your monitoring, with relabel configs to drop them.Wed, 23 Jul 2025 00:00:00 GMTHTTP Keep-Alive Connection Reset: Why Your Requests Fail with 'Connection Reset by Peer'https://www.michal-drozd.com/en/blog/http-keepalive-connection-reset/https://www.michal-drozd.com/en/blog/http-keepalive-connection-reset/Sporadic 'connection reset by peer' errors in production. I'll show how keep-alive timeout mismatches between client and server cause this and how to fix it.Wed, 16 Jul 2025 00:00:00 GMTRedlock vs PostgreSQL Advisory Locks: When You Don't Need Redis for Distributed Lockinghttps://www.michal-drozd.com/en/blog/redlock-vs-postgres-advisory-locks/https://www.michal-drozd.com/en/blog/redlock-vs-postgres-advisory-locks/Adding Redis just for distributed locks? PostgreSQL advisory locks might be enough. I compare both with failure scenarios and performance benchmarks.Sun, 13 Jul 2025 00:00:00 GMTProtobuf Event Evolution: Why buf breaking Isn't Enoughhttps://www.michal-drozd.com/en/blog/protobuf-event-evolution/https://www.michal-drozd.com/en/blog/protobuf-event-evolution/How to safely evolve Protobuf schemas in event-driven systems. Rules for .proto files, upcaster pattern and backward compatibility.Sun, 06 Jul 2025 00:00:00 GMTThe $10k/Month AWS Mistake: NAT Gateway vs VPC Endpointshttps://www.michal-drozd.com/en/blog/aws-nat-gateway-vs-vpc-endpoints/https://www.michal-drozd.com/en/blog/aws-nat-gateway-vs-vpc-endpoints/Your private subnets use NAT Gateway for S3 and DynamoDB. You're paying $0.045/GB for free traffic. I show how VPC Endpoints save thousands monthly.Tue, 01 Jul 2025 00:00:00 GMTPostgreSQL TOAST Strategy: Why Your JSON Column Kills Query Performancehttps://www.michal-drozd.com/en/blog/postgresql-toast-optimization/https://www.michal-drozd.com/en/blog/postgresql-toast-optimization/SELECT * on a table with JSON is 10x slower than expected. I'll show how TOAST storage works and when to change strategies for large columns.Tue, 24 Jun 2025 00:00:00 GMTTail-Based Sampling in OpenTelemetry: Sizing, Memory Crashes and Cost Modelhttps://www.michal-drozd.com/en/blog/otel-tail-sampling/https://www.michal-drozd.com/en/blog/otel-tail-sampling/Practical sizing guide for tail sampling in OpenTelemetry Collector. From decision_wait through memory limits to cost-benefit analysis.Sat, 21 Jun 2025 00:00:00 GMTCache Stampede Prevention: Probabilistic Early Expiration (X-Fetch)https://www.michal-drozd.com/en/blog/cache-stampede-xfetch/https://www.michal-drozd.com/en/blog/cache-stampede-xfetch/100 requests hit expired cache simultaneously. All 100 query the database. I implement the X-Fetch algorithm that refreshes cache before expiration without locks.Sat, 14 Jun 2025 00:00:00 GMTPostgreSQL Replication Slot Bloat: How One Inactive Slot Filled 500GB Diskhttps://www.michal-drozd.com/en/blog/postgresql-replication-slot-bloat/https://www.michal-drozd.com/en/blog/postgresql-replication-slot-bloat/Disk is 95% full, WAL directory has 400GB. I'll show how replication slots prevent WAL cleanup and a playbook for prevention and recovery.Sun, 08 Jun 2025 00:00:00 GMTKubernetes conntrack Table Exhaustion: The Silent Packet Killerhttps://www.michal-drozd.com/en/blog/kubernetes-conntrack-exhaustion/https://www.michal-drozd.com/en/blog/kubernetes-conntrack-exhaustion/Random DNS timeouts, dropped connections, services timing out. Your nf_conntrack table is full. I show how to diagnose, monitor, and fix this Kubernetes networking issue.Tue, 03 Jun 2025 00:00:00 GMTArchitectural Linting: Automated Protection Against Spaghetti Codehttps://www.michal-drozd.com/en/blog/architectural-linting/https://www.michal-drozd.com/en/blog/architectural-linting/How to enforce architectural rules in CI/CD. Dependency Cruiser for JS/TS, ArchUnit for Java, and practical configuration examples.Wed, 28 May 2025 00:00:00 GMTRedis Memory Fragmentation: When maxmemory Isn't Enoughhttps://www.michal-drozd.com/en/blog/redis-memory-fragmentation/https://www.michal-drozd.com/en/blog/redis-memory-fragmentation/Your Redis has 4GB maxmemory but RSS shows 6GB. OOM killer strikes. I explain jemalloc fragmentation with reproduction steps and activedefrag tuning.Thu, 22 May 2025 00:00:00 GMTPostgreSQL Idle in Transaction: Emergency Playbook for Stuck Connectionshttps://www.michal-drozd.com/en/blog/postgresql-idle-transaction-playbook/https://www.michal-drozd.com/en/blog/postgresql-idle-transaction-playbook/Autovacuum can't run, table bloat growing, all because of one 'idle in transaction' connection. Here's the detection and kill playbook.Tue, 20 May 2025 00:00:00 GMTAPI Idempotency: Designing Endpoints Resistant to Retrieshttps://www.michal-drozd.com/en/blog/api-idempotency/https://www.michal-drozd.com/en/blog/api-idempotency/Complete guide to implementing idempotent APIs. From Idempotency-Key through Redis locking to request processing state diagram.Mon, 12 May 2025 00:00:00 GMTCoreDNS vs NodeLocal DNS Cache: Cutting Kubernetes DNS Latency by 10xhttps://www.michal-drozd.com/en/blog/coredns-nodelocal-benchmark/https://www.michal-drozd.com/en/blog/coredns-nodelocal-benchmark/Your pods make 100 DNS queries per request. CoreDNS is a bottleneck. I benchmark NodeLocal DNS cache and show configuration for production.Thu, 08 May 2025 00:00:00 GMTClean Code: Principles Every Developer Should Knowhttps://www.michal-drozd.com/en/blog/clean-code-principles/https://www.michal-drozd.com/en/blog/clean-code-principles/An overview of key clean code principles and why they're important for long-term software project maintainability.Fri, 02 May 2025 00:00:00 GMTStop Mocking Your Database: Integration Tests in the Testcontainers Erahttps://www.michal-drozd.com/en/blog/testcontainers-vs-mocking/https://www.michal-drozd.com/en/blog/testcontainers-vs-mocking/Why mocks lie and how Testcontainers will change your testing approach. Practical examples, CI setup, and data isolation strategies.Thu, 24 Apr 2025 00:00:00 GMTGIN Index Pending List Overflow: Fast Writes, Slow Searcheshttps://www.michal-drozd.com/en/blog/gin-index-pending-list-overflow/https://www.michal-drozd.com/en/blog/gin-index-pending-list-overflow/Full-text search was fast, now it's slow. The cause: GIN index pending list grew huge during bulk inserts, and every search must now scan the unsorted pending entries.Thu, 17 Apr 2025 00:00:00 GMTAdaptive Concurrency Limits: Stop Guessing Thread Pool Sizeshttps://www.michal-drozd.com/en/blog/adaptive-concurrency-limits/https://www.michal-drozd.com/en/blog/adaptive-concurrency-limits/Thread pool 200 because that's what Stack Overflow says? Netflix's algorithm adjusts concurrency automatically based on latency. I show how it works with benchmarks.Fri, 11 Apr 2025 00:00:00 GMTKubernetes Cross-Zone Traffic: The Hidden Cost Eating Your Cloud Billhttps://www.michal-drozd.com/en/blog/k8s-cross-zone-traffic/https://www.michal-drozd.com/en/blog/k8s-cross-zone-traffic/Your AWS bill has $5000/month in data transfer. Half is cross-zone traffic within your cluster. I show how to measure and reduce it.Tue, 08 Apr 2025 00:00:00 GMTFeature Flags Without Tech Debt: Automatic Stale Flag Detectionhttps://www.michal-drozd.com/en/blog/feature-flags-stale-detection/https://www.michal-drozd.com/en/blog/feature-flags-stale-detection/End-to-end solution for feature flag lifecycle management. From runtime metrics through static analysis to automatic removal PRs.Fri, 04 Apr 2025 00:00:00 GMTKubernetes Rollout Without DB Outage: How to Stop PostgreSQL Connection Stormhttps://www.michal-drozd.com/en/blog/k8s-postgresql-connection-storm/https://www.michal-drozd.com/en/blog/k8s-postgresql-connection-storm/Reproducible lab demonstrating connection storm during K8s rollouts. PgBouncer, preStop hooks and jitter - practical solutions with benchmarks.Tue, 01 Apr 2025 00:00:00 GMTTransactional Outbox: Solving the Dual Write Problem Without 2PChttps://www.michal-drozd.com/en/blog/transactional-outbox/https://www.michal-drozd.com/en/blog/transactional-outbox/Practical Outbox pattern implementation in Node.js/TypeScript with PostgreSQL LISTEN/NOTIFY. Race-condition case study and production-ready solution.Thu, 27 Mar 2025 00:00:00 GMTThe Soft Delete Trap: Why is_deleted Kills Your Database (And What To Do)https://www.michal-drozd.com/en/blog/soft-delete-trap/https://www.michal-drozd.com/en/blog/soft-delete-trap/A practical analysis of why soft delete destroys database performance over time. Benchmarks, partitioning solution, and migration checklist.Sun, 23 Mar 2025 00:00:00 GMTICU Collation Version Drift: When Database Upgrades Break Your Indexeshttps://www.michal-drozd.com/en/blog/icu-collation-version-drift/https://www.michal-drozd.com/en/blog/icu-collation-version-drift/Query returns wrong results after OS upgrade. The cause: ICU library version changed, collation rules shifted, and indexes are now sorted inconsistently with the new sort order.Sat, 15 Mar 2025 00:00:00 GMTJava Profiling in Hardened Kubernetes: When Security Blocks Your Debuggerhttps://www.michal-drozd.com/en/blog/java-profiling-hardened-kubernetes/https://www.michal-drozd.com/en/blog/java-profiling-hardened-kubernetes/Can't attach profiler to production JVM. seccomp blocks perf_event_open, container drops CAP_SYS_PTRACE, and PodSecurityPolicy prevents privileged mode. Here's how to profile anyway.Fri, 07 Mar 2025 00:00:00 GMTPostgreSQL Partial Index: Planner Ignores Your Indexhttps://www.michal-drozd.com/en/blog/postgresql-partial-index-planner-miss/https://www.michal-drozd.com/en/blog/postgresql-partial-index-planner-miss/Query scans full table despite perfect partial index. The cause: query's WHERE clause doesn't match the index predicate exactly, or statistics mislead the planner.Tue, 04 Mar 2025 00:00:00 GMTGo cgo DNS Resolution Thread Explosion: When net.LookupHost Spawns Thousands of Threadshttps://www.michal-drozd.com/en/blog/go-cgo-dns-thread-explosion/https://www.michal-drozd.com/en/blog/go-cgo-dns-thread-explosion/Go application suddenly has 10,000 threads consuming all memory. The cause: cgo-based DNS resolution blocking in slow DNS environments, bypassing Go's goroutine scheduler.Tue, 25 Feb 2025 00:00:00 GMTeBPF Run-Queue Latency: Finding the Off-CPU Bottleneckhttps://www.michal-drozd.com/en/blog/ebpf-runqueue-latency-offcpu/https://www.michal-drozd.com/en/blog/ebpf-runqueue-latency-offcpu/CPU utilization is low but requests are slow. The hidden culprit: time spent waiting in the scheduler run-queue, invisible to traditional profilers but visible with eBPF off-CPU analysis.Mon, 17 Feb 2025 00:00:00 GMTLinux ARP Cache Stale Entries: Failover Traffic Blackholehttps://www.michal-drozd.com/en/blog/linux-arp-cache-failover-stale/https://www.michal-drozd.com/en/blog/linux-arp-cache-failover-stale/Traffic goes to old server after failover. The cause: Linux ARP cache retains MAC address of failed node, sending packets to unreachable destination for minutes.Fri, 14 Feb 2025 00:00:00 GMTGossip Protocol Ghost Nodes: IP Reuse Haunting Your Clusterhttps://www.michal-drozd.com/en/blog/gossip-ghost-nodes-ip-reuse/https://www.michal-drozd.com/en/blog/gossip-ghost-nodes-ip-reuse/New node joins cluster but gets shunned. Old node's IP is still in gossip protocol's failure detection blacklist. The zombie membership record lives on.Mon, 10 Feb 2025 00:00:00 GMTKubernetes Ghost Connections: Stale Conntrack DNAT Entrieshttps://www.michal-drozd.com/en/blog/kubernetes-conntrack-stale-dnat/https://www.michal-drozd.com/en/blog/kubernetes-conntrack-stale-dnat/Service returns wrong pod IPs after scaling. The cause: Linux conntrack keeps DNAT entries alive longer than pods exist, routing traffic to deleted endpoints.Wed, 05 Feb 2025 00:00:00 GMTDouble Charges From Idempotency Keys: The Replica Lag Traphttps://www.michal-drozd.com/en/blog/idempotency-keys-replica-lag/https://www.michal-drozd.com/en/blog/idempotency-keys-replica-lag/Perfect idempotency logic, but customers still get charged twice. The cause: checking idempotency keys against a read replica that's seconds behind the primary during traffic spikes.Wed, 29 Jan 2025 00:00:00 GMTPostgreSQL Read Replica Conflicts: Why Your Queries Get Canceledhttps://www.michal-drozd.com/en/blog/postgresql-read-replica-conflicts/https://www.michal-drozd.com/en/blog/postgresql-read-replica-conflicts/Queries on read replicas fail with 'canceling statement due to conflict with recovery'. The fix depends on which of the 5 conflict types you have - here's how to diagnose and solve each one.Tue, 28 Jan 2025 00:00:00 GMTRedis Cluster Slot Migration: Temporary Memory Explosionhttps://www.michal-drozd.com/en/blog/redis-cluster-slot-migration-memory/https://www.michal-drozd.com/en/blog/redis-cluster-slot-migration-memory/Redis nodes OOMKilled during cluster rebalancing. The cause: slot migration copies keys to destination before deleting from source, temporarily doubling memory usage.Mon, 27 Jan 2025 00:00:00 GMTSplit-Brain From a Clock Step Backwards: Wall Time in Lease-Based Systemshttps://www.michal-drozd.com/en/blog/clock-step-backwards-split-brain/https://www.michal-drozd.com/en/blog/clock-step-backwards-split-brain/Two nodes both believe they hold the leader lease. The cause: a small NTP time step backwards combined with code that mixes wall-clock time with duration-based timeouts.Wed, 22 Jan 2025 00:00:00 GMTJava OOMKilled With Stable Heap: Native Memory, Direct Buffers, and glibc Arenashttps://www.michal-drozd.com/en/blog/java-native-memory-oomkilled/https://www.michal-drozd.com/en/blog/java-native-memory-oomkilled/Heap metrics look fine, GC is happy, but the container keeps dying. The culprit: native memory from direct buffers, JNI, and glibc memory allocator fragmentation.Mon, 20 Jan 2025 00:00:00 GMTGo p99 Latency Cliffs: Nested context.WithTimeout Timer Stormshttps://www.michal-drozd.com/en/blog/go-timer-heap-pressure/https://www.michal-drozd.com/en/blog/go-timer-heap-pressure/Periodic latency spikes that look like network jitter. The real cause: nested timeouts creating thousands of timers that pressure the Go runtime timer heap and trigger GC scanning.Wed, 15 Jan 2025 00:00:00 GMTPostgreSQL Serialization Failures: Beyond 'Just Retry'https://www.michal-drozd.com/en/blog/postgresql-serialization-failure-retry/https://www.michal-drozd.com/en/blog/postgresql-serialization-failure-retry/Getting 'could not serialize access due to concurrent update'? The fix isn't just retry logic - it's understanding when to use which isolation level and how to reduce conflict frequency.Wed, 15 Jan 2025 00:00:00 GMTgRPC Keepalive Mismatch: Transport Closing After Idlehttps://www.michal-drozd.com/en/blog/grpc-keepalive-transport-closing/https://www.michal-drozd.com/en/blog/grpc-keepalive-transport-closing/gRPC connections randomly close with 'transport is closing'. The cause: client and server keepalive settings don't match, causing the server to terminate idle connections.Mon, 13 Jan 2025 00:00:00 GMTThe Ghost Pod: Why Your Service Still Sends Traffic to Dead Endpointshttps://www.michal-drozd.com/en/blog/kubernetes-ghost-pod-conntrack/https://www.michal-drozd.com/en/blog/kubernetes-ghost-pod-conntrack/Random ECONNRESET on some nodes but not others. Endpoints look fine. The culprit: conntrack NAT entries keeping long-lived connections pinned to pods that no longer exist.Sun, 05 Jan 2025 00:00:00 GMTPostgreSQL OOM by Design: work_mem × Parallel Workers × Plan Nodeshttps://www.michal-drozd.com/en/blog/postgresql-work-mem-parallel-oom/https://www.michal-drozd.com/en/blog/postgresql-work-mem-parallel-oom/work_mem looks small at 256MB, but a parallel hash join with 4 workers across 3 plan nodes uses 3GB. Here's how to prevent PostgreSQL from legitimately OOMing your container.Sat, 28 Dec 2024 00:00:00 GMTJVM Metaspace OOM in Kubernetes: Why MaxMetaspaceSize Alone Won't Save Youhttps://www.michal-drozd.com/en/blog/jvm-metaspace-oom-kubernetes/https://www.michal-drozd.com/en/blog/jvm-metaspace-oom-kubernetes/Pod OOMKilled despite MaxMetaspaceSize set. The cause: Metaspace grows outside heap, container memory limit doesn't account for it, and class unloading isn't happening.Mon, 23 Dec 2024 00:00:00 GMTThe Index That Killed Write Performance: Losing PostgreSQL HOT Updateshttps://www.michal-drozd.com/en/blog/postgresql-hot-updates-index-trap/https://www.michal-drozd.com/en/blog/postgresql-hot-updates-index-trap/Adding an index for performance made writes 10x slower. The counter-intuitive cause: the new index broke HOT updates, turning cheap in-place updates into full-row rewrites with massive bloat.Thu, 19 Dec 2024 00:00:00 GMTPostgreSQL 'cached plan must not change result type' During Zero-Downtime Migrationshttps://www.michal-drozd.com/en/blog/postgresql-cached-plan-schema-change/https://www.michal-drozd.com/en/blog/postgresql-cached-plan-schema-change/Rolling deploy fails with cached plan errors after ALTER TABLE. The cause: server-side prepared statements cache query plans that break when schema changes.Wed, 11 Dec 2024 00:00:00 GMTetcd Watch Replay Storms: When Giant ConfigMaps Kill the Control Planehttps://www.michal-drozd.com/en/blog/etcd-watch-replay-storms/https://www.michal-drozd.com/en/blog/etcd-watch-replay-storms/The apiserver becomes 'randomly slow'. Root cause: large, frequently updated ConfigMaps trigger watch compaction, causing thousands of controllers to relist simultaneously.Thu, 05 Dec 2024 00:00:00 GMTetcd Quota Alarm: When Your Kubernetes Cluster Goes Read-Onlyhttps://www.michal-drozd.com/en/blog/etcd-compaction-quota-alarm/https://www.michal-drozd.com/en/blog/etcd-compaction-quota-alarm/Cluster stops accepting writes, pods can't schedule. The cause: etcd hit its storage quota because compaction wasn't running, history accumulated beyond limits.Wed, 27 Nov 2024 00:00:00 GMTKubernetes Headless Service DNS: Stale Records After Pod Deletionhttps://www.michal-drozd.com/en/blog/kubernetes-headless-service-stale-dns/https://www.michal-drozd.com/en/blog/kubernetes-headless-service-stale-dns/Requests go to non-existent pods. The cause: headless service DNS records persist in client DNS cache after pods are deleted, before endpoints update propagates.Fri, 22 Nov 2024 00:00:00 GMTTraffic Hitting Dead Pods: Conntrack's Stale NAT Mappinghttps://www.michal-drozd.com/en/blog/conntrack-stale-nat-mapping/https://www.michal-drozd.com/en/blog/conntrack-stale-nat-mapping/Deploy causes 503s for exactly 2 minutes. The issue: conntrack keeps NAT mappings to old pod IPs even after Kubernetes removes endpoints, sending traffic to dead pods.Thu, 14 Nov 2024 00:00:00 GMTEphemeral Port Exhaustion: The Node That 'Goes Bad'https://www.michal-drozd.com/en/blog/ephemeral-port-exhaustion-kubernetes/https://www.michal-drozd.com/en/blog/ephemeral-port-exhaustion-kubernetes/A single Kubernetes node starts failing connections to external services while pods look healthy. The hidden cause: sidecar proxies exhausting ephemeral ports with short-lived connections.Mon, 11 Nov 2024 00:00:00 GMTPMTU Blackholes: When Only Large Responses Hanghttps://www.michal-drozd.com/en/blog/pmtu-blackhole-large-responses/https://www.michal-drozd.com/en/blog/pmtu-blackhole-large-responses/Small API responses work, large ones hang forever. The cause: ICMP 'Fragmentation Needed' messages filtered by firewalls, breaking Path MTU Discovery in overlay networks.Thu, 07 Nov 2024 00:00:00 GMTkube-proxy Micro-Outages: The xtables Lock Contention Problemhttps://www.michal-drozd.com/en/blog/kube-proxy-xtables-lock-contention/https://www.michal-drozd.com/en/blog/kube-proxy-xtables-lock-contention/Random 1-3 second connection drops during deployments. CPU looks fine, memory is stable. The hidden cause: iptables-restore grabbing the xtables lock while endpoints churn.Mon, 04 Nov 2024 00:00:00 GMTTCP TIME_WAIT Port Exhaustion: When Connection Pooling Isn't Enoughhttps://www.michal-drozd.com/en/blog/tcp-time-wait-port-exhaustion/https://www.michal-drozd.com/en/blog/tcp-time-wait-port-exhaustion/Service can't connect to database - 'cannot assign requested address'. The cause: ephemeral ports exhausted by thousands of sockets stuck in TIME_WAIT state.Mon, 28 Oct 2024 00:00:00 GMTVXLAN Random Packet Drops: The Checksum Offload Traphttps://www.michal-drozd.com/en/blog/vxlan-checksum-offload-packet-drops/https://www.michal-drozd.com/en/blog/vxlan-checksum-offload-packet-drops/Cross-node gRPC calls randomly fail but local traffic works fine. The culprit: TX checksum offload corrupting VXLAN headers on specific NIC drivers. Here's how to diagnose and fix.Mon, 21 Oct 2024 00:00:00 GMT