<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"><channel><title>Michal Drozd - Blog</title><description>Articles about software development, architecture and technologies.</description><link>https://www.michal-drozd.com/</link><language>en</language><item><title>Build a Solana Escrow Program for Service Marketplaces (Anchor Blueprint)</title><link>https://www.michal-drozd.com/en/blog/solana-escrow-program-marketplace/</link><guid isPermaLink="true">https://www.michal-drozd.com/en/blog/solana-escrow-program-marketplace/</guid><description>A practical Solana escrow architecture for marketplaces: account model, instruction set, security invariants, and production rollout plan.</description><pubDate>Tue, 24 Feb 2026 00:00:00 GMT</pubDate></item><item><title>Solana in 2026: Use Cases That Actually Ship</title><link>https://www.michal-drozd.com/en/blog/solana-use-cases-2026/</link><guid isPermaLink="true">https://www.michal-drozd.com/en/blog/solana-use-cases-2026/</guid><description>A practical map of real Solana use cases in 2026: stablecoin payments, embedded Actions, and operations patterns teams can implement this quarter.</description><pubDate>Fri, 20 Feb 2026 00:00:00 GMT</pubDate></item><item><title>Redis AOF fsync Latency Spikes: When Durability Becomes Your p99</title><link>https://www.michal-drozd.com/en/blog/redis-aof-fsync-latency-spikes/</link><guid isPermaLink="true">https://www.michal-drozd.com/en/blog/redis-aof-fsync-latency-spikes/</guid><description>Redis AOF can turn durability into p99 spikes: fsync pressure and rewrite fork CoW. Runbook to confirm, mitigate safely, and add guardrails.</description><pubDate>Fri, 09 Jan 2026 00:00:00 GMT</pubDate></item><item><title>Prometheus WAL Replay Hell: Slow Restarts and Missing Alerts</title><link>https://www.michal-drozd.com/en/blog/prometheus-wal-replay-slow-startup/</link><guid isPermaLink="true">https://www.michal-drozd.com/en/blog/prometheus-wal-replay-slow-startup/</guid><description>When Prometheus takes minutes or hours to restart, WAL replay is often the culprit. Prove it from logs and disk, recover safely, and prevent it.</description><pubDate>Mon, 05 Jan 2026 00:00:00 GMT</pubDate></item><item><title>tcpdump Sees SYNs, but the Service Times Out: The Listen Backlog Trap</title><link>https://www.michal-drozd.com/en/blog/tcpdump-syn-no-accept-backlog-trap/</link><guid isPermaLink="true">https://www.michal-drozd.com/en/blog/tcpdump-syn-no-accept-backlog-trap/</guid><description>Clients time out, tcpdump shows SYNs (sometimes even SYN-ACK), yet your app logs nothing. The culprit is often the Linux listen/accept queues overflowing under load or CPU starvation.</description><pubDate>Sat, 03 Jan 2026 00:00:00 GMT</pubDate></item><item><title>PostgreSQL Logical Replication Lag: Big Transactions and Reorder Buffer Spills</title><link>https://www.michal-drozd.com/en/blog/postgresql-logical-replication-lag-big-transactions/</link><guid isPermaLink="true">https://www.michal-drozd.com/en/blog/postgresql-logical-replication-lag-big-transactions/</guid><description>One huge transaction can pin logical replication for hours. Runbook to detect the blocker, tune decoding safely, and enforce bounded transactions in prod.</description><pubDate>Thu, 01 Jan 2026 00:00:00 GMT</pubDate></item><item><title>Span Contracts: Trace-Driven API Contract Testing with OpenTelemetry</title><link>https://www.michal-drozd.com/en/blog/span-contracts-otel-contract-testing/</link><guid isPermaLink="true">https://www.michal-drozd.com/en/blog/span-contracts-otel-contract-testing/</guid><description>Detect API breaking changes by hashing response shapes from OTel spans and fail CI without storing payloads.</description><pubDate>Wed, 31 Dec 2025 00:00:00 GMT</pubDate></item><item><title>Circuit Breaker Anti-Patterns: When Protection Causes Outages</title><link>https://www.michal-drozd.com/en/blog/circuit-breaker-anti-patterns/</link><guid isPermaLink="true">https://www.michal-drozd.com/en/blog/circuit-breaker-anti-patterns/</guid><description>Circuit breakers prevent cascading failures but wrong config makes them worse. I show 5 anti-patterns: shared breakers, wrong thresholds, instant open, no fallback, and testing gaps.</description><pubDate>Mon, 29 Dec 2025 00:00:00 GMT</pubDate></item><item><title>ingress-nginx Reload Storms: Why 502 Spikes Track Ingress Churn</title><link>https://www.michal-drozd.com/en/blog/ingress-nginx-reload-storms/</link><guid isPermaLink="true">https://www.michal-drozd.com/en/blog/ingress-nginx-reload-storms/</guid><description>NGINX Ingress reload storms can drop keep-alives and cause 502 spikes. Runbook to prove reload impact, reduce churn, and harden graceful reload.</description><pubDate>Sun, 28 Dec 2025 00:00:00 GMT</pubDate></item><item><title>The Cert Isn&apos;t Expired, Your Node Is: Time Drift Breaking TLS and JWT in Kubernetes</title><link>https://www.michal-drozd.com/en/blog/time-drift-tls-jwt-kubernetes/</link><guid isPermaLink="true">https://www.michal-drozd.com/en/blog/time-drift-tls-jwt-kubernetes/</guid><description>Sporadic TLS handshake failures and JWT rejections across services. Everything passes when you check. The culprit: a node&apos;s clock drifted or jumped, and NTP fixed it before you could catch it.</description><pubDate>Fri, 26 Dec 2025 00:00:00 GMT</pubDate></item><item><title>EXPLAIN Lied to You: The PostgreSQL Prepared Statement Plan Cliff</title><link>https://www.michal-drozd.com/en/blog/postgresql-prepared-statement-plan-cliff/</link><guid isPermaLink="true">https://www.michal-drozd.com/en/blog/postgresql-prepared-statement-plan-cliff/</guid><description>Your EXPLAIN looks perfect but production melts. The culprit: PostgreSQL silently switched from a custom plan to a generic plan after enough executions, and the generic plan is catastrophically wrong.</description><pubDate>Wed, 24 Dec 2025 00:00:00 GMT</pubDate></item><item><title>Prometheus remote_write backpressure: when monitoring fills the disk (and still loses data)</title><link>https://www.michal-drozd.com/en/blog/prometheus-remote-write-backpressure/</link><guid isPermaLink="true">https://www.michal-drozd.com/en/blog/prometheus-remote-write-backpressure/</guid><description>A practical runbook for remote_write outages: measure lag, estimate time-to-disk-full, tune queue_config safely, and choose explicit survival trade-offs.</description><pubDate>Wed, 24 Dec 2025 00:00:00 GMT</pubDate></item><item><title>Cardinality Contracts: Prometheus Labels as an API with Budgets</title><link>https://www.michal-drozd.com/en/blog/cardinality-contracts-prometheus-label-budgets/</link><guid isPermaLink="true">https://www.michal-drozd.com/en/blog/cardinality-contracts-prometheus-label-budgets/</guid><description>Define label budgets, enforce them in CI, and add a runtime firewall to stop cardinality explosions before production.</description><pubDate>Sun, 21 Dec 2025 00:00:00 GMT</pubDate></item><item><title>Prometheus Native Histograms in Production: Rollout Plan, Budgets, and Failure Modes</title><link>https://www.michal-drozd.com/en/blog/prometheus-native-histograms-production/</link><guid isPermaLink="true">https://www.michal-drozd.com/en/blog/prometheus-native-histograms-production/</guid><description>Prometheus native histograms can blow up memory, WAL, and remote_write. This guide shows a staged rollout, budgets, and concrete queries to verify safety.</description><pubDate>Sat, 20 Dec 2025 00:00:00 GMT</pubDate></item><item><title>Works in psql, Flaky in Prod: PgBouncer&apos;s Silent Murder of LISTEN/NOTIFY</title><link>https://www.michal-drozd.com/en/blog/pgbouncer-listen-notify-transaction-pooling/</link><guid isPermaLink="true">https://www.michal-drozd.com/en/blog/pgbouncer-listen-notify-transaction-pooling/</guid><description>PostgreSQL LISTEN/NOTIFY works perfectly in local testing but notifications randomly stop arriving in production. The culprit: transaction pooling quietly reassigning your connection to someone else.</description><pubDate>Thu, 18 Dec 2025 00:00:00 GMT</pubDate></item><item><title>PostgreSQL XID Wraparound: Emergency Playbook for Vacuum Freeze Under Fire</title><link>https://www.michal-drozd.com/en/blog/postgresql-xid-wraparound-emergency-playbook/</link><guid isPermaLink="true">https://www.michal-drozd.com/en/blog/postgresql-xid-wraparound-emergency-playbook/</guid><description>PostgreSQL can go read-only near XID wraparound. Use this emergency playbook to find the oldest tables, unblock vacuum freeze, and prevent repeat incidents.</description><pubDate>Tue, 16 Dec 2025 00:00:00 GMT</pubDate></item><item><title>Dash Contracts in Go: CI Compiler for Grafana Dashboards and Prometheus Alerts</title><link>https://www.michal-drozd.com/en/blog/dash-contracts-grafana-alerts-ci/</link><guid isPermaLink="true">https://www.michal-drozd.com/en/blog/dash-contracts-grafana-alerts-ci/</guid><description>Extract PromQL from dashboards and rules, verify selectors against /metrics, and fail CI before dashboards go dark.</description><pubDate>Mon, 15 Dec 2025 00:00:00 GMT</pubDate></item><item><title>Packets Arrive but the App Times Out: The rp_filter Trap in Kubernetes</title><link>https://www.michal-drozd.com/en/blog/linux-rp-filter-asymmetric-routing/</link><guid isPermaLink="true">https://www.michal-drozd.com/en/blog/linux-rp-filter-asymmetric-routing/</guid><description>tcpdump shows packets arriving, but your application sees nothing. The culprit: Linux reverse path filtering silently dropping packets before they reach iptables, triggered by asymmetric routing in multi-homed setups.</description><pubDate>Fri, 12 Dec 2025 00:00:00 GMT</pubDate></item><item><title>hot_standby_feedback Bloat Trap: Fixing Replica Conflicts by Slowly Killing the Primary</title><link>https://www.michal-drozd.com/en/blog/postgresql-hot-standby-feedback-bloat/</link><guid isPermaLink="true">https://www.michal-drozd.com/en/blog/postgresql-hot-standby-feedback-bloat/</guid><description>hot_standby_feedback stops replica query cancellations but can bloat the primary over days. Detect xmin pinning, mitigate safely, and add guardrails.</description><pubDate>Fri, 12 Dec 2025 00:00:00 GMT</pubDate></item><item><title>Kubernetes TLS Certificate Rotation: The 3AM Outage</title><link>https://www.michal-drozd.com/en/blog/kubernetes-tls-certificate-rotation/</link><guid isPermaLink="true">https://www.michal-drozd.com/en/blog/kubernetes-tls-certificate-rotation/</guid><description>Certificate expired at 3AM, service down. cert-manager renewal failed silently. I show monitoring, testing rotation, and preventing cert-related outages.</description><pubDate>Tue, 09 Dec 2025 00:00:00 GMT</pubDate></item><item><title>PostgreSQL Checkpoint Spikes: Why p99 Explodes Every N Minutes</title><link>https://www.michal-drozd.com/en/blog/postgresql-checkpoint-spikes/</link><guid isPermaLink="true">https://www.michal-drozd.com/en/blog/postgresql-checkpoint-spikes/</guid><description>A reproducible approach to diagnose and eliminate checkpoint-induced latency spikes using pgbench, pg_stat_bgwriter, and WAL/IO budgeting.</description><pubDate>Mon, 08 Dec 2025 00:00:00 GMT</pubDate></item><item><title>&apos;No Space Left on Device&apos; with 40% Disk Free: The Inode and OverlayFS Death Spiral</title><link>https://www.michal-drozd.com/en/blog/kubernetes-inode-exhaustion-overlayfs/</link><guid isPermaLink="true">https://www.michal-drozd.com/en/blog/kubernetes-inode-exhaustion-overlayfs/</guid><description>df -h shows 40% free. But your container keeps crashing with ENOSPC. The culprit: inode exhaustion on overlayfs layers, invisible to standard monitoring.</description><pubDate>Sun, 07 Dec 2025 00:00:00 GMT</pubDate></item><item><title>OpenTelemetry Collector Backpressure: Fixing Drops with memory_limiter and Queues</title><link>https://www.michal-drozd.com/en/blog/otel-collector-backpressure-memory-limiter/</link><guid isPermaLink="true">https://www.michal-drozd.com/en/blog/otel-collector-backpressure-memory-limiter/</guid><description>OpenTelemetry Collector drops spans under load when exporters backpressure. Fix with memory_limiter, queues, and batch tuning, with commands to verify.</description><pubDate>Thu, 04 Dec 2025 00:00:00 GMT</pubDate></item><item><title>Database Connection Pool Exhaustion: The Silent Outage Trigger</title><link>https://www.michal-drozd.com/en/blog/database-connection-pool-exhaustion/</link><guid isPermaLink="true">https://www.michal-drozd.com/en/blog/database-connection-pool-exhaustion/</guid><description>App hangs but the database looks healthy. Your pool is exhausted. I show how to detect it, size pools sanely, and prevent connection leaks.</description><pubDate>Sun, 30 Nov 2025 00:00:00 GMT</pubDate></item><item><title>CSI VolumeAttachment Stuck: Pods in ContainerCreating and Drains That Never Finish</title><link>https://www.michal-drozd.com/en/blog/kubernetes-volumeattachment-stuck-csi/</link><guid isPermaLink="true">https://www.michal-drozd.com/en/blog/kubernetes-volumeattachment-stuck-csi/</guid><description>Pods stuck in ContainerCreating often hide a stuck CSI VolumeAttachment. Runbook to find the blocker, detach safely, prevent data loss, and add alerts.</description><pubDate>Sun, 30 Nov 2025 00:00:00 GMT</pubDate></item><item><title>RSS Contracts: Stop OOMKilled Java Pods in Kubernetes by Testing RSS as an API</title><link>https://www.michal-drozd.com/en/blog/rss-contracts-jvm-oomkilled-kubernetes/</link><guid isPermaLink="true">https://www.michal-drozd.com/en/blog/rss-contracts-jvm-oomkilled-kubernetes/</guid><description>Use cgroup RSS budgets, CI sampling, and runtime headroom to catch JVM memory regressions before they hit production.</description><pubDate>Thu, 27 Nov 2025 00:00:00 GMT</pubDate></item><item><title>Pods Stuck in Terminating: A Production Decision Tree for Finalizers, Volumes, and Dead Nodes</title><link>https://www.michal-drozd.com/en/blog/kubernetes-pod-stuck-terminating-playbook/</link><guid isPermaLink="true">https://www.michal-drozd.com/en/blog/kubernetes-pod-stuck-terminating-playbook/</guid><description>A conservative runbook to unstick Pods safely: finalizers, CSI/volume cleanup stalls, dead nodes, and when (and how) to force-delete.</description><pubDate>Wed, 26 Nov 2025 00:00:00 GMT</pubDate></item><item><title>pg_waldump WAL Forensics: Reconstructing What Happened to Your Data</title><link>https://www.michal-drozd.com/en/blog/postgresql-wal-forensics/</link><guid isPermaLink="true">https://www.michal-drozd.com/en/blog/postgresql-wal-forensics/</guid><description>Something deleted rows from production but nobody admits to running DELETE. Use pg_waldump to analyze WAL files and reconstruct exactly what happened and when.</description><pubDate>Mon, 24 Nov 2025 00:00:00 GMT</pubDate></item><item><title>Kubernetes Graceful Shutdown as a Contract: Zero 502s During Rollouts (HTTP + gRPC)</title><link>https://www.michal-drozd.com/en/blog/kubernetes-graceful-shutdown-rollouts/</link><guid isPermaLink="true">https://www.michal-drozd.com/en/blog/kubernetes-graceful-shutdown-rollouts/</guid><description>A reproducible way to eliminate rollout 502/ECONNRESET: readiness-driven draining, preStop, SIGTERM handling, and a measurable drain budget.</description><pubDate>Sat, 22 Nov 2025 00:00:00 GMT</pubDate></item><item><title>5000 Unacked Messages and Climbing: Stop RabbitMQ Consumer Meltdowns in CI</title><link>https://www.michal-drozd.com/en/blog/rabbitmq-ack-contracts/</link><guid isPermaLink="true">https://www.michal-drozd.com/en/blog/rabbitmq-ack-contracts/</guid><description>Queue looks healthy until deployment, then messages_unacknowledged explodes, memory spikes, and redelivery storms start. The culprit: your prefetch is too high and nobody tested actual ack behavior.</description><pubDate>Sat, 22 Nov 2025 00:00:00 GMT</pubDate></item><item><title>Ephemeral-Storage Evictions in Kubernetes: The Log Storm That Took Down Healthy Pods</title><link>https://www.michal-drozd.com/en/blog/kubernetes-ephemeral-storage-eviction-log-storm/</link><guid isPermaLink="true">https://www.michal-drozd.com/en/blog/kubernetes-ephemeral-storage-eviction-log-storm/</guid><description>Pods get evicted for ephemeral-storage while disk looks free. Debug nodefs/imagefs, container logs, kubelet GC, then enforce budgets and log rotation.</description><pubDate>Tue, 18 Nov 2025 00:00:00 GMT</pubDate></item><item><title>Kubernetes OOM Killer: Why Your Container Dies at 50% Memory</title><link>https://www.michal-drozd.com/en/blog/kubernetes-oom-killer-memory-limits/</link><guid isPermaLink="true">https://www.michal-drozd.com/en/blog/kubernetes-oom-killer-memory-limits/</guid><description>Container memory limit is 4GB but OOM kills at 2GB used. Kernel buffers, page cache, and cgroup accounting tricks cause early OOMKills. Here&apos;s the full picture.</description><pubDate>Sun, 16 Nov 2025 00:00:00 GMT</pubDate></item><item><title>One Partition at 99% CPU: Stop Kafka Hotspots Before They Reach Production</title><link>https://www.michal-drozd.com/en/blog/kafka-partition-skew-contracts/</link><guid isPermaLink="true">https://www.michal-drozd.com/en/blog/kafka-partition-skew-contracts/</guid><description>All partitions look balanced in testing, then production traffic arrives and one partition melts. The culprit: your partition key has terrible cardinality and nobody noticed until now.</description><pubDate>Sat, 15 Nov 2025 00:00:00 GMT</pubDate></item><item><title>Kubernetes APF Starvation: When One Controller Makes kubectl Hang</title><link>https://www.michal-drozd.com/en/blog/kubernetes-api-priority-fairness-starvation/</link><guid isPermaLink="true">https://www.michal-drozd.com/en/blog/kubernetes-api-priority-fairness-starvation/</guid><description>APF can starve your Kubernetes API: kubectl hangs, controllers timeout, and 429s spike. Runbook to isolate the noisy client, fix FlowSchemas, and prove it.</description><pubDate>Fri, 14 Nov 2025 00:00:00 GMT</pubDate></item><item><title>ClickHouse ReplacingMergeTree: The Deduplication Illusion</title><link>https://www.michal-drozd.com/en/blog/clickhouse-replacingmergetree-deduplication/</link><guid isPermaLink="true">https://www.michal-drozd.com/en/blog/clickhouse-replacingmergetree-deduplication/</guid><description>ReplacingMergeTree doesn&apos;t dedupe on SELECT. It merges eventually. Your queries return duplicates until background merge runs. Here&apos;s how to handle it.</description><pubDate>Thu, 13 Nov 2025 00:00:00 GMT</pubDate></item><item><title>Kafka Consumer Rebalance Storms: Why Scaling Consumers Can Increase Lag</title><link>https://www.michal-drozd.com/en/blog/kafka-consumer-rebalance-storm/</link><guid isPermaLink="true">https://www.michal-drozd.com/en/blog/kafka-consumer-rebalance-storm/</guid><description>Kafka consumer rebalances can make lag worse when you scale out. Diagnose max.poll interval, heartbeats, and assignment strategy; apply safe config diffs.</description><pubDate>Mon, 10 Nov 2025 00:00:00 GMT</pubDate></item><item><title>Kubernetes DNS: The ndots:5 Latency Tax</title><link>https://www.michal-drozd.com/en/blog/kubernetes-dns-caching-ndots/</link><guid isPermaLink="true">https://www.michal-drozd.com/en/blog/kubernetes-dns-caching-ndots/</guid><description>Every DNS query in K8s makes 5 failed lookups before succeeding. ndots:5 default causes 100ms+ latency. Here&apos;s how to fix it properly.</description><pubDate>Mon, 10 Nov 2025 00:00:00 GMT</pubDate></item><item><title>Envoy Outlier Detection Brownouts: When the Mesh Ejects Healthy Pods</title><link>https://www.michal-drozd.com/en/blog/envoy-outlier-detection-brownouts/</link><guid isPermaLink="true">https://www.michal-drozd.com/en/blog/envoy-outlier-detection-brownouts/</guid><description>Debug Istio/Envoy outlier detection brownouts: why healthy pods get ejected and 503s spike in production. Includes xDS checks, safe fixes, and alerting.</description><pubDate>Thu, 06 Nov 2025 00:00:00 GMT</pubDate></item><item><title>Go GOMAXPROCS in Containers: The CPU Detection Problem</title><link>https://www.michal-drozd.com/en/blog/go-gomaxprocs-containers/</link><guid isPermaLink="true">https://www.michal-drozd.com/en/blog/go-gomaxprocs-containers/</guid><description>Go sees 64 host CPUs but your container has 2 CPU limit. GOMAXPROCS=64 causes excessive context switching and throttling. Here&apos;s the fix.</description><pubDate>Wed, 05 Nov 2025 00:00:00 GMT</pubDate></item><item><title>Envoy/Istio 503 UF/UO/UT: When the Mesh, Not the App, Is Your Outage</title><link>https://www.michal-drozd.com/en/blog/envoy-istio-503-uf-uo-ut/</link><guid isPermaLink="true">https://www.michal-drozd.com/en/blog/envoy-istio-503-uf-uo-ut/</guid><description>Envoy/Istio can return 503 UF/UO/UT when connection pools overflow. Decode flags, inspect proxy stats, patch DestinationRules, and verify fast.</description><pubDate>Sun, 02 Nov 2025 00:00:00 GMT</pubDate></item><item><title>Architecture as Code: ADR, C4 Diagrams and CI Quality Gates</title><link>https://www.michal-drozd.com/en/blog/architecture-as-code/</link><guid isPermaLink="true">https://www.michal-drozd.com/en/blog/architecture-as-code/</guid><description>A complete guide to implementing living documentation using Architecture Decision Records, C4 model, and CI/CD pipeline automation.</description><pubDate>Fri, 31 Oct 2025 00:00:00 GMT</pubDate></item><item><title>Cilium BPF Conntrack Map Exhaustion: Random Resets While conntrack Looks Fine</title><link>https://www.michal-drozd.com/en/blog/cilium-bpf-conntrack-map-exhaustion/</link><guid isPermaLink="true">https://www.michal-drozd.com/en/blog/cilium-bpf-conntrack-map-exhaustion/</guid><description>Random resets with Cilium? Learn how eBPF conntrack (CT) maps fill up, why netfilter conntrack looks fine, and how to size + verify fixes in Kubernetes.</description><pubDate>Wed, 29 Oct 2025 00:00:00 GMT</pubDate></item><item><title>Python GIL and Kubernetes CPU Limits: The Threading Trap</title><link>https://www.michal-drozd.com/en/blog/python-gil-kubernetes-cpu-limits/</link><guid isPermaLink="true">https://www.michal-drozd.com/en/blog/python-gil-kubernetes-cpu-limits/</guid><description>Your Python app has 4 threads but K8s gives 1 CPU. GIL + CFS quota = severe throttling. I show why and how to configure workers correctly.</description><pubDate>Mon, 27 Oct 2025 00:00:00 GMT</pubDate></item><item><title>Kubernetes p99 Spikes Without OOM: Diagnosing cgroup v2 memory.high with PSI</title><link>https://www.michal-drozd.com/en/blog/cgroup-v2-memory-high-psi-kubernetes/</link><guid isPermaLink="true">https://www.michal-drozd.com/en/blog/cgroup-v2-memory-high-psi-kubernetes/</guid><description>Use PSI and cgroup v2 memory.high to explain p99 spikes without OOMKills. Kubernetes runbook with commands, safe mitigations, diffs, and alerts.</description><pubDate>Sat, 25 Oct 2025 00:00:00 GMT</pubDate></item><item><title>S3 Intelligent-Tiering: The Small Object Cost Trap</title><link>https://www.michal-drozd.com/en/blog/s3-intelligent-tiering-trap/</link><guid isPermaLink="true">https://www.michal-drozd.com/en/blog/s3-intelligent-tiering-trap/</guid><description>S3 Intelligent-Tiering saves money for large files but charges minimum 128KB overhead. For millions of small objects, it INCREASES costs. I show the math.</description><pubDate>Sat, 25 Oct 2025 00:00:00 GMT</pubDate></item><item><title>Connection Pool Sizing with Little&apos;s Law: Mathematical Approach to HikariCP and PgBouncer</title><link>https://www.michal-drozd.com/en/blog/connection-pool-littles-law/</link><guid isPermaLink="true">https://www.michal-drozd.com/en/blog/connection-pool-littles-law/</guid><description>Pool size 50 because that&apos;s how it&apos;s always been? I&apos;ll show how to use Little&apos;s Law to calculate optimal pool size and prove it with load tests.</description><pubDate>Wed, 22 Oct 2025 00:00:00 GMT</pubDate></item><item><title>Kubernetes CPU Throttling Autopsy: Why p99 Latency Explodes at 40% CPU Usage</title><link>https://www.michal-drozd.com/en/blog/k8s-cpu-throttling/</link><guid isPermaLink="true">https://www.michal-drozd.com/en/blog/k8s-cpu-throttling/</guid><description>CPU looks OK but tail latency is catastrophic. I&apos;ll show how to correlate CFS throttling with latency spikes and why removing CPU limits can paradoxically help.</description><pubDate>Sun, 19 Oct 2025 00:00:00 GMT</pubDate></item><item><title>Elasticsearch Hot Shard Problem: When One Node Does All the Work</title><link>https://www.michal-drozd.com/en/blog/elasticsearch-hot-shard-problem/</link><guid isPermaLink="true">https://www.michal-drozd.com/en/blog/elasticsearch-hot-shard-problem/</guid><description>5 data nodes but one is at 100% CPU. Uneven routing keys create hot shards. I show how to detect skew and fix it with routing strategies.</description><pubDate>Thu, 16 Oct 2025 00:00:00 GMT</pubDate></item><item><title>UUIDv4 vs ULID vs TSID: Impact on PostgreSQL B-Tree Indexes After 100M Records</title><link>https://www.michal-drozd.com/en/blog/uuid-ulid-tsid-postgresql/</link><guid isPermaLink="true">https://www.michal-drozd.com/en/blog/uuid-ulid-tsid-postgresql/</guid><description>Random UUIDs as Primary Keys cause index bloat and random I/O. Benchmark with specific numbers - index size, cache hit ratio, and WAL volume after 100M inserts.</description><pubDate>Tue, 14 Oct 2025 00:00:00 GMT</pubDate></item><item><title>JWT Revocation Strategies: When Stateless Tokens Need State</title><link>https://www.michal-drozd.com/en/blog/jwt-revocation-strategies/</link><guid isPermaLink="true">https://www.michal-drozd.com/en/blog/jwt-revocation-strategies/</guid><description>User compromised, need to revoke JWT immediately. But JWTs are immutable. I compare allowlist, denylist, and short expiration with performance benchmarks.</description><pubDate>Sun, 12 Oct 2025 00:00:00 GMT</pubDate></item><item><title>Fields Disappeared But Nothing Crashed: Catch Schema Evolution Bugs Before Production</title><link>https://www.michal-drozd.com/en/blog/schema-evolution-contracts/</link><guid isPermaLink="true">https://www.michal-drozd.com/en/blog/schema-evolution-contracts/</guid><description>Producer upgraded Protobuf, consumer still on old version. No errors, no warnings—just silent data loss in production. Your schema evolution broke backward compatibility and CI didn&apos;t notice.</description><pubDate>Wed, 08 Oct 2025 00:00:00 GMT</pubDate></item><item><title>CI/CD for Monorepo: Speed, Caching, Selective Tests and Supply-Chain Security</title><link>https://www.michal-drozd.com/en/blog/cicd-monorepo/</link><guid isPermaLink="true">https://www.michal-drozd.com/en/blog/cicd-monorepo/</guid><description>A complete blueprint for efficient CI/CD pipelines in monorepo - from path filters through remote cache to SBOM and SLSA. Practical solutions, not theory.</description><pubDate>Sat, 04 Oct 2025 00:00:00 GMT</pubDate></item><item><title>Structured Logging Performance: When Your Logger Becomes the Bottleneck</title><link>https://www.michal-drozd.com/en/blog/structured-logging-performance/</link><guid isPermaLink="true">https://www.michal-drozd.com/en/blog/structured-logging-performance/</guid><description>At 50k logs/sec, JSON serialization eats 30% CPU. Standard library encoding/json is slow. I benchmark zap vs zerolog vs slog with real numbers.</description><pubDate>Sun, 28 Sep 2025 00:00:00 GMT</pubDate></item><item><title>PostgreSQL HOT Updates + FILLFACTOR: How to Reduce Index Bloat by 60%</title><link>https://www.michal-drozd.com/en/blog/postgresql-hot-updates-fillfactor/</link><guid isPermaLink="true">https://www.michal-drozd.com/en/blog/postgresql-hot-updates-fillfactor/</guid><description>Vacuum runs successfully but disk keeps growing and cache hit ratio drops. I&apos;ll show how to quantify HOT-update eligibility using pgstattuple and optimize fillfactor.</description><pubDate>Tue, 23 Sep 2025 00:00:00 GMT</pubDate></item><item><title>Circuit Breaker vs Rate Limiter vs Bulkhead: When to Use Which Pattern</title><link>https://www.michal-drozd.com/en/blog/circuit-breaker-rate-limiter-bulkhead/</link><guid isPermaLink="true">https://www.michal-drozd.com/en/blog/circuit-breaker-rate-limiter-bulkhead/</guid><description>Three resilience patterns that are often confused. I&apos;ll show exactly when each prevents cascading failures and when it makes things worse with real metrics.</description><pubDate>Fri, 19 Sep 2025 00:00:00 GMT</pubDate></item><item><title>When Prepared Statements Make PostgreSQL 10× Slower: Generic Plan Trap</title><link>https://www.michal-drozd.com/en/blog/postgresql-prepared-statements-trap/</link><guid isPermaLink="true">https://www.michal-drozd.com/en/blog/postgresql-prepared-statements-trap/</guid><description>Same query, same params, but prod is slow and staging works fine. I&apos;ll show how to reproduce the generic plan problem with pgBouncer, Java/Go and how to fix it.</description><pubDate>Mon, 15 Sep 2025 00:00:00 GMT</pubDate></item><item><title>Logical Replication Slot WAL Bloat: When Subscribers Go Offline</title><link>https://www.michal-drozd.com/en/blog/logical-replication-slot-wal-retention/</link><guid isPermaLink="true">https://www.michal-drozd.com/en/blog/logical-replication-slot-wal-retention/</guid><description>Disk filling up with WAL files. The cause: a logical replication slot consumer went offline, and PostgreSQL retains all WAL since then because it might be needed.</description><pubDate>Tue, 09 Sep 2025 00:00:00 GMT</pubDate></item><item><title>eBPF Off-CPU Analysis: Finding Latency That Metrics Miss</title><link>https://www.michal-drozd.com/en/blog/ebpf-off-cpu-debugging/</link><guid isPermaLink="true">https://www.michal-drozd.com/en/blog/ebpf-off-cpu-debugging/</guid><description>CPU is at 20% but latency is 500ms. Standard profilers show nothing. The app is waiting, not computing. I show how to use eBPF to find what it&apos;s waiting for.</description><pubDate>Sun, 07 Sep 2025 00:00:00 GMT</pubDate></item><item><title>PostgreSQL Autovacuum SLO Tuning: How to Configure Vacuum for 200M Rows and 5k UPSERT/s</title><link>https://www.michal-drozd.com/en/blog/postgresql-autovacuum-slo/</link><guid isPermaLink="true">https://www.michal-drozd.com/en/blog/postgresql-autovacuum-slo/</guid><description>Autovacuum is either ignored or cargo-cult tuned. I&apos;ll show how to turn it into an SLO-driven system with specific numbers, pg_stat metrics, and reproducible tests.</description><pubDate>Thu, 04 Sep 2025 00:00:00 GMT</pubDate></item><item><title>Java Virtual Threads vs Reactive: When to Drop WebFlux for Project Loom</title><link>https://www.michal-drozd.com/en/blog/java-virtual-threads-vs-reactive/</link><guid isPermaLink="true">https://www.michal-drozd.com/en/blog/java-virtual-threads-vs-reactive/</guid><description>Virtual Threads in Java 21 promise simpler code than Reactive. I benchmark both under 10k concurrent connections and show where each wins.</description><pubDate>Wed, 27 Aug 2025 00:00:00 GMT</pubDate></item><item><title>gRPC Deadline Propagation: Preventing Cascading Failures</title><link>https://www.michal-drozd.com/en/blog/grpc-deadline-propagation/</link><guid isPermaLink="true">https://www.michal-drozd.com/en/blog/grpc-deadline-propagation/</guid><description>Frontend gives up after 5s but backend keeps working for 30s. Without deadline propagation, you waste resources on doomed requests. I show how to implement it in Go.</description><pubDate>Sat, 23 Aug 2025 00:00:00 GMT</pubDate></item><item><title>JVM Native Memory in Kubernetes: Why Your Pod Gets OOMKilled with 50% Heap</title><link>https://www.michal-drozd.com/en/blog/jvm-native-memory-kubernetes/</link><guid isPermaLink="true">https://www.michal-drozd.com/en/blog/jvm-native-memory-kubernetes/</guid><description>Heap is 50% full but pod gets OOMKilled. I&apos;ll show how to track native memory (Metaspace, threads, NIO) and prevent container memory issues.</description><pubDate>Sat, 16 Aug 2025 00:00:00 GMT</pubDate></item><item><title>gRPC in Kubernetes: Why Service Round-Robin Lies</title><link>https://www.michal-drozd.com/en/blog/grpc-load-balancing-k8s/</link><guid isPermaLink="true">https://www.michal-drozd.com/en/blog/grpc-load-balancing-k8s/</guid><description>Why one pod has 90% of traffic with gRPC. Reproducible lab, solutions from client-side LB to service mesh, and production checklist.</description><pubDate>Mon, 11 Aug 2025 00:00:00 GMT</pubDate></item><item><title>Linux Page Cache Thrashing in Containers: When Free Memory Isn&apos;t Free</title><link>https://www.michal-drozd.com/en/blog/container-page-cache-thrashing/</link><guid isPermaLink="true">https://www.michal-drozd.com/en/blog/container-page-cache-thrashing/</guid><description>Your container has 2GB free but runs slow. Page cache counts against memory limit. File I/O forces code pages out. I explain with benchmarks and solutions.</description><pubDate>Wed, 06 Aug 2025 00:00:00 GMT</pubDate></item><item><title>Zero-Downtime PostgreSQL Migrations: Expand/Contract, Backfill and Rollback Strategies</title><link>https://www.michal-drozd.com/en/blog/zero-downtime-postgresql-migrations/</link><guid isPermaLink="true">https://www.michal-drozd.com/en/blog/zero-downtime-postgresql-migrations/</guid><description>A practical playbook for safe database migrations in production. From expand/contract pattern through online indexes to monitoring and rollback.</description><pubDate>Tue, 29 Jul 2025 00:00:00 GMT</pubDate></item><item><title>Prometheus Cardinality Explosion: Detection, Prevention, and Recovery</title><link>https://www.michal-drozd.com/en/blog/prometheus-cardinality-explosion/</link><guid isPermaLink="true">https://www.michal-drozd.com/en/blog/prometheus-cardinality-explosion/</guid><description>One developer added user_id label. Prometheus OOM&apos;d. I show how to detect high-cardinality metrics before they kill your monitoring, with relabel configs to drop them.</description><pubDate>Wed, 23 Jul 2025 00:00:00 GMT</pubDate></item><item><title>HTTP Keep-Alive Connection Reset: Why Your Requests Fail with &apos;Connection Reset by Peer&apos;</title><link>https://www.michal-drozd.com/en/blog/http-keepalive-connection-reset/</link><guid isPermaLink="true">https://www.michal-drozd.com/en/blog/http-keepalive-connection-reset/</guid><description>Sporadic &apos;connection reset by peer&apos; errors in production. I&apos;ll show how keep-alive timeout mismatches between client and server cause this and how to fix it.</description><pubDate>Wed, 16 Jul 2025 00:00:00 GMT</pubDate></item><item><title>Redlock vs PostgreSQL Advisory Locks: When You Don&apos;t Need Redis for Distributed Locking</title><link>https://www.michal-drozd.com/en/blog/redlock-vs-postgres-advisory-locks/</link><guid isPermaLink="true">https://www.michal-drozd.com/en/blog/redlock-vs-postgres-advisory-locks/</guid><description>Adding Redis just for distributed locks? PostgreSQL advisory locks might be enough. I compare both with failure scenarios and performance benchmarks.</description><pubDate>Sun, 13 Jul 2025 00:00:00 GMT</pubDate></item><item><title>Protobuf Event Evolution: Why buf breaking Isn&apos;t Enough</title><link>https://www.michal-drozd.com/en/blog/protobuf-event-evolution/</link><guid isPermaLink="true">https://www.michal-drozd.com/en/blog/protobuf-event-evolution/</guid><description>How to safely evolve Protobuf schemas in event-driven systems. Rules for .proto files, upcaster pattern and backward compatibility.</description><pubDate>Sun, 06 Jul 2025 00:00:00 GMT</pubDate></item><item><title>The $10k/Month AWS Mistake: NAT Gateway vs VPC Endpoints</title><link>https://www.michal-drozd.com/en/blog/aws-nat-gateway-vs-vpc-endpoints/</link><guid isPermaLink="true">https://www.michal-drozd.com/en/blog/aws-nat-gateway-vs-vpc-endpoints/</guid><description>Your private subnets use NAT Gateway for S3 and DynamoDB. You&apos;re paying $0.045/GB for free traffic. I show how VPC Endpoints save thousands monthly.</description><pubDate>Tue, 01 Jul 2025 00:00:00 GMT</pubDate></item><item><title>PostgreSQL TOAST Strategy: Why Your JSON Column Kills Query Performance</title><link>https://www.michal-drozd.com/en/blog/postgresql-toast-optimization/</link><guid isPermaLink="true">https://www.michal-drozd.com/en/blog/postgresql-toast-optimization/</guid><description>SELECT * on a table with JSON is 10x slower than expected. I&apos;ll show how TOAST storage works and when to change strategies for large columns.</description><pubDate>Tue, 24 Jun 2025 00:00:00 GMT</pubDate></item><item><title>Tail-Based Sampling in OpenTelemetry: Sizing, Memory Crashes and Cost Model</title><link>https://www.michal-drozd.com/en/blog/otel-tail-sampling/</link><guid isPermaLink="true">https://www.michal-drozd.com/en/blog/otel-tail-sampling/</guid><description>Practical sizing guide for tail sampling in OpenTelemetry Collector. From decision_wait through memory limits to cost-benefit analysis.</description><pubDate>Sat, 21 Jun 2025 00:00:00 GMT</pubDate></item><item><title>Cache Stampede Prevention: Probabilistic Early Expiration (X-Fetch)</title><link>https://www.michal-drozd.com/en/blog/cache-stampede-xfetch/</link><guid isPermaLink="true">https://www.michal-drozd.com/en/blog/cache-stampede-xfetch/</guid><description>100 requests hit expired cache simultaneously. All 100 query the database. I implement the X-Fetch algorithm that refreshes cache before expiration without locks.</description><pubDate>Sat, 14 Jun 2025 00:00:00 GMT</pubDate></item><item><title>PostgreSQL Replication Slot Bloat: How One Inactive Slot Filled 500GB Disk</title><link>https://www.michal-drozd.com/en/blog/postgresql-replication-slot-bloat/</link><guid isPermaLink="true">https://www.michal-drozd.com/en/blog/postgresql-replication-slot-bloat/</guid><description>Disk is 95% full, WAL directory has 400GB. I&apos;ll show how replication slots prevent WAL cleanup and a playbook for prevention and recovery.</description><pubDate>Sun, 08 Jun 2025 00:00:00 GMT</pubDate></item><item><title>Kubernetes conntrack Table Exhaustion: The Silent Packet Killer</title><link>https://www.michal-drozd.com/en/blog/kubernetes-conntrack-exhaustion/</link><guid isPermaLink="true">https://www.michal-drozd.com/en/blog/kubernetes-conntrack-exhaustion/</guid><description>Random DNS timeouts, dropped connections, services timing out. Your nf_conntrack table is full. I show how to diagnose, monitor, and fix this Kubernetes networking issue.</description><pubDate>Tue, 03 Jun 2025 00:00:00 GMT</pubDate></item><item><title>Architectural Linting: Automated Protection Against Spaghetti Code</title><link>https://www.michal-drozd.com/en/blog/architectural-linting/</link><guid isPermaLink="true">https://www.michal-drozd.com/en/blog/architectural-linting/</guid><description>How to enforce architectural rules in CI/CD. Dependency Cruiser for JS/TS, ArchUnit for Java, and practical configuration examples.</description><pubDate>Wed, 28 May 2025 00:00:00 GMT</pubDate></item><item><title>Redis Memory Fragmentation: When maxmemory Isn&apos;t Enough</title><link>https://www.michal-drozd.com/en/blog/redis-memory-fragmentation/</link><guid isPermaLink="true">https://www.michal-drozd.com/en/blog/redis-memory-fragmentation/</guid><description>Your Redis has 4GB maxmemory but RSS shows 6GB. OOM killer strikes. I explain jemalloc fragmentation with reproduction steps and activedefrag tuning.</description><pubDate>Thu, 22 May 2025 00:00:00 GMT</pubDate></item><item><title>PostgreSQL Idle in Transaction: Emergency Playbook for Stuck Connections</title><link>https://www.michal-drozd.com/en/blog/postgresql-idle-transaction-playbook/</link><guid isPermaLink="true">https://www.michal-drozd.com/en/blog/postgresql-idle-transaction-playbook/</guid><description>Autovacuum can&apos;t run, table bloat growing, all because of one &apos;idle in transaction&apos; connection. Here&apos;s the detection and kill playbook.</description><pubDate>Tue, 20 May 2025 00:00:00 GMT</pubDate></item><item><title>API Idempotency: Designing Endpoints Resistant to Retries</title><link>https://www.michal-drozd.com/en/blog/api-idempotency/</link><guid isPermaLink="true">https://www.michal-drozd.com/en/blog/api-idempotency/</guid><description>Complete guide to implementing idempotent APIs. From Idempotency-Key through Redis locking to request processing state diagram.</description><pubDate>Mon, 12 May 2025 00:00:00 GMT</pubDate></item><item><title>CoreDNS vs NodeLocal DNS Cache: Cutting Kubernetes DNS Latency by 10x</title><link>https://www.michal-drozd.com/en/blog/coredns-nodelocal-benchmark/</link><guid isPermaLink="true">https://www.michal-drozd.com/en/blog/coredns-nodelocal-benchmark/</guid><description>Your pods make 100 DNS queries per request. CoreDNS is a bottleneck. I benchmark NodeLocal DNS cache and show configuration for production.</description><pubDate>Thu, 08 May 2025 00:00:00 GMT</pubDate></item><item><title>Clean Code: Principles Every Developer Should Know</title><link>https://www.michal-drozd.com/en/blog/clean-code-principles/</link><guid isPermaLink="true">https://www.michal-drozd.com/en/blog/clean-code-principles/</guid><description>An overview of key clean code principles and why they&apos;re important for long-term software project maintainability.</description><pubDate>Fri, 02 May 2025 00:00:00 GMT</pubDate></item><item><title>Stop Mocking Your Database: Integration Tests in the Testcontainers Era</title><link>https://www.michal-drozd.com/en/blog/testcontainers-vs-mocking/</link><guid isPermaLink="true">https://www.michal-drozd.com/en/blog/testcontainers-vs-mocking/</guid><description>Why mocks lie and how Testcontainers will change your testing approach. Practical examples, CI setup, and data isolation strategies.</description><pubDate>Thu, 24 Apr 2025 00:00:00 GMT</pubDate></item><item><title>GIN Index Pending List Overflow: Fast Writes, Slow Searches</title><link>https://www.michal-drozd.com/en/blog/gin-index-pending-list-overflow/</link><guid isPermaLink="true">https://www.michal-drozd.com/en/blog/gin-index-pending-list-overflow/</guid><description>Full-text search was fast, now it&apos;s slow. The cause: GIN index pending list grew huge during bulk inserts, and every search must now scan the unsorted pending entries.</description><pubDate>Thu, 17 Apr 2025 00:00:00 GMT</pubDate></item><item><title>Adaptive Concurrency Limits: Stop Guessing Thread Pool Sizes</title><link>https://www.michal-drozd.com/en/blog/adaptive-concurrency-limits/</link><guid isPermaLink="true">https://www.michal-drozd.com/en/blog/adaptive-concurrency-limits/</guid><description>Thread pool 200 because that&apos;s what Stack Overflow says? Netflix&apos;s algorithm adjusts concurrency automatically based on latency. I show how it works with benchmarks.</description><pubDate>Fri, 11 Apr 2025 00:00:00 GMT</pubDate></item><item><title>Kubernetes Cross-Zone Traffic: The Hidden Cost Eating Your Cloud Bill</title><link>https://www.michal-drozd.com/en/blog/k8s-cross-zone-traffic/</link><guid isPermaLink="true">https://www.michal-drozd.com/en/blog/k8s-cross-zone-traffic/</guid><description>Your AWS bill has $5000/month in data transfer. Half is cross-zone traffic within your cluster. I show how to measure and reduce it.</description><pubDate>Tue, 08 Apr 2025 00:00:00 GMT</pubDate></item><item><title>Feature Flags Without Tech Debt: Automatic Stale Flag Detection</title><link>https://www.michal-drozd.com/en/blog/feature-flags-stale-detection/</link><guid isPermaLink="true">https://www.michal-drozd.com/en/blog/feature-flags-stale-detection/</guid><description>End-to-end solution for feature flag lifecycle management. From runtime metrics through static analysis to automatic removal PRs.</description><pubDate>Fri, 04 Apr 2025 00:00:00 GMT</pubDate></item><item><title>Kubernetes Rollout Without DB Outage: How to Stop PostgreSQL Connection Storm</title><link>https://www.michal-drozd.com/en/blog/k8s-postgresql-connection-storm/</link><guid isPermaLink="true">https://www.michal-drozd.com/en/blog/k8s-postgresql-connection-storm/</guid><description>Reproducible lab demonstrating connection storm during K8s rollouts. PgBouncer, preStop hooks and jitter - practical solutions with benchmarks.</description><pubDate>Tue, 01 Apr 2025 00:00:00 GMT</pubDate></item><item><title>Transactional Outbox: Solving the Dual Write Problem Without 2PC</title><link>https://www.michal-drozd.com/en/blog/transactional-outbox/</link><guid isPermaLink="true">https://www.michal-drozd.com/en/blog/transactional-outbox/</guid><description>Practical Outbox pattern implementation in Node.js/TypeScript with PostgreSQL LISTEN/NOTIFY. Race-condition case study and production-ready solution.</description><pubDate>Thu, 27 Mar 2025 00:00:00 GMT</pubDate></item><item><title>The Soft Delete Trap: Why is_deleted Kills Your Database (And What To Do)</title><link>https://www.michal-drozd.com/en/blog/soft-delete-trap/</link><guid isPermaLink="true">https://www.michal-drozd.com/en/blog/soft-delete-trap/</guid><description>A practical analysis of why soft delete destroys database performance over time. Benchmarks, partitioning solution, and migration checklist.</description><pubDate>Sun, 23 Mar 2025 00:00:00 GMT</pubDate></item><item><title>ICU Collation Version Drift: When Database Upgrades Break Your Indexes</title><link>https://www.michal-drozd.com/en/blog/icu-collation-version-drift/</link><guid isPermaLink="true">https://www.michal-drozd.com/en/blog/icu-collation-version-drift/</guid><description>Query returns wrong results after OS upgrade. The cause: ICU library version changed, collation rules shifted, and indexes are now sorted inconsistently with the new sort order.</description><pubDate>Sat, 15 Mar 2025 00:00:00 GMT</pubDate></item><item><title>Java Profiling in Hardened Kubernetes: When Security Blocks Your Debugger</title><link>https://www.michal-drozd.com/en/blog/java-profiling-hardened-kubernetes/</link><guid isPermaLink="true">https://www.michal-drozd.com/en/blog/java-profiling-hardened-kubernetes/</guid><description>Can&apos;t attach profiler to production JVM. seccomp blocks perf_event_open, container drops CAP_SYS_PTRACE, and PodSecurityPolicy prevents privileged mode. Here&apos;s how to profile anyway.</description><pubDate>Fri, 07 Mar 2025 00:00:00 GMT</pubDate></item><item><title>PostgreSQL Partial Index: Planner Ignores Your Index</title><link>https://www.michal-drozd.com/en/blog/postgresql-partial-index-planner-miss/</link><guid isPermaLink="true">https://www.michal-drozd.com/en/blog/postgresql-partial-index-planner-miss/</guid><description>Query scans full table despite perfect partial index. The cause: query&apos;s WHERE clause doesn&apos;t match the index predicate exactly, or statistics mislead the planner.</description><pubDate>Tue, 04 Mar 2025 00:00:00 GMT</pubDate></item><item><title>Go cgo DNS Resolution Thread Explosion: When net.LookupHost Spawns Thousands of Threads</title><link>https://www.michal-drozd.com/en/blog/go-cgo-dns-thread-explosion/</link><guid isPermaLink="true">https://www.michal-drozd.com/en/blog/go-cgo-dns-thread-explosion/</guid><description>Go application suddenly has 10,000 threads consuming all memory. The cause: cgo-based DNS resolution blocking in slow DNS environments, bypassing Go&apos;s goroutine scheduler.</description><pubDate>Tue, 25 Feb 2025 00:00:00 GMT</pubDate></item><item><title>eBPF Run-Queue Latency: Finding the Off-CPU Bottleneck</title><link>https://www.michal-drozd.com/en/blog/ebpf-runqueue-latency-offcpu/</link><guid isPermaLink="true">https://www.michal-drozd.com/en/blog/ebpf-runqueue-latency-offcpu/</guid><description>CPU utilization is low but requests are slow. The hidden culprit: time spent waiting in the scheduler run-queue, invisible to traditional profilers but visible with eBPF off-CPU analysis.</description><pubDate>Mon, 17 Feb 2025 00:00:00 GMT</pubDate></item><item><title>Linux ARP Cache Stale Entries: Failover Traffic Blackhole</title><link>https://www.michal-drozd.com/en/blog/linux-arp-cache-failover-stale/</link><guid isPermaLink="true">https://www.michal-drozd.com/en/blog/linux-arp-cache-failover-stale/</guid><description>Traffic goes to old server after failover. The cause: Linux ARP cache retains MAC address of failed node, sending packets to unreachable destination for minutes.</description><pubDate>Fri, 14 Feb 2025 00:00:00 GMT</pubDate></item><item><title>Gossip Protocol Ghost Nodes: IP Reuse Haunting Your Cluster</title><link>https://www.michal-drozd.com/en/blog/gossip-ghost-nodes-ip-reuse/</link><guid isPermaLink="true">https://www.michal-drozd.com/en/blog/gossip-ghost-nodes-ip-reuse/</guid><description>New node joins cluster but gets shunned. Old node&apos;s IP is still in gossip protocol&apos;s failure detection blacklist. The zombie membership record lives on.</description><pubDate>Mon, 10 Feb 2025 00:00:00 GMT</pubDate></item><item><title>Kubernetes Ghost Connections: Stale Conntrack DNAT Entries</title><link>https://www.michal-drozd.com/en/blog/kubernetes-conntrack-stale-dnat/</link><guid isPermaLink="true">https://www.michal-drozd.com/en/blog/kubernetes-conntrack-stale-dnat/</guid><description>Service returns wrong pod IPs after scaling. The cause: Linux conntrack keeps DNAT entries alive longer than pods exist, routing traffic to deleted endpoints.</description><pubDate>Wed, 05 Feb 2025 00:00:00 GMT</pubDate></item><item><title>Double Charges From Idempotency Keys: The Replica Lag Trap</title><link>https://www.michal-drozd.com/en/blog/idempotency-keys-replica-lag/</link><guid isPermaLink="true">https://www.michal-drozd.com/en/blog/idempotency-keys-replica-lag/</guid><description>Perfect idempotency logic, but customers still get charged twice. The cause: checking idempotency keys against a read replica that&apos;s seconds behind the primary during traffic spikes.</description><pubDate>Wed, 29 Jan 2025 00:00:00 GMT</pubDate></item><item><title>PostgreSQL Read Replica Conflicts: Why Your Queries Get Canceled</title><link>https://www.michal-drozd.com/en/blog/postgresql-read-replica-conflicts/</link><guid isPermaLink="true">https://www.michal-drozd.com/en/blog/postgresql-read-replica-conflicts/</guid><description>Queries on read replicas fail with &apos;canceling statement due to conflict with recovery&apos;. The fix depends on which of the 5 conflict types you have - here&apos;s how to diagnose and solve each one.</description><pubDate>Tue, 28 Jan 2025 00:00:00 GMT</pubDate></item><item><title>Redis Cluster Slot Migration: Temporary Memory Explosion</title><link>https://www.michal-drozd.com/en/blog/redis-cluster-slot-migration-memory/</link><guid isPermaLink="true">https://www.michal-drozd.com/en/blog/redis-cluster-slot-migration-memory/</guid><description>Redis nodes OOMKilled during cluster rebalancing. The cause: slot migration copies keys to destination before deleting from source, temporarily doubling memory usage.</description><pubDate>Mon, 27 Jan 2025 00:00:00 GMT</pubDate></item><item><title>Split-Brain From a Clock Step Backwards: Wall Time in Lease-Based Systems</title><link>https://www.michal-drozd.com/en/blog/clock-step-backwards-split-brain/</link><guid isPermaLink="true">https://www.michal-drozd.com/en/blog/clock-step-backwards-split-brain/</guid><description>Two nodes both believe they hold the leader lease. The cause: a small NTP time step backwards combined with code that mixes wall-clock time with duration-based timeouts.</description><pubDate>Wed, 22 Jan 2025 00:00:00 GMT</pubDate></item><item><title>Java OOMKilled With Stable Heap: Native Memory, Direct Buffers, and glibc Arenas</title><link>https://www.michal-drozd.com/en/blog/java-native-memory-oomkilled/</link><guid isPermaLink="true">https://www.michal-drozd.com/en/blog/java-native-memory-oomkilled/</guid><description>Heap metrics look fine, GC is happy, but the container keeps dying. The culprit: native memory from direct buffers, JNI, and glibc memory allocator fragmentation.</description><pubDate>Mon, 20 Jan 2025 00:00:00 GMT</pubDate></item><item><title>Go p99 Latency Cliffs: Nested context.WithTimeout Timer Storms</title><link>https://www.michal-drozd.com/en/blog/go-timer-heap-pressure/</link><guid isPermaLink="true">https://www.michal-drozd.com/en/blog/go-timer-heap-pressure/</guid><description>Periodic latency spikes that look like network jitter. The real cause: nested timeouts creating thousands of timers that pressure the Go runtime timer heap and trigger GC scanning.</description><pubDate>Wed, 15 Jan 2025 00:00:00 GMT</pubDate></item><item><title>PostgreSQL Serialization Failures: Beyond &apos;Just Retry&apos;</title><link>https://www.michal-drozd.com/en/blog/postgresql-serialization-failure-retry/</link><guid isPermaLink="true">https://www.michal-drozd.com/en/blog/postgresql-serialization-failure-retry/</guid><description>Getting &apos;could not serialize access due to concurrent update&apos;? The fix isn&apos;t just retry logic - it&apos;s understanding when to use which isolation level and how to reduce conflict frequency.</description><pubDate>Wed, 15 Jan 2025 00:00:00 GMT</pubDate></item><item><title>gRPC Keepalive Mismatch: Transport Closing After Idle</title><link>https://www.michal-drozd.com/en/blog/grpc-keepalive-transport-closing/</link><guid isPermaLink="true">https://www.michal-drozd.com/en/blog/grpc-keepalive-transport-closing/</guid><description>gRPC connections randomly close with &apos;transport is closing&apos;. The cause: client and server keepalive settings don&apos;t match, causing the server to terminate idle connections.</description><pubDate>Mon, 13 Jan 2025 00:00:00 GMT</pubDate></item><item><title>The Ghost Pod: Why Your Service Still Sends Traffic to Dead Endpoints</title><link>https://www.michal-drozd.com/en/blog/kubernetes-ghost-pod-conntrack/</link><guid isPermaLink="true">https://www.michal-drozd.com/en/blog/kubernetes-ghost-pod-conntrack/</guid><description>Random ECONNRESET on some nodes but not others. Endpoints look fine. The culprit: conntrack NAT entries keeping long-lived connections pinned to pods that no longer exist.</description><pubDate>Sun, 05 Jan 2025 00:00:00 GMT</pubDate></item><item><title>PostgreSQL OOM by Design: work_mem × Parallel Workers × Plan Nodes</title><link>https://www.michal-drozd.com/en/blog/postgresql-work-mem-parallel-oom/</link><guid isPermaLink="true">https://www.michal-drozd.com/en/blog/postgresql-work-mem-parallel-oom/</guid><description>work_mem looks small at 256MB, but a parallel hash join with 4 workers across 3 plan nodes uses 3GB. Here&apos;s how to prevent PostgreSQL from legitimately OOMing your container.</description><pubDate>Sat, 28 Dec 2024 00:00:00 GMT</pubDate></item><item><title>JVM Metaspace OOM in Kubernetes: Why MaxMetaspaceSize Alone Won&apos;t Save You</title><link>https://www.michal-drozd.com/en/blog/jvm-metaspace-oom-kubernetes/</link><guid isPermaLink="true">https://www.michal-drozd.com/en/blog/jvm-metaspace-oom-kubernetes/</guid><description>Pod OOMKilled despite MaxMetaspaceSize set. The cause: Metaspace grows outside heap, container memory limit doesn&apos;t account for it, and class unloading isn&apos;t happening.</description><pubDate>Mon, 23 Dec 2024 00:00:00 GMT</pubDate></item><item><title>The Index That Killed Write Performance: Losing PostgreSQL HOT Updates</title><link>https://www.michal-drozd.com/en/blog/postgresql-hot-updates-index-trap/</link><guid isPermaLink="true">https://www.michal-drozd.com/en/blog/postgresql-hot-updates-index-trap/</guid><description>Adding an index for performance made writes 10x slower. The counter-intuitive cause: the new index broke HOT updates, turning cheap in-place updates into full-row rewrites with massive bloat.</description><pubDate>Thu, 19 Dec 2024 00:00:00 GMT</pubDate></item><item><title>PostgreSQL &apos;cached plan must not change result type&apos; During Zero-Downtime Migrations</title><link>https://www.michal-drozd.com/en/blog/postgresql-cached-plan-schema-change/</link><guid isPermaLink="true">https://www.michal-drozd.com/en/blog/postgresql-cached-plan-schema-change/</guid><description>Rolling deploy fails with cached plan errors after ALTER TABLE. The cause: server-side prepared statements cache query plans that break when schema changes.</description><pubDate>Wed, 11 Dec 2024 00:00:00 GMT</pubDate></item><item><title>etcd Watch Replay Storms: When Giant ConfigMaps Kill the Control Plane</title><link>https://www.michal-drozd.com/en/blog/etcd-watch-replay-storms/</link><guid isPermaLink="true">https://www.michal-drozd.com/en/blog/etcd-watch-replay-storms/</guid><description>The apiserver becomes &apos;randomly slow&apos;. Root cause: large, frequently updated ConfigMaps trigger watch compaction, causing thousands of controllers to relist simultaneously.</description><pubDate>Thu, 05 Dec 2024 00:00:00 GMT</pubDate></item><item><title>etcd Quota Alarm: When Your Kubernetes Cluster Goes Read-Only</title><link>https://www.michal-drozd.com/en/blog/etcd-compaction-quota-alarm/</link><guid isPermaLink="true">https://www.michal-drozd.com/en/blog/etcd-compaction-quota-alarm/</guid><description>Cluster stops accepting writes, pods can&apos;t schedule. The cause: etcd hit its storage quota because compaction wasn&apos;t running, history accumulated beyond limits.</description><pubDate>Wed, 27 Nov 2024 00:00:00 GMT</pubDate></item><item><title>Kubernetes Headless Service DNS: Stale Records After Pod Deletion</title><link>https://www.michal-drozd.com/en/blog/kubernetes-headless-service-stale-dns/</link><guid isPermaLink="true">https://www.michal-drozd.com/en/blog/kubernetes-headless-service-stale-dns/</guid><description>Requests go to non-existent pods. The cause: headless service DNS records persist in client DNS cache after pods are deleted, before endpoints update propagates.</description><pubDate>Fri, 22 Nov 2024 00:00:00 GMT</pubDate></item><item><title>Traffic Hitting Dead Pods: Conntrack&apos;s Stale NAT Mapping</title><link>https://www.michal-drozd.com/en/blog/conntrack-stale-nat-mapping/</link><guid isPermaLink="true">https://www.michal-drozd.com/en/blog/conntrack-stale-nat-mapping/</guid><description>Deploy causes 503s for exactly 2 minutes. The issue: conntrack keeps NAT mappings to old pod IPs even after Kubernetes removes endpoints, sending traffic to dead pods.</description><pubDate>Thu, 14 Nov 2024 00:00:00 GMT</pubDate></item><item><title>Ephemeral Port Exhaustion: The Node That &apos;Goes Bad&apos;</title><link>https://www.michal-drozd.com/en/blog/ephemeral-port-exhaustion-kubernetes/</link><guid isPermaLink="true">https://www.michal-drozd.com/en/blog/ephemeral-port-exhaustion-kubernetes/</guid><description>A single Kubernetes node starts failing connections to external services while pods look healthy. The hidden cause: sidecar proxies exhausting ephemeral ports with short-lived connections.</description><pubDate>Mon, 11 Nov 2024 00:00:00 GMT</pubDate></item><item><title>PMTU Blackholes: When Only Large Responses Hang</title><link>https://www.michal-drozd.com/en/blog/pmtu-blackhole-large-responses/</link><guid isPermaLink="true">https://www.michal-drozd.com/en/blog/pmtu-blackhole-large-responses/</guid><description>Small API responses work, large ones hang forever. The cause: ICMP &apos;Fragmentation Needed&apos; messages filtered by firewalls, breaking Path MTU Discovery in overlay networks.</description><pubDate>Thu, 07 Nov 2024 00:00:00 GMT</pubDate></item><item><title>kube-proxy Micro-Outages: The xtables Lock Contention Problem</title><link>https://www.michal-drozd.com/en/blog/kube-proxy-xtables-lock-contention/</link><guid isPermaLink="true">https://www.michal-drozd.com/en/blog/kube-proxy-xtables-lock-contention/</guid><description>Random 1-3 second connection drops during deployments. CPU looks fine, memory is stable. The hidden cause: iptables-restore grabbing the xtables lock while endpoints churn.</description><pubDate>Mon, 04 Nov 2024 00:00:00 GMT</pubDate></item><item><title>TCP TIME_WAIT Port Exhaustion: When Connection Pooling Isn&apos;t Enough</title><link>https://www.michal-drozd.com/en/blog/tcp-time-wait-port-exhaustion/</link><guid isPermaLink="true">https://www.michal-drozd.com/en/blog/tcp-time-wait-port-exhaustion/</guid><description>Service can&apos;t connect to database - &apos;cannot assign requested address&apos;. The cause: ephemeral ports exhausted by thousands of sockets stuck in TIME_WAIT state.</description><pubDate>Mon, 28 Oct 2024 00:00:00 GMT</pubDate></item><item><title>VXLAN Random Packet Drops: The Checksum Offload Trap</title><link>https://www.michal-drozd.com/en/blog/vxlan-checksum-offload-packet-drops/</link><guid isPermaLink="true">https://www.michal-drozd.com/en/blog/vxlan-checksum-offload-packet-drops/</guid><description>Cross-node gRPC calls randomly fail but local traffic works fine. The culprit: TX checksum offload corrupting VXLAN headers on specific NIC drivers. Here&apos;s how to diagnose and fix.</description><pubDate>Mon, 21 Oct 2024 00:00:00 GMT</pubDate></item></channel></rss>