RSS Contracts: Stop OOMKilled Java Pods in Kubernetes by Testing RSS as an API
Use cgroup RSS budgets, CI sampling, and runtime headroom to catch JVM memory regressions before they hit production.
13 posts
Use cgroup RSS budgets, CI sampling, and runtime headroom to catch JVM memory regressions before they hit production.
CPU looks OK but tail latency is catastrophic. I'll show how to correlate CFS throttling with latency spikes and why removing CPU limits can paradoxically help.
Three resilience patterns that are often confused. I'll show exactly when each prevents cascading failures and when it makes things worse with real metrics.
Same query, same params, but prod is slow and staging works fine. I'll show how to reproduce the generic plan problem with pgBouncer, Java/Go and how to fix it.
Virtual Threads in Java 21 promise simpler code than Reactive. I benchmark both under 10k concurrent connections and show where each wins.
Heap is 50% full but pod gets OOMKilled. I'll show how to track native memory (Metaspace, threads, NIO) and prevent container memory issues.
Sporadic 'connection reset by peer' errors in production. I'll show how keep-alive timeout mismatches between client and server cause this and how to fix it.
Adding Redis just for distributed locks? PostgreSQL advisory locks might be enough. I compare both with failure scenarios and performance benchmarks.
How to enforce architectural rules in CI/CD. Dependency Cruiser for JS/TS, ArchUnit for Java, and practical configuration examples.
Thread pool 200 because that's what Stack Overflow says? Netflix's algorithm adjusts concurrency automatically based on latency. I show how it works with benchmarks.
Can't attach profiler to production JVM. seccomp blocks perf_event_open, container drops CAP_SYS_PTRACE, and PodSecurityPolicy prevents privileged mode. Here's how to profile anyway.
Heap metrics look fine, GC is happy, but the container keeps dying. The culprit: native memory from direct buffers, JNI, and glibc memory allocator fragmentation.
Pod OOMKilled despite MaxMetaspaceSize set. The cause: Metaspace grows outside heap, container memory limit doesn't account for it, and class unloading isn't happening.