RSS Contracts: Stop OOMKilled Java Pods in Kubernetes by Testing RSS as an API
Use cgroup RSS budgets, CI sampling, and runtime headroom to catch JVM memory regressions before they hit production.
9 posts
Use cgroup RSS budgets, CI sampling, and runtime headroom to catch JVM memory regressions before they hit production.
Container memory limit is 4GB but OOM kills at 2GB used. Kernel buffers, page cache, and cgroup accounting tricks cause early OOMKills. Here's the full picture.
Heap is 50% full but pod gets OOMKilled. I'll show how to track native memory (Metaspace, threads, NIO) and prevent container memory issues.
Your container has 2GB free but runs slow. Page cache counts against memory limit. File I/O forces code pages out. I explain with benchmarks and solutions.
Your Redis has 4GB maxmemory but RSS shows 6GB. OOM killer strikes. I explain jemalloc fragmentation with reproduction steps and activedefrag tuning.
Redis nodes OOMKilled during cluster rebalancing. The cause: slot migration copies keys to destination before deleting from source, temporarily doubling memory usage.
Heap metrics look fine, GC is happy, but the container keeps dying. The culprit: native memory from direct buffers, JNI, and glibc memory allocator fragmentation.
work_mem looks small at 256MB, but a parallel hash join with 4 workers across 3 plan nodes uses 3GB. Here's how to prevent PostgreSQL from legitimately OOMing your container.
Pod OOMKilled despite MaxMetaspaceSize set. The cause: Metaspace grows outside heap, container memory limit doesn't account for it, and class unloading isn't happening.