RSS Contracts: Stop OOMKilled Java Pods in Kubernetes by Testing RSS as an API
I got tired of seeing JVM pods die with OOMKilled while the heap looked fine. This is one of the most common WTF incidents in Kubernetes:
- container limit: 2GiB
- heap set to
-Xmx1400m - heap in graphs looks OK (50-70%)
- and yet: OOMKilled
It often happens hours or days later (cache warm-up, traffic patterns, that one rare endpoint).
This post is not another “set Xmx to 70%” recipe. I use this as a guardrail:
RSS Contracts = memory budgets expressed as RSS / cgroup usage + automated verification in CI + a runtime headroom guard.
We already test observability as contracts (Span Contracts, Dash Contracts). So why not memory?
- Span Contracts - trace-derived API contracts in CI
- Dash Contracts - dashboards and alerts as a contract
What OOMKilled actually means (and why heap is not an argument)
Java OOM (java.lang.OutOfMemoryError) = JVM cannot allocate memory inside its own limits.
Kubernetes OOMKilled = the kernel killed the process because cgroup memory usage exceeded the container limit.
The kernel does not care about heap. It cares about resident memory and cgroup usage.
So you can have:
- heap OK
- but native memory + direct buffers + thread stacks + metaspace + allocators + page cache = boom
Why the usual advice fails
”Set -Xmx to 70% of the limit”
- Sometimes it works.
- Sometimes it does not.
- Mostly it is guessing without guardrails.
”Use MaxRAMPercentage”
Modern JDKs are container-aware, but that only sizes the heap. You still have:
- metaspace
- code cache
- thread stacks
- direct memory (Netty / NIO)
- native overhead (malloc arenas, JNI, TLS, crypto)
- fragmentation and spikes
”We have graphs”
Graphs are great… after the deploy. RSS Contracts move the risk to PR/CI.
The memory bill (RSS budget) in a JVM container
Think of memory as an invoice. The container limit is your total budget.
| Line item | Where it comes from | Typical growth drivers | How it shows up |
|---|---|---|---|
| Java heap | JVM heap | object allocation, caches, large payloads | GC pressure, latency, RSS growth |
| Metaspace / Class metadata | native | dynamic classloading, proxies, libraries | growth without heap |
| Code cache / JIT | native | warm-up, lots of methods | slow growth |
| Thread stacks | native | high concurrency, thread-per-request | sudden RSS jumps |
| Direct buffers (NIO/Netty) | native | allocateDirect, Netty pooling | RSS growth outside heap |
| Malloc arenas / libc overhead | native | many threads, fragmentation | surprisingly large numbers |
| Page cache / file cache | kernel/cgroup | heavy IO, mmap, logs, jars | cgroup usage climbs, OOM w/o heap |
| Safety margin | reality | spikes, fragmentation, unknown unknowns | what saves you |
The point: OOMKilled is almost always “invoice > limit”. Heap is just one line item.
RSS Contracts: definition and rules
An RSS Contract is a policy file in your repo:
- maximum RSS as % of the limit
- warning threshold
- maximum RSS spike during the test
- optional guidance for heap/native/safety split
Example rss_contract.yml:
version: 1
limits:
max_rss_pct: 0.90
warn_rss_pct: 0.80
spikes:
max_rss_delta_pct: 0.08
sampling:
duration: 60s
interval: 250ms
guidance:
heap_pct_of_limit: 0.60
native_pct_of_limit: 0.25
safety_pct_of_limit: 0.10
How it works in practice
Workflow (CI and local)
- Start the app in a container with the same memory limit as prod.
- Run a short, realistic workload (smoke + a few hot paths).
- Run
rsscontract verify, which:- reads cgroup memory limit
- samples process RSS
- computes max RSS and spikes
- produces a report
- If the contract fails, the PR fails before deploy.
Measuring memory: kernel view, not JVM wishful thinking
1) Cgroup limit (v2 and v1)
Cgroup v2:
- limit:
/sys/fs/cgroup/memory.max - usage:
/sys/fs/cgroup/memory.current
Cgroup v1:
- limit:
/sys/fs/cgroup/memory/memory.limit_in_bytes - usage:
/sys/fs/cgroup/memory/memory.usage_in_bytes
2) Process RSS (best signal)
Most stable:
/proc/<pid>/smaps_rollup->Rss:line (kB)
Fallback:
/proc/<pid>/status->VmRSS:
Go implementation: rsscontract
One binary, no external services. You can bake it into your image and run it in CI.
internal/memprobe/memprobe.go
package memprobe
import (
"bufio"
"errors"
"fmt"
"os"
"strconv"
"strings"
)
func readFirstLine(path string) (string, error) {
b, err := os.ReadFile(path)
if err != nil {
return "", err
}
s := strings.TrimSpace(string(b))
if s == "" {
return "", errors.New("empty")
}
if i := strings.IndexByte(s, '\n'); i >= 0 {
s = s[:i]
}
return s, nil
}
// CgroupMemoryLimitBytes tries v2 first, then v1.
func CgroupMemoryLimitBytes() (uint64, error) {
// cgroup v2
if s, err := readFirstLine("/sys/fs/cgroup/memory.max"); err == nil {
if s == "max" {
return 0, errors.New("memory.max is unlimited")
}
v, err := strconv.ParseUint(s, 10, 64)
if err == nil && v > 0 {
if v > (1 << 60) {
return 0, errors.New("memory.max looks unlimited")
}
return v, nil
}
}
// cgroup v1
if s, err := readFirstLine("/sys/fs/cgroup/memory/memory.limit_in_bytes"); err == nil {
v, err := strconv.ParseUint(s, 10, 64)
if err == nil && v > 0 {
if v > (1 << 60) {
return 0, errors.New("memory.limit_in_bytes looks unlimited")
}
return v, nil
}
}
return 0, errors.New("cgroup memory limit not found")
}
// RssBytes reads RSS for a given pid using smaps_rollup, fallback to status.
func RssBytes(pid int) (uint64, error) {
if v, err := rssFromSmapsRollup(pid); err == nil {
return v, nil
}
return rssFromStatus(pid)
}
func rssFromSmapsRollup(pid int) (uint64, error) {
path := fmt.Sprintf("/proc/%d/smaps_rollup", pid)
f, err := os.Open(path)
if err != nil {
return 0, err
}
defer f.Close()
sc := bufio.NewScanner(f)
for sc.Scan() {
line := sc.Text()
if strings.HasPrefix(line, "Rss:") {
fields := strings.Fields(line)
if len(fields) < 2 {
return 0, fmt.Errorf("unexpected rss line: %q", line)
}
kb, err := strconv.ParseUint(fields[1], 10, 64)
if err != nil {
return 0, err
}
return kb * 1024, nil
}
}
if err := sc.Err(); err != nil {
return 0, err
}
return 0, errors.New("Rss not found in smaps_rollup")
}
func rssFromStatus(pid int) (uint64, error) {
path := fmt.Sprintf("/proc/%d/status", pid)
f, err := os.Open(path)
if err != nil {
return 0, err
}
defer f.Close()
sc := bufio.NewScanner(f)
for sc.Scan() {
line := sc.Text()
if strings.HasPrefix(line, "VmRSS:") {
fields := strings.Fields(line)
if len(fields) < 2 {
return 0, fmt.Errorf("unexpected VmRSS line: %q", line)
}
kb, err := strconv.ParseUint(fields[1], 10, 64)
if err != nil {
return 0, err
}
return kb * 1024, nil
}
}
if err := sc.Err(); err != nil {
return 0, err
}
return 0, errors.New("VmRSS not found in status")
}
cmd/rsscontract/main.go
package main
import (
"encoding/json"
"errors"
"flag"
"fmt"
"math"
"os"
"time"
"gopkg.in/yaml.v3"
"example.com/rsscontract/internal/memprobe"
)
type Contract struct {
Version int `yaml:"version"`
Limits struct {
MaxRssPct float64 `yaml:"max_rss_pct"`
WarnRssPct float64 `yaml:"warn_rss_pct"`
} `yaml:"limits"`
Spikes struct {
MaxRssDeltaPct float64 `yaml:"max_rss_delta_pct"`
} `yaml:"spikes"`
Sampling struct {
Duration string `yaml:"duration"`
Interval string `yaml:"interval"`
} `yaml:"sampling"`
Guidance struct {
HeapPctOfLimit float64 `yaml:"heap_pct_of_limit"`
NativePctOfLimit float64 `yaml:"native_pct_of_limit"`
SafetyPctOfLimit float64 `yaml:"safety_pct_of_limit"`
} `yaml:"guidance"`
}
type Report struct {
PID int `json:"pid"`
Timestamp time.Time `json:"timestamp"`
LimitBytes uint64 `json:"limit_bytes"`
MaxRssBytes uint64 `json:"max_rss_bytes"`
MinRssBytes uint64 `json:"min_rss_bytes"`
DeltaRssBytes int64 `json:"delta_rss_bytes"`
MaxRssPct float64 `json:"max_rss_pct"`
DeltaRssPct float64 `json:"delta_rss_pct"`
WarnThreshold float64 `json:"warn_threshold_pct"`
FailThreshold float64 `json:"fail_threshold_pct"`
SpikeFailPct float64 `json:"spike_fail_pct"`
Warnings []string `json:"warnings,omitempty"`
Violations []string `json:"violations,omitempty"`
ContractVersion int `json:"contract_version"`
}
func main() {
var (
contractPath = flag.String("contract", "rss_contract.yml", "Path to RSS contract YAML")
pid = flag.Int("pid", 1, "Target process PID (in container usually 1)")
outPath = flag.String("report", "rss_report.json", "Where to write JSON report")
)
flag.Parse()
c, err := loadContract(*contractPath)
if err != nil {
fatal(err)
}
dur, err := time.ParseDuration(c.Sampling.Duration)
if err != nil {
fatal(fmt.Errorf("bad sampling.duration: %w", err))
}
interval, err := time.ParseDuration(c.Sampling.Interval)
if err != nil {
fatal(fmt.Errorf("bad sampling.interval: %w", err))
}
if dur <= 0 || interval <= 0 {
fatal(errors.New("sampling.duration and sampling.interval must be > 0"))
}
limit, err := memprobe.CgroupMemoryLimitBytes()
if err != nil {
fatal(fmt.Errorf("cannot determine cgroup limit: %w", err))
}
rep, exitCode := verify(c, *pid, limit, dur, interval)
if err := writeJSON(*outPath, rep); err != nil {
fmt.Fprintf(os.Stderr, "report write failed: %v\n", err)
}
printHuman(rep)
os.Exit(exitCode)
}
func verify(c Contract, pid int, limit uint64, dur, interval time.Duration) (Report, int) {
rep := Report{
PID: pid,
Timestamp: time.Now().UTC(),
LimitBytes: limit,
WarnThreshold: c.Limits.WarnRssPct,
FailThreshold: c.Limits.MaxRssPct,
SpikeFailPct: c.Spikes.MaxRssDeltaPct,
ContractVersion: c.Version,
}
start := time.Now()
deadline := start.Add(dur)
var min uint64 = math.MaxUint64
var max uint64 = 0
var first uint64 = 0
var last uint64 = 0
for now := time.Now(); now.Before(deadline); now = time.Now() {
rss, err := memprobe.RssBytes(pid)
if err != nil {
rep.Violations = append(rep.Violations, fmt.Sprintf("rss_read_error: %v", err))
return rep, 2
}
if first == 0 {
first = rss
}
last = rss
if rss < min {
min = rss
}
if rss > max {
max = rss
}
time.Sleep(interval)
}
rep.MinRssBytes = min
rep.MaxRssBytes = max
rep.DeltaRssBytes = int64(last) - int64(first)
rep.MaxRssPct = float64(max) / float64(limit)
rep.DeltaRssPct = math.Abs(float64(rep.DeltaRssBytes)) / float64(limit)
if c.Limits.WarnRssPct > 0 && rep.MaxRssPct >= c.Limits.WarnRssPct {
rep.Warnings = append(rep.Warnings,
fmt.Sprintf("RSS warning threshold exceeded: max_rss_pct=%.3f >= warn_rss_pct=%.3f", rep.MaxRssPct, c.Limits.WarnRssPct))
}
exitCode := 0
if c.Limits.MaxRssPct > 0 && rep.MaxRssPct >= c.Limits.MaxRssPct {
rep.Violations = append(rep.Violations,
fmt.Sprintf("RSS contract failed: max_rss_pct=%.3f >= max_rss_pct=%.3f", rep.MaxRssPct, c.Limits.MaxRssPct))
exitCode = 1
}
if c.Spikes.MaxRssDeltaPct > 0 && rep.DeltaRssPct >= c.Spikes.MaxRssDeltaPct {
rep.Violations = append(rep.Violations,
fmt.Sprintf("RSS spike contract failed: delta_rss_pct=%.3f >= max_rss_delta_pct=%.3f", rep.DeltaRssPct, c.Spikes.MaxRssDeltaPct))
exitCode = 1
}
return rep, exitCode
}
func loadContract(path string) (Contract, error) {
b, err := os.ReadFile(path)
if err != nil {
return Contract{}, err
}
var c Contract
if err := yaml.Unmarshal(b, &c); err != nil {
return Contract{}, err
}
if c.Version == 0 {
c.Version = 1
}
return c, nil
}
func writeJSON(path string, rep Report) error {
b, err := json.MarshalIndent(rep, "", " ")
if err != nil {
return err
}
return os.WriteFile(path, b, 0o644)
}
func printHuman(rep Report) {
fmt.Printf("RSS Contract report (pid=%d)\n", rep.PID)
fmt.Printf("- limit: %.2f MiB\n", float64(rep.LimitBytes)/(1024*1024))
fmt.Printf("- max RSS: %.2f MiB (%.1f%%)\n", float64(rep.MaxRssBytes)/(1024*1024), rep.MaxRssPct*100)
fmt.Printf("- min RSS: %.2f MiB\n", float64(rep.MinRssBytes)/(1024*1024))
fmt.Printf("- delta: %.2f MiB (%.1f%%)\n", float64(rep.DeltaRssBytes)/(1024*1024), rep.DeltaRssPct*100)
for _, w := range rep.Warnings {
fmt.Printf("WARN: %s\n", w)
}
for _, v := range rep.Violations {
fmt.Printf("FAIL: %s\n", v)
}
}
func fatal(err error) {
fmt.Fprintln(os.Stderr, "error:", err)
os.Exit(2)
}
go.mod
module example.com/rsscontract
go 1.22
require gopkg.in/yaml.v3 v3.0.1
How to bake it into a Java image
FROM golang:1.22 AS rssbuild
WORKDIR /src
COPY go.mod go.sum ./
RUN go mod download
COPY . .
RUN CGO_ENABLED=0 go build -o /out/rsscontract ./cmd/rsscontract
FROM eclipse-temurin:21-jre
WORKDIR /app
COPY build/libs/app.jar /app/app.jar
COPY --from=rssbuild /out/rsscontract /usr/local/bin/rsscontract
COPY rss_contract.yml /app/rss_contract.yml
ENTRYPOINT ["java","-jar","/app/app.jar"]
CI example (GitHub Actions)
name: rss-contract
on: [pull_request]
jobs:
rss_contract:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Build image
run: |
docker build -t myapp:test .
- name: Run app with memory limit
run: |
docker run -d --name app \
--memory=2g --memory-swap=2g \
-p 8080:8080 \
myapp:test
- name: Warm-up / smoke
run: |
for i in $(seq 1 50); do
curl -fsS http://localhost:8080/health || true
curl -fsS http://localhost:8080/api/some-hot-path || true
done
- name: Run RSS contract inside container
run: |
docker exec app rsscontract \
-contract /app/rss_contract.yml \
-pid 1 \
-report /tmp/rss_report.json
- name: Copy report
if: always()
run: |
docker cp app:/tmp/rss_report.json rss_report.json || true
- name: Upload report
if: always()
uses: actions/upload-artifact@v4
with:
name: rss_report
path: rss_report.json
Why --memory-swap=2g?
It keeps the container close to Kubernetes behavior (no swap, tighter limit). It is not a perfect emulation, it is a reproducible guardrail.
Turning the contract into JVM settings
The contract tells you how much RAM you can spend. Now split the budget.
A reasonable starting point (not dogma):
- Heap: 55-65% of limit
- Native: 20-30%
- Safety: 10-15%
For a 2GiB limit:
- heap: ~1.2GiB
- native: ~0.5GiB
- safety: ~0.3GiB
Heap sizing
Explicit:
-Xms512m -Xmx1200m
Or percent-based:
-XX:MaxRAMPercentage=60
-XX:InitialRAMPercentage=25
Percentages only control heap, not total RSS.
Direct memory (Netty / NIO)
-XX:MaxDirectMemorySize=256m
Thread stacks
-Xss256k
Lower stacks can help if you run many threads. Use with care.
Metaspace
Capping metaspace can backfire. I usually:
- measure with NMT
- fix the cause
- only then consider a cap
When the contract fails: pre-OOM debugging checklist
1) Heap vs RSS
If heap is flat but RSS grows, it is almost always native/direct/threads/page cache.
2) Thread count
ps -o pid,comm,nlwp -p 1
If nlwp jumps, RSS will follow (stacks + TLS + arenas).
3) Direct buffers / Netty
If throughput grows and RSS grows but heap does not:
- check
MaxDirectMemorySize - check pooling and leak detection
4) Native Memory Tracking (NMT)
Start JVM with:
-XX:NativeMemoryTracking=summary
Then:
jcmd 1 VM.native_memory summary scale=MB
This is the fastest way to stop guessing. It has overhead - do not enable everywhere in prod.
5) Page cache / file memory
In cgroup v2, check memory.stat:
cat /sys/fs/cgroup/memory.stat | egrep 'anon|file|slab'
If file grows, page cache may be the culprit.
Runtime guard: headroom metrics and alerts
CI catches a lot, but not everything. Add runtime headroom:
rss_bytescgroup_limit_bytesrss_headroom_bytes = limit - rssrss_headroom_ratio = headroom / limit
Alert examples:
rss_headroom_ratio < 0.15for 5m -> notifyrss_headroom_ratio < 0.08-> panic (shed load / disable features)
FAQ
Why OOMKilled if heap is only 60%?
Because the kernel kills by cgroup usage/RSS. Heap is only a part of it.
If I lower -Xmx, is it fixed?
Maybe. Without a contract you do not know if you just moved the problem into GC/latency.
Why RSS and not only cgroup usage?
For a single-process container they are close. RSS gives process truth. For multi-process containers, cgroup usage matters more.
Will this be flaky in CI?
If you keep a short, deterministic warm-up and sane thresholds (fail at ~90%), it is surprisingly stable. Flakiness often signals real spikes.
Conclusion
RSS Contracts are simple, but they change the game:
- you stop guessing
-Xmx - you get a reproducible guardrail
- memory becomes an API with a budget, not a mystery
Bonus: next steps
rsscontract diffvs baseline with PR comment- NMT on fail only - auto attach
jcmd VM.native_memory summary - Kubernetes sidecar with
shareProcessNamespace: true(advanced but powerful)
Related posts
JVM Native Memory in Kubernetes: Why Your Pod Gets OOMKilled with 50% Heap
Heap is 50% full but pod gets OOMKilled. I'll show how to track native memory (Metaspace, threads, NIO) and prevent container memory issues.
Java OOMKilled With Stable Heap: Native Memory, Direct Buffers, and glibc Arenas
Heap metrics look fine, GC is happy, but the container keeps dying. The culprit: native memory from direct buffers, JNI, and glibc memory allocator fragmentation.
JVM Metaspace OOM in Kubernetes: Why MaxMetaspaceSize Alone Won't Save You
Pod OOMKilled despite MaxMetaspaceSize set. The cause: Metaspace grows outside heap, container memory limit doesn't account for it, and class unloading isn't happening.
OpenTelemetry Collector Backpressure: Fixing Drops with memory_limiter and Queues
OpenTelemetry Collector drops spans under load when exporters backpressure. Fix with memory_limiter, queues, and batch tuning, with commands to verify.
Cite this article
If you reference this post, please link to the original URL and credit the author.