Back to blog

RSS Contracts: Stop OOMKilled Java Pods in Kubernetes by Testing RSS as an API

I got tired of seeing JVM pods die with OOMKilled while the heap looked fine. This is one of the most common WTF incidents in Kubernetes:

  • container limit: 2GiB
  • heap set to -Xmx1400m
  • heap in graphs looks OK (50-70%)
  • and yet: OOMKilled

It often happens hours or days later (cache warm-up, traffic patterns, that one rare endpoint).

This post is not another “set Xmx to 70%” recipe. I use this as a guardrail:

RSS Contracts = memory budgets expressed as RSS / cgroup usage + automated verification in CI + a runtime headroom guard.

We already test observability as contracts (Span Contracts, Dash Contracts). So why not memory?

What OOMKilled actually means (and why heap is not an argument)

Java OOM (java.lang.OutOfMemoryError) = JVM cannot allocate memory inside its own limits.

Kubernetes OOMKilled = the kernel killed the process because cgroup memory usage exceeded the container limit.

The kernel does not care about heap. It cares about resident memory and cgroup usage.

So you can have:

  • heap OK
  • but native memory + direct buffers + thread stacks + metaspace + allocators + page cache = boom

Why the usual advice fails

”Set -Xmx to 70% of the limit”

  • Sometimes it works.
  • Sometimes it does not.
  • Mostly it is guessing without guardrails.

”Use MaxRAMPercentage”

Modern JDKs are container-aware, but that only sizes the heap. You still have:

  • metaspace
  • code cache
  • thread stacks
  • direct memory (Netty / NIO)
  • native overhead (malloc arenas, JNI, TLS, crypto)
  • fragmentation and spikes

”We have graphs”

Graphs are great… after the deploy. RSS Contracts move the risk to PR/CI.

The memory bill (RSS budget) in a JVM container

Think of memory as an invoice. The container limit is your total budget.

Line itemWhere it comes fromTypical growth driversHow it shows up
Java heapJVM heapobject allocation, caches, large payloadsGC pressure, latency, RSS growth
Metaspace / Class metadatanativedynamic classloading, proxies, librariesgrowth without heap
Code cache / JITnativewarm-up, lots of methodsslow growth
Thread stacksnativehigh concurrency, thread-per-requestsudden RSS jumps
Direct buffers (NIO/Netty)nativeallocateDirect, Netty poolingRSS growth outside heap
Malloc arenas / libc overheadnativemany threads, fragmentationsurprisingly large numbers
Page cache / file cachekernel/cgroupheavy IO, mmap, logs, jarscgroup usage climbs, OOM w/o heap
Safety marginrealityspikes, fragmentation, unknown unknownswhat saves you

The point: OOMKilled is almost always “invoice > limit”. Heap is just one line item.

RSS Contracts: definition and rules

An RSS Contract is a policy file in your repo:

  • maximum RSS as % of the limit
  • warning threshold
  • maximum RSS spike during the test
  • optional guidance for heap/native/safety split

Example rss_contract.yml:

version: 1

limits:
  max_rss_pct: 0.90
  warn_rss_pct: 0.80

spikes:
  max_rss_delta_pct: 0.08

sampling:
  duration: 60s
  interval: 250ms

guidance:
  heap_pct_of_limit: 0.60
  native_pct_of_limit: 0.25
  safety_pct_of_limit: 0.10

How it works in practice

Workflow (CI and local)

  1. Start the app in a container with the same memory limit as prod.
  2. Run a short, realistic workload (smoke + a few hot paths).
  3. Run rsscontract verify, which:
    • reads cgroup memory limit
    • samples process RSS
    • computes max RSS and spikes
    • produces a report
  4. If the contract fails, the PR fails before deploy.

Measuring memory: kernel view, not JVM wishful thinking

1) Cgroup limit (v2 and v1)

Cgroup v2:

  • limit: /sys/fs/cgroup/memory.max
  • usage: /sys/fs/cgroup/memory.current

Cgroup v1:

  • limit: /sys/fs/cgroup/memory/memory.limit_in_bytes
  • usage: /sys/fs/cgroup/memory/memory.usage_in_bytes

2) Process RSS (best signal)

Most stable:

  • /proc/<pid>/smaps_rollup -> Rss: line (kB)

Fallback:

  • /proc/<pid>/status -> VmRSS:

Go implementation: rsscontract

One binary, no external services. You can bake it into your image and run it in CI.

internal/memprobe/memprobe.go

package memprobe

import (
  "bufio"
  "errors"
  "fmt"
  "os"
  "strconv"
  "strings"
)

func readFirstLine(path string) (string, error) {
  b, err := os.ReadFile(path)
  if err != nil {
    return "", err
  }
  s := strings.TrimSpace(string(b))
  if s == "" {
    return "", errors.New("empty")
  }
  if i := strings.IndexByte(s, '\n'); i >= 0 {
    s = s[:i]
  }
  return s, nil
}

// CgroupMemoryLimitBytes tries v2 first, then v1.
func CgroupMemoryLimitBytes() (uint64, error) {
  // cgroup v2
  if s, err := readFirstLine("/sys/fs/cgroup/memory.max"); err == nil {
    if s == "max" {
      return 0, errors.New("memory.max is unlimited")
    }
    v, err := strconv.ParseUint(s, 10, 64)
    if err == nil && v > 0 {
      if v > (1 << 60) {
        return 0, errors.New("memory.max looks unlimited")
      }
      return v, nil
    }
  }

  // cgroup v1
  if s, err := readFirstLine("/sys/fs/cgroup/memory/memory.limit_in_bytes"); err == nil {
    v, err := strconv.ParseUint(s, 10, 64)
    if err == nil && v > 0 {
      if v > (1 << 60) {
        return 0, errors.New("memory.limit_in_bytes looks unlimited")
      }
      return v, nil
    }
  }

  return 0, errors.New("cgroup memory limit not found")
}

// RssBytes reads RSS for a given pid using smaps_rollup, fallback to status.
func RssBytes(pid int) (uint64, error) {
  if v, err := rssFromSmapsRollup(pid); err == nil {
    return v, nil
  }
  return rssFromStatus(pid)
}

func rssFromSmapsRollup(pid int) (uint64, error) {
  path := fmt.Sprintf("/proc/%d/smaps_rollup", pid)
  f, err := os.Open(path)
  if err != nil {
    return 0, err
  }
  defer f.Close()

  sc := bufio.NewScanner(f)
  for sc.Scan() {
    line := sc.Text()
    if strings.HasPrefix(line, "Rss:") {
      fields := strings.Fields(line)
      if len(fields) < 2 {
        return 0, fmt.Errorf("unexpected rss line: %q", line)
      }
      kb, err := strconv.ParseUint(fields[1], 10, 64)
      if err != nil {
        return 0, err
      }
      return kb * 1024, nil
    }
  }
  if err := sc.Err(); err != nil {
    return 0, err
  }
  return 0, errors.New("Rss not found in smaps_rollup")
}

func rssFromStatus(pid int) (uint64, error) {
  path := fmt.Sprintf("/proc/%d/status", pid)
  f, err := os.Open(path)
  if err != nil {
    return 0, err
  }
  defer f.Close()

  sc := bufio.NewScanner(f)
  for sc.Scan() {
    line := sc.Text()
    if strings.HasPrefix(line, "VmRSS:") {
      fields := strings.Fields(line)
      if len(fields) < 2 {
        return 0, fmt.Errorf("unexpected VmRSS line: %q", line)
      }
      kb, err := strconv.ParseUint(fields[1], 10, 64)
      if err != nil {
        return 0, err
      }
      return kb * 1024, nil
    }
  }
  if err := sc.Err(); err != nil {
    return 0, err
  }
  return 0, errors.New("VmRSS not found in status")
}

cmd/rsscontract/main.go

package main

import (
  "encoding/json"
  "errors"
  "flag"
  "fmt"
  "math"
  "os"
  "time"

  "gopkg.in/yaml.v3"

  "example.com/rsscontract/internal/memprobe"
)

type Contract struct {
  Version int `yaml:"version"`
  Limits  struct {
    MaxRssPct  float64 `yaml:"max_rss_pct"`
    WarnRssPct float64 `yaml:"warn_rss_pct"`
  } `yaml:"limits"`
  Spikes struct {
    MaxRssDeltaPct float64 `yaml:"max_rss_delta_pct"`
  } `yaml:"spikes"`
  Sampling struct {
    Duration string `yaml:"duration"`
    Interval string `yaml:"interval"`
  } `yaml:"sampling"`
  Guidance struct {
    HeapPctOfLimit   float64 `yaml:"heap_pct_of_limit"`
    NativePctOfLimit float64 `yaml:"native_pct_of_limit"`
    SafetyPctOfLimit float64 `yaml:"safety_pct_of_limit"`
  } `yaml:"guidance"`
}

type Report struct {
  PID             int       `json:"pid"`
  Timestamp       time.Time `json:"timestamp"`
  LimitBytes      uint64    `json:"limit_bytes"`
  MaxRssBytes     uint64    `json:"max_rss_bytes"`
  MinRssBytes     uint64    `json:"min_rss_bytes"`
  DeltaRssBytes   int64     `json:"delta_rss_bytes"`
  MaxRssPct       float64   `json:"max_rss_pct"`
  DeltaRssPct     float64   `json:"delta_rss_pct"`
  WarnThreshold   float64   `json:"warn_threshold_pct"`
  FailThreshold   float64   `json:"fail_threshold_pct"`
  SpikeFailPct    float64   `json:"spike_fail_pct"`
  Warnings        []string  `json:"warnings,omitempty"`
  Violations      []string  `json:"violations,omitempty"`
  ContractVersion int       `json:"contract_version"`
}

func main() {
  var (
    contractPath = flag.String("contract", "rss_contract.yml", "Path to RSS contract YAML")
    pid          = flag.Int("pid", 1, "Target process PID (in container usually 1)")
    outPath      = flag.String("report", "rss_report.json", "Where to write JSON report")
  )
  flag.Parse()

  c, err := loadContract(*contractPath)
  if err != nil {
    fatal(err)
  }

  dur, err := time.ParseDuration(c.Sampling.Duration)
  if err != nil {
    fatal(fmt.Errorf("bad sampling.duration: %w", err))
  }
  interval, err := time.ParseDuration(c.Sampling.Interval)
  if err != nil {
    fatal(fmt.Errorf("bad sampling.interval: %w", err))
  }
  if dur <= 0 || interval <= 0 {
    fatal(errors.New("sampling.duration and sampling.interval must be > 0"))
  }

  limit, err := memprobe.CgroupMemoryLimitBytes()
  if err != nil {
    fatal(fmt.Errorf("cannot determine cgroup limit: %w", err))
  }

  rep, exitCode := verify(c, *pid, limit, dur, interval)
  if err := writeJSON(*outPath, rep); err != nil {
    fmt.Fprintf(os.Stderr, "report write failed: %v\n", err)
  }

  printHuman(rep)
  os.Exit(exitCode)
}

func verify(c Contract, pid int, limit uint64, dur, interval time.Duration) (Report, int) {
  rep := Report{
    PID:             pid,
    Timestamp:       time.Now().UTC(),
    LimitBytes:      limit,
    WarnThreshold:   c.Limits.WarnRssPct,
    FailThreshold:   c.Limits.MaxRssPct,
    SpikeFailPct:    c.Spikes.MaxRssDeltaPct,
    ContractVersion: c.Version,
  }

  start := time.Now()
  deadline := start.Add(dur)

  var min uint64 = math.MaxUint64
  var max uint64 = 0

  var first uint64 = 0
  var last uint64 = 0

  for now := time.Now(); now.Before(deadline); now = time.Now() {
    rss, err := memprobe.RssBytes(pid)
    if err != nil {
      rep.Violations = append(rep.Violations, fmt.Sprintf("rss_read_error: %v", err))
      return rep, 2
    }

    if first == 0 {
      first = rss
    }
    last = rss

    if rss < min {
      min = rss
    }
    if rss > max {
      max = rss
    }

    time.Sleep(interval)
  }

  rep.MinRssBytes = min
  rep.MaxRssBytes = max
  rep.DeltaRssBytes = int64(last) - int64(first)

  rep.MaxRssPct = float64(max) / float64(limit)
  rep.DeltaRssPct = math.Abs(float64(rep.DeltaRssBytes)) / float64(limit)

  if c.Limits.WarnRssPct > 0 && rep.MaxRssPct >= c.Limits.WarnRssPct {
    rep.Warnings = append(rep.Warnings,
      fmt.Sprintf("RSS warning threshold exceeded: max_rss_pct=%.3f >= warn_rss_pct=%.3f", rep.MaxRssPct, c.Limits.WarnRssPct))
  }

  exitCode := 0
  if c.Limits.MaxRssPct > 0 && rep.MaxRssPct >= c.Limits.MaxRssPct {
    rep.Violations = append(rep.Violations,
      fmt.Sprintf("RSS contract failed: max_rss_pct=%.3f >= max_rss_pct=%.3f", rep.MaxRssPct, c.Limits.MaxRssPct))
    exitCode = 1
  }
  if c.Spikes.MaxRssDeltaPct > 0 && rep.DeltaRssPct >= c.Spikes.MaxRssDeltaPct {
    rep.Violations = append(rep.Violations,
      fmt.Sprintf("RSS spike contract failed: delta_rss_pct=%.3f >= max_rss_delta_pct=%.3f", rep.DeltaRssPct, c.Spikes.MaxRssDeltaPct))
    exitCode = 1
  }

  return rep, exitCode
}

func loadContract(path string) (Contract, error) {
  b, err := os.ReadFile(path)
  if err != nil {
    return Contract{}, err
  }
  var c Contract
  if err := yaml.Unmarshal(b, &c); err != nil {
    return Contract{}, err
  }
  if c.Version == 0 {
    c.Version = 1
  }
  return c, nil
}

func writeJSON(path string, rep Report) error {
  b, err := json.MarshalIndent(rep, "", "  ")
  if err != nil {
    return err
  }
  return os.WriteFile(path, b, 0o644)
}

func printHuman(rep Report) {
  fmt.Printf("RSS Contract report (pid=%d)\n", rep.PID)
  fmt.Printf("- limit:    %.2f MiB\n", float64(rep.LimitBytes)/(1024*1024))
  fmt.Printf("- max RSS:  %.2f MiB (%.1f%%)\n", float64(rep.MaxRssBytes)/(1024*1024), rep.MaxRssPct*100)
  fmt.Printf("- min RSS:  %.2f MiB\n", float64(rep.MinRssBytes)/(1024*1024))
  fmt.Printf("- delta:    %.2f MiB (%.1f%%)\n", float64(rep.DeltaRssBytes)/(1024*1024), rep.DeltaRssPct*100)

  for _, w := range rep.Warnings {
    fmt.Printf("WARN: %s\n", w)
  }
  for _, v := range rep.Violations {
    fmt.Printf("FAIL: %s\n", v)
  }
}

func fatal(err error) {
  fmt.Fprintln(os.Stderr, "error:", err)
  os.Exit(2)
}

go.mod

module example.com/rsscontract

go 1.22

require gopkg.in/yaml.v3 v3.0.1

How to bake it into a Java image

FROM golang:1.22 AS rssbuild
WORKDIR /src
COPY go.mod go.sum ./
RUN go mod download
COPY . .
RUN CGO_ENABLED=0 go build -o /out/rsscontract ./cmd/rsscontract

FROM eclipse-temurin:21-jre
WORKDIR /app

COPY build/libs/app.jar /app/app.jar
COPY --from=rssbuild /out/rsscontract /usr/local/bin/rsscontract
COPY rss_contract.yml /app/rss_contract.yml

ENTRYPOINT ["java","-jar","/app/app.jar"]

CI example (GitHub Actions)

name: rss-contract
on: [pull_request]

jobs:
  rss_contract:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Build image
        run: |
          docker build -t myapp:test .

      - name: Run app with memory limit
        run: |
          docker run -d --name app \
            --memory=2g --memory-swap=2g \
            -p 8080:8080 \
            myapp:test

      - name: Warm-up / smoke
        run: |
          for i in $(seq 1 50); do
            curl -fsS http://localhost:8080/health || true
            curl -fsS http://localhost:8080/api/some-hot-path || true
          done

      - name: Run RSS contract inside container
        run: |
          docker exec app rsscontract \
            -contract /app/rss_contract.yml \
            -pid 1 \
            -report /tmp/rss_report.json

      - name: Copy report
        if: always()
        run: |
          docker cp app:/tmp/rss_report.json rss_report.json || true

      - name: Upload report
        if: always()
        uses: actions/upload-artifact@v4
        with:
          name: rss_report
          path: rss_report.json

Why --memory-swap=2g?

It keeps the container close to Kubernetes behavior (no swap, tighter limit). It is not a perfect emulation, it is a reproducible guardrail.

Turning the contract into JVM settings

The contract tells you how much RAM you can spend. Now split the budget.

A reasonable starting point (not dogma):

  • Heap: 55-65% of limit
  • Native: 20-30%
  • Safety: 10-15%

For a 2GiB limit:

  • heap: ~1.2GiB
  • native: ~0.5GiB
  • safety: ~0.3GiB

Heap sizing

Explicit:

-Xms512m -Xmx1200m

Or percent-based:

-XX:MaxRAMPercentage=60
-XX:InitialRAMPercentage=25

Percentages only control heap, not total RSS.

Direct memory (Netty / NIO)

-XX:MaxDirectMemorySize=256m

Thread stacks

-Xss256k

Lower stacks can help if you run many threads. Use with care.

Metaspace

Capping metaspace can backfire. I usually:

  • measure with NMT
  • fix the cause
  • only then consider a cap

When the contract fails: pre-OOM debugging checklist

1) Heap vs RSS

If heap is flat but RSS grows, it is almost always native/direct/threads/page cache.

2) Thread count

ps -o pid,comm,nlwp -p 1

If nlwp jumps, RSS will follow (stacks + TLS + arenas).

3) Direct buffers / Netty

If throughput grows and RSS grows but heap does not:

  • check MaxDirectMemorySize
  • check pooling and leak detection

4) Native Memory Tracking (NMT)

Start JVM with:

-XX:NativeMemoryTracking=summary

Then:

jcmd 1 VM.native_memory summary scale=MB

This is the fastest way to stop guessing. It has overhead - do not enable everywhere in prod.

5) Page cache / file memory

In cgroup v2, check memory.stat:

cat /sys/fs/cgroup/memory.stat | egrep 'anon|file|slab'

If file grows, page cache may be the culprit.

Runtime guard: headroom metrics and alerts

CI catches a lot, but not everything. Add runtime headroom:

  • rss_bytes
  • cgroup_limit_bytes
  • rss_headroom_bytes = limit - rss
  • rss_headroom_ratio = headroom / limit

Alert examples:

  • rss_headroom_ratio < 0.15 for 5m -> notify
  • rss_headroom_ratio < 0.08 -> panic (shed load / disable features)

FAQ

Why OOMKilled if heap is only 60%?

Because the kernel kills by cgroup usage/RSS. Heap is only a part of it.

If I lower -Xmx, is it fixed?

Maybe. Without a contract you do not know if you just moved the problem into GC/latency.

Why RSS and not only cgroup usage?

For a single-process container they are close. RSS gives process truth. For multi-process containers, cgroup usage matters more.

Will this be flaky in CI?

If you keep a short, deterministic warm-up and sane thresholds (fail at ~90%), it is surprisingly stable. Flakiness often signals real spikes.

Conclusion

RSS Contracts are simple, but they change the game:

  • you stop guessing -Xmx
  • you get a reproducible guardrail
  • memory becomes an API with a budget, not a mystery

Bonus: next steps

  1. rsscontract diff vs baseline with PR comment
  2. NMT on fail only - auto attach jcmd VM.native_memory summary
  3. Kubernetes sidecar with shareProcessNamespace: true (advanced but powerful)

Related posts

Cite this article

If you reference this post, please link to the original URL and credit the author.

Michal Drozd. "RSS Contracts: Stop OOMKilled Java Pods in Kubernetes by Testing RSS as an API". https://www.michal-drozd.com/en/blog/rss-contracts-jvm-oomkilled-kubernetes/ (Published November 27, 2025).