Späť na blog

RSS Contracts: Ako prestat zabijat Java pody v Kubernetes (OOMKilled) testovanim RSS ako API

Uz ma unavovalo, ked JVM pody padli na OOMKilled a heap vyzeral v pohode. Toto je jeden z najcastejsich WTF incidentov v Kubernetes:

  • limit kontajnera: 2GiB
  • heap nastaveny na -Xmx1400m
  • heap v grafoch vyzera OK (50-70%)
  • a napriek tomu: OOMKilled

Casto sa to stane az po hodinach alebo dnoch (cache warm-up, traffic patterny, jeden rare endpoint).

Toto nie je dalsi “nastav Xmx na 70%” recept. Ja to beriem ako guardrail:

RSS Contracts = memory budgety vyjadrene cez RSS / cgroup usage + automaticke overenie v CI + runtime headroom guard.

Uz testujeme observability ako kontrakty (Span Contracts, Dash Contracts). Tak preco nie pamat?

Co presne znamena OOMKilled (a preco heap nie je argument)

Java OOM (java.lang.OutOfMemoryError) = JVM si vramci svojich limitov nevie alokovat pamat.

Kubernetes OOMKilled = kernel zabil proces, lebo cgroup memory usage presiahla limit kontajnera.

Kernel neriesi heap. Riesi rezidentnu pamat a cgroup usage.

Preto mozes mat:

  • heap OK
  • ale native memory + direct buffers + thread stacks + metaspace + allocatory + page cache = boom

Preco klasicke rady zlyhavaju

”Daj -Xmx na 70% limitu”

  • Niekedy funguje.
  • Niekedy nie.
  • Vacsinou je to guessing bez guardrailov.

”Pouzi MaxRAMPercentage”

Moderne JDK su container-aware, ale to riesi len heap sizing. Stale ti ostane:

  • metaspace
  • code cache
  • thread stacks
  • direct memory (Netty / NIO)
  • native overhead (malloc arenas, JNI, TLS, crypto)
  • fragmentacia a spicky

”Ved mame grafy”

Grafy su super… po deployi. RSS Contracts posuvaju problem do PR/CI.

Memory bill (RSS budget) v JVM kontajneri

Predstav si pamat ako fakturu. Limit kontajnera je celkovy budget.

PolozkaKde vznikaTypicke priciny rastuAko sa to prejavi
Java heapJVM heapalokacie objektov, cache, payloadyGC pressure, latency, RSS rast
Metaspace / Class metadatanativedynamic classloading, proxy, kniznicerast bez heapu
Code cache / JITnativewarm-up, vela metodpomaly rast
Thread stacksnativevysoka concurrency, thread-per-requestskokovy RSS rast
Direct buffers (NIO/Netty)nativeallocateDirect, Netty poolingRSS rast mimo heapu
Malloc arenas / libc overheadnativevela threadov, fragmentaciaprekvapivo velke cisla
Page cache / file cachekernel/cgroupIO, mmap, logy, jarycgroup usage rastie, OOM bez heapu
Safety marginrealitaspicky, fragmentacia, unknown unknownsto, co ta zachrani

Pointa: OOMKilled je skoro vzdy “faktura > limit”. Heap je len jedna polozka.

RSS Contracts: definicia a pravidla

RSS Contract je policy subor v repozitari:

  • max RSS ako % limitu
  • warning threshold
  • max RSS spike pocas testu
  • volitelne guidance na heap/native/safety split

Priklad rss_contract.yml:

version: 1

limits:
  max_rss_pct: 0.90
  warn_rss_pct: 0.80

spikes:
  max_rss_delta_pct: 0.08

sampling:
  duration: 60s
  interval: 250ms

guidance:
  heap_pct_of_limit: 0.60
  native_pct_of_limit: 0.25
  safety_pct_of_limit: 0.10

Ako to funguje v praxi

Workflow (CI aj lokalne)

  1. Spustis aplikaciu v kontajneri s rovnakym memory limitom ako v produ.
  2. Roztochis kratky, ale realisticky workload (smoke + par hot paths).
  3. Spustis rsscontract verify, ktory:
    • precita cgroup memory limit
    • sampluje process RSS
    • spocita max RSS a spicky
    • vygeneruje report
  4. Ak kontrakt padne, PR padne este pred deployom.

Meranie pamate: kernel view, nie JVM wishful thinking

1) Cgroup limit (v2 a v1)

Cgroup v2:

  • limit: /sys/fs/cgroup/memory.max
  • usage: /sys/fs/cgroup/memory.current

Cgroup v1:

  • limit: /sys/fs/cgroup/memory/memory.limit_in_bytes
  • usage: /sys/fs/cgroup/memory/memory.usage_in_bytes

2) Process RSS (najlepsi signal)

Najstabilnejsie:

  • /proc/<pid>/smaps_rollup -> Rss: (kB)

Fallback:

  • /proc/<pid>/status -> VmRSS:

Go implementacia: rsscontract

Jeden binar, ziadne external services. Vies ho pribalit do image a spustit v CI.

internal/memprobe/memprobe.go

package memprobe

import (
  "bufio"
  "errors"
  "fmt"
  "os"
  "strconv"
  "strings"
)

func readFirstLine(path string) (string, error) {
  b, err := os.ReadFile(path)
  if err != nil {
    return "", err
  }
  s := strings.TrimSpace(string(b))
  if s == "" {
    return "", errors.New("empty")
  }
  if i := strings.IndexByte(s, '\n'); i >= 0 {
    s = s[:i]
  }
  return s, nil
}

// CgroupMemoryLimitBytes tries v2 first, then v1.
func CgroupMemoryLimitBytes() (uint64, error) {
  // cgroup v2
  if s, err := readFirstLine("/sys/fs/cgroup/memory.max"); err == nil {
    if s == "max" {
      return 0, errors.New("memory.max is unlimited")
    }
    v, err := strconv.ParseUint(s, 10, 64)
    if err == nil && v > 0 {
      if v > (1 << 60) {
        return 0, errors.New("memory.max looks unlimited")
      }
      return v, nil
    }
  }

  // cgroup v1
  if s, err := readFirstLine("/sys/fs/cgroup/memory/memory.limit_in_bytes"); err == nil {
    v, err := strconv.ParseUint(s, 10, 64)
    if err == nil && v > 0 {
      if v > (1 << 60) {
        return 0, errors.New("memory.limit_in_bytes looks unlimited")
      }
      return v, nil
    }
  }

  return 0, errors.New("cgroup memory limit not found")
}

// RssBytes reads RSS for a given pid using smaps_rollup, fallback to status.
func RssBytes(pid int) (uint64, error) {
  if v, err := rssFromSmapsRollup(pid); err == nil {
    return v, nil
  }
  return rssFromStatus(pid)
}

func rssFromSmapsRollup(pid int) (uint64, error) {
  path := fmt.Sprintf("/proc/%d/smaps_rollup", pid)
  f, err := os.Open(path)
  if err != nil {
    return 0, err
  }
  defer f.Close()

  sc := bufio.NewScanner(f)
  for sc.Scan() {
    line := sc.Text()
    if strings.HasPrefix(line, "Rss:") {
      fields := strings.Fields(line)
      if len(fields) < 2 {
        return 0, fmt.Errorf("unexpected rss line: %q", line)
      }
      kb, err := strconv.ParseUint(fields[1], 10, 64)
      if err != nil {
        return 0, err
      }
      return kb * 1024, nil
    }
  }
  if err := sc.Err(); err != nil {
    return 0, err
  }
  return 0, errors.New("Rss not found in smaps_rollup")
}

func rssFromStatus(pid int) (uint64, error) {
  path := fmt.Sprintf("/proc/%d/status", pid)
  f, err := os.Open(path)
  if err != nil {
    return 0, err
  }
  defer f.Close()

  sc := bufio.NewScanner(f)
  for sc.Scan() {
    line := sc.Text()
    if strings.HasPrefix(line, "VmRSS:") {
      fields := strings.Fields(line)
      if len(fields) < 2 {
        return 0, fmt.Errorf("unexpected VmRSS line: %q", line)
      }
      kb, err := strconv.ParseUint(fields[1], 10, 64)
      if err != nil {
        return 0, err
      }
      return kb * 1024, nil
    }
  }
  if err := sc.Err(); err != nil {
    return 0, err
  }
  return 0, errors.New("VmRSS not found in status")
}

cmd/rsscontract/main.go

package main

import (
  "encoding/json"
  "errors"
  "flag"
  "fmt"
  "math"
  "os"
  "time"

  "gopkg.in/yaml.v3"

  "example.com/rsscontract/internal/memprobe"
)

type Contract struct {
  Version int `yaml:"version"`
  Limits  struct {
    MaxRssPct  float64 `yaml:"max_rss_pct"`
    WarnRssPct float64 `yaml:"warn_rss_pct"`
  } `yaml:"limits"`
  Spikes struct {
    MaxRssDeltaPct float64 `yaml:"max_rss_delta_pct"`
  } `yaml:"spikes"`
  Sampling struct {
    Duration string `yaml:"duration"`
    Interval string `yaml:"interval"`
  } `yaml:"sampling"`
  Guidance struct {
    HeapPctOfLimit   float64 `yaml:"heap_pct_of_limit"`
    NativePctOfLimit float64 `yaml:"native_pct_of_limit"`
    SafetyPctOfLimit float64 `yaml:"safety_pct_of_limit"`
  } `yaml:"guidance"`
}

type Report struct {
  PID             int       `json:"pid"`
  Timestamp       time.Time `json:"timestamp"`
  LimitBytes      uint64    `json:"limit_bytes"`
  MaxRssBytes     uint64    `json:"max_rss_bytes"`
  MinRssBytes     uint64    `json:"min_rss_bytes"`
  DeltaRssBytes   int64     `json:"delta_rss_bytes"`
  MaxRssPct       float64   `json:"max_rss_pct"`
  DeltaRssPct     float64   `json:"delta_rss_pct"`
  WarnThreshold   float64   `json:"warn_threshold_pct"`
  FailThreshold   float64   `json:"fail_threshold_pct"`
  SpikeFailPct    float64   `json:"spike_fail_pct"`
  Warnings        []string  `json:"warnings,omitempty"`
  Violations      []string  `json:"violations,omitempty"`
  ContractVersion int       `json:"contract_version"`
}

func main() {
  var (
    contractPath = flag.String("contract", "rss_contract.yml", "Path to RSS contract YAML")
    pid          = flag.Int("pid", 1, "Target process PID (in container usually 1)")
    outPath      = flag.String("report", "rss_report.json", "Where to write JSON report")
  )
  flag.Parse()

  c, err := loadContract(*contractPath)
  if err != nil {
    fatal(err)
  }

  dur, err := time.ParseDuration(c.Sampling.Duration)
  if err != nil {
    fatal(fmt.Errorf("bad sampling.duration: %w", err))
  }
  interval, err := time.ParseDuration(c.Sampling.Interval)
  if err != nil {
    fatal(fmt.Errorf("bad sampling.interval: %w", err))
  }
  if dur <= 0 || interval <= 0 {
    fatal(errors.New("sampling.duration and sampling.interval must be > 0"))
  }

  limit, err := memprobe.CgroupMemoryLimitBytes()
  if err != nil {
    fatal(fmt.Errorf("cannot determine cgroup limit: %w", err))
  }

  rep, exitCode := verify(c, *pid, limit, dur, interval)
  if err := writeJSON(*outPath, rep); err != nil {
    fmt.Fprintf(os.Stderr, "report write failed: %v\n", err)
  }

  printHuman(rep)
  os.Exit(exitCode)
}

func verify(c Contract, pid int, limit uint64, dur, interval time.Duration) (Report, int) {
  rep := Report{
    PID:             pid,
    Timestamp:       time.Now().UTC(),
    LimitBytes:      limit,
    WarnThreshold:   c.Limits.WarnRssPct,
    FailThreshold:   c.Limits.MaxRssPct,
    SpikeFailPct:    c.Spikes.MaxRssDeltaPct,
    ContractVersion: c.Version,
  }

  start := time.Now()
  deadline := start.Add(dur)

  var min uint64 = math.MaxUint64
  var max uint64 = 0

  var first uint64 = 0
  var last uint64 = 0

  for now := time.Now(); now.Before(deadline); now = time.Now() {
    rss, err := memprobe.RssBytes(pid)
    if err != nil {
      rep.Violations = append(rep.Violations, fmt.Sprintf("rss_read_error: %v", err))
      return rep, 2
    }

    if first == 0 {
      first = rss
    }
    last = rss

    if rss < min {
      min = rss
    }
    if rss > max {
      max = rss
    }

    time.Sleep(interval)
  }

  rep.MinRssBytes = min
  rep.MaxRssBytes = max
  rep.DeltaRssBytes = int64(last) - int64(first)

  rep.MaxRssPct = float64(max) / float64(limit)
  rep.DeltaRssPct = math.Abs(float64(rep.DeltaRssBytes)) / float64(limit)

  if c.Limits.WarnRssPct > 0 && rep.MaxRssPct >= c.Limits.WarnRssPct {
    rep.Warnings = append(rep.Warnings,
      fmt.Sprintf("RSS warning threshold exceeded: max_rss_pct=%.3f >= warn_rss_pct=%.3f", rep.MaxRssPct, c.Limits.WarnRssPct))
  }

  exitCode := 0
  if c.Limits.MaxRssPct > 0 && rep.MaxRssPct >= c.Limits.MaxRssPct {
    rep.Violations = append(rep.Violations,
      fmt.Sprintf("RSS contract failed: max_rss_pct=%.3f >= max_rss_pct=%.3f", rep.MaxRssPct, c.Limits.MaxRssPct))
    exitCode = 1
  }
  if c.Spikes.MaxRssDeltaPct > 0 && rep.DeltaRssPct >= c.Spikes.MaxRssDeltaPct {
    rep.Violations = append(rep.Violations,
      fmt.Sprintf("RSS spike contract failed: delta_rss_pct=%.3f >= max_rss_delta_pct=%.3f", rep.DeltaRssPct, c.Spikes.MaxRssDeltaPct))
    exitCode = 1
  }

  return rep, exitCode
}

func loadContract(path string) (Contract, error) {
  b, err := os.ReadFile(path)
  if err != nil {
    return Contract{}, err
  }
  var c Contract
  if err := yaml.Unmarshal(b, &c); err != nil {
    return Contract{}, err
  }
  if c.Version == 0 {
    c.Version = 1
  }
  return c, nil
}

func writeJSON(path string, rep Report) error {
  b, err := json.MarshalIndent(rep, "", "  ")
  if err != nil {
    return err
  }
  return os.WriteFile(path, b, 0o644)
}

func printHuman(rep Report) {
  fmt.Printf("RSS Contract report (pid=%d)\n", rep.PID)
  fmt.Printf("- limit:    %.2f MiB\n", float64(rep.LimitBytes)/(1024*1024))
  fmt.Printf("- max RSS:  %.2f MiB (%.1f%%)\n", float64(rep.MaxRssBytes)/(1024*1024), rep.MaxRssPct*100)
  fmt.Printf("- min RSS:  %.2f MiB\n", float64(rep.MinRssBytes)/(1024*1024))
  fmt.Printf("- delta:    %.2f MiB (%.1f%%)\n", float64(rep.DeltaRssBytes)/(1024*1024), rep.DeltaRssPct*100)

  for _, w := range rep.Warnings {
    fmt.Printf("WARN: %s\n", w)
  }
  for _, v := range rep.Violations {
    fmt.Printf("FAIL: %s\n", v)
  }
}

func fatal(err error) {
  fmt.Fprintln(os.Stderr, "error:", err)
  os.Exit(2)
}

go.mod

module example.com/rsscontract

go 1.22

require gopkg.in/yaml.v3 v3.0.1

Ako to pribalit do Java image

FROM golang:1.22 AS rssbuild
WORKDIR /src
COPY go.mod go.sum ./
RUN go mod download
COPY . .
RUN CGO_ENABLED=0 go build -o /out/rsscontract ./cmd/rsscontract

FROM eclipse-temurin:21-jre
WORKDIR /app

COPY build/libs/app.jar /app/app.jar
COPY --from=rssbuild /out/rsscontract /usr/local/bin/rsscontract
COPY rss_contract.yml /app/rss_contract.yml

ENTRYPOINT ["java","-jar","/app/app.jar"]

CI priklad (GitHub Actions)

name: rss-contract
on: [pull_request]

jobs:
  rss_contract:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Build image
        run: |
          docker build -t myapp:test .

      - name: Run app with memory limit
        run: |
          docker run -d --name app \
            --memory=2g --memory-swap=2g \
            -p 8080:8080 \
            myapp:test

      - name: Warm-up / smoke
        run: |
          for i in $(seq 1 50); do
            curl -fsS http://localhost:8080/health || true
            curl -fsS http://localhost:8080/api/some-hot-path || true
          done

      - name: Run RSS contract inside container
        run: |
          docker exec app rsscontract \
            -contract /app/rss_contract.yml \
            -pid 1 \
            -report /tmp/rss_report.json

      - name: Copy report
        if: always()
        run: |
          docker cp app:/tmp/rss_report.json rss_report.json || true

      - name: Upload report
        if: always()
        uses: actions/upload-artifact@v4
        with:
          name: rss_report
          path: rss_report.json

Preco --memory-swap=2g?

Drzi to kontajner blizko Kubernetes spravania (bez swapu, prisne limity). Nie je to perfektna emulacia, je to reprodukovatelny guardrail.

Ako z kontraktu spravit realne JVM nastavenie

Kontrakt ti povie, kolko RAM mozes minut. Teraz rozdel budget.

Rozumny start (nie dogma):

  • Heap: 55-65% limitu
  • Native: 20-30%
  • Safety: 10-15%

Pre limit 2GiB:

  • heap: ~1.2GiB
  • native: ~0.5GiB
  • safety: ~0.3GiB

Heap sizing

Explicitne:

-Xms512m -Xmx1200m

Alebo percentami:

-XX:MaxRAMPercentage=60
-XX:InitialRAMPercentage=25

Percenta riesia heap, nie total RSS.

Direct memory (Netty / NIO)

-XX:MaxDirectMemorySize=256m

Thread stacky

-Xss256k

Mensi stack moze pomoct, ak mas vela threadov. Opatrne.

Metaspace

Capping metaspace vie backfire-nut. Ja zvyknem:

  • merat cez NMT
  • fixnut pricinu
  • az potom zvazit cap

Ked kontrakt padne: pre-OOM debugging checklist

1) Heap vs RSS

Ak heap stoji a RSS rastie, je to takmer vzdy native/direct/threads/page cache.

2) Thread count

ps -o pid,comm,nlwp -p 1

Ak nlwp vystreli, RSS pojde hore (stacks + TLS + arenas).

3) Direct buffers / Netty

Ak throughput rastie a RSS rastie, ale heap nie:

  • skontroluj MaxDirectMemorySize
  • skontroluj pooling a leak detection

4) Native Memory Tracking (NMT)

Spusti JVM s:

-XX:NativeMemoryTracking=summary

Potom:

jcmd 1 VM.native_memory summary scale=MB

Toto je najrychlejsi sposob, ako prestat hadat. Ma overhead - nepchaj to vsade do produ.

5) Page cache / file memory

V cgroup v2 pozri memory.stat:

cat /sys/fs/cgroup/memory.stat | egrep 'anon|file|slab'

Ak file rastie, page cache moze byt vinnik.

Runtime guard: headroom metriky a alerty

CI zachyti vela, ale nie vsetko. Pridaj runtime headroom:

  • rss_bytes
  • cgroup_limit_bytes
  • rss_headroom_bytes = limit - rss
  • rss_headroom_ratio = headroom / limit

Alerty:

  • rss_headroom_ratio < 0.15 na 5m -> notify
  • rss_headroom_ratio < 0.08 -> panic (shed load / disable features)

FAQ

Preco OOMKilled, ked heap je len 60%?

Kernel zabija podla cgroup usage/RSS. Heap je len cast.

Ked znizim -Xmx, bude to fixnute?

Mozno. Bez kontraktu nevies, ci si len presunul problem do GC/latency.

Preco RSS a nie len cgroup usage?

Pri single-process kontajneri su blizko. RSS je process truth. Pri multi-process je dolezitejsia cgroup usage.

Nebude to flaky v CI?

Ak mas kratky, deterministicky warm-up a sane thresoldy (fail okolo 90%), je to stabilne. Flakiness casto odhali realne spicky.

Zaver

RSS Contracts su jednoduche, ale menia hru:

  • prestanes hadat -Xmx
  • mas reprodukovatelny guardrail
  • pamat sa stane API s budgetom, nie zahada

Bonus: dalsie kroky

  1. rsscontract diff vs baseline s PR komentarom
  2. NMT len pri faili - auto prilozit jcmd VM.native_memory summary
  3. Kubernetes sidecar so shareProcessNamespace: true (advanced, ale silne)

Súvisiace články

Citujte tento článok

Ak na článok odkazujete, pridajte pôvodnú URL a uveďte autora.

Michal Drozd. "RSS Contracts: Ako prestat zabijat Java pody v Kubernetes (OOMKilled) testovanim RSS ako API". https://www.michal-drozd.com/sk/blog/rss-contracts-jvm-oomkilled-kubernetes/ (Publikované 27. novembra 2025).