RSS Contracts: Ako prestat zabijat Java pody v Kubernetes (OOMKilled) testovanim RSS ako API
Uz ma unavovalo, ked JVM pody padli na OOMKilled a heap vyzeral v pohode. Toto je jeden z najcastejsich WTF incidentov v Kubernetes:
- limit kontajnera: 2GiB
- heap nastaveny na
-Xmx1400m - heap v grafoch vyzera OK (50-70%)
- a napriek tomu: OOMKilled
Casto sa to stane az po hodinach alebo dnoch (cache warm-up, traffic patterny, jeden rare endpoint).
Toto nie je dalsi “nastav Xmx na 70%” recept. Ja to beriem ako guardrail:
RSS Contracts = memory budgety vyjadrene cez RSS / cgroup usage + automaticke overenie v CI + runtime headroom guard.
Uz testujeme observability ako kontrakty (Span Contracts, Dash Contracts). Tak preco nie pamat?
- Span Contracts - trace-derived API kontrakty v CI
- Dash Contracts - dashboardy a alerty ako kontrakt
Co presne znamena OOMKilled (a preco heap nie je argument)
Java OOM (java.lang.OutOfMemoryError) = JVM si vramci svojich limitov nevie alokovat pamat.
Kubernetes OOMKilled = kernel zabil proces, lebo cgroup memory usage presiahla limit kontajnera.
Kernel neriesi heap. Riesi rezidentnu pamat a cgroup usage.
Preto mozes mat:
- heap OK
- ale native memory + direct buffers + thread stacks + metaspace + allocatory + page cache = boom
Preco klasicke rady zlyhavaju
”Daj -Xmx na 70% limitu”
- Niekedy funguje.
- Niekedy nie.
- Vacsinou je to guessing bez guardrailov.
”Pouzi MaxRAMPercentage”
Moderne JDK su container-aware, ale to riesi len heap sizing. Stale ti ostane:
- metaspace
- code cache
- thread stacks
- direct memory (Netty / NIO)
- native overhead (malloc arenas, JNI, TLS, crypto)
- fragmentacia a spicky
”Ved mame grafy”
Grafy su super… po deployi. RSS Contracts posuvaju problem do PR/CI.
Memory bill (RSS budget) v JVM kontajneri
Predstav si pamat ako fakturu. Limit kontajnera je celkovy budget.
| Polozka | Kde vznika | Typicke priciny rastu | Ako sa to prejavi |
|---|---|---|---|
| Java heap | JVM heap | alokacie objektov, cache, payloady | GC pressure, latency, RSS rast |
| Metaspace / Class metadata | native | dynamic classloading, proxy, kniznice | rast bez heapu |
| Code cache / JIT | native | warm-up, vela metod | pomaly rast |
| Thread stacks | native | vysoka concurrency, thread-per-request | skokovy RSS rast |
| Direct buffers (NIO/Netty) | native | allocateDirect, Netty pooling | RSS rast mimo heapu |
| Malloc arenas / libc overhead | native | vela threadov, fragmentacia | prekvapivo velke cisla |
| Page cache / file cache | kernel/cgroup | IO, mmap, logy, jary | cgroup usage rastie, OOM bez heapu |
| Safety margin | realita | spicky, fragmentacia, unknown unknowns | to, co ta zachrani |
Pointa: OOMKilled je skoro vzdy “faktura > limit”. Heap je len jedna polozka.
RSS Contracts: definicia a pravidla
RSS Contract je policy subor v repozitari:
- max RSS ako % limitu
- warning threshold
- max RSS spike pocas testu
- volitelne guidance na heap/native/safety split
Priklad rss_contract.yml:
version: 1
limits:
max_rss_pct: 0.90
warn_rss_pct: 0.80
spikes:
max_rss_delta_pct: 0.08
sampling:
duration: 60s
interval: 250ms
guidance:
heap_pct_of_limit: 0.60
native_pct_of_limit: 0.25
safety_pct_of_limit: 0.10
Ako to funguje v praxi
Workflow (CI aj lokalne)
- Spustis aplikaciu v kontajneri s rovnakym memory limitom ako v produ.
- Roztochis kratky, ale realisticky workload (smoke + par hot paths).
- Spustis
rsscontract verify, ktory:- precita cgroup memory limit
- sampluje process RSS
- spocita max RSS a spicky
- vygeneruje report
- Ak kontrakt padne, PR padne este pred deployom.
Meranie pamate: kernel view, nie JVM wishful thinking
1) Cgroup limit (v2 a v1)
Cgroup v2:
- limit:
/sys/fs/cgroup/memory.max - usage:
/sys/fs/cgroup/memory.current
Cgroup v1:
- limit:
/sys/fs/cgroup/memory/memory.limit_in_bytes - usage:
/sys/fs/cgroup/memory/memory.usage_in_bytes
2) Process RSS (najlepsi signal)
Najstabilnejsie:
/proc/<pid>/smaps_rollup->Rss:(kB)
Fallback:
/proc/<pid>/status->VmRSS:
Go implementacia: rsscontract
Jeden binar, ziadne external services. Vies ho pribalit do image a spustit v CI.
internal/memprobe/memprobe.go
package memprobe
import (
"bufio"
"errors"
"fmt"
"os"
"strconv"
"strings"
)
func readFirstLine(path string) (string, error) {
b, err := os.ReadFile(path)
if err != nil {
return "", err
}
s := strings.TrimSpace(string(b))
if s == "" {
return "", errors.New("empty")
}
if i := strings.IndexByte(s, '\n'); i >= 0 {
s = s[:i]
}
return s, nil
}
// CgroupMemoryLimitBytes tries v2 first, then v1.
func CgroupMemoryLimitBytes() (uint64, error) {
// cgroup v2
if s, err := readFirstLine("/sys/fs/cgroup/memory.max"); err == nil {
if s == "max" {
return 0, errors.New("memory.max is unlimited")
}
v, err := strconv.ParseUint(s, 10, 64)
if err == nil && v > 0 {
if v > (1 << 60) {
return 0, errors.New("memory.max looks unlimited")
}
return v, nil
}
}
// cgroup v1
if s, err := readFirstLine("/sys/fs/cgroup/memory/memory.limit_in_bytes"); err == nil {
v, err := strconv.ParseUint(s, 10, 64)
if err == nil && v > 0 {
if v > (1 << 60) {
return 0, errors.New("memory.limit_in_bytes looks unlimited")
}
return v, nil
}
}
return 0, errors.New("cgroup memory limit not found")
}
// RssBytes reads RSS for a given pid using smaps_rollup, fallback to status.
func RssBytes(pid int) (uint64, error) {
if v, err := rssFromSmapsRollup(pid); err == nil {
return v, nil
}
return rssFromStatus(pid)
}
func rssFromSmapsRollup(pid int) (uint64, error) {
path := fmt.Sprintf("/proc/%d/smaps_rollup", pid)
f, err := os.Open(path)
if err != nil {
return 0, err
}
defer f.Close()
sc := bufio.NewScanner(f)
for sc.Scan() {
line := sc.Text()
if strings.HasPrefix(line, "Rss:") {
fields := strings.Fields(line)
if len(fields) < 2 {
return 0, fmt.Errorf("unexpected rss line: %q", line)
}
kb, err := strconv.ParseUint(fields[1], 10, 64)
if err != nil {
return 0, err
}
return kb * 1024, nil
}
}
if err := sc.Err(); err != nil {
return 0, err
}
return 0, errors.New("Rss not found in smaps_rollup")
}
func rssFromStatus(pid int) (uint64, error) {
path := fmt.Sprintf("/proc/%d/status", pid)
f, err := os.Open(path)
if err != nil {
return 0, err
}
defer f.Close()
sc := bufio.NewScanner(f)
for sc.Scan() {
line := sc.Text()
if strings.HasPrefix(line, "VmRSS:") {
fields := strings.Fields(line)
if len(fields) < 2 {
return 0, fmt.Errorf("unexpected VmRSS line: %q", line)
}
kb, err := strconv.ParseUint(fields[1], 10, 64)
if err != nil {
return 0, err
}
return kb * 1024, nil
}
}
if err := sc.Err(); err != nil {
return 0, err
}
return 0, errors.New("VmRSS not found in status")
}
cmd/rsscontract/main.go
package main
import (
"encoding/json"
"errors"
"flag"
"fmt"
"math"
"os"
"time"
"gopkg.in/yaml.v3"
"example.com/rsscontract/internal/memprobe"
)
type Contract struct {
Version int `yaml:"version"`
Limits struct {
MaxRssPct float64 `yaml:"max_rss_pct"`
WarnRssPct float64 `yaml:"warn_rss_pct"`
} `yaml:"limits"`
Spikes struct {
MaxRssDeltaPct float64 `yaml:"max_rss_delta_pct"`
} `yaml:"spikes"`
Sampling struct {
Duration string `yaml:"duration"`
Interval string `yaml:"interval"`
} `yaml:"sampling"`
Guidance struct {
HeapPctOfLimit float64 `yaml:"heap_pct_of_limit"`
NativePctOfLimit float64 `yaml:"native_pct_of_limit"`
SafetyPctOfLimit float64 `yaml:"safety_pct_of_limit"`
} `yaml:"guidance"`
}
type Report struct {
PID int `json:"pid"`
Timestamp time.Time `json:"timestamp"`
LimitBytes uint64 `json:"limit_bytes"`
MaxRssBytes uint64 `json:"max_rss_bytes"`
MinRssBytes uint64 `json:"min_rss_bytes"`
DeltaRssBytes int64 `json:"delta_rss_bytes"`
MaxRssPct float64 `json:"max_rss_pct"`
DeltaRssPct float64 `json:"delta_rss_pct"`
WarnThreshold float64 `json:"warn_threshold_pct"`
FailThreshold float64 `json:"fail_threshold_pct"`
SpikeFailPct float64 `json:"spike_fail_pct"`
Warnings []string `json:"warnings,omitempty"`
Violations []string `json:"violations,omitempty"`
ContractVersion int `json:"contract_version"`
}
func main() {
var (
contractPath = flag.String("contract", "rss_contract.yml", "Path to RSS contract YAML")
pid = flag.Int("pid", 1, "Target process PID (in container usually 1)")
outPath = flag.String("report", "rss_report.json", "Where to write JSON report")
)
flag.Parse()
c, err := loadContract(*contractPath)
if err != nil {
fatal(err)
}
dur, err := time.ParseDuration(c.Sampling.Duration)
if err != nil {
fatal(fmt.Errorf("bad sampling.duration: %w", err))
}
interval, err := time.ParseDuration(c.Sampling.Interval)
if err != nil {
fatal(fmt.Errorf("bad sampling.interval: %w", err))
}
if dur <= 0 || interval <= 0 {
fatal(errors.New("sampling.duration and sampling.interval must be > 0"))
}
limit, err := memprobe.CgroupMemoryLimitBytes()
if err != nil {
fatal(fmt.Errorf("cannot determine cgroup limit: %w", err))
}
rep, exitCode := verify(c, *pid, limit, dur, interval)
if err := writeJSON(*outPath, rep); err != nil {
fmt.Fprintf(os.Stderr, "report write failed: %v\n", err)
}
printHuman(rep)
os.Exit(exitCode)
}
func verify(c Contract, pid int, limit uint64, dur, interval time.Duration) (Report, int) {
rep := Report{
PID: pid,
Timestamp: time.Now().UTC(),
LimitBytes: limit,
WarnThreshold: c.Limits.WarnRssPct,
FailThreshold: c.Limits.MaxRssPct,
SpikeFailPct: c.Spikes.MaxRssDeltaPct,
ContractVersion: c.Version,
}
start := time.Now()
deadline := start.Add(dur)
var min uint64 = math.MaxUint64
var max uint64 = 0
var first uint64 = 0
var last uint64 = 0
for now := time.Now(); now.Before(deadline); now = time.Now() {
rss, err := memprobe.RssBytes(pid)
if err != nil {
rep.Violations = append(rep.Violations, fmt.Sprintf("rss_read_error: %v", err))
return rep, 2
}
if first == 0 {
first = rss
}
last = rss
if rss < min {
min = rss
}
if rss > max {
max = rss
}
time.Sleep(interval)
}
rep.MinRssBytes = min
rep.MaxRssBytes = max
rep.DeltaRssBytes = int64(last) - int64(first)
rep.MaxRssPct = float64(max) / float64(limit)
rep.DeltaRssPct = math.Abs(float64(rep.DeltaRssBytes)) / float64(limit)
if c.Limits.WarnRssPct > 0 && rep.MaxRssPct >= c.Limits.WarnRssPct {
rep.Warnings = append(rep.Warnings,
fmt.Sprintf("RSS warning threshold exceeded: max_rss_pct=%.3f >= warn_rss_pct=%.3f", rep.MaxRssPct, c.Limits.WarnRssPct))
}
exitCode := 0
if c.Limits.MaxRssPct > 0 && rep.MaxRssPct >= c.Limits.MaxRssPct {
rep.Violations = append(rep.Violations,
fmt.Sprintf("RSS contract failed: max_rss_pct=%.3f >= max_rss_pct=%.3f", rep.MaxRssPct, c.Limits.MaxRssPct))
exitCode = 1
}
if c.Spikes.MaxRssDeltaPct > 0 && rep.DeltaRssPct >= c.Spikes.MaxRssDeltaPct {
rep.Violations = append(rep.Violations,
fmt.Sprintf("RSS spike contract failed: delta_rss_pct=%.3f >= max_rss_delta_pct=%.3f", rep.DeltaRssPct, c.Spikes.MaxRssDeltaPct))
exitCode = 1
}
return rep, exitCode
}
func loadContract(path string) (Contract, error) {
b, err := os.ReadFile(path)
if err != nil {
return Contract{}, err
}
var c Contract
if err := yaml.Unmarshal(b, &c); err != nil {
return Contract{}, err
}
if c.Version == 0 {
c.Version = 1
}
return c, nil
}
func writeJSON(path string, rep Report) error {
b, err := json.MarshalIndent(rep, "", " ")
if err != nil {
return err
}
return os.WriteFile(path, b, 0o644)
}
func printHuman(rep Report) {
fmt.Printf("RSS Contract report (pid=%d)\n", rep.PID)
fmt.Printf("- limit: %.2f MiB\n", float64(rep.LimitBytes)/(1024*1024))
fmt.Printf("- max RSS: %.2f MiB (%.1f%%)\n", float64(rep.MaxRssBytes)/(1024*1024), rep.MaxRssPct*100)
fmt.Printf("- min RSS: %.2f MiB\n", float64(rep.MinRssBytes)/(1024*1024))
fmt.Printf("- delta: %.2f MiB (%.1f%%)\n", float64(rep.DeltaRssBytes)/(1024*1024), rep.DeltaRssPct*100)
for _, w := range rep.Warnings {
fmt.Printf("WARN: %s\n", w)
}
for _, v := range rep.Violations {
fmt.Printf("FAIL: %s\n", v)
}
}
func fatal(err error) {
fmt.Fprintln(os.Stderr, "error:", err)
os.Exit(2)
}
go.mod
module example.com/rsscontract
go 1.22
require gopkg.in/yaml.v3 v3.0.1
Ako to pribalit do Java image
FROM golang:1.22 AS rssbuild
WORKDIR /src
COPY go.mod go.sum ./
RUN go mod download
COPY . .
RUN CGO_ENABLED=0 go build -o /out/rsscontract ./cmd/rsscontract
FROM eclipse-temurin:21-jre
WORKDIR /app
COPY build/libs/app.jar /app/app.jar
COPY --from=rssbuild /out/rsscontract /usr/local/bin/rsscontract
COPY rss_contract.yml /app/rss_contract.yml
ENTRYPOINT ["java","-jar","/app/app.jar"]
CI priklad (GitHub Actions)
name: rss-contract
on: [pull_request]
jobs:
rss_contract:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Build image
run: |
docker build -t myapp:test .
- name: Run app with memory limit
run: |
docker run -d --name app \
--memory=2g --memory-swap=2g \
-p 8080:8080 \
myapp:test
- name: Warm-up / smoke
run: |
for i in $(seq 1 50); do
curl -fsS http://localhost:8080/health || true
curl -fsS http://localhost:8080/api/some-hot-path || true
done
- name: Run RSS contract inside container
run: |
docker exec app rsscontract \
-contract /app/rss_contract.yml \
-pid 1 \
-report /tmp/rss_report.json
- name: Copy report
if: always()
run: |
docker cp app:/tmp/rss_report.json rss_report.json || true
- name: Upload report
if: always()
uses: actions/upload-artifact@v4
with:
name: rss_report
path: rss_report.json
Preco --memory-swap=2g?
Drzi to kontajner blizko Kubernetes spravania (bez swapu, prisne limity). Nie je to perfektna emulacia, je to reprodukovatelny guardrail.
Ako z kontraktu spravit realne JVM nastavenie
Kontrakt ti povie, kolko RAM mozes minut. Teraz rozdel budget.
Rozumny start (nie dogma):
- Heap: 55-65% limitu
- Native: 20-30%
- Safety: 10-15%
Pre limit 2GiB:
- heap: ~1.2GiB
- native: ~0.5GiB
- safety: ~0.3GiB
Heap sizing
Explicitne:
-Xms512m -Xmx1200m
Alebo percentami:
-XX:MaxRAMPercentage=60
-XX:InitialRAMPercentage=25
Percenta riesia heap, nie total RSS.
Direct memory (Netty / NIO)
-XX:MaxDirectMemorySize=256m
Thread stacky
-Xss256k
Mensi stack moze pomoct, ak mas vela threadov. Opatrne.
Metaspace
Capping metaspace vie backfire-nut. Ja zvyknem:
- merat cez NMT
- fixnut pricinu
- az potom zvazit cap
Ked kontrakt padne: pre-OOM debugging checklist
1) Heap vs RSS
Ak heap stoji a RSS rastie, je to takmer vzdy native/direct/threads/page cache.
2) Thread count
ps -o pid,comm,nlwp -p 1
Ak nlwp vystreli, RSS pojde hore (stacks + TLS + arenas).
3) Direct buffers / Netty
Ak throughput rastie a RSS rastie, ale heap nie:
- skontroluj
MaxDirectMemorySize - skontroluj pooling a leak detection
4) Native Memory Tracking (NMT)
Spusti JVM s:
-XX:NativeMemoryTracking=summary
Potom:
jcmd 1 VM.native_memory summary scale=MB
Toto je najrychlejsi sposob, ako prestat hadat. Ma overhead - nepchaj to vsade do produ.
5) Page cache / file memory
V cgroup v2 pozri memory.stat:
cat /sys/fs/cgroup/memory.stat | egrep 'anon|file|slab'
Ak file rastie, page cache moze byt vinnik.
Runtime guard: headroom metriky a alerty
CI zachyti vela, ale nie vsetko. Pridaj runtime headroom:
rss_bytescgroup_limit_bytesrss_headroom_bytes = limit - rssrss_headroom_ratio = headroom / limit
Alerty:
rss_headroom_ratio < 0.15na 5m -> notifyrss_headroom_ratio < 0.08-> panic (shed load / disable features)
FAQ
Preco OOMKilled, ked heap je len 60%?
Kernel zabija podla cgroup usage/RSS. Heap je len cast.
Ked znizim -Xmx, bude to fixnute?
Mozno. Bez kontraktu nevies, ci si len presunul problem do GC/latency.
Preco RSS a nie len cgroup usage?
Pri single-process kontajneri su blizko. RSS je process truth. Pri multi-process je dolezitejsia cgroup usage.
Nebude to flaky v CI?
Ak mas kratky, deterministicky warm-up a sane thresoldy (fail okolo 90%), je to stabilne. Flakiness casto odhali realne spicky.
Zaver
RSS Contracts su jednoduche, ale menia hru:
- prestanes hadat
-Xmx - mas reprodukovatelny guardrail
- pamat sa stane API s budgetom, nie zahada
Bonus: dalsie kroky
rsscontract diffvs baseline s PR komentarom- NMT len pri faili - auto prilozit
jcmd VM.native_memory summary - Kubernetes sidecar so
shareProcessNamespace: true(advanced, ale silne)
Súvisiace články
JVM Native Memory v Kubernetes: Prečo Pod Dostane OOMKilled s 50% Heap
Heap je 50% plný ale pod dostane OOMKilled. Ukážem ako sledovať native memory (Metaspace, threads, NIO) a zabrániť container memory problémom.
Java OOMKilled So Stabilným Heapom: Native Memory, Direct Buffers a glibc Arenas
Heap metriky vyzerajú dobre, GC je spokojný, ale kontajner stále umiera. Vinník: native memory z direct buffers, JNI a glibc memory allocator fragmentácia.
JVM Metaspace OOM v Kubernetes: Prečo MaxMetaspaceSize Nestačí
Pod OOMKilled napriek nastavenému MaxMetaspaceSize. Príčina: Metaspace rastie mimo heap, container memory limit nepočíta s tým, a triedy sa neuvoľňujú.
OpenTelemetry Collector backpressure: dropy, memory_limiter a queue ako guardrails
OpenTelemetry Collector pri loade dropuje spany kvôli backpressure exportérov. Oprava cez memory_limiter, queue a batch tuning + verifikácia.
Citujte tento článok
Ak na článok odkazujete, pridajte pôvodnú URL a uveďte autora.