Kubernetes p99 Spikes Without OOM: Diagnosing cgroup v2 memory.high with PSI
Your p99 jumps from 80ms to 3s. CPU is fine. RSS stays below the pod limit. Nobody is getting OOMKilled. Yet the service is clearly “stuck” for bursts of time.
This is a classic cgroup v2 failure mode: you’re being throttled by memory pressure, not killed by memory limits. The culprit is usually a non-obvious combination of:
memory.high(backpressure threshold) being hit- reclaim stalls (direct reclaim / refault storms)
- PSI (Pressure Stall Information) screaming in a file nobody graphs
If you only look at RSS and OOMKills, you’ll miss it.
Tested on: Kubernetes 1.29–1.31, containerd 1.7, Linux 6.1–6.6 (cgroup v2), systemd 252+.
Why this matters in 2026
cgroup v2 is the default on most modern distros. Platform teams also increasingly enable “early backpressure” (via systemd slices or kubelet/runtime QoS knobs) to protect nodes from hard OOM events. The result: more incidents where latency collapses without a single OOMKill.
If your SLO is latency-based, memory.high is as real a limiter as CPU throttling.
Incident narrative (anonymized)
We run a latency-sensitive API (Go) on a multi-tenant Kubernetes cluster. After a node pool refresh, alerts started firing:
- p99 latency > 2s for 30–90 seconds at a time
- 5xx rate low, but client timeouts rose
- node CPU < 50%, pod RSS ~70% of limit
- no restarts, no OOMKills, no obvious GC spike
Blast radius: ~20% of traffic (requests landing on pods scheduled to a subset of nodes) saw timeouts.
Constraint: we couldn’t simply “add more memory” to the nodes; capacity was tight and we needed a surgical fix.
Timeline
- T-0: p99 latency alert fires; dashboards show CPU normal, memory “okay”.
- T+5m: App metrics show request handler time flat; queue time grows (threads stuck).
- T+15m: From inside the pod,
memory.eventsshows rapid increments ofhigh. - T+25m:
memory.pressureshows sustained memory stall time during the spikes. - T+35m: On affected nodes,
/proc/pressure/memoryspikes align with the latency. - T+50m: Mitigation: move workload to Guaranteed QoS (request==limit) + slightly raise limit.
- T+90m: p99 stabilizes; PSI drops;
highevents stop incrementing during steady traffic.
Mechanism: what actually happened
memory.max kills you. memory.high slows you down.
In cgroup v2, the memory controller exposes two separate “lines”:
memory.max: hard ceiling; exceeding it triggers OOM (in-cgroup or global).memory.high: soft ceiling; exceeding it triggers reclaim and throttling as backpressure.
When a cgroup exceeds memory.high, the kernel starts reclaiming memory for that cgroup and can throttle allocations. The process doesn’t die; it just spends time stalled in reclaim paths.
Why latency explodes while CPU looks fine
A stalled thread often shows up as “not much CPU” because it’s blocked on:
- reclaim work
- IO triggered by reclaim/refault (depending on your storage and cache behavior)
- allocator stalls
That time is real latency but not necessarily “busy CPU”.
PSI is your truth serum
PSI measures “how long tasks were stalled due to resource pressure” (memory, CPU, IO). For this incident, PSI made the root cause obvious: when latency spiked, memory PSI spiked.
Runbook: confirming memory.high reclaim stalls
What to check first
- Is cgroup v2 in use?
Inside the pod:
stat -fc %T /sys/fs/cgroup
# cgroup2fs == v2
- Is
memory.highset to something meaningful (not “max”)?
cat /sys/fs/cgroup/memory.max
cat /sys/fs/cgroup/memory.high
cat /sys/fs/cgroup/memory.current
If memory.high is below memory.max, the pod can be throttled before it is OOMKilled.
- Did the kernel record throttling events?
cat /sys/fs/cgroup/memory.events
Look for high increasing during the incident window.
How to confirm the hypothesis
A. Confirm stalls via PSI (pod-level):
cat /sys/fs/cgroup/memory.pressure
You’ll see something like:
some avg10=0.25 avg60=0.12 avg300=0.05 total=123456789
full avg10=0.03 avg60=0.01 avg300=0.00 total=987654
Interpretation:
some= % of time at least one task was stalled due to memory pressurefull= % of time all non-idle tasks were stalled (this is “everything is stuck”)
For latency-sensitive services, sustained some avg10 above a few percent is already a problem. Sustained full is usually catastrophic.
B. Correlate with latency spikes:
Run this during an incident (or in a loop):
while true; do
date
cat /sys/fs/cgroup/memory.current
cat /sys/fs/cgroup/memory.events
cat /sys/fs/cgroup/memory.pressure
echo
sleep 5
done
You’re looking for:
memory.currentnear/overmemory.highhighcounter climbing- PSI
some/fullclimbing - p99 latency climbing at the same time
C. Validate at node-level (to catch global pressure):
On the node (via SSH or kubectl debug node/...):
cat /proc/pressure/memory
cat /proc/pressure/io
If node-level PSI spikes too, you may have a node headroom issue (kube-reserved/system-reserved too small) in addition to the pod’s memory.high.
Safe mitigations
Pick the least invasive first:
-
Move the workload to Guaranteed QoS (request == limit) for critical services.
- This reduces surprises from runtime/QoS knobs that treat Burstable workloads more aggressively.
-
Increase the memory limit slightly if the workload’s steady-state + normal spikes are too close to the backpressure threshold.
-
Reduce allocation rate / fan-out (temporary):
- lower concurrency / thread pool
- cap in-flight requests
- reduce per-request buffering (especially large JSON/protobuf payloads)
-
Drain a small number of “hot” nodes (if only some nodes show PSI spikes).
Risky mitigations (can cause collateral damage)
-
Disabling
memory.highglobally (e.g., onkubepods.slice) without understanding why it was enabled.- This can turn a “latency problem” into “node OOM and mass eviction”.
-
Aggressive cache drops (
echo 3 > /proc/sys/vm/drop_caches)- Often makes things worse by increasing refaults and IO.
-
Restarting pods as a “fix”
- It may reset memory state temporarily, but it also hides the mechanism and can trigger connection storms.
What we changed (concrete)
1) Make the service Guaranteed + add real headroom
Before (Burstable, too tight):
resources:
requests:
memory: "512Mi"
limits:
memory: "768Mi"
After (Guaranteed + headroom):
resources:
requests:
memory: "1Gi"
limits:
memory: "1Gi"
2) Remove an accidental node-level MemoryHigh clamp on kubepods slice
We found a systemd drop-in that applied MemoryHigh to kubepods.slice too aggressively on new nodes.
Diff (illustrative):
# /etc/systemd/system/kubepods.slice.d/10-memoryhigh.conf
[Slice]
-MemoryHigh=85%
+MemoryHigh=infinity
We kept Kubernetes eviction thresholds as the primary node protection mechanism.
3) Add an explicit “latency under memory pressure” alert
We added alerts for:
memory.events{high}rate (if you scrape it)- PSI memory
some/full(node-level is a great starting point)
Even without full automation, we documented these commands in the on-call runbook.
How to verify (measurable checks)
- During load,
memory.eventsshould stop incrementinghighrapidly:
watch -n 2 'cat /sys/fs/cgroup/memory.events'
- PSI should stay low during steady traffic:
watch -n 2 'cat /sys/fs/cgroup/memory.pressure'
- Latency recovers without restarts:
- p99 back to baseline
- no correlated spikes in memory PSI
- Node-level headroom holds:
- node
/proc/pressure/memorydoesn’t spike across the fleet - fewer eviction events in kubelet logs
Prevention / guardrails
Contracts we enforce
- Latency-critical services must be Guaranteed (or explicitly reviewed if Burstable).
- Memory headroom budget:
- keep
memory.current< ~70–80% ofmemory.maxunder peak expected load (service-specific)
- keep
- PSI budget:
- sustained memory PSI
fullshould be ~0 - sustained
someshould stay below a low single-digit % during normal operation
- sustained memory PSI
Alerts worth having
- Node memory PSI
some avg10above threshold for N minutes - Rate of
highevents > 0 for latency-critical pods - Increase in major faults/refault rate (where available)
Related reading
- Kubernetes OOM Killer: Why Your Container Dies at 50% Memory
- Linux Page Cache Thrashing in Containers: When Free Memory Isn’t Free
- Kubernetes CPU Throttling Autopsy: Why p99 Latency Explodes at 40% CPU Usage
- RSS Contracts: Stop OOMKilled Java Pods in Kubernetes by Testing RSS as an API
- eBPF Off-CPU Analysis: Finding Latency That Metrics Miss
- Java OOMKilled With Stable Heap: Native Memory, Direct Buffers, and glibc Arenas
- Adaptive Concurrency Limits: Stop Guessing Thread Pool Sizes
Related posts
tcpdump Sees SYNs, but the Service Times Out: The Listen Backlog Trap
Clients time out, tcpdump shows SYNs (sometimes even SYN-ACK), yet your app logs nothing. The culprit is often the Linux listen/accept queues overflowing under load or CPU starvation.
Redis AOF fsync Latency Spikes: When Durability Becomes Your p99
Redis AOF can turn durability into p99 spikes: fsync pressure and rewrite fork CoW. Runbook to confirm, mitigate safely, and add guardrails.
Linux Page Cache Thrashing in Containers: When Free Memory Isn't Free
Your container has 2GB free but runs slow. Page cache counts against memory limit. File I/O forces code pages out. I explain with benchmarks and solutions.
Python GIL and Kubernetes CPU Limits: The Threading Trap
Your Python app has 4 threads but K8s gives 1 CPU. GIL + CFS quota = severe throttling. I show why and how to configure workers correctly.
Cite this article
If you reference this post, please link to the original URL and credit the author.