Kubernetes p99 Spikes Without OOM: Diagnosing cgroup v2 memory.high with PSI

Your p99 jumps from 80ms to 3s. CPU is fine. RSS stays below the pod limit. Nobody is getting OOMKilled. Yet the service is clearly “stuck” for bursts of time.

This is a classic cgroup v2 failure mode: you’re being throttled by memory pressure, not killed by memory limits. The culprit is usually a non-obvious combination of:

memory.high (backpressure threshold) being hit
reclaim stalls (direct reclaim / refault storms)
PSI (Pressure Stall Information) screaming in a file nobody graphs

If you only look at RSS and OOMKills, you’ll miss it.

Tested on: Kubernetes 1.29–1.31, containerd 1.7, Linux 6.1–6.6 (cgroup v2), systemd 252+.

Why this matters in 2026

cgroup v2 is the default on most modern distros. Platform teams also increasingly enable “early backpressure” (via systemd slices or kubelet/runtime QoS knobs) to protect nodes from hard OOM events. The result: more incidents where latency collapses without a single OOMKill.

If your SLO is latency-based, memory.high is as real a limiter as CPU throttling.

Incident narrative (anonymized)

We run a latency-sensitive API (Go) on a multi-tenant Kubernetes cluster. After a node pool refresh, alerts started firing:

p99 latency > 2s for 30–90 seconds at a time
5xx rate low, but client timeouts rose
node CPU < 50%, pod RSS ~70% of limit
no restarts, no OOMKills, no obvious GC spike

Blast radius: ~20% of traffic (requests landing on pods scheduled to a subset of nodes) saw timeouts.

Constraint: we couldn’t simply “add more memory” to the nodes; capacity was tight and we needed a surgical fix.

Timeline

T-0: p99 latency alert fires; dashboards show CPU normal, memory “okay”.
T+5m: App metrics show request handler time flat; queue time grows (threads stuck).
T+15m: From inside the pod, memory.events shows rapid increments of high.
T+25m: memory.pressure shows sustained memory stall time during the spikes.
T+35m: On affected nodes, /proc/pressure/memory spikes align with the latency.
T+50m: Mitigation: move workload to Guaranteed QoS (request==limit) + slightly raise limit.
T+90m: p99 stabilizes; PSI drops; high events stop incrementing during steady traffic.

Mechanism: what actually happened

`memory.max` kills you. `memory.high` slows you down.

In cgroup v2, the memory controller exposes two separate “lines”:

memory.max: hard ceiling; exceeding it triggers OOM (in-cgroup or global).
memory.high: soft ceiling; exceeding it triggers reclaim and throttling as backpressure.

When a cgroup exceeds memory.high, the kernel starts reclaiming memory for that cgroup and can throttle allocations. The process doesn’t die; it just spends time stalled in reclaim paths.

Why latency explodes while CPU looks fine

A stalled thread often shows up as “not much CPU” because it’s blocked on:

reclaim work
IO triggered by reclaim/refault (depending on your storage and cache behavior)
allocator stalls

That time is real latency but not necessarily “busy CPU”.

PSI is your truth serum

PSI measures “how long tasks were stalled due to resource pressure” (memory, CPU, IO). For this incident, PSI made the root cause obvious: when latency spiked, memory PSI spiked.

Runbook: confirming `memory.high` reclaim stalls

What to check first

Is cgroup v2 in use?

Inside the pod:

stat -fc %T /sys/fs/cgroup
# cgroup2fs == v2

Is memory.high set to something meaningful (not “max”)?

cat /sys/fs/cgroup/memory.max
cat /sys/fs/cgroup/memory.high
cat /sys/fs/cgroup/memory.current

If memory.high is below memory.max, the pod can be throttled before it is OOMKilled.

Did the kernel record throttling events?

cat /sys/fs/cgroup/memory.events

Look for high increasing during the incident window.

How to confirm the hypothesis

A. Confirm stalls via PSI (pod-level):

cat /sys/fs/cgroup/memory.pressure

You’ll see something like:

some avg10=0.25 avg60=0.12 avg300=0.05 total=123456789
full avg10=0.03 avg60=0.01 avg300=0.00 total=987654

Interpretation:

some = % of time at least one task was stalled due to memory pressure
full = % of time all non-idle tasks were stalled (this is “everything is stuck”)

For latency-sensitive services, sustained some avg10 above a few percent is already a problem. Sustained full is usually catastrophic.

B. Correlate with latency spikes:

Run this during an incident (or in a loop):

while true; do
  date
  cat /sys/fs/cgroup/memory.current
  cat /sys/fs/cgroup/memory.events
  cat /sys/fs/cgroup/memory.pressure
  echo
  sleep 5
done

You’re looking for:

memory.current near/over memory.high
high counter climbing
PSI some/full climbing
p99 latency climbing at the same time

C. Validate at node-level (to catch global pressure):

On the node (via SSH or kubectl debug node/...):

cat /proc/pressure/memory
cat /proc/pressure/io

If node-level PSI spikes too, you may have a node headroom issue (kube-reserved/system-reserved too small) in addition to the pod’s memory.high.

Safe mitigations

Pick the least invasive first:

Move the workload to Guaranteed QoS (request == limit) for critical services.
- This reduces surprises from runtime/QoS knobs that treat Burstable workloads more aggressively.
Increase the memory limit slightly if the workload’s steady-state + normal spikes are too close to the backpressure threshold.
Reduce allocation rate / fan-out (temporary):
- lower concurrency / thread pool
- cap in-flight requests
- reduce per-request buffering (especially large JSON/protobuf payloads)
Drain a small number of “hot” nodes (if only some nodes show PSI spikes).

Risky mitigations (can cause collateral damage)

Disabling memory.high globally (e.g., on kubepods.slice) without understanding why it was enabled.
- This can turn a “latency problem” into “node OOM and mass eviction”.
Aggressive cache drops (echo 3 > /proc/sys/vm/drop_caches)
- Often makes things worse by increasing refaults and IO.
Restarting pods as a “fix”
- It may reset memory state temporarily, but it also hides the mechanism and can trigger connection storms.

What we changed (concrete)

1) Make the service Guaranteed + add real headroom

Before (Burstable, too tight):

resources:
  requests:
    memory: "512Mi"
  limits:
    memory: "768Mi"

After (Guaranteed + headroom):

resources:
  requests:
    memory: "1Gi"
  limits:
    memory: "1Gi"

2) Remove an accidental node-level `MemoryHigh` clamp on kubepods slice

We found a systemd drop-in that applied MemoryHigh to kubepods.slice too aggressively on new nodes.

Diff (illustrative):

# /etc/systemd/system/kubepods.slice.d/10-memoryhigh.conf

[Slice]
-MemoryHigh=85%
+MemoryHigh=infinity

We kept Kubernetes eviction thresholds as the primary node protection mechanism.

3) Add an explicit “latency under memory pressure” alert

We added alerts for:

memory.events{high} rate (if you scrape it)
PSI memory some/full (node-level is a great starting point)

Even without full automation, we documented these commands in the on-call runbook.

How to verify (measurable checks)

During load, memory.events should stop incrementing high rapidly:

watch -n 2 'cat /sys/fs/cgroup/memory.events'

PSI should stay low during steady traffic:

watch -n 2 'cat /sys/fs/cgroup/memory.pressure'

Latency recovers without restarts:

p99 back to baseline
no correlated spikes in memory PSI

Node-level headroom holds:

node /proc/pressure/memory doesn’t spike across the fleet
fewer eviction events in kubelet logs

Prevention / guardrails

Contracts we enforce

Latency-critical services must be Guaranteed (or explicitly reviewed if Burstable).
Memory headroom budget:
- keep memory.current < ~70–80% of memory.max under peak expected load (service-specific)
PSI budget:
- sustained memory PSI full should be ~0
- sustained some should stay below a low single-digit % during normal operation

Alerts worth having

Node memory PSI some avg10 above threshold for N minutes
Rate of high events > 0 for latency-critical pods
Increase in major faults/refault rate (where available)

Kubernetes p99 Spikes Without OOM: Diagnosing cgroup v2 memory.high with PSI

Why this matters in 2026

Incident narrative (anonymized)

Timeline

Mechanism: what actually happened

`memory.max` kills you. `memory.high` slows you down.

Why latency explodes while CPU looks fine

PSI is your truth serum

Runbook: confirming `memory.high` reclaim stalls

What to check first

How to confirm the hypothesis

Safe mitigations

Risky mitigations (can cause collateral damage)

What we changed (concrete)

1) Make the service Guaranteed + add real headroom

2) Remove an accidental node-level `MemoryHigh` clamp on kubepods slice

3) Add an explicit “latency under memory pressure” alert

How to verify (measurable checks)

Prevention / guardrails

Contracts we enforce

Alerts worth having

Related posts

Cite this article

Why this matters in 2026

Incident narrative (anonymized)

Timeline

Mechanism: what actually happened

memory.max kills you. memory.high slows you down.

Why latency explodes while CPU looks fine

PSI is your truth serum

Runbook: confirming memory.high reclaim stalls

What to check first

How to confirm the hypothesis

Safe mitigations

Risky mitigations (can cause collateral damage)

What we changed (concrete)

1) Make the service Guaranteed + add real headroom

2) Remove an accidental node-level MemoryHigh clamp on kubepods slice

3) Add an explicit “latency under memory pressure” alert

How to verify (measurable checks)

Prevention / guardrails

Contracts we enforce

Alerts worth having

Related reading

Related posts

Cite this article

`memory.max` kills you. `memory.high` slows you down.

Runbook: confirming `memory.high` reclaim stalls

2) Remove an accidental node-level `MemoryHigh` clamp on kubepods slice