Kubernetes OOM Killer: Why Your Container Dies at 50% Memory
OOMKilled is one of those messages that feels personal after midnight. “Our container has 4GB memory limit but gets OOMKilled at 2GB RSS.” I’ve heard this complaint from countless teams, and the confusion is completely understandable. You look at top or your monitoring dashboard, see 2GB memory usage, and then—boom—OOMKilled. Where did the other 2GB go? Are the metrics lying?
The metrics aren’t lying. They’re just not telling you the whole story. Container memory accounting in Linux includes far more than what your application allocates. The kernel tracks page cache, socket buffers, kernel data structures, and various other overhead—all counting against your container’s memory limit. Understanding this hidden memory is the key to setting limits that don’t cause unexpected OOMKills.
I first ran into this problem with a data processing service that read large files from S3. The application itself used about 1.5GB of heap, but it kept getting killed with a 4GB limit. The culprit was Linux’s helpful habit of caching file data in memory. Every file we read stayed in the page cache, silently consuming memory until we hit the cgroup limit.
Tested on: Kubernetes 1.28, cgroup v2, Java and Go applications
Understanding Container Memory Accounting
Let’s break down exactly what counts against your container’s memory limit. This knowledge is essential for setting appropriate limits and debugging OOMKill issues.
What Counts Against Memory Limit
When you set a memory limit on a container, Linux’s cgroup controller tracks multiple categories of memory usage:
Container memory cgroup includes:
1. RSS (Resident Set Size)
- Your application's heap
- Stack memory
- Mapped files (actually in RAM)
2. Page Cache
- File reads cached by kernel
- Can be reclaimed under pressure
3. Kernel Memory (kmem)
- Socket buffers
- Dentry cache
- inode cache
4. Swap (if enabled)
Total cgroup memory = RSS + Cache + Kmem
┌─────────────────────────────────────────────────────────────┐
│ Container limit: 4GB │
│ │
│ Application RSS: 2.0GB ← What you see in top │
│ Page Cache: 1.5GB ← File reads cached │
│ Kernel Buffers: 0.6GB ← Socket buffers, etc │
│ ───────────────────────────── │
│ Total cgroup usage: 4.1GB ← EXCEEDS LIMIT = OOMKill! │
└─────────────────────────────────────────────────────────────┘
The critical insight is that your application’s heap (what most monitoring shows as “memory usage”) is often only a fraction of total cgroup memory consumption. The rest is overhead that’s invisible unless you know where to look.
The Page Cache Problem
Linux aggressively caches file data in memory. When you read a file, the kernel keeps that data in RAM in case you need it again. This is usually great for performance—repeated file access becomes much faster.
The problem in containers is that this cached data counts against your memory limit. If your application reads 2GB of files during processing, those 2GB sit in the page cache even after your application is done with them. The page cache is “reclaimable”—the kernel can evict it under memory pressure—but by the time pressure builds, you might already be at the limit.
Common Culprits
Several patterns commonly trigger unexpected OOMKills:
# 1. High file I/O (page cache explosion)
# Reading lots of files fills page cache
cat /sys/fs/cgroup/memory.stat | grep -E "^(file|anon)"
# 2. Many network connections (socket buffers)
ss -s # Count sockets
cat /proc/net/sockstat # Socket memory
# 3. JVM off-heap memory
# DirectByteBuffers, memory-mapped files
# Don't count toward -Xmx
# 4. Native memory in language runtimes
# Go GC overhead, Python object overhead
Let me elaborate on each:
File I/O patterns: ETL jobs, log processors, and data pipelines that read many files are prime candidates. Each file read adds to the page cache. Even if you close the file handle, the data stays cached.
Network connections: Each TCP socket has send and receive buffers. With default kernel settings, that’s about 200KB per socket. A service with 1000 connections might have 200MB in socket buffers alone—memory that doesn’t show up in your application’s heap.
JVM off-heap: Java’s -Xmx only controls heap size. DirectByteBuffer allocations (used by NIO, Netty, etc.), Metaspace, code cache, and thread stacks all consume additional memory outside the heap.
Native memory: Every language runtime has overhead. Go’s garbage collector needs headroom to work efficiently. Python’s object system has significant per-object overhead. C extensions in Python or Ruby can allocate memory that the interpreter doesn’t track.
Diagnosing OOMKills
When a container gets OOMKilled, you need to understand what actually consumed the memory. Here’s how to investigate:
Check What Really Killed It
# Find OOMKill events
kubectl describe pod <pod-name> | grep -A5 "Last State"
# Get detailed cgroup stats before death
kubectl exec <pod> -- cat /sys/fs/cgroup/memory.stat
# Key metrics:
# anon: Anonymous memory (heap, stack)
# file: Page cache
# kernel: Kernel data structures
# sock: Socket buffers
The memory.stat file is your best friend for understanding memory breakdown. Unlike high-level metrics that often just show RSS, this file breaks down exactly where memory is going.
Memory Stat Breakdown
Let me walk through how to read this file:
# Inside container
cat /sys/fs/cgroup/memory.stat
# Output interpretation:
anon 1073741824 # 1GB - actual app memory
file 1610612736 # 1.5GB - page cache (reclaimable)
kernel 104857600 # 100MB - kernel structures
sock 52428800 # 50MB - socket buffers
shmem 0 # 0 - shared memory
# ... more fields
# memory.current shows total:
cat /sys/fs/cgroup/memory.current
# 2841640960 (2.6GB total)
In this example, the application is using 1GB of actual memory (anon), but total cgroup usage is 2.6GB due to page cache and kernel overhead. If this container had a 2.5GB limit, it would get OOMKilled despite the application only “using” 1GB.
Why Didn’t Page Cache Get Reclaimed?
You might wonder: if page cache is reclaimable, why does it trigger OOMKill? The answer involves timing and the kernel’s reclaim behavior.
The kernel starts reclaiming page cache when memory pressure builds. But this reclaim process takes time. If your application suddenly allocates a large chunk of memory (like processing a big request), it might push cgroup usage over the limit before reclaim can free enough cache. The OOM killer triggers immediately when the limit is exceeded.
Additionally, some cached pages might be “dirty” (modified but not yet written to disk). Dirty pages can’t be immediately reclaimed—they need to be written first. A write-heavy workload can have significant dirty page cache that isn’t quickly reclaimable.
Solutions
Now that we understand the problem, let’s fix it.
1. Set Realistic Limits (Account for Overhead)
The most common fix is simply setting higher limits that account for the reality of container memory accounting:
# deployment.yaml
resources:
requests:
memory: "2Gi" # What app typically uses
limits:
memory: "4Gi" # App + cache + buffers overhead
# Rule of thumb:
# limit = expected_rss × 1.5 to 2.0
# Adjust based on I/O patterns
For I/O-heavy workloads (ETL, data processing, file serving), use a 2x multiplier or higher. For memory-bound workloads with little I/O, 1.3-1.5x might be sufficient.
The request should reflect typical memory usage. The limit should account for worst-case scenarios including cache and kernel overhead.
2. Tune JVM Memory Correctly
JVM applications are particularly tricky because Java has multiple memory regions, and -Xmx only controls one of them:
# For JVM applications
env:
- name: JAVA_OPTS
value: >-
-XX:+UseContainerSupport
-XX:MaxRAMPercentage=75.0
-XX:+ExitOnOutOfMemoryError
-XX:NativeMemoryTracking=summary
# MaxRAMPercentage=75% leaves room for:
# - Metaspace
# - Code cache
# - Direct buffers
# - Thread stacks
# - Kernel overhead
The 75% rule is important. With a 4GB container limit, setting -Xmx4g is a recipe for OOMKills. The heap might fit, but Metaspace, code cache, direct buffers, thread stacks, and kernel overhead push you over. Use 75% (3GB heap for 4GB limit) to leave room for everything else.
UseContainerSupport (default since JDK 10) tells the JVM to read cgroup limits rather than host memory. Always verify this is working—some container configurations (like using the host PID namespace) can confuse the JVM.
NativeMemoryTracking lets you see where JVM memory is going beyond the heap. Use jcmd <pid> VM.native_memory summary to get a breakdown.
3. Limit Page Cache Growth
For workloads that read lots of files, you can take steps to reduce page cache accumulation:
# For I/O heavy workloads, use memory.high
# This triggers reclaim before hitting limit
apiVersion: v1
kind: Pod
spec:
containers:
- name: app
resources:
limits:
memory: "4Gi"
# Or in application - use O_DIRECT for large files
# Bypasses page cache
dd if=/dev/zero of=test bs=1M count=1000 oflag=direct
O_DIRECT bypasses the page cache entirely for I/O operations. This is useful for applications that read large files sequentially and won’t benefit from caching. Database engines often use O_DIRECT for data files because they manage their own caching.
Another approach is periodic sync; echo 3 > /proc/sys/vm/drop_caches to force cache drops. This is crude but effective for batch jobs that process files in phases. Note that this affects all containers on the node, so use it carefully.
4. Monitor Memory Components
Proactive monitoring prevents surprises. Track not just total memory but its breakdown:
# Python - track memory breakdown
import os
def get_cgroup_memory():
with open('/sys/fs/cgroup/memory.stat') as f:
stats = {}
for line in f:
key, value = line.strip().split(' ')
stats[key] = int(value)
return stats
def get_memory_breakdown():
stats = get_cgroup_memory()
return {
'anon_gb': stats.get('anon', 0) / (1024**3),
'file_gb': stats.get('file', 0) / (1024**3),
'kernel_gb': stats.get('kernel', 0) / (1024**3),
'sock_gb': stats.get('sock', 0) / (1024**3),
}
Export these as Prometheus metrics so you can see trends over time. A gradually growing page cache might indicate a leak or accumulating state that will eventually cause problems.
5. Use memory.high for Soft Limits
cgroup v2 introduced memory.high, a soft limit that triggers aggressive reclaim without killing the container:
# In cgroup v2, memory.high triggers reclaim
# Set via Kubernetes memory request/limit gap
# Or directly:
echo $((3 * 1024 * 1024 * 1024)) > /sys/fs/cgroup/memory.high
# memory.high = soft limit (triggers reclaim)
# memory.max = hard limit (triggers OOMKill)
When usage exceeds memory.high, the kernel aggressively reclaims memory, slowing the application but not killing it. This gives page cache and other reclaimable memory time to be freed before hitting the hard limit.
In Kubernetes, the gap between requests.memory and limits.memory influences how aggressively the kernel reclaims memory. A larger gap means more headroom before aggressive reclaim kicks in.
Monitoring
Set up alerts to catch memory issues before they cause OOMKills:
# Prometheus alert - approaching memory limit
- alert: ContainerMemoryNearLimit
expr: |
container_memory_working_set_bytes /
container_spec_memory_limit_bytes > 0.85
for: 5m
annotations:
summary: "Container {{ $labels.container }} at 85%+ memory"
# Alert on page cache growth
- alert: HighPageCacheUsage
expr: |
(container_memory_cache / container_memory_working_set_bytes) > 0.5
for: 15m
annotations:
summary: "Page cache > 50% of working set"
# Alert on OOMKills
- alert: ContainerOOMKilled
expr: |
increase(kube_pod_container_status_last_terminated_reason{reason="OOMKilled"}[5m]) > 0
annotations:
summary: "Container {{ $labels.container }} was OOMKilled"
The 85% threshold gives you time to investigate before hitting 100%. The page cache alert catches containers where cached data is dominating memory usage—often a sign that limits need adjusting or the application’s I/O patterns need review.
Checklist
## Container Memory Configuration
### Sizing
- [ ] Account for page cache (1.5-2x RSS)
- [ ] Add buffer for kernel memory (10-20%)
- [ ] Test under realistic I/O load
### JVM Specific
- [ ] Use -XX:+UseContainerSupport
- [ ] Set MaxRAMPercentage to 75% max
- [ ] Enable NativeMemoryTracking for debugging
### Monitoring
- [ ] Alert on memory > 85% of limit
- [ ] Track anon vs file memory ratio
- [ ] Monitor OOMKill events
### Debugging
- [ ] Check /sys/fs/cgroup/memory.stat on OOM
- [ ] Verify what's in page cache
- [ ] Profile native memory usage
Conclusion
Container memory is much more than what your application allocates. The kernel’s memory accounting includes caches, buffers, and overhead that most monitoring tools don’t prominently display. Understanding this gap between “application memory” and “cgroup memory” is essential for reliable container operation.
Key takeaways:
- Page cache and kernel buffers count against your limit—they’re not “free” memory
- Set limits 1.5-2x expected RSS to account for overhead, especially for I/O workloads
- JVM applications need 75% MaxRAMPercentage maximum—the other 25% goes to non-heap memory
- Monitor all memory components—RSS alone doesn’t tell the full story
- Check memory.stat when debugging OOMKills—it reveals where memory actually went
Before setting memory limits, run your application under realistic load and check /sys/fs/cgroup/memory.stat. The numbers there tell you what limits you actually need.
Related Articles
- Container Page Cache Thrashing - Memory pressure details
- Go GOMAXPROCS in Containers - Container resource detection
- Java Native Memory OOMKilled - JVM memory outside the heap
Related posts
'No Space Left on Device' with 40% Disk Free: The Inode and OverlayFS Death Spiral
df -h shows 40% free. But your container keeps crashing with ENOSPC. The culprit: inode exhaustion on overlayfs layers, invisible to standard monitoring.
Linux Page Cache Thrashing in Containers: When Free Memory Isn't Free
Your container has 2GB free but runs slow. Page cache counts against memory limit. File I/O forces code pages out. I explain with benchmarks and solutions.
Java OOMKilled With Stable Heap: Native Memory, Direct Buffers, and glibc Arenas
Heap metrics look fine, GC is happy, but the container keeps dying. The culprit: native memory from direct buffers, JNI, and glibc memory allocator fragmentation.
Packets Arrive but the App Times Out: The rp_filter Trap in Kubernetes
tcpdump shows packets arriving, but your application sees nothing. The culprit: Linux reverse path filtering silently dropping packets before they reach iptables, triggered by asymmetric routing in multi-homed setups.
Cite this article
If you reference this post, please link to the original URL and credit the author.