Kubernetes OOM Killer: Why Your Container Dies at 50% Memory

OOMKilled is one of those messages that feels personal after midnight. “Our container has 4GB memory limit but gets OOMKilled at 2GB RSS.” I’ve heard this complaint from countless teams, and the confusion is completely understandable. You look at top or your monitoring dashboard, see 2GB memory usage, and then—boom—OOMKilled. Where did the other 2GB go? Are the metrics lying?

The metrics aren’t lying. They’re just not telling you the whole story. Container memory accounting in Linux includes far more than what your application allocates. The kernel tracks page cache, socket buffers, kernel data structures, and various other overhead—all counting against your container’s memory limit. Understanding this hidden memory is the key to setting limits that don’t cause unexpected OOMKills.

I first ran into this problem with a data processing service that read large files from S3. The application itself used about 1.5GB of heap, but it kept getting killed with a 4GB limit. The culprit was Linux’s helpful habit of caching file data in memory. Every file we read stayed in the page cache, silently consuming memory until we hit the cgroup limit.

Tested on: Kubernetes 1.28, cgroup v2, Java and Go applications

Understanding Container Memory Accounting

Let’s break down exactly what counts against your container’s memory limit. This knowledge is essential for setting appropriate limits and debugging OOMKill issues.

What Counts Against Memory Limit

When you set a memory limit on a container, Linux’s cgroup controller tracks multiple categories of memory usage:

Container memory cgroup includes:

1. RSS (Resident Set Size)
   - Your application's heap
   - Stack memory
   - Mapped files (actually in RAM)

2. Page Cache
   - File reads cached by kernel
   - Can be reclaimed under pressure

3. Kernel Memory (kmem)
   - Socket buffers
   - Dentry cache
   - inode cache

4. Swap (if enabled)

Total cgroup memory = RSS + Cache + Kmem

┌─────────────────────────────────────────────────────────────┐
│ Container limit: 4GB                                         │
│                                                              │
│ Application RSS:     2.0GB  ← What you see in top          │
│ Page Cache:          1.5GB  ← File reads cached            │
│ Kernel Buffers:      0.6GB  ← Socket buffers, etc          │
│ ─────────────────────────────                               │
│ Total cgroup usage:  4.1GB  ← EXCEEDS LIMIT = OOMKill!     │
└─────────────────────────────────────────────────────────────┘

The critical insight is that your application’s heap (what most monitoring shows as “memory usage”) is often only a fraction of total cgroup memory consumption. The rest is overhead that’s invisible unless you know where to look.

The Page Cache Problem

Linux aggressively caches file data in memory. When you read a file, the kernel keeps that data in RAM in case you need it again. This is usually great for performance—repeated file access becomes much faster.

The problem in containers is that this cached data counts against your memory limit. If your application reads 2GB of files during processing, those 2GB sit in the page cache even after your application is done with them. The page cache is “reclaimable”—the kernel can evict it under memory pressure—but by the time pressure builds, you might already be at the limit.

Common Culprits

Several patterns commonly trigger unexpected OOMKills:

# 1. High file I/O (page cache explosion)
# Reading lots of files fills page cache
cat /sys/fs/cgroup/memory.stat | grep -E "^(file|anon)"

# 2. Many network connections (socket buffers)
ss -s  # Count sockets
cat /proc/net/sockstat  # Socket memory

# 3. JVM off-heap memory
# DirectByteBuffers, memory-mapped files
# Don't count toward -Xmx

# 4. Native memory in language runtimes
# Go GC overhead, Python object overhead

Let me elaborate on each:

File I/O patterns: ETL jobs, log processors, and data pipelines that read many files are prime candidates. Each file read adds to the page cache. Even if you close the file handle, the data stays cached.

Network connections: Each TCP socket has send and receive buffers. With default kernel settings, that’s about 200KB per socket. A service with 1000 connections might have 200MB in socket buffers alone—memory that doesn’t show up in your application’s heap.

JVM off-heap: Java’s -Xmx only controls heap size. DirectByteBuffer allocations (used by NIO, Netty, etc.), Metaspace, code cache, and thread stacks all consume additional memory outside the heap.

Native memory: Every language runtime has overhead. Go’s garbage collector needs headroom to work efficiently. Python’s object system has significant per-object overhead. C extensions in Python or Ruby can allocate memory that the interpreter doesn’t track.

Diagnosing OOMKills

When a container gets OOMKilled, you need to understand what actually consumed the memory. Here’s how to investigate:

Check What Really Killed It

# Find OOMKill events
kubectl describe pod <pod-name> | grep -A5 "Last State"

# Get detailed cgroup stats before death
kubectl exec <pod> -- cat /sys/fs/cgroup/memory.stat

# Key metrics:
# anon:     Anonymous memory (heap, stack)
# file:     Page cache
# kernel:   Kernel data structures
# sock:     Socket buffers

The memory.stat file is your best friend for understanding memory breakdown. Unlike high-level metrics that often just show RSS, this file breaks down exactly where memory is going.

Memory Stat Breakdown

Let me walk through how to read this file:

# Inside container
cat /sys/fs/cgroup/memory.stat

# Output interpretation:
anon 1073741824      # 1GB - actual app memory
file 1610612736      # 1.5GB - page cache (reclaimable)
kernel 104857600     # 100MB - kernel structures
sock 52428800        # 50MB - socket buffers
shmem 0              # 0 - shared memory
# ... more fields

# memory.current shows total:
cat /sys/fs/cgroup/memory.current
# 2841640960 (2.6GB total)

In this example, the application is using 1GB of actual memory (anon), but total cgroup usage is 2.6GB due to page cache and kernel overhead. If this container had a 2.5GB limit, it would get OOMKilled despite the application only “using” 1GB.

Why Didn’t Page Cache Get Reclaimed?

You might wonder: if page cache is reclaimable, why does it trigger OOMKill? The answer involves timing and the kernel’s reclaim behavior.

The kernel starts reclaiming page cache when memory pressure builds. But this reclaim process takes time. If your application suddenly allocates a large chunk of memory (like processing a big request), it might push cgroup usage over the limit before reclaim can free enough cache. The OOM killer triggers immediately when the limit is exceeded.

Additionally, some cached pages might be “dirty” (modified but not yet written to disk). Dirty pages can’t be immediately reclaimed—they need to be written first. A write-heavy workload can have significant dirty page cache that isn’t quickly reclaimable.

Solutions

Now that we understand the problem, let’s fix it.

1. Set Realistic Limits (Account for Overhead)

The most common fix is simply setting higher limits that account for the reality of container memory accounting:

# deployment.yaml
resources:
  requests:
    memory: "2Gi"    # What app typically uses
  limits:
    memory: "4Gi"    # App + cache + buffers overhead

# Rule of thumb:
# limit = expected_rss × 1.5 to 2.0
# Adjust based on I/O patterns

For I/O-heavy workloads (ETL, data processing, file serving), use a 2x multiplier or higher. For memory-bound workloads with little I/O, 1.3-1.5x might be sufficient.

The request should reflect typical memory usage. The limit should account for worst-case scenarios including cache and kernel overhead.

2. Tune JVM Memory Correctly

JVM applications are particularly tricky because Java has multiple memory regions, and -Xmx only controls one of them:

# For JVM applications
env:
  - name: JAVA_OPTS
    value: >-
      -XX:+UseContainerSupport
      -XX:MaxRAMPercentage=75.0
      -XX:+ExitOnOutOfMemoryError
      -XX:NativeMemoryTracking=summary

# MaxRAMPercentage=75% leaves room for:
# - Metaspace
# - Code cache
# - Direct buffers
# - Thread stacks
# - Kernel overhead

The 75% rule is important. With a 4GB container limit, setting -Xmx4g is a recipe for OOMKills. The heap might fit, but Metaspace, code cache, direct buffers, thread stacks, and kernel overhead push you over. Use 75% (3GB heap for 4GB limit) to leave room for everything else.

UseContainerSupport (default since JDK 10) tells the JVM to read cgroup limits rather than host memory. Always verify this is working—some container configurations (like using the host PID namespace) can confuse the JVM.

NativeMemoryTracking lets you see where JVM memory is going beyond the heap. Use jcmd <pid> VM.native_memory summary to get a breakdown.

3. Limit Page Cache Growth

For workloads that read lots of files, you can take steps to reduce page cache accumulation:

# For I/O heavy workloads, use memory.high
# This triggers reclaim before hitting limit
apiVersion: v1
kind: Pod
spec:
  containers:
  - name: app
    resources:
      limits:
        memory: "4Gi"

# Or in application - use O_DIRECT for large files
# Bypasses page cache
dd if=/dev/zero of=test bs=1M count=1000 oflag=direct

O_DIRECT bypasses the page cache entirely for I/O operations. This is useful for applications that read large files sequentially and won’t benefit from caching. Database engines often use O_DIRECT for data files because they manage their own caching.

Another approach is periodic sync; echo 3 > /proc/sys/vm/drop_caches to force cache drops. This is crude but effective for batch jobs that process files in phases. Note that this affects all containers on the node, so use it carefully.

4. Monitor Memory Components

Proactive monitoring prevents surprises. Track not just total memory but its breakdown:

# Python - track memory breakdown
import os

def get_cgroup_memory():
    with open('/sys/fs/cgroup/memory.stat') as f:
        stats = {}
        for line in f:
            key, value = line.strip().split(' ')
            stats[key] = int(value)
    return stats

def get_memory_breakdown():
    stats = get_cgroup_memory()
    return {
        'anon_gb': stats.get('anon', 0) / (1024**3),
        'file_gb': stats.get('file', 0) / (1024**3),
        'kernel_gb': stats.get('kernel', 0) / (1024**3),
        'sock_gb': stats.get('sock', 0) / (1024**3),
    }

Export these as Prometheus metrics so you can see trends over time. A gradually growing page cache might indicate a leak or accumulating state that will eventually cause problems.

5. Use memory.high for Soft Limits

cgroup v2 introduced memory.high, a soft limit that triggers aggressive reclaim without killing the container:

# In cgroup v2, memory.high triggers reclaim
# Set via Kubernetes memory request/limit gap
# Or directly:
echo $((3 * 1024 * 1024 * 1024)) > /sys/fs/cgroup/memory.high

# memory.high = soft limit (triggers reclaim)
# memory.max = hard limit (triggers OOMKill)

When usage exceeds memory.high, the kernel aggressively reclaims memory, slowing the application but not killing it. This gives page cache and other reclaimable memory time to be freed before hitting the hard limit.

In Kubernetes, the gap between requests.memory and limits.memory influences how aggressively the kernel reclaims memory. A larger gap means more headroom before aggressive reclaim kicks in.

Monitoring

Set up alerts to catch memory issues before they cause OOMKills:

# Prometheus alert - approaching memory limit
- alert: ContainerMemoryNearLimit
  expr: |
    container_memory_working_set_bytes /
    container_spec_memory_limit_bytes > 0.85
  for: 5m
  annotations:
    summary: "Container {{ $labels.container }} at 85%+ memory"

# Alert on page cache growth
- alert: HighPageCacheUsage
  expr: |
    (container_memory_cache / container_memory_working_set_bytes) > 0.5
  for: 15m
  annotations:
    summary: "Page cache > 50% of working set"

# Alert on OOMKills
- alert: ContainerOOMKilled
  expr: |
    increase(kube_pod_container_status_last_terminated_reason{reason="OOMKilled"}[5m]) > 0
  annotations:
    summary: "Container {{ $labels.container }} was OOMKilled"

The 85% threshold gives you time to investigate before hitting 100%. The page cache alert catches containers where cached data is dominating memory usage—often a sign that limits need adjusting or the application’s I/O patterns need review.

Checklist

## Container Memory Configuration

### Sizing
- [ ] Account for page cache (1.5-2x RSS)
- [ ] Add buffer for kernel memory (10-20%)
- [ ] Test under realistic I/O load

### JVM Specific
- [ ] Use -XX:+UseContainerSupport
- [ ] Set MaxRAMPercentage to 75% max
- [ ] Enable NativeMemoryTracking for debugging

### Monitoring
- [ ] Alert on memory > 85% of limit
- [ ] Track anon vs file memory ratio
- [ ] Monitor OOMKill events

### Debugging
- [ ] Check /sys/fs/cgroup/memory.stat on OOM
- [ ] Verify what's in page cache
- [ ] Profile native memory usage

Conclusion

Container memory is much more than what your application allocates. The kernel’s memory accounting includes caches, buffers, and overhead that most monitoring tools don’t prominently display. Understanding this gap between “application memory” and “cgroup memory” is essential for reliable container operation.

Key takeaways:

Page cache and kernel buffers count against your limit—they’re not “free” memory
Set limits 1.5-2x expected RSS to account for overhead, especially for I/O workloads
JVM applications need 75% MaxRAMPercentage maximum—the other 25% goes to non-heap memory
Monitor all memory components—RSS alone doesn’t tell the full story
Check memory.stat when debugging OOMKills—it reveals where memory actually went

Before setting memory limits, run your application under realistic load and check /sys/fs/cgroup/memory.stat. The numbers there tell you what limits you actually need.

Container Page Cache Thrashing - Memory pressure details
Go GOMAXPROCS in Containers - Container resource detection
Java Native Memory OOMKilled - JVM memory outside the heap

Kubernetes OOM Killer: Why Your Container Dies at 50% Memory

Understanding Container Memory Accounting

What Counts Against Memory Limit

The Page Cache Problem

Common Culprits

Diagnosing OOMKills

Check What Really Killed It

Memory Stat Breakdown

Why Didn’t Page Cache Get Reclaimed?

Solutions

1. Set Realistic Limits (Account for Overhead)

2. Tune JVM Memory Correctly

3. Limit Page Cache Growth

4. Monitor Memory Components

5. Use memory.high for Soft Limits

Monitoring

Checklist

Conclusion

Related posts

Cite this article

Understanding Container Memory Accounting

What Counts Against Memory Limit

The Page Cache Problem

Common Culprits

Diagnosing OOMKills

Check What Really Killed It

Memory Stat Breakdown

Why Didn’t Page Cache Get Reclaimed?

Solutions

1. Set Realistic Limits (Account for Overhead)

2. Tune JVM Memory Correctly

3. Limit Page Cache Growth

4. Monitor Memory Components

5. Use memory.high for Soft Limits

Monitoring

Checklist

Conclusion

Related Articles

Related posts

Cite this article