Linux Page Cache Thrashing in Containers: When Free Memory Isn't Free
We had plenty of memory on paper and still thrashed the page cache. “The container has 2GB memory limit and only uses 500MB. Why is it slow?” Because the other 1.5GB is page cache, and your file I/O is evicting your application’s code from memory.
Tested on: Linux 5.15, cgroups v2, Kubernetes 1.28, Go application with file processing
Understanding Page Cache in Containers
What Page Cache Is
Linux Memory Model:
┌─────────────────────────────────────────────────────────────┐
│ Container Memory Limit │
├─────────────────────────────────────────────────────────────┤
│ Application Memory (anonymous pages) │
│ - Heap allocations │
│ - Stack │
│ - mmap'd memory │
├─────────────────────────────────────────────────────────────┤
│ Page Cache (file-backed pages) │
│ - Cached file reads │
│ - Memory-mapped files │
│ - Executable code pages (from binaries) │
└─────────────────────────────────────────────────────────────┘
Both count against cgroup memory limit!
The Problem
Scenario: Container with 2GB limit processing files
1. Application starts: 500MB anonymous memory
2. Read 100MB file: 500MB app + 100MB page cache = 600MB
3. Read 1GB of files: 500MB app + 1000MB cache = 1500MB
4. Read more files: Cache fills to 1500MB
Now at limit: 500MB app + 1500MB cache = 2000MB
5. Read another file:
- Kernel needs to evict something
- Application code pages are "inactive"
- Code pages evicted to make room for file cache!
6. Application runs code from evicted page:
- Major page fault
- Disk read to reload code
- Performance tanks
File I/O pushed out your own executable!
Demonstration
Test Program
// page_cache_demo.go
package main
import (
"fmt"
"io"
"os"
"runtime"
"syscall"
"time"
)
func main() {
// Force GC and print memory stats
printMemStats("Start")
// Simulate file processing
for i := 0; i < 100; i++ {
processLargeFile(fmt.Sprintf("/data/file_%d.bin", i))
if i%10 == 0 {
printMemStats(fmt.Sprintf("After %d files", i))
measureCodePageFaults()
}
}
}
func processLargeFile(path string) {
f, _ := os.Open(path)
defer f.Close()
io.Copy(io.Discard, f) // Read entire file
}
func printMemStats(label string) {
var m runtime.MemStats
runtime.ReadMemStats(&m)
// Read cgroup memory stats
current, _ := os.ReadFile("/sys/fs/cgroup/memory.current")
limit, _ := os.ReadFile("/sys/fs/cgroup/memory.max")
fmt.Printf("[%s] Heap: %dMB, Cgroup: %s/%s\n",
label, m.HeapAlloc/1024/1024,
string(current[:len(current)-1]),
string(limit[:len(limit)-1]))
}
func measureCodePageFaults() {
var rusage syscall.Rusage
syscall.Getrusage(syscall.RUSAGE_SELF, &rusage)
fmt.Printf(" Major faults: %d, Minor faults: %d\n",
rusage.Majflt, rusage.Minflt)
}
Results
# Run in container with 512MB limit
docker run --memory=512m -v $(pwd)/data:/data page-cache-demo
[Start] Heap: 2MB, Cgroup: 15728640/536870912
Major faults: 0, Minor faults: 1542
[After 10 files] Heap: 2MB, Cgroup: 524288000/536870912 # Near limit!
Major faults: 0, Minor faults: 2341
[After 20 files] Heap: 2MB, Cgroup: 536870912/536870912 # At limit!
Major faults: 145, Minor faults: 8923 # Thrashing starts
[After 30 files] Heap: 2MB, Cgroup: 536870912/536870912
Major faults: 892, Minor faults: 15234 # Severe thrashing
# App memory is only 2MB but container is at 512MB limit
# Major faults = code pages being reloaded from disk
Solutions
1. Direct I/O (Bypass Page Cache)
// Use O_DIRECT for large file reads
import "golang.org/x/sys/unix"
func processWithDirectIO(path string) error {
fd, err := unix.Open(path, unix.O_RDONLY|unix.O_DIRECT, 0)
if err != nil {
return err
}
defer unix.Close(fd)
// Buffer must be aligned for O_DIRECT
bufSize := 4096 * 256 // 1MB aligned
buf := make([]byte, bufSize+4096)
alignedBuf := buf[4096-int(uintptr(unsafe.Pointer(&buf[0]))%4096):][:bufSize]
for {
n, err := unix.Read(fd, alignedBuf)
if n == 0 || err != nil {
break
}
// Process data...
}
return nil
}
2. Advise Kernel to Drop Cache
import "golang.org/x/sys/unix"
func processAndDropCache(path string) error {
f, err := os.Open(path)
if err != nil {
return err
}
defer f.Close()
// Process file
io.Copy(io.Discard, f)
// Advise kernel we won't need this again
fd := int(f.Fd())
fi, _ := f.Stat()
unix.Fadvise(fd, 0, fi.Size(), unix.FADV_DONTNEED)
return nil
}
3. Streaming with Small Buffers
// Don't read entire file at once
func processStreaming(path string, chunkSize int) error {
f, _ := os.Open(path)
defer f.Close()
buf := make([]byte, chunkSize) // 64KB chunks
for {
n, err := f.Read(buf)
if n == 0 || err != nil {
break
}
processChunk(buf[:n])
// Periodically advise kernel
if shouldDropCache() {
fd := int(f.Fd())
pos, _ := f.Seek(0, io.SeekCurrent)
unix.Fadvise(fd, 0, pos, unix.FADV_DONTNEED)
}
}
return nil
}
4. Kubernetes Memory Configuration
# Give headroom for page cache
apiVersion: v1
kind: Pod
spec:
containers:
- name: file-processor
resources:
requests:
memory: "512Mi"
limits:
memory: "1Gi" # 2x request for page cache headroom
env:
# Tell app its real memory budget
- name: APP_MEMORY_LIMIT
value: "512Mi" # App should only use half
---
# Or use memory.high for soft limiting (cgroups v2)
apiVersion: v1
kind: Pod
metadata:
annotations:
# Soft limit at 512Mi, hard limit at 1Gi
# Kernel will reclaim cache more aggressively
memory.high: "512Mi"
5. Memory.high for Soft Limits
# cgroups v2: memory.high triggers reclaim before hitting limit
# Check if cgroups v2
cat /proc/mounts | grep cgroup2
# Set soft limit
echo 512M > /sys/fs/cgroup/memory.high
echo 1G > /sys/fs/cgroup/memory.max
# Behavior:
# - Below memory.high: Normal operation
# - Above memory.high: Aggressive page cache reclaim
# - At memory.max: OOM killer if needed
# Application code pages less likely to be evicted
# Page cache evicted first when above memory.high
Monitoring
Detecting Page Cache Thrashing
# Inside container
cat /sys/fs/cgroup/memory.stat
# Key metrics:
# file: bytes of page cache
# anon: bytes of anonymous memory
# pgmajfault: major page faults (disk reads for evicted pages)
# pgscan: pages scanned for reclaim
# pgsteal: pages reclaimed
# High pgmajfault + high pgscan = thrashing
Prometheus Metrics
# cadvisor exposes these
container_memory_rss # Anonymous memory
container_memory_cache # Page cache
container_memory_mapped_file # mmap'd files
container_memory_working_set_bytes # "Real" usage
# Alert on cache pressure
- alert: ContainerPageCacheThrashing
expr: |
rate(container_memory_pgmajfault_total[5m]) > 100
for: 5m
annotations:
summary: "Container {{ $labels.container }} thrashing"
Understanding Memory Metrics
# What you usually see
container_memory_usage_bytes # Includes page cache
# What matters for OOM
container_memory_working_set_bytes # Anon + active cache
# Page cache specifically
container_memory_cache
# Your actual application memory
container_memory_rss - container_memory_mapped_file
# Thrashing indicator
rate(container_memory_pgmajfault_total[5m])
Common Patterns
Log Rotation Issue
Problem: Logging to files fills page cache
┌─────────────────────────────────────────────────────────────┐
│ App writes logs → Page cache grows → Evicts app code │
│ │
│ Solution: Log to stdout, let container runtime handle it │
│ Or use O_DIRECT for log files │
└─────────────────────────────────────────────────────────────┘
Temp File Processing
// Bad: Creates huge temp file, fills cache
func processBad(data []byte) {
tmp, _ := os.CreateTemp("", "process-*")
tmp.Write(data)
tmp.Seek(0, 0)
// Process from file...
// Page cache now holds entire temp file
}
// Good: Use memory directly or drop cache
func processGood(data []byte) {
tmp, _ := os.CreateTemp("", "process-*")
tmp.Write(data)
tmp.Sync()
unix.Fadvise(int(tmp.Fd()), 0, int64(len(data)), unix.FADV_DONTNEED)
// Or just process in memory if it fits
}
Database in Container
# PostgreSQL page cache management
apiVersion: v1
kind: Pod
spec:
containers:
- name: postgres
resources:
limits:
memory: "4Gi" # Total limit
env:
# PostgreSQL shared_buffers (managed cache)
- name: POSTGRES_SHARED_BUFFERS
value: "1GB" # Only 25% of limit
# Leaves 3GB for OS page cache + connections
Checklist
## Page Cache Management
### Detection
- [ ] Monitor container_memory_cache
- [ ] Alert on pgmajfault rate
- [ ] Check memory.stat for cache vs anon ratio
### Prevention
- [ ] Use FADV_DONTNEED for large file reads
- [ ] Consider O_DIRECT for sequential I/O
- [ ] Stream files instead of reading entirely
- [ ] Give 2x memory headroom for file-heavy workloads
### Configuration
- [ ] Set memory.high (cgroups v2) for soft limiting
- [ ] Configure app memory budget separately from limit
- [ ] Log to stdout instead of files
### Architecture
- [ ] Offload file processing to dedicated pods
- [ ] Use external object storage instead of local files
- [ ] Consider tmpfs for temp files (charged to memory anyway)
Conclusion
Page cache is not “free” memory in containers:
- File I/O fills page cache charged against your limit
- App code pages can be evicted causing major faults
- Use FADV_DONTNEED to release cache after reading
- Give 2x memory headroom for file-heavy workloads
Monitor pgmajfault - when it spikes, you’re thrashing.
Related Articles
- JVM Native Memory in Kubernetes - Container memory
- K8s CPU Throttling - Container performance
Related posts
Kubernetes OOM Killer: Why Your Container Dies at 50% Memory
Container memory limit is 4GB but OOM kills at 2GB used. Kernel buffers, page cache, and cgroup accounting tricks cause early OOMKills. Here's the full picture.
JVM Native Memory in Kubernetes: Why Your Pod Gets OOMKilled with 50% Heap
Heap is 50% full but pod gets OOMKilled. I'll show how to track native memory (Metaspace, threads, NIO) and prevent container memory issues.
Kubernetes p99 Spikes Without OOM: Diagnosing cgroup v2 memory.high with PSI
Use PSI and cgroup v2 memory.high to explain p99 spikes without OOMKills. Kubernetes runbook with commands, safe mitigations, diffs, and alerts.
Python GIL and Kubernetes CPU Limits: The Threading Trap
Your Python app has 4 threads but K8s gives 1 CPU. GIL + CFS quota = severe throttling. I show why and how to configure workers correctly.
Cite this article
If you reference this post, please link to the original URL and credit the author.