Back to blog

Linux Page Cache Thrashing in Containers: When Free Memory Isn't Free

We had plenty of memory on paper and still thrashed the page cache. “The container has 2GB memory limit and only uses 500MB. Why is it slow?” Because the other 1.5GB is page cache, and your file I/O is evicting your application’s code from memory.

Tested on: Linux 5.15, cgroups v2, Kubernetes 1.28, Go application with file processing

Understanding Page Cache in Containers

What Page Cache Is

Linux Memory Model:

┌─────────────────────────────────────────────────────────────┐
│                    Container Memory Limit                    │
├─────────────────────────────────────────────────────────────┤
│  Application Memory (anonymous pages)                        │
│  - Heap allocations                                          │
│  - Stack                                                     │
│  - mmap'd memory                                            │
├─────────────────────────────────────────────────────────────┤
│  Page Cache (file-backed pages)                              │
│  - Cached file reads                                         │
│  - Memory-mapped files                                       │
│  - Executable code pages (from binaries)                     │
└─────────────────────────────────────────────────────────────┘

Both count against cgroup memory limit!

The Problem

Scenario: Container with 2GB limit processing files

1. Application starts: 500MB anonymous memory
2. Read 100MB file: 500MB app + 100MB page cache = 600MB
3. Read 1GB of files: 500MB app + 1000MB cache = 1500MB
4. Read more files: Cache fills to 1500MB

Now at limit: 500MB app + 1500MB cache = 2000MB

5. Read another file:
   - Kernel needs to evict something
   - Application code pages are "inactive"
   - Code pages evicted to make room for file cache!

6. Application runs code from evicted page:
   - Major page fault
   - Disk read to reload code
   - Performance tanks

File I/O pushed out your own executable!

Demonstration

Test Program

// page_cache_demo.go
package main

import (
    "fmt"
    "io"
    "os"
    "runtime"
    "syscall"
    "time"
)

func main() {
    // Force GC and print memory stats
    printMemStats("Start")

    // Simulate file processing
    for i := 0; i < 100; i++ {
        processLargeFile(fmt.Sprintf("/data/file_%d.bin", i))
        if i%10 == 0 {
            printMemStats(fmt.Sprintf("After %d files", i))
            measureCodePageFaults()
        }
    }
}

func processLargeFile(path string) {
    f, _ := os.Open(path)
    defer f.Close()
    io.Copy(io.Discard, f)  // Read entire file
}

func printMemStats(label string) {
    var m runtime.MemStats
    runtime.ReadMemStats(&m)

    // Read cgroup memory stats
    current, _ := os.ReadFile("/sys/fs/cgroup/memory.current")
    limit, _ := os.ReadFile("/sys/fs/cgroup/memory.max")

    fmt.Printf("[%s] Heap: %dMB, Cgroup: %s/%s\n",
        label, m.HeapAlloc/1024/1024,
        string(current[:len(current)-1]),
        string(limit[:len(limit)-1]))
}

func measureCodePageFaults() {
    var rusage syscall.Rusage
    syscall.Getrusage(syscall.RUSAGE_SELF, &rusage)
    fmt.Printf("  Major faults: %d, Minor faults: %d\n",
        rusage.Majflt, rusage.Minflt)
}

Results

# Run in container with 512MB limit
docker run --memory=512m -v $(pwd)/data:/data page-cache-demo

[Start] Heap: 2MB, Cgroup: 15728640/536870912
  Major faults: 0, Minor faults: 1542
[After 10 files] Heap: 2MB, Cgroup: 524288000/536870912  # Near limit!
  Major faults: 0, Minor faults: 2341
[After 20 files] Heap: 2MB, Cgroup: 536870912/536870912  # At limit!
  Major faults: 145, Minor faults: 8923                    # Thrashing starts
[After 30 files] Heap: 2MB, Cgroup: 536870912/536870912
  Major faults: 892, Minor faults: 15234                   # Severe thrashing

# App memory is only 2MB but container is at 512MB limit
# Major faults = code pages being reloaded from disk

Solutions

1. Direct I/O (Bypass Page Cache)

// Use O_DIRECT for large file reads
import "golang.org/x/sys/unix"

func processWithDirectIO(path string) error {
    fd, err := unix.Open(path, unix.O_RDONLY|unix.O_DIRECT, 0)
    if err != nil {
        return err
    }
    defer unix.Close(fd)

    // Buffer must be aligned for O_DIRECT
    bufSize := 4096 * 256  // 1MB aligned
    buf := make([]byte, bufSize+4096)
    alignedBuf := buf[4096-int(uintptr(unsafe.Pointer(&buf[0]))%4096):][:bufSize]

    for {
        n, err := unix.Read(fd, alignedBuf)
        if n == 0 || err != nil {
            break
        }
        // Process data...
    }
    return nil
}

2. Advise Kernel to Drop Cache

import "golang.org/x/sys/unix"

func processAndDropCache(path string) error {
    f, err := os.Open(path)
    if err != nil {
        return err
    }
    defer f.Close()

    // Process file
    io.Copy(io.Discard, f)

    // Advise kernel we won't need this again
    fd := int(f.Fd())
    fi, _ := f.Stat()
    unix.Fadvise(fd, 0, fi.Size(), unix.FADV_DONTNEED)

    return nil
}

3. Streaming with Small Buffers

// Don't read entire file at once
func processStreaming(path string, chunkSize int) error {
    f, _ := os.Open(path)
    defer f.Close()

    buf := make([]byte, chunkSize)  // 64KB chunks
    for {
        n, err := f.Read(buf)
        if n == 0 || err != nil {
            break
        }
        processChunk(buf[:n])

        // Periodically advise kernel
        if shouldDropCache() {
            fd := int(f.Fd())
            pos, _ := f.Seek(0, io.SeekCurrent)
            unix.Fadvise(fd, 0, pos, unix.FADV_DONTNEED)
        }
    }
    return nil
}

4. Kubernetes Memory Configuration

# Give headroom for page cache
apiVersion: v1
kind: Pod
spec:
  containers:
    - name: file-processor
      resources:
        requests:
          memory: "512Mi"
        limits:
          memory: "1Gi"  # 2x request for page cache headroom
      env:
        # Tell app its real memory budget
        - name: APP_MEMORY_LIMIT
          value: "512Mi"  # App should only use half

---
# Or use memory.high for soft limiting (cgroups v2)
apiVersion: v1
kind: Pod
metadata:
  annotations:
    # Soft limit at 512Mi, hard limit at 1Gi
    # Kernel will reclaim cache more aggressively
    memory.high: "512Mi"

5. Memory.high for Soft Limits

# cgroups v2: memory.high triggers reclaim before hitting limit

# Check if cgroups v2
cat /proc/mounts | grep cgroup2

# Set soft limit
echo 512M > /sys/fs/cgroup/memory.high
echo 1G > /sys/fs/cgroup/memory.max

# Behavior:
# - Below memory.high: Normal operation
# - Above memory.high: Aggressive page cache reclaim
# - At memory.max: OOM killer if needed

# Application code pages less likely to be evicted
# Page cache evicted first when above memory.high

Monitoring

Detecting Page Cache Thrashing

# Inside container
cat /sys/fs/cgroup/memory.stat

# Key metrics:
# file: bytes of page cache
# anon: bytes of anonymous memory
# pgmajfault: major page faults (disk reads for evicted pages)
# pgscan: pages scanned for reclaim
# pgsteal: pages reclaimed

# High pgmajfault + high pgscan = thrashing

Prometheus Metrics

# cadvisor exposes these
container_memory_rss                    # Anonymous memory
container_memory_cache                  # Page cache
container_memory_mapped_file            # mmap'd files
container_memory_working_set_bytes      # "Real" usage

# Alert on cache pressure
- alert: ContainerPageCacheThrashing
  expr: |
    rate(container_memory_pgmajfault_total[5m]) > 100
  for: 5m
  annotations:
    summary: "Container {{ $labels.container }} thrashing"

Understanding Memory Metrics

# What you usually see
container_memory_usage_bytes  # Includes page cache

# What matters for OOM
container_memory_working_set_bytes  # Anon + active cache

# Page cache specifically
container_memory_cache

# Your actual application memory
container_memory_rss - container_memory_mapped_file

# Thrashing indicator
rate(container_memory_pgmajfault_total[5m])

Common Patterns

Log Rotation Issue

Problem: Logging to files fills page cache

┌─────────────────────────────────────────────────────────────┐
│ App writes logs → Page cache grows → Evicts app code        │
│                                                              │
│ Solution: Log to stdout, let container runtime handle it    │
│           Or use O_DIRECT for log files                     │
└─────────────────────────────────────────────────────────────┘

Temp File Processing

// Bad: Creates huge temp file, fills cache
func processBad(data []byte) {
    tmp, _ := os.CreateTemp("", "process-*")
    tmp.Write(data)
    tmp.Seek(0, 0)
    // Process from file...
    // Page cache now holds entire temp file
}

// Good: Use memory directly or drop cache
func processGood(data []byte) {
    tmp, _ := os.CreateTemp("", "process-*")
    tmp.Write(data)
    tmp.Sync()
    unix.Fadvise(int(tmp.Fd()), 0, int64(len(data)), unix.FADV_DONTNEED)
    // Or just process in memory if it fits
}

Database in Container

# PostgreSQL page cache management
apiVersion: v1
kind: Pod
spec:
  containers:
    - name: postgres
      resources:
        limits:
          memory: "4Gi"  # Total limit
      env:
        # PostgreSQL shared_buffers (managed cache)
        - name: POSTGRES_SHARED_BUFFERS
          value: "1GB"  # Only 25% of limit
        # Leaves 3GB for OS page cache + connections

Checklist

## Page Cache Management

### Detection
- [ ] Monitor container_memory_cache
- [ ] Alert on pgmajfault rate
- [ ] Check memory.stat for cache vs anon ratio

### Prevention
- [ ] Use FADV_DONTNEED for large file reads
- [ ] Consider O_DIRECT for sequential I/O
- [ ] Stream files instead of reading entirely
- [ ] Give 2x memory headroom for file-heavy workloads

### Configuration
- [ ] Set memory.high (cgroups v2) for soft limiting
- [ ] Configure app memory budget separately from limit
- [ ] Log to stdout instead of files

### Architecture
- [ ] Offload file processing to dedicated pods
- [ ] Use external object storage instead of local files
- [ ] Consider tmpfs for temp files (charged to memory anyway)

Conclusion

Page cache is not “free” memory in containers:

  1. File I/O fills page cache charged against your limit
  2. App code pages can be evicted causing major faults
  3. Use FADV_DONTNEED to release cache after reading
  4. Give 2x memory headroom for file-heavy workloads

Monitor pgmajfault - when it spikes, you’re thrashing.


Related posts

Cite this article

If you reference this post, please link to the original URL and credit the author.

Michal Drozd. "Linux Page Cache Thrashing in Containers: When Free Memory Isn't Free". https://www.michal-drozd.com/en/blog/container-page-cache-thrashing/ (Published August 6, 2025).