Back to blog

Ephemeral-Storage Evictions in Kubernetes: The Log Storm That Took Down Healthy Pods

If you’ve never been paged for this, it sounds fake:

  • Pods restart with reason Evicted
  • The message says node was low on ephemeral-storage
  • You SSH to the node, run df -h, and… there’s free disk

The first time I saw it, I assumed kubelet had a bug. It didn’t. We did.

This article is the runbook I wish I had: nodefs vs imagefs, log growth, kubelet garbage collection, and ephemeral-storage as an explicit contract.

Tested on: Kubernetes 1.29–1.31, containerd 1.7, Linux 6.1–6.6, mixed nodefs/imagefs setups.

Incident narrative (anonymized)

During a production incident, I enabled very verbose logging in one service to capture a rare edge case. The incident ended. The debug logging didn’t.

Over the next few hours:

  • container logs ballooned into tens of GB on a subset of nodes
  • kubelet set DiskPressure
  • pods started getting evicted — including unrelated services
  • rollouts got noisy because evictions looked like “random restarts”

Blast radius: multiple services had elevated error rates due to restarts and cold caches.

Constraint: we couldn’t take down the node pool. We needed to stop the bleeding and then add guardrails so “debug logging” could never become a cluster incident again.

Timeline

  • T-0: Alerts: increased pod restarts across several namespaces.
  • T+10m: kubectl describe pod shows Evicted with “low on resource: ephemeral-storage”.
  • T+20m: kubectl describe node shows DiskPressure=True on the same nodes.
  • T+30m: On-node du reveals /var/log/containers dominates disk growth.
  • T+40m: Mitigation: reduce log verbosity + restart the noisy pods + drain the worst nodes.
  • T+2h: Disk pressure clears; evictions stop.
  • T+1d: We enforce kubelet log rotation + per-pod ephemeral-storage limits + alerts.

Mechanism: why “disk looks free” but kubelet evicts pods

Kubelet evicts based on thresholds, not 100% full

Evictions trigger when kubelet predicts or observes pressure past configured thresholds (evictionHard, evictionSoft) and for specific filesystems:

  • nodefs (root filesystem, often where logs and emptyDir live)
  • imagefs (where container images/snapshots live; may be a separate partition)

So you can have:

  • plenty of space on / but imagefs is full
  • or the opposite
  • or free space overall but below kubelet’s configured “hard” threshold

Container logs are ephemeral-storage

By default, container logs live on the node (commonly under /var/log/containers and /var/log/pods). If you don’t rotate them at kubelet level, one chatty container can consume “ephemeral-storage” until kubelet protects the node by evicting pods.

Evictions are “not graceful” unless you engineered them

Eviction triggers a termination; your app gets SIGTERM. If you don’t handle shutdown well, you get:

  • user-visible errors
  • connection storms on restart
  • rollout instability

Runbook: diagnose and stop ephemeral-storage evictions

What to check first

  1. Confirm eviction reason
kubectl -n <ns> describe pod <pod>

Look for something like:

  • Reason: Evicted
  • The node was low on resource: ephemeral-storage
  • sometimes it will mention the container usage
  1. Find the node and check node conditions
kubectl get pod -n <ns> <pod> -o wide
kubectl describe node <node> | grep -n "DiskPressure" -n
  1. Figure out whether it’s nodefs or imagefs On the node (or via kubectl debug node/... if you use that pattern):
df -h
df -h /var/log
df -h /var/lib/containerd

I’m looking for:

  • a partition close to full
  • or a partition that is below an eviction threshold even if not “full”

How to confirm the hypothesis (fast disk forensics)

A. Identify what is consuming disk on nodefs (logs, emptyDir, app temp)

sudo du -xh /var/log | sort -h | tail -n 20
sudo du -xh /var/log/containers | sort -h | tail -n 20
sudo du -xh /var/log/pods | sort -h | tail -n 20

B. Identify what is consuming imagefs (images/snapshots)

sudo du -xh /var/lib/containerd | sort -h | tail -n 20

C. Correlate “big log files” to pods Once you find a huge log file, it usually includes the pod name in the filename under /var/log/containers.

Safe mitigations (do these first)

  1. Turn off the log firehose
  • revert debug logging
  • cap per-request logging
  • reduce log volume at the source
  1. Drain the worst nodes Draining is often safer than “surgical deletion” in containerd directories.
kubectl drain <node> --ignore-daemonsets --delete-emptydir-data
  1. Enable kubelet log rotation If it’s not already enabled. This is a real “fix”, not a bandaid.

  2. Add ephemeral-storage requests/limits So one pod can’t silently claim the node’s disk.

Risky mitigations (I avoid unless I’m desperate)

  • Manually deleting things under /var/lib/containerd without understanding containerd GC
  • “rm -rf” in overlay/snapshot directories
  • Randomly restarting containerd on a hot node (can cascade into more disruption)

What we changed (concrete)

1) We enforced kubelet container log rotation

We moved kubelet to a config file (if not already) and set rotation.

Example KubeletConfiguration snippet:

apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
containerLogMaxSize: "50Mi"
containerLogMaxFiles: 5

2) We set ephemeral-storage as a contract for the noisy service

Before (no budget):

resources:
  requests:
    cpu: "200m"
    memory: "256Mi"
  limits:
    cpu: "1"
    memory: "512Mi"

After (explicit ephemeral-storage budget):

resources:
  requests:
    cpu: "200m"
    memory: "256Mi"
    ephemeral-storage: "512Mi"
  limits:
    cpu: "1"
    memory: "512Mi"
    ephemeral-storage: "2Gi"

3) We prevented “debug logging stays on” accidents

We added:

  • a runtime flag that auto-expires debug logging (time-based)
  • a CI check that fails builds if debug logging is the default in production configs

How to verify (measurable)

  1. DiskPressure clears
kubectl describe node <node> | grep -n "DiskPressure" -n

Expected: DiskPressure=False.

  1. No new evictions
kubectl get events -A --sort-by=.metadata.creationTimestamp | tail -n 50

Expected: eviction events stop appearing.

  1. Log directories stop growing unbounded On-node:
sudo du -sh /var/log/containers

Expected: stabilizes and rotates.

  1. Workloads recover without churn
  • restart rates return to baseline
  • SLOs stabilize (no periodic restart spikes)

Prevention / guardrails

  • Ephemeral-storage budgets
    • define per critical workload
    • enforce via policy (at minimum: code review + templates)
  • Node disk alerts
    • nodefs and imagefs utilization
    • growth rate alerts are more actionable than absolute thresholds
  • Logging budgets
    • max logs per request
    • sampling defaults in production
  • Graceful shutdown
    • if eviction happens, it should not create a user-visible outage

Related posts

Cite this article

If you reference this post, please link to the original URL and credit the author.

Michal Drozd. "Ephemeral-Storage Evictions in Kubernetes: The Log Storm That Took Down Healthy Pods". https://www.michal-drozd.com/en/blog/kubernetes-ephemeral-storage-eviction-log-storm/ (Published November 18, 2025).