Ephemeral-Storage Evictions in Kubernetes: The Log Storm That Took Down Healthy Pods

If you’ve never been paged for this, it sounds fake:

Pods restart with reason Evicted
The message says node was low on ephemeral-storage
You SSH to the node, run df -h, and… there’s free disk

The first time I saw it, I assumed kubelet had a bug. It didn’t. We did.

This article is the runbook I wish I had: nodefs vs imagefs, log growth, kubelet garbage collection, and ephemeral-storage as an explicit contract.

Tested on: Kubernetes 1.29–1.31, containerd 1.7, Linux 6.1–6.6, mixed nodefs/imagefs setups.

Incident narrative (anonymized)

During a production incident, I enabled very verbose logging in one service to capture a rare edge case. The incident ended. The debug logging didn’t.

Over the next few hours:

container logs ballooned into tens of GB on a subset of nodes
kubelet set DiskPressure
pods started getting evicted — including unrelated services
rollouts got noisy because evictions looked like “random restarts”

Blast radius: multiple services had elevated error rates due to restarts and cold caches.

Constraint: we couldn’t take down the node pool. We needed to stop the bleeding and then add guardrails so “debug logging” could never become a cluster incident again.

Timeline

T-0: Alerts: increased pod restarts across several namespaces.
T+10m: kubectl describe pod shows Evicted with “low on resource: ephemeral-storage”.
T+20m: kubectl describe node shows DiskPressure=True on the same nodes.
T+30m: On-node du reveals /var/log/containers dominates disk growth.
T+40m: Mitigation: reduce log verbosity + restart the noisy pods + drain the worst nodes.
T+2h: Disk pressure clears; evictions stop.
T+1d: We enforce kubelet log rotation + per-pod ephemeral-storage limits + alerts.

Mechanism: why “disk looks free” but kubelet evicts pods

Kubelet evicts based on thresholds, not 100% full

Evictions trigger when kubelet predicts or observes pressure past configured thresholds (evictionHard, evictionSoft) and for specific filesystems:

nodefs (root filesystem, often where logs and emptyDir live)
imagefs (where container images/snapshots live; may be a separate partition)

So you can have:

plenty of space on / but imagefs is full
or the opposite
or free space overall but below kubelet’s configured “hard” threshold

Container logs are ephemeral-storage

By default, container logs live on the node (commonly under /var/log/containers and /var/log/pods). If you don’t rotate them at kubelet level, one chatty container can consume “ephemeral-storage” until kubelet protects the node by evicting pods.

Evictions are “not graceful” unless you engineered them

Eviction triggers a termination; your app gets SIGTERM. If you don’t handle shutdown well, you get:

user-visible errors
connection storms on restart
rollout instability

Runbook: diagnose and stop ephemeral-storage evictions

What to check first

Confirm eviction reason

kubectl -n <ns> describe pod <pod>

Look for something like:

Reason: Evicted
The node was low on resource: ephemeral-storage
sometimes it will mention the container usage

Find the node and check node conditions

kubectl get pod -n <ns> <pod> -o wide
kubectl describe node <node> | grep -n "DiskPressure" -n

Figure out whether it’s nodefs or imagefs On the node (or via kubectl debug node/... if you use that pattern):

df -h
df -h /var/log
df -h /var/lib/containerd

I’m looking for:

a partition close to full
or a partition that is below an eviction threshold even if not “full”

How to confirm the hypothesis (fast disk forensics)

A. Identify what is consuming disk on nodefs (logs, emptyDir, app temp)

sudo du -xh /var/log | sort -h | tail -n 20
sudo du -xh /var/log/containers | sort -h | tail -n 20
sudo du -xh /var/log/pods | sort -h | tail -n 20

B. Identify what is consuming imagefs (images/snapshots)

sudo du -xh /var/lib/containerd | sort -h | tail -n 20

C. Correlate “big log files” to pods Once you find a huge log file, it usually includes the pod name in the filename under /var/log/containers.

Safe mitigations (do these first)

Turn off the log firehose

revert debug logging
cap per-request logging
reduce log volume at the source

Drain the worst nodes Draining is often safer than “surgical deletion” in containerd directories.

kubectl drain <node> --ignore-daemonsets --delete-emptydir-data

Enable kubelet log rotation If it’s not already enabled. This is a real “fix”, not a bandaid.
Add ephemeral-storage requests/limits So one pod can’t silently claim the node’s disk.

Risky mitigations (I avoid unless I’m desperate)

Manually deleting things under /var/lib/containerd without understanding containerd GC
“rm -rf” in overlay/snapshot directories
Randomly restarting containerd on a hot node (can cascade into more disruption)

What we changed (concrete)

1) We enforced kubelet container log rotation

We moved kubelet to a config file (if not already) and set rotation.

Example KubeletConfiguration snippet:

apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
containerLogMaxSize: "50Mi"
containerLogMaxFiles: 5

2) We set ephemeral-storage as a contract for the noisy service

Before (no budget):

resources:
  requests:
    cpu: "200m"
    memory: "256Mi"
  limits:
    cpu: "1"
    memory: "512Mi"

After (explicit ephemeral-storage budget):

resources:
  requests:
    cpu: "200m"
    memory: "256Mi"
    ephemeral-storage: "512Mi"
  limits:
    cpu: "1"
    memory: "512Mi"
    ephemeral-storage: "2Gi"

3) We prevented “debug logging stays on” accidents

We added:

a runtime flag that auto-expires debug logging (time-based)
a CI check that fails builds if debug logging is the default in production configs

How to verify (measurable)

DiskPressure clears

kubectl describe node <node> | grep -n "DiskPressure" -n

Expected: DiskPressure=False.

No new evictions

kubectl get events -A --sort-by=.metadata.creationTimestamp | tail -n 50

Expected: eviction events stop appearing.

Log directories stop growing unbounded On-node:

sudo du -sh /var/log/containers

Expected: stabilizes and rotates.

Workloads recover without churn

restart rates return to baseline
SLOs stabilize (no periodic restart spikes)

Prevention / guardrails

Ephemeral-storage budgets
- define per critical workload
- enforce via policy (at minimum: code review + templates)
Node disk alerts
- nodefs and imagefs utilization
- growth rate alerts are more actionable than absolute thresholds
Logging budgets
- max logs per request
- sampling defaults in production
Graceful shutdown
- if eviction happens, it should not create a user-visible outage

Ephemeral-Storage Evictions in Kubernetes: The Log Storm That Took Down Healthy Pods

Incident narrative (anonymized)

Timeline

Mechanism: why “disk looks free” but kubelet evicts pods

Kubelet evicts based on thresholds, not 100% full

Container logs are ephemeral-storage

Evictions are “not graceful” unless you engineered them

Runbook: diagnose and stop ephemeral-storage evictions

What to check first

How to confirm the hypothesis (fast disk forensics)

Safe mitigations (do these first)

Risky mitigations (I avoid unless I’m desperate)

What we changed (concrete)

1) We enforced kubelet container log rotation

2) We set ephemeral-storage as a contract for the noisy service

3) We prevented “debug logging stays on” accidents

How to verify (measurable)

Prevention / guardrails

Related posts

Cite this article

Incident narrative (anonymized)

Timeline

Mechanism: why “disk looks free” but kubelet evicts pods

Kubelet evicts based on thresholds, not 100% full

Container logs are ephemeral-storage

Evictions are “not graceful” unless you engineered them

Runbook: diagnose and stop ephemeral-storage evictions

What to check first

How to confirm the hypothesis (fast disk forensics)

Safe mitigations (do these first)

Risky mitigations (I avoid unless I’m desperate)

What we changed (concrete)

1) We enforced kubelet container log rotation

2) We set ephemeral-storage as a contract for the noisy service

3) We prevented “debug logging stays on” accidents

How to verify (measurable)

Prevention / guardrails

Related reading

Related posts

Cite this article