etcd Quota Alarm: When Your Kubernetes Cluster Goes Read-Only
The cluster looked fine until etcd started screaming about its quota. “kubectl apply hangs, new pods stuck in Pending.” The cause: etcd exceeded its storage quota and entered alarm mode, making the cluster effectively read-only until you compact and defragment.
Environment: Kubernetes with etcd (self-managed or kubeadm), clusters with high churn (frequent deployments, many events), default etcd configuration
The Problem
The Sudden Lockdown
Timeline of cluster freeze:
T+0:00 Normal cluster operation
etcd DB size: 2GB (quota: 2GB)
Compaction: running every 5 min
T+1:00 Compaction job fails silently
DB size starts growing with history
T+24:00 DB size: 2.05GB
etcd triggers ALARM: NOSPACE
All write operations rejected!
T+24:01 kubectl apply deployment.yaml
Error: etcdserver: mvcc: database space exceeded
T+24:02 Pod crashes, can't reschedule
scheduler: can't create binding: space exceeded
T+24:03 ConfigMap update fails
Everything is frozen
What the Errors Look Like
# API server logs
E0115 03:42:17.123456 etcdserver: mvcc: database space exceeded
# kubectl errors
$ kubectl apply -f deployment.yaml
Error from server: etcdserver: mvcc: database space exceeded
$ kubectl create namespace test
error: etcdserver: mvcc: database space exceeded
# Even deletions fail!
$ kubectl delete pod stuck-pod
error: etcdserver: mvcc: database space exceeded
Root Cause
How etcd Storage Works
etcd MVCC (Multi-Version Concurrency Control):
┌─────────────────────────────────────────────────────────────┐
│ Every write creates a NEW revision, old versions kept │
│ │
│ Key: /registry/pods/default/nginx │
│ │
│ Rev 1000: {replicas: 1} ← kept for history │
│ Rev 1001: {replicas: 2} ← kept for history │
│ Rev 1002: {replicas: 3} ← kept for history │
│ Rev 1003: {replicas: 5} ← current │
│ │
│ Without compaction: │
│ - All revisions stored forever │
│ - DB grows with every write │
│ - watch operations can read old history │
│ │
│ Quota (default 2GB) prevents unbounded growth │
│ When exceeded → ALARM → read-only mode │
└─────────────────────────────────────────────────────────────┘
Why Compaction Stops
# Common reasons compaction fails:
# 1. etcd running without auto-compaction
etcd --auto-compaction-retention=0 # Disabled!
# 2. kube-apiserver not setting compaction
# Check apiserver flags:
ps aux | grep kube-apiserver | grep etcd-compaction
# Missing: --etcd-compaction-interval
# 3. Compaction runs but defrag doesn't
# Compaction marks space as reclaimable
# Defragmentation actually frees it
# DB file stays large without defrag
# 4. High write rate exceeds compaction rate
# Cluster with 1000s of deployments/hour
# Compaction can't keep up
The Quota Math
# Check current etcd status
etcdctl endpoint status --write-out=table
# +----------------+------------------+-------+-------+----------+
# | ENDPOINT | ID | V | DB SZ | IS LEADER|
# +----------------+------------------+-------+-------+----------+
# | 127.0.0.1:2379 | 8e9e05c52164694d | 3.5.0 | 2.1GB | true |
# +----------------+------------------+-------+-------+----------+
# Check quota
etcdctl endpoint status --write-out=json | jq '.[] | .Status.dbSize, .Status.dbSizeInUse'
# 2147483648 (DB size on disk: 2GB)
# 1073741824 (Actually used: 1GB - rest is history!)
# Check alarms
etcdctl alarm list
# memberID:8e9e05c52164694d alarm:NOSPACE
Diagnosis
Check etcd Health
# Connect to etcd (find certs in /etc/kubernetes/pki/etcd/)
export ETCDCTL_API=3
export ETCDCTL_ENDPOINTS=https://127.0.0.1:2379
export ETCDCTL_CACERT=/etc/kubernetes/pki/etcd/ca.crt
export ETCDCTL_CERT=/etc/kubernetes/pki/etcd/server.crt
export ETCDCTL_KEY=/etc/kubernetes/pki/etcd/server.key
# Check health
etcdctl endpoint health
# Check status including DB size
etcdctl endpoint status --write-out=table
# List any alarms
etcdctl alarm list
Analyze Storage Usage
# Get top keys by size
etcdctl get / --prefix --keys-only | \
cut -d/ -f1-4 | sort | uniq -c | sort -rn | head -20
# Output shows what's filling your etcd:
# 15234 /registry/events/default
# 8234 /registry/pods/kube-system
# 5123 /registry/configmaps/default
# Events are often the biggest culprit!
# Default retention: forever (until you delete)
Check Compaction Status
# Get current revision
etcdctl endpoint status --write-out=json | jq '.[].Status.header.revision'
# 12345678
# See what revision is compacted to
etcdctl endpoint status --write-out=json | jq '.[].Status.header.raft_term'
# Check apiserver compaction settings
kubectl -n kube-system get pod kube-apiserver-* -o yaml | \
grep -A5 etcd-compaction
The Fix
Step 1: Emergency - Clear the Alarm
# First, compact to free up logical space
# Get current revision
REVISION=$(etcdctl endpoint status --write-out=json | \
jq -r '.[].Status.header.revision')
# Compact to current revision (removes history)
etcdctl compact $REVISION
# Defragment to free physical space
etcdctl defrag --endpoints=https://127.0.0.1:2379
# Clear the alarm
etcdctl alarm disarm
# Verify
etcdctl alarm list
# (should be empty)
etcdctl endpoint status --write-out=table
# DB size should be smaller now
Step 2: Increase Quota (Temporary)
# If compaction alone isn't enough, increase quota
# Edit etcd static pod manifest
vim /etc/kubernetes/manifests/etcd.yaml
# Add/modify:
spec:
containers:
- command:
- etcd
- --quota-backend-bytes=4294967296 # 4GB
# ... other flags
# etcd will restart automatically
# WARNING: This is treating symptom, not cause
Step 3: Enable Auto-Compaction
# etcd auto-compaction (edit etcd manifest)
spec:
containers:
- command:
- etcd
- --auto-compaction-mode=periodic
- --auto-compaction-retention=1h # Keep 1 hour of history
# For kube-apiserver (edit apiserver manifest)
spec:
containers:
- command:
- kube-apiserver
- --etcd-compaction-interval=5m0s # Compact every 5 minutes
Step 4: Clean Up Events
# Events are often the biggest space consumer
# Delete old events
kubectl delete events --all -A
# Or set shorter TTL (Kubernetes 1.25+)
# In apiserver:
--event-ttl=1h # Default is 1h, but check yours
Step 5: Set Up Regular Defragmentation
# CronJob for regular defragmentation
apiVersion: batch/v1
kind: CronJob
metadata:
name: etcd-defrag
namespace: kube-system
spec:
schedule: "0 2 * * *" # Daily at 2 AM
jobTemplate:
spec:
template:
spec:
hostNetwork: true
containers:
- name: etcd-defrag
image: bitnami/etcd:3.5
command:
- /bin/sh
- -c
- |
etcdctl defrag \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key
volumeMounts:
- name: etcd-certs
mountPath: /etc/kubernetes/pki/etcd
readOnly: true
volumes:
- name: etcd-certs
hostPath:
path: /etc/kubernetes/pki/etcd
restartPolicy: OnFailure
nodeSelector:
node-role.kubernetes.io/control-plane: ""
tolerations:
- effect: NoSchedule
key: node-role.kubernetes.io/control-plane
Step 6: Reduce Write Volume
# Reduce event spam from controllers
# In controller-manager:
spec:
containers:
- command:
- kube-controller-manager
- --event-burst=20 # Default 30
- --event-qps=5 # Default 20
# Reduce leader election churn
# Increase lease duration
spec:
containers:
- command:
- kube-controller-manager
- --leader-elect-lease-duration=30s # Default 15s
- --leader-elect-renew-deadline=20s # Default 10s
Monitoring
groups:
- name: etcd
rules:
- alert: EtcdDatabaseSizeHigh
expr: |
etcd_mvcc_db_total_size_in_bytes /
etcd_server_quota_backend_bytes > 0.8
for: 5m
labels:
severity: warning
annotations:
summary: "etcd database {{ $value | humanizePercentage }} of quota"
- alert: EtcdDatabaseSpaceExceeded
expr: |
etcd_server_has_leader == 1 and
etcd_mvcc_db_total_size_in_bytes > etcd_server_quota_backend_bytes
for: 1m
labels:
severity: critical
annotations:
summary: "etcd quota exceeded - cluster read-only!"
- alert: EtcdCompactionPaused
expr: |
increase(etcd_debugging_mvcc_db_compaction_total_duration_milliseconds_count[1h]) == 0
for: 2h
labels:
severity: warning
annotations:
summary: "etcd compaction hasn't run in 2 hours"
- alert: EtcdDefragNeeded
expr: |
(etcd_mvcc_db_total_size_in_bytes - etcd_mvcc_db_total_size_in_use_in_bytes) /
etcd_mvcc_db_total_size_in_bytes > 0.5
for: 1h
labels:
severity: warning
annotations:
summary: "etcd has >50% reclaimable space - defrag needed"
Checklist
## etcd Quota Alarm Recovery
### Emergency Recovery
- [ ] Check alarm status: etcdctl alarm list
- [ ] Get current revision for compaction
- [ ] Run compaction: etcdctl compact $REVISION
- [ ] Run defragmentation: etcdctl defrag
- [ ] Disarm alarm: etcdctl alarm disarm
- [ ] Verify cluster is writable
### Prevention
- [ ] Enable auto-compaction in etcd (--auto-compaction-retention)
- [ ] Enable compaction in apiserver (--etcd-compaction-interval)
- [ ] Set up scheduled defragmentation
- [ ] Delete old events regularly
- [ ] Monitor DB size vs quota
### Capacity Planning
- [ ] Size quota appropriately (start 2GB, increase as needed)
- [ ] Estimate write rate and retention needs
- [ ] Consider 3+ node etcd cluster for HA
Conclusion
The lesson: etcd’s MVCC design keeps all history until compaction runs. Without regular compaction and defragmentation, your cluster will hit the quota and become read-only.
Key principles:
- Every write adds to DB size - history accumulates fast
- Compaction removes history - but doesn’t free disk space
- Defragmentation frees space - must run after compaction
- Events are often the biggest consumer - clean them regularly
Related Articles
- Kubernetes Pod Stuck in Pending - Scheduling issues
- Kubernetes Control Plane Debugging - Control plane troubleshooting
Related posts
etcd Watch Replay Storms: When Giant ConfigMaps Kill the Control Plane
The apiserver becomes 'randomly slow'. Root cause: large, frequently updated ConfigMaps trigger watch compaction, causing thousands of controllers to relist simultaneously.
Kubernetes Headless Service DNS: Stale Records After Pod Deletion
Requests go to non-existent pods. The cause: headless service DNS records persist in client DNS cache after pods are deleted, before endpoints update propagates.
Traffic Hitting Dead Pods: Conntrack's Stale NAT Mapping
Deploy causes 503s for exactly 2 minutes. The issue: conntrack keeps NAT mappings to old pod IPs even after Kubernetes removes endpoints, sending traffic to dead pods.
Ephemeral Port Exhaustion: The Node That 'Goes Bad'
A single Kubernetes node starts failing connections to external services while pods look healthy. The hidden cause: sidecar proxies exhausting ephemeral ports with short-lived connections.
Cite this article
If you reference this post, please link to the original URL and credit the author.