etcd Watch Replay Storms: When Giant ConfigMaps Kill the Control Plane
etcd is quiet until watch replays turn into a storm. “The Kubernetes apiserver is randomly slow.” The pattern was maddening. For about 50 minutes out of every hour, the cluster was perfectly responsive. Then, like clockwork, the apiserver would become sluggish—simple kubectl get pods commands taking 10-30 seconds instead of 50 milliseconds. After 2-5 minutes, it would recover. We added more CPU, more memory, more apiservers. Nothing helped.
The breakthrough came when we noticed the timing. The slowdowns happened almost exactly on the hour, sometimes a few minutes past. We had a certificate rotation job that ran hourly. It updated a 500KB ConfigMap containing a bundle of CA certificates. That update triggered a cascade of events that brought the control plane to its knees.
Here’s what was happening: Kubernetes controllers use watches to stay synchronized with cluster state. A watch says “tell me about all changes since revision X.” If revision X has been compacted (deleted from etcd’s history), the controller has to relist—fetch all resources of that type from scratch. A large ConfigMap update generates a large etcd transaction, which advances the revision counter rapidly, which causes older revisions to be compacted sooner, which invalidates watches, which triggers relists from 150+ controllers simultaneously. Each relist fetches thousands of objects. The apiserver and etcd become overwhelmed. More watches fall behind. More relists trigger. It’s a cascade.
This is one of those problems where everything looks fine from the outside. CPU utilization is moderate. Memory is available. etcd health checks pass. But the internal machinery of watches and revisions is in chaos.
Environment: Kubernetes 1.27, 3-node etcd cluster, 150+ controllers watching various resources
The Problem
Symptoms
What we observed:
Normal day:
kubectl get pods → 50ms
API latency p99 → 100ms
etcd latency p99 → 10ms
During "random slowdowns":
kubectl get pods → 5-30 seconds
API latency p99 → 10+ seconds
etcd latency p99 → 500ms+
Pattern:
- Happens roughly once per hour
- Lasts 2-5 minutes
- No obvious trigger in application layer
- Correlates with... certificate rotation?
Why Standard Monitoring Misses This
# CPU and memory look fine
kubectl top pods -n kube-system
# etcd: CPU 30%, Memory 40%
# apiserver: CPU 50%, Memory 60%
# etcd health says everything is fine
etcdctl endpoint health
# healthy: true
# The problem is hidden in watch mechanics
# and only visible in specific metrics
Root Cause
How Watches Work
Normal watch flow:
Controller: "Watch pods starting from revision 12345"
↓
apiserver: Proxies watch to etcd
↓
etcd: "Here's events since 12345:
- Pod A created (rev 12346)
- Pod B deleted (rev 12347)
- ..."
↓
Controller: Processes events one by one
What Happens With Compaction
etcd compaction scenario:
Controller: "Watch pods starting from revision 12345"
↓
etcd: "Sorry, revision 12345 was compacted!
Earliest available is 50000"
↓
apiserver: Returns error to controller
↓
Controller: "I need to RELIST everything!"
Lists ALL pods (could be thousands)
Creates new watch from current revision
Now imagine 150 controllers all hit this at once...
The Giant Object Problem
The cascade:
1. Giant ConfigMap (500KB) updated
└─▶ Generates large etcd transaction
2. Transaction increases revision rapidly
└─▶ Older revisions get compacted faster
3. Slow watcher falls behind
└─▶ Its watched revision gets compacted
└─▶ Controller forced to relist
4. Relist of thousands of objects
└─▶ Adds load to apiserver
└─▶ Other watchers fall behind
└─▶ More compactions
└─▶ More relists
└─▶ STORM!
┌─────────────────────────────────────────────────┐
│ The Storm │
│ │
│ ConfigMap Large Fast │
│ update → revision → compaction │
│ increase │
│ ↓ │
│ Watcher 1 behind → Relist 10000 objects │
│ Watcher 2 behind → Relist 5000 objects │
│ Watcher N behind → Relist ... │
│ ↓ │
│ API Server overloaded │
│ More watchers fall behind │
│ Repeat... │
└─────────────────────────────────────────────────┘
Diagnosis
Step 1: Find the Large Objects
# Check object sizes in etcd
ETCDCTL_API=3 etcdctl get "" --prefix --keys-only | while read key; do
size=$(etcdctl get "$key" --print-value-only 2>/dev/null | wc -c)
echo "$size $key"
done | sort -rn | head -20
# Or via kubectl
kubectl get configmaps -A -o json | \
jq -r '.items[] | "\(.metadata.namespace)/\(.metadata.name) \(.data | to_entries | map(.value | length) | add)"' | \
sort -t' ' -k2 -rn | head -20
Step 2: Check Compaction Rate
# etcd compaction metrics
curl -s http://localhost:2379/metrics | grep etcd_mvcc_db_compaction
# High compaction_pause_duration = problem
# etcd_mvcc_db_compaction_pause_duration_milliseconds_bucket
# Check current revision vs compacted revision
etcdctl endpoint status --write-out=table
# Look at RAFT INDEX and compare with previous
Step 3: Watch Restart Metrics
# Check how often watches are restarting
curl -s http://localhost:8080/metrics | grep watch
# Look for:
# apiserver_watch_events_total
# apiserver_watch_events_sizes
# apiserver_longrunning_requests (for watch counts)
# High rate of watch establishment = problem
Step 4: Find the Trigger
# Check what's updating frequently
kubectl get events -A --sort-by='.lastTimestamp' | tail -50
# Check update frequency of large ConfigMaps
kubectl get configmap <suspicious-cm> -n <ns> -o json | \
jq '.metadata.annotations["kubectl.kubernetes.io/last-applied-configuration"]'
The Fix
Option 1: Shrink the Large Objects
# Before: Giant ConfigMap with all certs
apiVersion: v1
kind: ConfigMap
metadata:
name: all-certificates
data:
ca-bundle.crt: |
-----BEGIN CERTIFICATE-----
... 500KB of certificates ...
-----END CERTIFICATE-----
# After: Split into multiple smaller ConfigMaps
---
apiVersion: v1
kind: ConfigMap
metadata:
name: ca-cert-1
data:
ca.crt: |
... 50KB ...
---
apiVersion: v1
kind: ConfigMap
metadata:
name: ca-cert-2
data:
ca.crt: |
... 50KB ...
Option 2: Reduce Update Frequency
# If rotating certs every hour, consider:
# 1. Rotate less frequently (every 6 hours)
# 2. Update only when content actually changes
# Check if updates are actually changing content
kubectl get configmap cert-bundle -o jsonpath='{.data}' | md5sum
# Run again after "update" - if same hash, update is unnecessary
Option 3: Move Large Data Out of etcd
# Use external storage for large data
apiVersion: v1
kind: Secret
metadata:
name: ca-bundle-reference
type: Opaque
stringData:
# Store URL or reference, not the actual data
bundle-url: "https://internal-ca-server/bundle.pem"
bundle-hash: "sha256:abc123..."
# Application fetches from URL, validates with hash
Option 4: Tune etcd Compaction
# etcd configuration
# Increase compaction retention to give watchers more time
auto-compaction-retention: "8" # hours (default is 1)
auto-compaction-mode: "revision" # or "periodic"
# But careful: more retention = more disk space
# May need to increase quota-backend-bytes
quota-backend-bytes: 8589934592 # 8GB
Option 5: Use Delta Updates
// Instead of replacing entire ConfigMap
// Use strategic merge patch to update only changed fields
patch := []byte(`{"data":{"new-cert.pem":"..."}}`)
_, err := clientset.CoreV1().ConfigMaps(namespace).Patch(
context.TODO(),
"ca-bundle",
types.StrategicMergePatchType,
patch,
metav1.PatchOptions{},
)
// Smaller transaction = less impact on watchers
Monitoring
Key Metrics
groups:
- name: etcd-watch
rules:
- alert: EtcdHighCompactionRate
expr: |
rate(etcd_mvcc_db_compaction_total_duration_milliseconds_count[5m]) > 10
for: 5m
labels:
severity: warning
annotations:
summary: "High etcd compaction rate"
- alert: WatchCacheHitRate
expr: |
sum(rate(apiserver_cache_list_fetched_objects_total[5m])) /
sum(rate(apiserver_cache_list_total[5m])) < 0.9
for: 5m
labels:
severity: warning
annotations:
summary: "Low watch cache hit rate - possible relist storms"
- alert: LargeEtcdObject
expr: |
etcd_object_size_bytes > 100000
for: 5m
labels:
severity: info
annotations:
summary: "Large object in etcd: {{ $labels.key }}"
Dashboard Queries
# Grafana queries for etcd watch health
# Watch event rate by resource type
sum(rate(apiserver_watch_events_total[5m])) by (resource)
# List requests (indicator of relist storms)
sum(rate(apiserver_request_total{verb="LIST"}[5m])) by (resource)
# etcd transaction size distribution
histogram_quantile(0.99, sum(rate(etcd_request_duration_seconds_bucket[5m])) by (le, operation))
Checklist
## etcd Watch Replay Storms
### Symptoms
- [ ] Random apiserver slowdowns
- [ ] High list request rate
- [ ] Pattern correlates with config updates
- [ ] Lasts 2-5 minutes, then recovers
### Diagnosis
- [ ] Find large objects in etcd (>100KB)
- [ ] Check object update frequency
- [ ] Monitor etcd compaction rate
- [ ] Check watch restart metrics
### Fixes
- [ ] Split large ConfigMaps/Secrets
- [ ] Reduce update frequency
- [ ] Use delta updates instead of full replace
- [ ] Move large data out of etcd
- [ ] Tune compaction retention
### Prevention
- [ ] Set object size limits in admission control
- [ ] Monitor for large objects proactively
- [ ] Use external storage for large data
Conclusion
This failure mode reveals the hidden complexity behind Kubernetes’ “declarative” model. When you update a ConfigMap, you think you’re just changing some configuration. But under the hood, you’re generating an etcd transaction that affects revision numbers, compaction timelines, watch validity, and ultimately the behavior of every controller in the cluster. The system is deeply interconnected in ways that aren’t visible from the API layer.
The “random slowness” symptom is particularly insidious because it doesn’t point to any obvious cause. The apiserver isn’t out of memory. etcd isn’t out of disk space. Network isn’t congested. The problem is in the machinery of watches and revisions—a layer of abstraction that most operators never need to think about, until it breaks.
The fix is usually simple once you understand the problem: don’t store large, frequently-updated objects in etcd. Split large ConfigMaps. Use external storage for certificate bundles. Update only what actually changed. These are good practices anyway, but they become critical when you have a busy cluster with many controllers.
Key takeaways:
- Symptom is “random slowness” - which makes root cause analysis difficult
- Standard resources look fine - CPU/memory utilization doesn’t explain the problem
- Root cause is interaction - between object size, update frequency, and watch mechanics
- Fix requires understanding etcd internals - not just Kubernetes APIs
The key insight: etcd isn’t just a key-value store—it’s a versioned, watched database with its own internal dynamics. Large, frequently-updated objects can destabilize the entire watch machinery, affecting the responsiveness of the entire cluster. When debugging Kubernetes control plane issues, always consider what’s happening at the etcd level.
Related Articles
- Kubernetes ConfigMap Anti-Patterns - More ConfigMap gotchas
- Control Plane Performance - Optimizing the control plane
Related posts
etcd Quota Alarm: When Your Kubernetes Cluster Goes Read-Only
Cluster stops accepting writes, pods can't schedule. The cause: etcd hit its storage quota because compaction wasn't running, history accumulated beyond limits.
kube-proxy Micro-Outages: The xtables Lock Contention Problem
Random 1-3 second connection drops during deployments. CPU looks fine, memory is stable. The hidden cause: iptables-restore grabbing the xtables lock while endpoints churn.
Go cgo DNS Resolution Thread Explosion: When net.LookupHost Spawns Thousands of Threads
Go application suddenly has 10,000 threads consuming all memory. The cause: cgo-based DNS resolution blocking in slow DNS environments, bypassing Go's goroutine scheduler.
Java Profiling in Hardened Kubernetes: When Security Blocks Your Debugger
Can't attach profiler to production JVM. seccomp blocks perf_event_open, container drops CAP_SYS_PTRACE, and PodSecurityPolicy prevents privileged mode. Here's how to profile anyway.
Cite this article
If you reference this post, please link to the original URL and credit the author.