Back to blog

etcd Watch Replay Storms: When Giant ConfigMaps Kill the Control Plane

|
| kubernetes, etcd, control-plane, debugging, configmap, performance

etcd is quiet until watch replays turn into a storm. “The Kubernetes apiserver is randomly slow.” The pattern was maddening. For about 50 minutes out of every hour, the cluster was perfectly responsive. Then, like clockwork, the apiserver would become sluggish—simple kubectl get pods commands taking 10-30 seconds instead of 50 milliseconds. After 2-5 minutes, it would recover. We added more CPU, more memory, more apiservers. Nothing helped.

The breakthrough came when we noticed the timing. The slowdowns happened almost exactly on the hour, sometimes a few minutes past. We had a certificate rotation job that ran hourly. It updated a 500KB ConfigMap containing a bundle of CA certificates. That update triggered a cascade of events that brought the control plane to its knees.

Here’s what was happening: Kubernetes controllers use watches to stay synchronized with cluster state. A watch says “tell me about all changes since revision X.” If revision X has been compacted (deleted from etcd’s history), the controller has to relist—fetch all resources of that type from scratch. A large ConfigMap update generates a large etcd transaction, which advances the revision counter rapidly, which causes older revisions to be compacted sooner, which invalidates watches, which triggers relists from 150+ controllers simultaneously. Each relist fetches thousands of objects. The apiserver and etcd become overwhelmed. More watches fall behind. More relists trigger. It’s a cascade.

This is one of those problems where everything looks fine from the outside. CPU utilization is moderate. Memory is available. etcd health checks pass. But the internal machinery of watches and revisions is in chaos.

Environment: Kubernetes 1.27, 3-node etcd cluster, 150+ controllers watching various resources

The Problem

Symptoms

What we observed:

Normal day:
  kubectl get pods        → 50ms
  API latency p99         → 100ms
  etcd latency p99        → 10ms

During "random slowdowns":
  kubectl get pods        → 5-30 seconds
  API latency p99         → 10+ seconds
  etcd latency p99        → 500ms+

Pattern:
- Happens roughly once per hour
- Lasts 2-5 minutes
- No obvious trigger in application layer
- Correlates with... certificate rotation?

Why Standard Monitoring Misses This

# CPU and memory look fine
kubectl top pods -n kube-system
# etcd: CPU 30%, Memory 40%
# apiserver: CPU 50%, Memory 60%

# etcd health says everything is fine
etcdctl endpoint health
# healthy: true

# The problem is hidden in watch mechanics
# and only visible in specific metrics

Root Cause

How Watches Work

Normal watch flow:

Controller: "Watch pods starting from revision 12345"

apiserver:   Proxies watch to etcd

etcd:        "Here's events since 12345:
              - Pod A created (rev 12346)
              - Pod B deleted (rev 12347)
              - ..."

Controller:  Processes events one by one

What Happens With Compaction

etcd compaction scenario:

Controller: "Watch pods starting from revision 12345"

etcd:        "Sorry, revision 12345 was compacted!
              Earliest available is 50000"

apiserver:   Returns error to controller

Controller:  "I need to RELIST everything!"
              Lists ALL pods (could be thousands)
              Creates new watch from current revision

Now imagine 150 controllers all hit this at once...

The Giant Object Problem

The cascade:

1. Giant ConfigMap (500KB) updated
   └─▶ Generates large etcd transaction

2. Transaction increases revision rapidly
   └─▶ Older revisions get compacted faster

3. Slow watcher falls behind
   └─▶ Its watched revision gets compacted
        └─▶ Controller forced to relist

4. Relist of thousands of objects
   └─▶ Adds load to apiserver
        └─▶ Other watchers fall behind
             └─▶ More compactions
                  └─▶ More relists
                       └─▶ STORM!

┌─────────────────────────────────────────────────┐
│                 The Storm                       │
│                                                 │
│    ConfigMap    Large        Fast              │
│      update  →  revision  →  compaction        │
│                 increase                        │
│         ↓                                       │
│    Watcher 1 behind → Relist 10000 objects     │
│    Watcher 2 behind → Relist 5000 objects      │
│    Watcher N behind → Relist ...               │
│         ↓                                       │
│    API Server overloaded                        │
│    More watchers fall behind                    │
│    Repeat...                                    │
└─────────────────────────────────────────────────┘

Diagnosis

Step 1: Find the Large Objects

# Check object sizes in etcd
ETCDCTL_API=3 etcdctl get "" --prefix --keys-only | while read key; do
  size=$(etcdctl get "$key" --print-value-only 2>/dev/null | wc -c)
  echo "$size $key"
done | sort -rn | head -20

# Or via kubectl
kubectl get configmaps -A -o json | \
  jq -r '.items[] | "\(.metadata.namespace)/\(.metadata.name) \(.data | to_entries | map(.value | length) | add)"' | \
  sort -t' ' -k2 -rn | head -20

Step 2: Check Compaction Rate

# etcd compaction metrics
curl -s http://localhost:2379/metrics | grep etcd_mvcc_db_compaction

# High compaction_pause_duration = problem
# etcd_mvcc_db_compaction_pause_duration_milliseconds_bucket

# Check current revision vs compacted revision
etcdctl endpoint status --write-out=table
# Look at RAFT INDEX and compare with previous

Step 3: Watch Restart Metrics

# Check how often watches are restarting
curl -s http://localhost:8080/metrics | grep watch

# Look for:
# apiserver_watch_events_total
# apiserver_watch_events_sizes
# apiserver_longrunning_requests (for watch counts)

# High rate of watch establishment = problem

Step 4: Find the Trigger

# Check what's updating frequently
kubectl get events -A --sort-by='.lastTimestamp' | tail -50

# Check update frequency of large ConfigMaps
kubectl get configmap <suspicious-cm> -n <ns> -o json | \
  jq '.metadata.annotations["kubectl.kubernetes.io/last-applied-configuration"]'

The Fix

Option 1: Shrink the Large Objects

# Before: Giant ConfigMap with all certs
apiVersion: v1
kind: ConfigMap
metadata:
  name: all-certificates
data:
  ca-bundle.crt: |
    -----BEGIN CERTIFICATE-----
    ... 500KB of certificates ...
    -----END CERTIFICATE-----

# After: Split into multiple smaller ConfigMaps
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: ca-cert-1
data:
  ca.crt: |
    ... 50KB ...
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: ca-cert-2
data:
  ca.crt: |
    ... 50KB ...

Option 2: Reduce Update Frequency

# If rotating certs every hour, consider:
# 1. Rotate less frequently (every 6 hours)
# 2. Update only when content actually changes

# Check if updates are actually changing content
kubectl get configmap cert-bundle -o jsonpath='{.data}' | md5sum
# Run again after "update" - if same hash, update is unnecessary

Option 3: Move Large Data Out of etcd

# Use external storage for large data
apiVersion: v1
kind: Secret
metadata:
  name: ca-bundle-reference
type: Opaque
stringData:
  # Store URL or reference, not the actual data
  bundle-url: "https://internal-ca-server/bundle.pem"
  bundle-hash: "sha256:abc123..."

# Application fetches from URL, validates with hash

Option 4: Tune etcd Compaction

# etcd configuration
# Increase compaction retention to give watchers more time
auto-compaction-retention: "8"  # hours (default is 1)
auto-compaction-mode: "revision"  # or "periodic"

# But careful: more retention = more disk space
# May need to increase quota-backend-bytes
quota-backend-bytes: 8589934592  # 8GB

Option 5: Use Delta Updates

// Instead of replacing entire ConfigMap
// Use strategic merge patch to update only changed fields

patch := []byte(`{"data":{"new-cert.pem":"..."}}`)
_, err := clientset.CoreV1().ConfigMaps(namespace).Patch(
    context.TODO(),
    "ca-bundle",
    types.StrategicMergePatchType,
    patch,
    metav1.PatchOptions{},
)
// Smaller transaction = less impact on watchers

Monitoring

Key Metrics

groups:
  - name: etcd-watch
    rules:
      - alert: EtcdHighCompactionRate
        expr: |
          rate(etcd_mvcc_db_compaction_total_duration_milliseconds_count[5m]) > 10
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High etcd compaction rate"

      - alert: WatchCacheHitRate
        expr: |
          sum(rate(apiserver_cache_list_fetched_objects_total[5m])) /
          sum(rate(apiserver_cache_list_total[5m])) < 0.9
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Low watch cache hit rate - possible relist storms"

      - alert: LargeEtcdObject
        expr: |
          etcd_object_size_bytes > 100000
        for: 5m
        labels:
          severity: info
        annotations:
          summary: "Large object in etcd: {{ $labels.key }}"

Dashboard Queries

# Grafana queries for etcd watch health

# Watch event rate by resource type
sum(rate(apiserver_watch_events_total[5m])) by (resource)

# List requests (indicator of relist storms)
sum(rate(apiserver_request_total{verb="LIST"}[5m])) by (resource)

# etcd transaction size distribution
histogram_quantile(0.99, sum(rate(etcd_request_duration_seconds_bucket[5m])) by (le, operation))

Checklist

## etcd Watch Replay Storms

### Symptoms
- [ ] Random apiserver slowdowns
- [ ] High list request rate
- [ ] Pattern correlates with config updates
- [ ] Lasts 2-5 minutes, then recovers

### Diagnosis
- [ ] Find large objects in etcd (>100KB)
- [ ] Check object update frequency
- [ ] Monitor etcd compaction rate
- [ ] Check watch restart metrics

### Fixes
- [ ] Split large ConfigMaps/Secrets
- [ ] Reduce update frequency
- [ ] Use delta updates instead of full replace
- [ ] Move large data out of etcd
- [ ] Tune compaction retention

### Prevention
- [ ] Set object size limits in admission control
- [ ] Monitor for large objects proactively
- [ ] Use external storage for large data

Conclusion

This failure mode reveals the hidden complexity behind Kubernetes’ “declarative” model. When you update a ConfigMap, you think you’re just changing some configuration. But under the hood, you’re generating an etcd transaction that affects revision numbers, compaction timelines, watch validity, and ultimately the behavior of every controller in the cluster. The system is deeply interconnected in ways that aren’t visible from the API layer.

The “random slowness” symptom is particularly insidious because it doesn’t point to any obvious cause. The apiserver isn’t out of memory. etcd isn’t out of disk space. Network isn’t congested. The problem is in the machinery of watches and revisions—a layer of abstraction that most operators never need to think about, until it breaks.

The fix is usually simple once you understand the problem: don’t store large, frequently-updated objects in etcd. Split large ConfigMaps. Use external storage for certificate bundles. Update only what actually changed. These are good practices anyway, but they become critical when you have a busy cluster with many controllers.

Key takeaways:

  1. Symptom is “random slowness” - which makes root cause analysis difficult
  2. Standard resources look fine - CPU/memory utilization doesn’t explain the problem
  3. Root cause is interaction - between object size, update frequency, and watch mechanics
  4. Fix requires understanding etcd internals - not just Kubernetes APIs

The key insight: etcd isn’t just a key-value store—it’s a versioned, watched database with its own internal dynamics. Large, frequently-updated objects can destabilize the entire watch machinery, affecting the responsiveness of the entire cluster. When debugging Kubernetes control plane issues, always consider what’s happening at the etcd level.


Related posts

Cite this article

If you reference this post, please link to the original URL and credit the author.

Michal Drozd. "etcd Watch Replay Storms: When Giant ConfigMaps Kill the Control Plane". https://www.michal-drozd.com/en/blog/etcd-watch-replay-storms/ (Published December 5, 2024).