Back to blog

kube-proxy Micro-Outages: The xtables Lock Contention Problem

At first it looked like packet loss; it was an iptables lock. “Deployments cause random connection drops but everything looks healthy.” We hit this one during a big migration project. The team had successfully moved to Kubernetes, traffic was flowing, metrics looked good. But every time we deployed a new version, a small percentage of requests would fail with connection refused errors for 1-3 seconds. The pattern was maddening—random, short-lived, and impossible to correlate with any visible problem.

The natural response was to add more replicas. If some connections are dropping, having more pods should provide redundancy, right? But adding replicas made things worse. Every new pod meant another endpoint change. Every endpoint change meant kube-proxy regenerating iptables rules. And every rule regeneration meant grabbing the xtables lock—a kernel mutex that blocks all iptables operations while held.

This is one of those problems where the symptom seems completely unrelated to the cause. Connection drops feel like a network problem. But the actual issue is a single-threaded lock in the kernel being held for seconds at a time while thousands of iptables rules are rewritten. During that time, new TCP connections can’t be NAT’d properly, so they fail. Once the lock releases, everything works again.

The counter-intuitive lesson: in iptables mode, adding capacity can decrease reliability. The more endpoints you have, the more rules kube-proxy needs to maintain, the longer the lock is held, and the more connections drop.

Environment: Kubernetes 1.27, kube-proxy iptables mode, 200+ services, frequent HPA scaling

The Problem

Symptoms

The pattern that drove us crazy:

1. Deploy new version (or HPA scales)
2. 1-3 seconds of dropped connections
3. No pod restarts, no OOM, no CPU spikes
4. Metrics show nothing wrong
5. Happens randomly during "busy" times

What we saw:
- New TCP connections fail with "connection refused"
- Existing connections are fine
- Only lasts 1-3 seconds
- More replicas = more frequent occurrence

Why Standard Monitoring Misses This

# Normal metrics look fine
kubectl top nodes     # CPU: 40%, Memory: 60% - looks good
kubectl top pods      # All pods healthy

# Even kube-proxy metrics don't show the issue clearly
curl localhost:10249/metrics | grep sync
# kubeproxy_sync_proxy_rules_duration_seconds - shows average, not spikes

# The lock contention happens at kernel level, not visible to Kubernetes

Root Cause

How iptables Mode Works

Normal kube-proxy iptables sync:

┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│  Endpoints  │────▶│  kube-proxy │────▶│  iptables   │
│   change    │     │  generates  │     │  -restore   │
└─────────────┘     │  new rules  │     └─────────────┘
                    └─────────────┘           │

                                    ┌─────────────────┐
                                    │  xtables lock   │
                                    │  (kernel mutex) │
                                    └─────────────────┘

                                    Blocks ALL iptables
                                    operations during
                                    restore (can be seconds)

The Amplification Problem

Endpoint churn creates a feedback loop:

1. Pod crashes (CrashLoopBackOff)
   └─▶ Endpoint removed
        └─▶ kube-proxy syncs rules (takes xtables lock)

2. HPA scales up
   └─▶ New endpoints added
        └─▶ kube-proxy syncs rules (takes xtables lock)

3. Rolling deployment
   └─▶ Old pods terminating, new pods starting
        └─▶ Multiple endpoint changes
             └─▶ Multiple rule syncs queued
                  └─▶ Lock held longer

The counter-intuitive part:
MORE REPLICAS = MORE ENDPOINT CHANGES = MORE LOCK CONTENTION

"Scale up to handle load" can make the problem worse!

Measuring the Lock Contention

# On a node, trace xtables lock acquisition
sudo perf trace -e 'lock:*' -p $(pgrep kube-proxy) 2>&1 | head -100

# Or use bpftrace to measure lock hold time
sudo bpftrace -e '
  kprobe:xt_lock_table_lock { @start[tid] = nsecs; }
  kretprobe:xt_lock_table_unlock /@start[tid]/ {
    @lock_held_us = hist((nsecs - @start[tid]) / 1000);
    delete(@start[tid]);
  }
'

# More practical: time iptables-save during sync
time iptables-save > /dev/null
# If this takes >100ms during "normal" times, you have a problem

The Fix

Option 1: Switch to IPVS Mode

# kube-proxy ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
  name: kube-proxy
  namespace: kube-system
data:
  config.conf: |
    mode: "ipvs"
    ipvs:
      syncPeriod: "30s"
      minSyncPeriod: "5s"
      scheduler: "rr"

# IPVS uses different kernel structures, no xtables lock
# But requires ipvs kernel modules loaded
# Verify IPVS modules are available
lsmod | grep -E "ip_vs|nf_conntrack"

# If not, load them
modprobe ip_vs
modprobe ip_vs_rr
modprobe ip_vs_wrr
modprobe ip_vs_sh
modprobe nf_conntrack

Option 2: Tune iptables Mode

# kube-proxy ConfigMap - reduce sync frequency
apiVersion: v1
kind: ConfigMap
metadata:
  name: kube-proxy
  namespace: kube-system
data:
  config.conf: |
    mode: "iptables"
    iptables:
      syncPeriod: "60s"       # Increase from default 30s
      minSyncPeriod: "10s"    # Increase from default 1s
      masqueradeAll: false

Option 3: Reduce Endpoint Churn

# Slow down HPA reactions
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: my-app
spec:
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300  # 5 min window
      policies:
        - type: Percent
          value: 10                     # Only 10% at a time
          periodSeconds: 60
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
        - type: Percent
          value: 50                     # 50% at a time
          periodSeconds: 60
# Use maxSurge/maxUnavailable to limit deployment churn
apiVersion: apps/v1
kind: Deployment
spec:
  strategy:
    rollingUpdate:
      maxSurge: 1          # Only 1 new pod at a time
      maxUnavailable: 0    # Keep all old pods until new is ready

Option 4: nftables Mode (Kubernetes 1.29+)

# If running K8s 1.29+ with compatible nodes
apiVersion: v1
kind: ConfigMap
metadata:
  name: kube-proxy
  namespace: kube-system
data:
  config.conf: |
    mode: "nftables"
    nftables:
      syncPeriod: "30s"
      minSyncPeriod: "1s"

# nftables uses atomic rule replacement, much less lock contention

Monitoring

Detecting Lock Contention

# Prometheus alert for sync duration spikes
groups:
  - name: kube-proxy
    rules:
      - alert: KubeProxySyncSlow
        expr: |
          histogram_quantile(0.99,
            rate(kubeproxy_sync_proxy_rules_duration_seconds_bucket[5m])
          ) > 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "kube-proxy rule sync taking >1s on {{ $labels.instance }}"

      - alert: KubeProxyEndpointChurn
        expr: |
          rate(kubeproxy_sync_proxy_rules_endpoint_changes_total[5m]) > 10
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High endpoint churn on {{ $labels.instance }}"

Custom Metric for xtables Lock

#!/bin/bash
# /etc/node_exporter/scripts/xtables_lock.sh
# Run as textfile collector

# Time how long iptables-save takes (proxy for lock contention)
start=$(date +%s.%N)
iptables-save > /dev/null 2>&1
end=$(date +%s.%N)
duration=$(echo "$end - $start" | bc)

echo "# HELP xtables_lock_duration_seconds Time to acquire xtables lock"
echo "# TYPE xtables_lock_duration_seconds gauge"
echo "xtables_lock_duration_seconds $duration"

# Count rules (more rules = longer lock hold)
rule_count=$(iptables-save | wc -l)
echo "# HELP iptables_rule_count Total iptables rules"
echo "# TYPE iptables_rule_count gauge"
echo "iptables_rule_count $rule_count"

Debugging Playbook

#!/bin/bash
# kube-proxy-debug.sh

echo "=== Step 1: Check kube-proxy mode ==="
kubectl get cm kube-proxy -n kube-system -o jsonpath='{.data.config\.conf}' | grep mode

echo "=== Step 2: Count iptables rules ==="
iptables-save | wc -l

echo "=== Step 3: Time iptables-save ==="
time iptables-save > /dev/null

echo "=== Step 4: Check endpoint count ==="
kubectl get endpoints -A | wc -l

echo "=== Step 5: Watch for sync events ==="
kubectl logs -n kube-system -l k8s-app=kube-proxy --tail=50 | grep -i sync

echo "=== Step 6: Check for crashlooping pods (churn source) ==="
kubectl get pods -A | grep -E "(CrashLoop|Error)"

echo "=== Step 7: Check HPA activity ==="
kubectl get hpa -A

Real-World Numbers

Cluster with 200 services, iptables mode:

Before tuning:
- iptables rules: 15,000+
- iptables-save time: 800ms average, 3s spikes
- Endpoint changes: 50/min during business hours
- Connection drops: 5-10/hour

After switch to IPVS:
- iptables-save time: 50ms (only non-kube rules)
- Endpoint changes: same, but no lock contention
- Connection drops: 0

Alternative - tuned iptables:
- minSyncPeriod: 10s (was 1s)
- Controlled HPA behavior
- Connection drops: <1/hour

Checklist

## kube-proxy Lock Contention

### Symptoms
- [ ] Random 1-3s connection drops
- [ ] Correlates with deployments/scaling
- [ ] CPU/memory look normal
- [ ] More replicas made it worse

### Diagnosis
- [ ] Check kube-proxy mode (iptables?)
- [ ] Count iptables rules (>10k is concerning)
- [ ] Time iptables-save (>500ms is bad)
- [ ] Check endpoint churn rate
- [ ] Look for crashlooping pods

### Fixes
- [ ] Consider IPVS mode
- [ ] Increase minSyncPeriod
- [ ] Tune HPA stabilization windows
- [ ] Control deployment surge/unavailable
- [ ] Consider nftables mode (K8s 1.29+)

Conclusion

This is a fundamental architectural limitation of kube-proxy’s iptables mode. The Linux kernel’s iptables implementation uses a single global lock to protect the rule tables. When kube-proxy runs iptables-restore to update rules, it holds that lock for the entire duration of the operation. With hundreds of services and thousands of endpoints, that operation can take seconds.

The xtables lock is invisible to standard Kubernetes monitoring. CPU looks fine. Memory looks fine. Network metrics show some failures, but nothing that explains the pattern. You have to understand the interaction between kube-proxy’s sync mechanism and the kernel’s locking behavior to diagnose the issue.

The good news is that solutions exist. IPVS mode uses a completely different kernel mechanism that doesn’t have the same locking behavior. nftables mode in newer Kubernetes versions uses atomic rule replacement. And even within iptables mode, reducing endpoint churn through HPA tuning and deployment strategies can mitigate the problem.

Key takeaways:

  1. iptables mode has inherent lock contention at scale - it’s not a bug, it’s architecture
  2. More endpoints = longer lock hold times = more connection drops during sync
  3. IPVS mode avoids the problem entirely by using different kernel data structures
  4. If stuck with iptables, reduce churn through HPA stabilization and gradual rollouts
  5. nftables mode (K8s 1.29+) provides atomic updates without lock contention

Related posts

Cite this article

If you reference this post, please link to the original URL and credit the author.

Michal Drozd. "kube-proxy Micro-Outages: The xtables Lock Contention Problem". https://www.michal-drozd.com/en/blog/kube-proxy-xtables-lock-contention/ (Published November 4, 2024).