kube-proxy Micro-Outages: The xtables Lock Contention Problem
At first it looked like packet loss; it was an iptables lock. “Deployments cause random connection drops but everything looks healthy.” We hit this one during a big migration project. The team had successfully moved to Kubernetes, traffic was flowing, metrics looked good. But every time we deployed a new version, a small percentage of requests would fail with connection refused errors for 1-3 seconds. The pattern was maddening—random, short-lived, and impossible to correlate with any visible problem.
The natural response was to add more replicas. If some connections are dropping, having more pods should provide redundancy, right? But adding replicas made things worse. Every new pod meant another endpoint change. Every endpoint change meant kube-proxy regenerating iptables rules. And every rule regeneration meant grabbing the xtables lock—a kernel mutex that blocks all iptables operations while held.
This is one of those problems where the symptom seems completely unrelated to the cause. Connection drops feel like a network problem. But the actual issue is a single-threaded lock in the kernel being held for seconds at a time while thousands of iptables rules are rewritten. During that time, new TCP connections can’t be NAT’d properly, so they fail. Once the lock releases, everything works again.
The counter-intuitive lesson: in iptables mode, adding capacity can decrease reliability. The more endpoints you have, the more rules kube-proxy needs to maintain, the longer the lock is held, and the more connections drop.
Environment: Kubernetes 1.27, kube-proxy iptables mode, 200+ services, frequent HPA scaling
The Problem
Symptoms
The pattern that drove us crazy:
1. Deploy new version (or HPA scales)
2. 1-3 seconds of dropped connections
3. No pod restarts, no OOM, no CPU spikes
4. Metrics show nothing wrong
5. Happens randomly during "busy" times
What we saw:
- New TCP connections fail with "connection refused"
- Existing connections are fine
- Only lasts 1-3 seconds
- More replicas = more frequent occurrence
Why Standard Monitoring Misses This
# Normal metrics look fine
kubectl top nodes # CPU: 40%, Memory: 60% - looks good
kubectl top pods # All pods healthy
# Even kube-proxy metrics don't show the issue clearly
curl localhost:10249/metrics | grep sync
# kubeproxy_sync_proxy_rules_duration_seconds - shows average, not spikes
# The lock contention happens at kernel level, not visible to Kubernetes
Root Cause
How iptables Mode Works
Normal kube-proxy iptables sync:
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Endpoints │────▶│ kube-proxy │────▶│ iptables │
│ change │ │ generates │ │ -restore │
└─────────────┘ │ new rules │ └─────────────┘
└─────────────┘ │
▼
┌─────────────────┐
│ xtables lock │
│ (kernel mutex) │
└─────────────────┘
│
Blocks ALL iptables
operations during
restore (can be seconds)
The Amplification Problem
Endpoint churn creates a feedback loop:
1. Pod crashes (CrashLoopBackOff)
└─▶ Endpoint removed
└─▶ kube-proxy syncs rules (takes xtables lock)
2. HPA scales up
└─▶ New endpoints added
└─▶ kube-proxy syncs rules (takes xtables lock)
3. Rolling deployment
└─▶ Old pods terminating, new pods starting
└─▶ Multiple endpoint changes
└─▶ Multiple rule syncs queued
└─▶ Lock held longer
The counter-intuitive part:
MORE REPLICAS = MORE ENDPOINT CHANGES = MORE LOCK CONTENTION
"Scale up to handle load" can make the problem worse!
Measuring the Lock Contention
# On a node, trace xtables lock acquisition
sudo perf trace -e 'lock:*' -p $(pgrep kube-proxy) 2>&1 | head -100
# Or use bpftrace to measure lock hold time
sudo bpftrace -e '
kprobe:xt_lock_table_lock { @start[tid] = nsecs; }
kretprobe:xt_lock_table_unlock /@start[tid]/ {
@lock_held_us = hist((nsecs - @start[tid]) / 1000);
delete(@start[tid]);
}
'
# More practical: time iptables-save during sync
time iptables-save > /dev/null
# If this takes >100ms during "normal" times, you have a problem
The Fix
Option 1: Switch to IPVS Mode
# kube-proxy ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
name: kube-proxy
namespace: kube-system
data:
config.conf: |
mode: "ipvs"
ipvs:
syncPeriod: "30s"
minSyncPeriod: "5s"
scheduler: "rr"
# IPVS uses different kernel structures, no xtables lock
# But requires ipvs kernel modules loaded
# Verify IPVS modules are available
lsmod | grep -E "ip_vs|nf_conntrack"
# If not, load them
modprobe ip_vs
modprobe ip_vs_rr
modprobe ip_vs_wrr
modprobe ip_vs_sh
modprobe nf_conntrack
Option 2: Tune iptables Mode
# kube-proxy ConfigMap - reduce sync frequency
apiVersion: v1
kind: ConfigMap
metadata:
name: kube-proxy
namespace: kube-system
data:
config.conf: |
mode: "iptables"
iptables:
syncPeriod: "60s" # Increase from default 30s
minSyncPeriod: "10s" # Increase from default 1s
masqueradeAll: false
Option 3: Reduce Endpoint Churn
# Slow down HPA reactions
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: my-app
spec:
behavior:
scaleDown:
stabilizationWindowSeconds: 300 # 5 min window
policies:
- type: Percent
value: 10 # Only 10% at a time
periodSeconds: 60
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Percent
value: 50 # 50% at a time
periodSeconds: 60
# Use maxSurge/maxUnavailable to limit deployment churn
apiVersion: apps/v1
kind: Deployment
spec:
strategy:
rollingUpdate:
maxSurge: 1 # Only 1 new pod at a time
maxUnavailable: 0 # Keep all old pods until new is ready
Option 4: nftables Mode (Kubernetes 1.29+)
# If running K8s 1.29+ with compatible nodes
apiVersion: v1
kind: ConfigMap
metadata:
name: kube-proxy
namespace: kube-system
data:
config.conf: |
mode: "nftables"
nftables:
syncPeriod: "30s"
minSyncPeriod: "1s"
# nftables uses atomic rule replacement, much less lock contention
Monitoring
Detecting Lock Contention
# Prometheus alert for sync duration spikes
groups:
- name: kube-proxy
rules:
- alert: KubeProxySyncSlow
expr: |
histogram_quantile(0.99,
rate(kubeproxy_sync_proxy_rules_duration_seconds_bucket[5m])
) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "kube-proxy rule sync taking >1s on {{ $labels.instance }}"
- alert: KubeProxyEndpointChurn
expr: |
rate(kubeproxy_sync_proxy_rules_endpoint_changes_total[5m]) > 10
for: 5m
labels:
severity: warning
annotations:
summary: "High endpoint churn on {{ $labels.instance }}"
Custom Metric for xtables Lock
#!/bin/bash
# /etc/node_exporter/scripts/xtables_lock.sh
# Run as textfile collector
# Time how long iptables-save takes (proxy for lock contention)
start=$(date +%s.%N)
iptables-save > /dev/null 2>&1
end=$(date +%s.%N)
duration=$(echo "$end - $start" | bc)
echo "# HELP xtables_lock_duration_seconds Time to acquire xtables lock"
echo "# TYPE xtables_lock_duration_seconds gauge"
echo "xtables_lock_duration_seconds $duration"
# Count rules (more rules = longer lock hold)
rule_count=$(iptables-save | wc -l)
echo "# HELP iptables_rule_count Total iptables rules"
echo "# TYPE iptables_rule_count gauge"
echo "iptables_rule_count $rule_count"
Debugging Playbook
#!/bin/bash
# kube-proxy-debug.sh
echo "=== Step 1: Check kube-proxy mode ==="
kubectl get cm kube-proxy -n kube-system -o jsonpath='{.data.config\.conf}' | grep mode
echo "=== Step 2: Count iptables rules ==="
iptables-save | wc -l
echo "=== Step 3: Time iptables-save ==="
time iptables-save > /dev/null
echo "=== Step 4: Check endpoint count ==="
kubectl get endpoints -A | wc -l
echo "=== Step 5: Watch for sync events ==="
kubectl logs -n kube-system -l k8s-app=kube-proxy --tail=50 | grep -i sync
echo "=== Step 6: Check for crashlooping pods (churn source) ==="
kubectl get pods -A | grep -E "(CrashLoop|Error)"
echo "=== Step 7: Check HPA activity ==="
kubectl get hpa -A
Real-World Numbers
Cluster with 200 services, iptables mode:
Before tuning:
- iptables rules: 15,000+
- iptables-save time: 800ms average, 3s spikes
- Endpoint changes: 50/min during business hours
- Connection drops: 5-10/hour
After switch to IPVS:
- iptables-save time: 50ms (only non-kube rules)
- Endpoint changes: same, but no lock contention
- Connection drops: 0
Alternative - tuned iptables:
- minSyncPeriod: 10s (was 1s)
- Controlled HPA behavior
- Connection drops: <1/hour
Checklist
## kube-proxy Lock Contention
### Symptoms
- [ ] Random 1-3s connection drops
- [ ] Correlates with deployments/scaling
- [ ] CPU/memory look normal
- [ ] More replicas made it worse
### Diagnosis
- [ ] Check kube-proxy mode (iptables?)
- [ ] Count iptables rules (>10k is concerning)
- [ ] Time iptables-save (>500ms is bad)
- [ ] Check endpoint churn rate
- [ ] Look for crashlooping pods
### Fixes
- [ ] Consider IPVS mode
- [ ] Increase minSyncPeriod
- [ ] Tune HPA stabilization windows
- [ ] Control deployment surge/unavailable
- [ ] Consider nftables mode (K8s 1.29+)
Conclusion
This is a fundamental architectural limitation of kube-proxy’s iptables mode. The Linux kernel’s iptables implementation uses a single global lock to protect the rule tables. When kube-proxy runs iptables-restore to update rules, it holds that lock for the entire duration of the operation. With hundreds of services and thousands of endpoints, that operation can take seconds.
The xtables lock is invisible to standard Kubernetes monitoring. CPU looks fine. Memory looks fine. Network metrics show some failures, but nothing that explains the pattern. You have to understand the interaction between kube-proxy’s sync mechanism and the kernel’s locking behavior to diagnose the issue.
The good news is that solutions exist. IPVS mode uses a completely different kernel mechanism that doesn’t have the same locking behavior. nftables mode in newer Kubernetes versions uses atomic rule replacement. And even within iptables mode, reducing endpoint churn through HPA tuning and deployment strategies can mitigate the problem.
Key takeaways:
- iptables mode has inherent lock contention at scale - it’s not a bug, it’s architecture
- More endpoints = longer lock hold times = more connection drops during sync
- IPVS mode avoids the problem entirely by using different kernel data structures
- If stuck with iptables, reduce churn through HPA stabilization and gradual rollouts
- nftables mode (K8s 1.29+) provides atomic updates without lock contention
Related Articles
- Kubernetes Conntrack Exhaustion - Connection tracking limits
- gRPC Load Balancing in Kubernetes - Connection management
Related posts
The Ghost Pod: Why Your Service Still Sends Traffic to Dead Endpoints
Random ECONNRESET on some nodes but not others. Endpoints look fine. The culprit: conntrack NAT entries keeping long-lived connections pinned to pods that no longer exist.
tcpdump Sees SYNs, but the Service Times Out: The Listen Backlog Trap
Clients time out, tcpdump shows SYNs (sometimes even SYN-ACK), yet your app logs nothing. The culprit is often the Linux listen/accept queues overflowing under load or CPU starvation.
PMTU Blackholes: When Only Large Responses Hang
Small API responses work, large ones hang forever. The cause: ICMP 'Fragmentation Needed' messages filtered by firewalls, breaking Path MTU Discovery in overlay networks.
Ephemeral Port Exhaustion: The Node That 'Goes Bad'
A single Kubernetes node starts failing connections to external services while pods look healthy. The hidden cause: sidecar proxies exhausting ephemeral ports with short-lived connections.
Cite this article
If you reference this post, please link to the original URL and credit the author.