Traffic Hitting Dead Pods: Conntrack's Stale NAT Mapping
This bug felt like a ghost: new pods couldn’t connect, old ones could. “Every deploy causes exactly 2 minutes of 503 errors.” The pattern was so consistent that we started joking about it. Deploy at 14:00, errors clear at 14:02. Deploy at 17:30, errors clear at 17:32. We tried everything the Kubernetes best practices suggested: preStop hooks with sleep commands, longer terminationGracePeriodSeconds, faster readiness probe failures, aggressive endpoint removal. Nothing made any difference. Exactly 2 minutes, every time.
The breakthrough came when we stopped thinking about Kubernetes and started thinking about Linux networking. Kubernetes removes endpoints and updates iptables rules correctly. But those iptables rules are only consulted for new connections. For existing connections—and for connections from persistent clients that keep trying the same source port—the Linux kernel uses conntrack, a connection tracking subsystem that remembers NAT mappings.
When a pod dies, its conntrack entries don’t die with it. They linger, remembering “connection from client:54321 should go to pod:10.1.1.100”. When the client’s next request arrives using the same source port, conntrack short-circuits the iptables lookup and sends the packet directly to the dead pod. The packet goes into the void, the connection times out, and the user sees a 503.
This is one of those problems that only appears when you have persistent connections or clients that reuse source ports. In development, with fresh connections every time, you’d never see it. In production, with connection pools and HTTP keep-alive, it’s everywhere.
Environment: Kubernetes 1.27, NodePort services, high-traffic stateless API
The Problem
The Eerie Pattern
Every single deployment:
T+0:00 New pods ready, old pods terminating
T+0:00 Endpoints updated (kube-proxy syncs)
T+0:00 503 errors start appearing
T+0:05 503 rate: 5%
T+0:30 503 rate: 3%
T+1:00 503 rate: 2%
T+2:00 503 rate: 0% (finally!)
Always exactly 2 minutes.
Same pattern every time.
No exceptions.
Why Standard Fixes Don’t Work
# We tried everything:
# Longer termination grace period - didn't help
terminationGracePeriodSeconds: 120
# preStop hook delay - didn't help
lifecycle:
preStop:
exec:
command: ["sleep", "30"]
# Aggressive readiness probe - didn't help
readinessProbe:
periodSeconds: 1
failureThreshold: 1
# The problem isn't pod lifecycle
# It's the node's conntrack table!
Root Cause
How Conntrack Works
Normal request flow with NodePort:
Client → Node:30080 → iptables DNAT → Pod:8080
conntrack entry created:
┌─────────────────────────────────────────────────────┐
│ tcp src=client:54321 dst=node:30080 │
│ src=pod:8080 dst=client:54321 [ASSURED] │
│ timeout=432000 (5 days!) │
└─────────────────────────────────────────────────────┘
This entry remembers: "traffic from client:54321 goes to pod:8080"
The Problem During Deployment
Timeline:
T+0:00 Old pod IP: 10.1.1.100
New pod IP: 10.1.1.200
Kubernetes: "Endpoint 10.1.1.100 removed!"
kube-proxy: "iptables rules updated!"
But conntrack table still has:
┌─────────────────────────────────────────┐
│ tcp src=client:54321 dst=node:30080 │
│ src=10.1.1.100:8080 dst=client │ ← Points to DEAD pod!
│ timeout=still_has_time │
└─────────────────────────────────────────┘
T+0:01 Client sends packet on same connection
→ Conntrack: "I know this! Send to 10.1.1.100"
→ Packet goes to dead pod
→ Connection reset / timeout
→ 503 error!
T+2:00 Conntrack entries finally expire
New connections get new NAT to 10.1.1.200
Errors stop
Why It’s Exactly 2 Minutes
# Check conntrack timeout for established TCP
cat /proc/sys/net/netfilter/nf_conntrack_tcp_timeout_established
# 432000 (5 days - not relevant)
# The 2-minute pattern comes from:
# 1. Client-side keepalive/retry settings
# 2. HTTP client connection pool timeout
# 3. Load balancer health check intervals
# The conntrack entry itself could live for days
# But the client eventually gives up and creates new connection
Diagnosis
Step 1: Watch Conntrack During Deploy
# Before deploy, note a client IP
CLIENT_IP="203.0.113.10"
# Watch conntrack entries for that client during deploy
watch -n 0.5 "conntrack -L -s $CLIENT_IP 2>/dev/null | grep -E '(ESTABLISHED|TIME_WAIT)'"
# You'll see entries pointing to the old pod IP
# even after the pod is gone
Step 2: Verify Dead Pod Traffic
# tcpdump on a node during deploy
tcpdump -i any host 10.1.1.100 # old pod IP
# You'll see packets being sent to the old IP
# after the pod is gone
Step 3: Count Stale Entries
#!/bin/bash
# count-stale-conntrack.sh
# Get current endpoint IPs
VALID_IPS=$(kubectl get endpoints my-service -o jsonpath='{.subsets[*].addresses[*].ip}')
# Count conntrack entries pointing to invalid IPs
conntrack -L 2>/dev/null | while read line; do
DST_IP=$(echo "$line" | grep -oP 'dst=\K[0-9.]+' | head -1)
if [[ ! " $VALID_IPS " =~ " $DST_IP " ]]; then
echo "STALE: $line"
fi
done | wc -l
The Fix
Option 1: Flush Conntrack During Deploy
# Add to deployment as a preStop on OLD pods
lifecycle:
preStop:
exec:
command:
- /bin/sh
- -c
- |
# Wait for endpoint removal to propagate
sleep 5
# Get this pod's IP
POD_IP=$(hostname -i)
# Flush conntrack entries pointing to this pod
# This requires NET_ADMIN capability
conntrack -D -d $POD_IP || true
# Grant NET_ADMIN capability
securityContext:
capabilities:
add: ["NET_ADMIN"]
Option 2: Node-Level Conntrack Flush
# DaemonSet that watches for endpoint changes
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: conntrack-flusher
namespace: kube-system
spec:
selector:
matchLabels:
app: conntrack-flusher
template:
metadata:
labels:
app: conntrack-flusher
spec:
hostNetwork: true
serviceAccountName: conntrack-flusher
containers:
- name: flusher
image: alpine
securityContext:
privileged: true
command:
- /bin/sh
- -c
- |
apk add --no-cache conntrack-tools curl
while true; do
# Watch for endpoint deletions via API
# Flush conntrack when pods are removed
sleep 10
done
Option 3: Use Headless Service (Avoid NAT)
# Headless service = direct pod IPs, no NAT, no conntrack issue
apiVersion: v1
kind: Service
metadata:
name: my-service
spec:
clusterIP: None # Headless!
selector:
app: my-app
ports:
- port: 8080
# Clients connect directly to pod IPs
# No DNAT = no stale conntrack entries
# But requires client-side load balancing
Option 4: Reduce Conntrack Timeouts
# Reduce established connection timeout (careful!)
sysctl -w net.netfilter.nf_conntrack_tcp_timeout_established=120
# Reduce FIN_WAIT timeout
sysctl -w net.netfilter.nf_conntrack_tcp_timeout_fin_wait=30
# Reduce TIME_WAIT timeout
sysctl -w net.netfilter.nf_conntrack_tcp_timeout_time_wait=30
# These affect ALL connections, not just stale ones
# May break long-lived connections
Option 5: Graceful Connection Draining
// In your application: drain connections before shutdown
func gracefulShutdown(srv *http.Server) {
// Signal that we're shutting down
// Stop accepting new connections on health endpoint
healthStatus.Store(false)
// Wait for load balancer to stop sending traffic
time.Sleep(10 * time.Second)
// Now gracefully shutdown existing connections
ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
defer cancel()
srv.Shutdown(ctx)
}
Monitoring
Prometheus Metrics
groups:
- name: conntrack
rules:
- alert: StaleConntrackEntries
expr: |
node_nf_conntrack_entries > 50000 AND
rate(node_nf_conntrack_entries[5m]) < 0
for: 5m
labels:
severity: warning
annotations:
summary: "Possible stale conntrack entries on {{ $labels.instance }}"
- alert: DeploymentErrors
expr: |
sum(rate(http_requests_total{status=~"5.."}[1m])) BY (deployment) /
sum(rate(http_requests_total[1m])) BY (deployment) > 0.01
for: 2m
labels:
severity: warning
annotations:
summary: "Elevated 5xx rate during deployment"
Deploy Monitoring Script
#!/bin/bash
# monitor-deploy.sh - Run during deployment
SERVICE="my-service"
INTERVAL=5
while true; do
# Get current endpoint IPs
ENDPOINTS=$(kubectl get endpoints $SERVICE -o jsonpath='{.subsets[*].addresses[*].ip}' | tr ' ' '\n' | sort)
# Get conntrack destination IPs for service port
CONNTRACK_DSTS=$(conntrack -L 2>/dev/null | grep "dport=8080" | grep -oP 'dst=\K[0-9.]+' | sort | uniq)
# Find stale entries
STALE=$(comm -23 <(echo "$CONNTRACK_DSTS") <(echo "$ENDPOINTS"))
if [ -n "$STALE" ]; then
echo "$(date): STALE entries pointing to: $STALE"
fi
sleep $INTERVAL
done
Checklist
## Conntrack Stale NAT Mapping
### Symptoms
- [ ] Errors last exactly 2+ minutes after deploy
- [ ] Same pattern every deployment
- [ ] Longer preStop/terminationGrace doesn't help
- [ ] Uses NodePort or LoadBalancer service
### Diagnosis
- [ ] Watch conntrack during deploy
- [ ] Compare conntrack dst IPs vs current endpoints
- [ ] tcpdump traffic to old pod IPs
### Fixes
- [ ] Flush conntrack entries for dying pods
- [ ] Use headless service (if possible)
- [ ] Implement proper connection draining
- [ ] Reduce conntrack timeouts (carefully)
### Prevention
- [ ] Add NET_ADMIN capability for conntrack flush
- [ ] Implement graceful shutdown in app
- [ ] Consider service mesh (handles this automatically)
Conclusion
This problem exposes a fundamental impedance mismatch between Kubernetes abstractions and Linux networking primitives. Kubernetes thinks in terms of pods and endpoints. Linux thinks in terms of IP addresses and connections. When a pod dies, Kubernetes does its job—removes the endpoint, updates iptables rules. But Linux conntrack doesn’t know or care about Kubernetes. It just remembers NAT mappings and uses them.
The “exactly 2 minutes” pattern comes from client behavior, not Kubernetes behavior. Most HTTP clients will retry or recreate connections after about 2 minutes of failure. When they do, they get new source ports, which don’t match any stale conntrack entries, so they get fresh NAT lookups through the updated iptables rules.
The fix requires breaking the abstraction barrier and touching kernel state directly. Flushing conntrack entries for dying pods, using headless services to avoid NAT entirely, or deploying a service mesh that handles connection draining properly—all of these require understanding what’s happening below the Kubernetes layer.
Key takeaways:
- Standard pod lifecycle fixes don’t help - the problem is at node kernel level, not pod level
- Pattern is eerily consistent - because it’s governed by client timeouts, not server behavior
- Requires kernel-level understanding - conntrack is invisible to Kubernetes monitoring
- Fix requires elevated privileges - NET_ADMIN capability or privileged containers
The fundamental lesson: when debugging Kubernetes networking issues, don’t stop at the Kubernetes layer. The actual networking happens in Linux, and sometimes you need to go down to conntrack, iptables, and ss to understand what’s really happening.
Related Articles
- kube-proxy xtables Lock Contention - Another kube-proxy issue
- Ephemeral Port Exhaustion - Related NAT problem
Related posts
Ephemeral Port Exhaustion: The Node That 'Goes Bad'
A single Kubernetes node starts failing connections to external services while pods look healthy. The hidden cause: sidecar proxies exhausting ephemeral ports with short-lived connections.
The Ghost Pod: Why Your Service Still Sends Traffic to Dead Endpoints
Random ECONNRESET on some nodes but not others. Endpoints look fine. The culprit: conntrack NAT entries keeping long-lived connections pinned to pods that no longer exist.
Kubernetes Ghost Connections: Stale Conntrack DNAT Entries
Service returns wrong pod IPs after scaling. The cause: Linux conntrack keeps DNAT entries alive longer than pods exist, routing traffic to deleted endpoints.
Kubernetes conntrack Table Exhaustion: The Silent Packet Killer
Random DNS timeouts, dropped connections, services timing out. Your nf_conntrack table is full. I show how to diagnose, monitor, and fix this Kubernetes networking issue.
Cite this article
If you reference this post, please link to the original URL and credit the author.