Kubernetes Ghost Connections: Stale Conntrack DNAT Entries
Stale DNAT entries are one of those bugs you only meet at scale. “Requests randomly fail with connection refused after pod scaling.” The cause: Linux conntrack table keeps DNAT entries for deleted pods, and kube-proxy doesn’t clean them up immediately.
Environment: Kubernetes 1.20+, iptables mode kube-proxy, Services with multiple backends, frequent scaling events
The Problem
The Intermittent Failures
Timeline of disaster:
T+0:00 3 pods running for service-a
Pod IPs: 10.0.1.10, 10.0.1.11, 10.0.1.12
Conntrack entries established
T+0:30 Scale down to 2 pods
Pod 10.0.1.12 terminated
Endpoints updated: 10.0.1.10, 10.0.1.11
T+0:31 New request arrives from client
Conntrack lookup: "I know this flow! → 10.0.1.12"
Packet sent to deleted pod IP
Connection refused / timeout
T+2:00 Conntrack entry expires (default 120s)
Traffic finally goes to correct pods
Why Conntrack Caches Routes
Linux Connection Tracking (conntrack):
┌─────────────────────────────────────────────────────────────┐
│ Incoming packet to Service VIP 10.96.0.100:80 │
│ │
│ First packet (no conntrack entry): │
│ 1. iptables DNAT rule selects backend: 10.0.1.12 │
│ 2. Conntrack creates entry: │
│ src=10.0.2.50:45678 dst=10.96.0.100:80 │
│ → DNAT to 10.0.1.12:8080 │
│ 3. Entry cached for connection lifetime + timeout │
│ │
│ Subsequent packets (conntrack hit): │
│ 1. Lookup in conntrack table │
│ 2. Apply cached DNAT: → 10.0.1.12:8080 │
│ 3. BYPASS iptables rules completely! │
│ │
│ Problem: Pod 10.0.1.12 deleted but conntrack entry lives │
└─────────────────────────────────────────────────────────────┘
Root Cause
The Conntrack Lifecycle
# View current conntrack entries
conntrack -L -d 10.96.0.100
# Output showing stale entry:
tcp 6 117 TIME_WAIT
src=10.0.2.50 dst=10.96.0.100 sport=45678 dport=80
src=10.0.1.12 dst=10.0.2.50 sport=8080 dport=45678 [ASSURED]
mark=0 use=1
# This entry will route traffic to 10.0.1.12 for another 117 seconds
# Even though the pod is gone!
When Kube-Proxy Doesn’t Help
// kube-proxy behavior on endpoint removal:
// 1. Updates iptables rules (removes backend)
// 2. Does NOT clean conntrack entries
// Why? Performance - conntrack cleanup is expensive
// Also: Race condition - pod might just be restarting
// The problem flows:
// - Long-lived connections (gRPC streams, WebSockets)
// - UDP traffic (DNS, metrics)
// - Connection pools with keepalive
// These maintain conntrack entries indefinitely
UDP Is Especially Bad
UDP conntrack timeout: 30 seconds (shorter but still problematic)
UDP flow to CoreDNS:
T+0:00 DNS query to kube-dns service
Conntrack: src=pod → dst=10.96.0.10 (kube-dns VIP)
DNAT to: 10.0.1.50 (coredns pod)
T+0:10 CoreDNS pod rescheduled to 10.0.1.51
T+0:15 Next DNS query
Conntrack hit → still routes to 10.0.1.50
DNS timeout! Pod resolves nothing!
# UDP has no connection state to detect failure
# Client keeps sending to dead endpoint
Diagnosis
Check for Stale Entries
# List all conntrack entries for a service
kubectl exec -n kube-system <kube-proxy-pod> -- \
conntrack -L -d <service-cluster-ip> 2>/dev/null
# Find entries pointing to non-existent endpoints
# Compare against current endpoints:
kubectl get endpoints <service-name> -o yaml
# Look for DNAT targets not in endpoint list
Monitor Conntrack Table
# Conntrack table statistics
conntrack -S
# Output:
# cpu=0 found=12847 invalid=32 insert=0 insert_failed=0 drop=0
# cpu=1 found=11923 invalid=28 insert=0 insert_failed=0 drop=0
# Watch for insert_failed - table might be full
# Watch for high invalid - stale entry problems
# Table size and limits
cat /proc/sys/net/netfilter/nf_conntrack_count
cat /proc/sys/net/netfilter/nf_conntrack_max
Correlate with Pod Events
# Watch pod deletions and connection failures together
kubectl get events -w --field-selector reason=Killing &
kubectl logs -f <client-pod> | grep -i "connection refused"
# If connection refused spikes after pod terminations
# → stale conntrack is likely the cause
The Fix
Option 1: Graceful Pod Termination
apiVersion: v1
kind: Pod
spec:
terminationGracePeriodSeconds: 60
containers:
- name: app
lifecycle:
preStop:
exec:
command:
- /bin/sh
- -c
- |
# Remove from service endpoints first
# Wait for conntrack entries to expire
# Then actually terminate
sleep 30
// In your application - drain connections before exit
func main() {
// Handle SIGTERM
sigChan := make(chan os.Signal, 1)
signal.Notify(sigChan, syscall.SIGTERM)
go func() {
<-sigChan
// Stop accepting new connections
listener.Close()
// Wait for existing connections to finish
// Or timeout after grace period
server.Shutdown(context.Background())
}()
}
Option 2: Aggressive Conntrack Cleanup
# On pod deletion, clean up conntrack entries
# Add to pod preStop hook or controller
# Delete all conntrack entries for the pod IP
conntrack -D -d <pod-ip>
conntrack -D -s <pod-ip>
# For services specifically
conntrack -D -d <service-cluster-ip> --dport <service-port>
# DaemonSet for conntrack cleanup
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: conntrack-cleaner
spec:
template:
spec:
hostNetwork: true
containers:
- name: cleaner
image: your-cleaner-image
securityContext:
capabilities:
add: ["NET_ADMIN"]
# Watch endpoint changes and clean conntrack
Option 3: Use IPVS Mode
# kube-proxy config for IPVS mode
apiVersion: kubeproxy.config.k8s.io/v1alpha1
kind: KubeProxyConfiguration
mode: "ipvs"
ipvs:
strictARP: true
# IPVS has better connection tracking
# Automatic cleanup on backend removal
# IPVS advantages:
# - Connection table per-service (not global conntrack)
# - Automatic cleanup when backend removed
# - Better performance at scale
# Check current mode
kubectl -n kube-system get cm kube-proxy -o yaml | grep mode
Option 4: Client-Side Retry Logic
// Implement retry with backoff for transient failures
func callService(ctx context.Context) error {
backoff := []time.Duration{10*time.Millisecond, 100*time.Millisecond, 1*time.Second}
var lastErr error
for i := 0; i <= len(backoff); i++ {
resp, err := http.Get("http://service-a/endpoint")
if err == nil {
return nil
}
// Check if it's a connection refused (stale conntrack symptom)
if isConnectionRefused(err) && i < len(backoff) {
time.Sleep(backoff[i])
continue
}
lastErr = err
}
return lastErr
}
func isConnectionRefused(err error) bool {
var opErr *net.OpError
if errors.As(err, &opErr) {
var syscallErr *os.SyscallError
if errors.As(opErr.Err, &syscallErr) {
return syscallErr.Err == syscall.ECONNREFUSED
}
}
return false
}
Option 5: Reduce Conntrack Timeouts
# Tune conntrack timeouts (node-level)
# WARNING: Affects all connections on node
# TCP established (default 432000 = 5 days!)
sysctl -w net.netfilter.nf_conntrack_tcp_timeout_established=3600
# TCP time_wait (default 120)
sysctl -w net.netfilter.nf_conntrack_tcp_timeout_time_wait=30
# UDP (default 30)
sysctl -w net.netfilter.nf_conntrack_udp_timeout=10
sysctl -w net.netfilter.nf_conntrack_udp_timeout_stream=30
# Apply via DaemonSet
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: sysctl-tuner
spec:
template:
spec:
initContainers:
- name: sysctl
image: busybox
securityContext:
privileged: true
command:
- sysctl
- -w
- net.netfilter.nf_conntrack_tcp_timeout_time_wait=30
Monitoring
groups:
- name: conntrack
rules:
- alert: ConntrackTableNearFull
expr: |
node_nf_conntrack_entries /
node_nf_conntrack_entries_limit > 0.8
for: 5m
labels:
severity: warning
annotations:
summary: "Conntrack table {{ $value | humanizePercentage }} full"
- alert: HighConntrackInsertFailed
expr: |
rate(node_nf_conntrack_stat_insert_failed[5m]) > 0
for: 1m
labels:
severity: critical
annotations:
summary: "Conntrack inserts failing - table full or hash collision"
- alert: ServiceEndpointChurn
expr: |
changes(kube_endpoint_address_available[5m]) > 10
for: 5m
labels:
severity: warning
annotations:
summary: "High endpoint churn - conntrack stale risk"
Checklist
## Kubernetes Conntrack Stale DNAT
### Diagnosis
- [ ] Check conntrack entries: conntrack -L -d <service-ip>
- [ ] Compare entries against current endpoints
- [ ] Look for connection refused after scale events
- [ ] Check conntrack table utilization
### Prevention
- [ ] Implement graceful pod termination (preStop hook)
- [ ] Add sleep before pod exit (conntrack expiry)
- [ ] Consider IPVS mode for better cleanup
- [ ] Implement client retry logic
### Tuning
- [ ] Review conntrack timeout values
- [ ] Monitor conntrack table size
- [ ] Alert on endpoint churn rate
Conclusion
The lesson: Kubernetes Services rely on Linux conntrack for connection routing, but conntrack outlives pod lifecycles. Stale DNAT entries cause “ghost connections” to deleted pods.
Key principles:
- Conntrack caches DNAT decisions - bypasses iptables rules
- Pod deletion doesn’t clean conntrack - entries live until timeout
- UDP is worse than TCP - no connection state to detect failures
- IPVS mode handles this better - per-service connection tables
Related Articles
- Kubernetes DNS Resolution Failures - CoreDNS and DNS issues
- Kubernetes Headless Service Stale DNS - DNS caching problems
Related posts
Kubernetes conntrack Table Exhaustion: The Silent Packet Killer
Random DNS timeouts, dropped connections, services timing out. Your nf_conntrack table is full. I show how to diagnose, monitor, and fix this Kubernetes networking issue.
The Ghost Pod: Why Your Service Still Sends Traffic to Dead Endpoints
Random ECONNRESET on some nodes but not others. Endpoints look fine. The culprit: conntrack NAT entries keeping long-lived connections pinned to pods that no longer exist.
Traffic Hitting Dead Pods: Conntrack's Stale NAT Mapping
Deploy causes 503s for exactly 2 minutes. The issue: conntrack keeps NAT mappings to old pod IPs even after Kubernetes removes endpoints, sending traffic to dead pods.
Ephemeral Port Exhaustion: The Node That 'Goes Bad'
A single Kubernetes node starts failing connections to external services while pods look healthy. The hidden cause: sidecar proxies exhausting ephemeral ports with short-lived connections.
Cite this article
If you reference this post, please link to the original URL and credit the author.