Kubernetes Ghost Connections: Stale Conntrack DNAT Entries

Stale DNAT entries are one of those bugs you only meet at scale. “Requests randomly fail with connection refused after pod scaling.” The cause: Linux conntrack table keeps DNAT entries for deleted pods, and kube-proxy doesn’t clean them up immediately.

Environment: Kubernetes 1.20+, iptables mode kube-proxy, Services with multiple backends, frequent scaling events

The Problem

The Intermittent Failures

Timeline of disaster:

T+0:00   3 pods running for service-a
         Pod IPs: 10.0.1.10, 10.0.1.11, 10.0.1.12
         Conntrack entries established

T+0:30   Scale down to 2 pods
         Pod 10.0.1.12 terminated
         Endpoints updated: 10.0.1.10, 10.0.1.11

T+0:31   New request arrives from client
         Conntrack lookup: "I know this flow! → 10.0.1.12"
         Packet sent to deleted pod IP
         Connection refused / timeout

T+2:00   Conntrack entry expires (default 120s)
         Traffic finally goes to correct pods

Why Conntrack Caches Routes

Linux Connection Tracking (conntrack):

┌─────────────────────────────────────────────────────────────┐
│ Incoming packet to Service VIP 10.96.0.100:80               │
│                                                             │
│ First packet (no conntrack entry):                         │
│ 1. iptables DNAT rule selects backend: 10.0.1.12           │
│ 2. Conntrack creates entry:                                 │
│    src=10.0.2.50:45678 dst=10.96.0.100:80                  │
│    → DNAT to 10.0.1.12:8080                                │
│ 3. Entry cached for connection lifetime + timeout          │
│                                                             │
│ Subsequent packets (conntrack hit):                        │
│ 1. Lookup in conntrack table                               │
│ 2. Apply cached DNAT: → 10.0.1.12:8080                     │
│ 3. BYPASS iptables rules completely!                       │
│                                                             │
│ Problem: Pod 10.0.1.12 deleted but conntrack entry lives   │
└─────────────────────────────────────────────────────────────┘

Root Cause

The Conntrack Lifecycle

# View current conntrack entries
conntrack -L -d 10.96.0.100

# Output showing stale entry:
tcp  6 117 TIME_WAIT
  src=10.0.2.50 dst=10.96.0.100 sport=45678 dport=80
  src=10.0.1.12 dst=10.0.2.50 sport=8080 dport=45678 [ASSURED]
  mark=0 use=1

# This entry will route traffic to 10.0.1.12 for another 117 seconds
# Even though the pod is gone!

When Kube-Proxy Doesn’t Help

// kube-proxy behavior on endpoint removal:
// 1. Updates iptables rules (removes backend)
// 2. Does NOT clean conntrack entries

// Why? Performance - conntrack cleanup is expensive
// Also: Race condition - pod might just be restarting

// The problem flows:
// - Long-lived connections (gRPC streams, WebSockets)
// - UDP traffic (DNS, metrics)
// - Connection pools with keepalive

// These maintain conntrack entries indefinitely

UDP Is Especially Bad

UDP conntrack timeout: 30 seconds (shorter but still problematic)

UDP flow to CoreDNS:

T+0:00   DNS query to kube-dns service
         Conntrack: src=pod → dst=10.96.0.10 (kube-dns VIP)
         DNAT to: 10.0.1.50 (coredns pod)

T+0:10   CoreDNS pod rescheduled to 10.0.1.51

T+0:15   Next DNS query
         Conntrack hit → still routes to 10.0.1.50
         DNS timeout! Pod resolves nothing!

# UDP has no connection state to detect failure
# Client keeps sending to dead endpoint

Diagnosis

Check for Stale Entries

# List all conntrack entries for a service
kubectl exec -n kube-system <kube-proxy-pod> -- \
  conntrack -L -d <service-cluster-ip> 2>/dev/null

# Find entries pointing to non-existent endpoints
# Compare against current endpoints:
kubectl get endpoints <service-name> -o yaml

# Look for DNAT targets not in endpoint list

Monitor Conntrack Table

# Conntrack table statistics
conntrack -S

# Output:
# cpu=0   found=12847 invalid=32 insert=0 insert_failed=0 drop=0
# cpu=1   found=11923 invalid=28 insert=0 insert_failed=0 drop=0

# Watch for insert_failed - table might be full
# Watch for high invalid - stale entry problems

# Table size and limits
cat /proc/sys/net/netfilter/nf_conntrack_count
cat /proc/sys/net/netfilter/nf_conntrack_max

Correlate with Pod Events

# Watch pod deletions and connection failures together
kubectl get events -w --field-selector reason=Killing &
kubectl logs -f <client-pod> | grep -i "connection refused"

# If connection refused spikes after pod terminations
# → stale conntrack is likely the cause

The Fix

Option 1: Graceful Pod Termination

apiVersion: v1
kind: Pod
spec:
  terminationGracePeriodSeconds: 60
  containers:
  - name: app
    lifecycle:
      preStop:
        exec:
          command:
          - /bin/sh
          - -c
          - |
            # Remove from service endpoints first
            # Wait for conntrack entries to expire
            # Then actually terminate
            sleep 30

// In your application - drain connections before exit
func main() {
    // Handle SIGTERM
    sigChan := make(chan os.Signal, 1)
    signal.Notify(sigChan, syscall.SIGTERM)

    go func() {
        <-sigChan
        // Stop accepting new connections
        listener.Close()
        // Wait for existing connections to finish
        // Or timeout after grace period
        server.Shutdown(context.Background())
    }()
}

Option 2: Aggressive Conntrack Cleanup

# On pod deletion, clean up conntrack entries
# Add to pod preStop hook or controller

# Delete all conntrack entries for the pod IP
conntrack -D -d <pod-ip>
conntrack -D -s <pod-ip>

# For services specifically
conntrack -D -d <service-cluster-ip> --dport <service-port>

# DaemonSet for conntrack cleanup
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: conntrack-cleaner
spec:
  template:
    spec:
      hostNetwork: true
      containers:
      - name: cleaner
        image: your-cleaner-image
        securityContext:
          capabilities:
            add: ["NET_ADMIN"]
        # Watch endpoint changes and clean conntrack

Option 3: Use IPVS Mode

# kube-proxy config for IPVS mode
apiVersion: kubeproxy.config.k8s.io/v1alpha1
kind: KubeProxyConfiguration
mode: "ipvs"
ipvs:
  strictARP: true
  # IPVS has better connection tracking
  # Automatic cleanup on backend removal

# IPVS advantages:
# - Connection table per-service (not global conntrack)
# - Automatic cleanup when backend removed
# - Better performance at scale

# Check current mode
kubectl -n kube-system get cm kube-proxy -o yaml | grep mode

Option 4: Client-Side Retry Logic

// Implement retry with backoff for transient failures
func callService(ctx context.Context) error {
    backoff := []time.Duration{10*time.Millisecond, 100*time.Millisecond, 1*time.Second}

    var lastErr error
    for i := 0; i <= len(backoff); i++ {
        resp, err := http.Get("http://service-a/endpoint")
        if err == nil {
            return nil
        }

        // Check if it's a connection refused (stale conntrack symptom)
        if isConnectionRefused(err) && i < len(backoff) {
            time.Sleep(backoff[i])
            continue
        }
        lastErr = err
    }
    return lastErr
}

func isConnectionRefused(err error) bool {
    var opErr *net.OpError
    if errors.As(err, &opErr) {
        var syscallErr *os.SyscallError
        if errors.As(opErr.Err, &syscallErr) {
            return syscallErr.Err == syscall.ECONNREFUSED
        }
    }
    return false
}

Option 5: Reduce Conntrack Timeouts

# Tune conntrack timeouts (node-level)
# WARNING: Affects all connections on node

# TCP established (default 432000 = 5 days!)
sysctl -w net.netfilter.nf_conntrack_tcp_timeout_established=3600

# TCP time_wait (default 120)
sysctl -w net.netfilter.nf_conntrack_tcp_timeout_time_wait=30

# UDP (default 30)
sysctl -w net.netfilter.nf_conntrack_udp_timeout=10
sysctl -w net.netfilter.nf_conntrack_udp_timeout_stream=30

# Apply via DaemonSet
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: sysctl-tuner
spec:
  template:
    spec:
      initContainers:
      - name: sysctl
        image: busybox
        securityContext:
          privileged: true
        command:
        - sysctl
        - -w
        - net.netfilter.nf_conntrack_tcp_timeout_time_wait=30

Monitoring

groups:
  - name: conntrack
    rules:
      - alert: ConntrackTableNearFull
        expr: |
          node_nf_conntrack_entries /
          node_nf_conntrack_entries_limit > 0.8
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Conntrack table {{ $value | humanizePercentage }} full"

      - alert: HighConntrackInsertFailed
        expr: |
          rate(node_nf_conntrack_stat_insert_failed[5m]) > 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Conntrack inserts failing - table full or hash collision"

      - alert: ServiceEndpointChurn
        expr: |
          changes(kube_endpoint_address_available[5m]) > 10
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High endpoint churn - conntrack stale risk"

Checklist

## Kubernetes Conntrack Stale DNAT

### Diagnosis
- [ ] Check conntrack entries: conntrack -L -d <service-ip>
- [ ] Compare entries against current endpoints
- [ ] Look for connection refused after scale events
- [ ] Check conntrack table utilization

### Prevention
- [ ] Implement graceful pod termination (preStop hook)
- [ ] Add sleep before pod exit (conntrack expiry)
- [ ] Consider IPVS mode for better cleanup
- [ ] Implement client retry logic

### Tuning
- [ ] Review conntrack timeout values
- [ ] Monitor conntrack table size
- [ ] Alert on endpoint churn rate

Conclusion

The lesson: Kubernetes Services rely on Linux conntrack for connection routing, but conntrack outlives pod lifecycles. Stale DNAT entries cause “ghost connections” to deleted pods.

Key principles:

Conntrack caches DNAT decisions - bypasses iptables rules
Pod deletion doesn’t clean conntrack - entries live until timeout
UDP is worse than TCP - no connection state to detect failures
IPVS mode handles this better - per-service connection tables

Kubernetes DNS Resolution Failures - CoreDNS and DNS issues
Kubernetes Headless Service Stale DNS - DNS caching problems

Kubernetes Ghost Connections: Stale Conntrack DNAT Entries

The Problem

The Intermittent Failures

Why Conntrack Caches Routes

Root Cause

The Conntrack Lifecycle

When Kube-Proxy Doesn’t Help

UDP Is Especially Bad

Diagnosis

Check for Stale Entries

Monitor Conntrack Table

Correlate with Pod Events

The Fix

Option 1: Graceful Pod Termination

Option 2: Aggressive Conntrack Cleanup

Option 3: Use IPVS Mode

Option 4: Client-Side Retry Logic

Option 5: Reduce Conntrack Timeouts

Monitoring

Checklist

Conclusion

Related posts

Cite this article

The Problem

The Intermittent Failures

Why Conntrack Caches Routes

Root Cause

The Conntrack Lifecycle

When Kube-Proxy Doesn’t Help

UDP Is Especially Bad

Diagnosis

Check for Stale Entries

Monitor Conntrack Table

Correlate with Pod Events

The Fix

Option 1: Graceful Pod Termination

Option 2: Aggressive Conntrack Cleanup

Option 3: Use IPVS Mode

Option 4: Client-Side Retry Logic

Option 5: Reduce Conntrack Timeouts

Monitoring

Checklist

Conclusion

Related Articles

Related posts

Cite this article