Back to blog

Traffic Hitting Dead Pods: Conntrack's Stale NAT Mapping

|
| kubernetes, networking, conntrack, debugging, deployment, nat

This bug felt like a ghost: new pods couldn’t connect, old ones could. “Every deploy causes exactly 2 minutes of 503 errors.” The pattern was so consistent that we started joking about it. Deploy at 14:00, errors clear at 14:02. Deploy at 17:30, errors clear at 17:32. We tried everything the Kubernetes best practices suggested: preStop hooks with sleep commands, longer terminationGracePeriodSeconds, faster readiness probe failures, aggressive endpoint removal. Nothing made any difference. Exactly 2 minutes, every time.

The breakthrough came when we stopped thinking about Kubernetes and started thinking about Linux networking. Kubernetes removes endpoints and updates iptables rules correctly. But those iptables rules are only consulted for new connections. For existing connections—and for connections from persistent clients that keep trying the same source port—the Linux kernel uses conntrack, a connection tracking subsystem that remembers NAT mappings.

When a pod dies, its conntrack entries don’t die with it. They linger, remembering “connection from client:54321 should go to pod:10.1.1.100”. When the client’s next request arrives using the same source port, conntrack short-circuits the iptables lookup and sends the packet directly to the dead pod. The packet goes into the void, the connection times out, and the user sees a 503.

This is one of those problems that only appears when you have persistent connections or clients that reuse source ports. In development, with fresh connections every time, you’d never see it. In production, with connection pools and HTTP keep-alive, it’s everywhere.

Environment: Kubernetes 1.27, NodePort services, high-traffic stateless API

The Problem

The Eerie Pattern

Every single deployment:

T+0:00  New pods ready, old pods terminating
T+0:00  Endpoints updated (kube-proxy syncs)
T+0:00  503 errors start appearing
T+0:05  503 rate: 5%
T+0:30  503 rate: 3%
T+1:00  503 rate: 2%
T+2:00  503 rate: 0% (finally!)

Always exactly 2 minutes.
Same pattern every time.
No exceptions.

Why Standard Fixes Don’t Work

# We tried everything:

# Longer termination grace period - didn't help
terminationGracePeriodSeconds: 120

# preStop hook delay - didn't help
lifecycle:
  preStop:
    exec:
      command: ["sleep", "30"]

# Aggressive readiness probe - didn't help
readinessProbe:
  periodSeconds: 1
  failureThreshold: 1

# The problem isn't pod lifecycle
# It's the node's conntrack table!

Root Cause

How Conntrack Works

Normal request flow with NodePort:

Client → Node:30080 → iptables DNAT → Pod:8080

conntrack entry created:
┌─────────────────────────────────────────────────────┐
│ tcp  src=client:54321 dst=node:30080                │
│      src=pod:8080     dst=client:54321 [ASSURED]    │
│      timeout=432000 (5 days!)                       │
└─────────────────────────────────────────────────────┘

This entry remembers: "traffic from client:54321 goes to pod:8080"

The Problem During Deployment

Timeline:

T+0:00  Old pod IP: 10.1.1.100
        New pod IP: 10.1.1.200

        Kubernetes: "Endpoint 10.1.1.100 removed!"
        kube-proxy: "iptables rules updated!"

        But conntrack table still has:
        ┌─────────────────────────────────────────┐
        │ tcp  src=client:54321 dst=node:30080    │
        │      src=10.1.1.100:8080 dst=client     │  ← Points to DEAD pod!
        │      timeout=still_has_time             │
        └─────────────────────────────────────────┘

T+0:01  Client sends packet on same connection
        → Conntrack: "I know this! Send to 10.1.1.100"
        → Packet goes to dead pod
        → Connection reset / timeout
        → 503 error!

T+2:00  Conntrack entries finally expire
        New connections get new NAT to 10.1.1.200
        Errors stop

Why It’s Exactly 2 Minutes

# Check conntrack timeout for established TCP
cat /proc/sys/net/netfilter/nf_conntrack_tcp_timeout_established
# 432000 (5 days - not relevant)

# The 2-minute pattern comes from:
# 1. Client-side keepalive/retry settings
# 2. HTTP client connection pool timeout
# 3. Load balancer health check intervals

# The conntrack entry itself could live for days
# But the client eventually gives up and creates new connection

Diagnosis

Step 1: Watch Conntrack During Deploy

# Before deploy, note a client IP
CLIENT_IP="203.0.113.10"

# Watch conntrack entries for that client during deploy
watch -n 0.5 "conntrack -L -s $CLIENT_IP 2>/dev/null | grep -E '(ESTABLISHED|TIME_WAIT)'"

# You'll see entries pointing to the old pod IP
# even after the pod is gone

Step 2: Verify Dead Pod Traffic

# tcpdump on a node during deploy
tcpdump -i any host 10.1.1.100  # old pod IP

# You'll see packets being sent to the old IP
# after the pod is gone

Step 3: Count Stale Entries

#!/bin/bash
# count-stale-conntrack.sh

# Get current endpoint IPs
VALID_IPS=$(kubectl get endpoints my-service -o jsonpath='{.subsets[*].addresses[*].ip}')

# Count conntrack entries pointing to invalid IPs
conntrack -L 2>/dev/null | while read line; do
  DST_IP=$(echo "$line" | grep -oP 'dst=\K[0-9.]+' | head -1)
  if [[ ! " $VALID_IPS " =~ " $DST_IP " ]]; then
    echo "STALE: $line"
  fi
done | wc -l

The Fix

Option 1: Flush Conntrack During Deploy

# Add to deployment as a preStop on OLD pods
lifecycle:
  preStop:
    exec:
      command:
        - /bin/sh
        - -c
        - |
          # Wait for endpoint removal to propagate
          sleep 5
          # Get this pod's IP
          POD_IP=$(hostname -i)
          # Flush conntrack entries pointing to this pod
          # This requires NET_ADMIN capability
          conntrack -D -d $POD_IP || true
# Grant NET_ADMIN capability
securityContext:
  capabilities:
    add: ["NET_ADMIN"]

Option 2: Node-Level Conntrack Flush

# DaemonSet that watches for endpoint changes
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: conntrack-flusher
  namespace: kube-system
spec:
  selector:
    matchLabels:
      app: conntrack-flusher
  template:
    metadata:
      labels:
        app: conntrack-flusher
    spec:
      hostNetwork: true
      serviceAccountName: conntrack-flusher
      containers:
        - name: flusher
          image: alpine
          securityContext:
            privileged: true
          command:
            - /bin/sh
            - -c
            - |
              apk add --no-cache conntrack-tools curl
              while true; do
                # Watch for endpoint deletions via API
                # Flush conntrack when pods are removed
                sleep 10
              done

Option 3: Use Headless Service (Avoid NAT)

# Headless service = direct pod IPs, no NAT, no conntrack issue
apiVersion: v1
kind: Service
metadata:
  name: my-service
spec:
  clusterIP: None  # Headless!
  selector:
    app: my-app
  ports:
    - port: 8080

# Clients connect directly to pod IPs
# No DNAT = no stale conntrack entries
# But requires client-side load balancing

Option 4: Reduce Conntrack Timeouts

# Reduce established connection timeout (careful!)
sysctl -w net.netfilter.nf_conntrack_tcp_timeout_established=120

# Reduce FIN_WAIT timeout
sysctl -w net.netfilter.nf_conntrack_tcp_timeout_fin_wait=30

# Reduce TIME_WAIT timeout
sysctl -w net.netfilter.nf_conntrack_tcp_timeout_time_wait=30

# These affect ALL connections, not just stale ones
# May break long-lived connections

Option 5: Graceful Connection Draining

// In your application: drain connections before shutdown
func gracefulShutdown(srv *http.Server) {
    // Signal that we're shutting down
    // Stop accepting new connections on health endpoint
    healthStatus.Store(false)

    // Wait for load balancer to stop sending traffic
    time.Sleep(10 * time.Second)

    // Now gracefully shutdown existing connections
    ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
    defer cancel()
    srv.Shutdown(ctx)
}

Monitoring

Prometheus Metrics

groups:
  - name: conntrack
    rules:
      - alert: StaleConntrackEntries
        expr: |
          node_nf_conntrack_entries > 50000 AND
          rate(node_nf_conntrack_entries[5m]) < 0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Possible stale conntrack entries on {{ $labels.instance }}"

      - alert: DeploymentErrors
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[1m])) BY (deployment) /
          sum(rate(http_requests_total[1m])) BY (deployment) > 0.01
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "Elevated 5xx rate during deployment"

Deploy Monitoring Script

#!/bin/bash
# monitor-deploy.sh - Run during deployment

SERVICE="my-service"
INTERVAL=5

while true; do
  # Get current endpoint IPs
  ENDPOINTS=$(kubectl get endpoints $SERVICE -o jsonpath='{.subsets[*].addresses[*].ip}' | tr ' ' '\n' | sort)

  # Get conntrack destination IPs for service port
  CONNTRACK_DSTS=$(conntrack -L 2>/dev/null | grep "dport=8080" | grep -oP 'dst=\K[0-9.]+' | sort | uniq)

  # Find stale entries
  STALE=$(comm -23 <(echo "$CONNTRACK_DSTS") <(echo "$ENDPOINTS"))

  if [ -n "$STALE" ]; then
    echo "$(date): STALE entries pointing to: $STALE"
  fi

  sleep $INTERVAL
done

Checklist

## Conntrack Stale NAT Mapping

### Symptoms
- [ ] Errors last exactly 2+ minutes after deploy
- [ ] Same pattern every deployment
- [ ] Longer preStop/terminationGrace doesn't help
- [ ] Uses NodePort or LoadBalancer service

### Diagnosis
- [ ] Watch conntrack during deploy
- [ ] Compare conntrack dst IPs vs current endpoints
- [ ] tcpdump traffic to old pod IPs

### Fixes
- [ ] Flush conntrack entries for dying pods
- [ ] Use headless service (if possible)
- [ ] Implement proper connection draining
- [ ] Reduce conntrack timeouts (carefully)

### Prevention
- [ ] Add NET_ADMIN capability for conntrack flush
- [ ] Implement graceful shutdown in app
- [ ] Consider service mesh (handles this automatically)

Conclusion

This problem exposes a fundamental impedance mismatch between Kubernetes abstractions and Linux networking primitives. Kubernetes thinks in terms of pods and endpoints. Linux thinks in terms of IP addresses and connections. When a pod dies, Kubernetes does its job—removes the endpoint, updates iptables rules. But Linux conntrack doesn’t know or care about Kubernetes. It just remembers NAT mappings and uses them.

The “exactly 2 minutes” pattern comes from client behavior, not Kubernetes behavior. Most HTTP clients will retry or recreate connections after about 2 minutes of failure. When they do, they get new source ports, which don’t match any stale conntrack entries, so they get fresh NAT lookups through the updated iptables rules.

The fix requires breaking the abstraction barrier and touching kernel state directly. Flushing conntrack entries for dying pods, using headless services to avoid NAT entirely, or deploying a service mesh that handles connection draining properly—all of these require understanding what’s happening below the Kubernetes layer.

Key takeaways:

  1. Standard pod lifecycle fixes don’t help - the problem is at node kernel level, not pod level
  2. Pattern is eerily consistent - because it’s governed by client timeouts, not server behavior
  3. Requires kernel-level understanding - conntrack is invisible to Kubernetes monitoring
  4. Fix requires elevated privileges - NET_ADMIN capability or privileged containers

The fundamental lesson: when debugging Kubernetes networking issues, don’t stop at the Kubernetes layer. The actual networking happens in Linux, and sometimes you need to go down to conntrack, iptables, and ss to understand what’s really happening.


Related posts

Cite this article

If you reference this post, please link to the original URL and credit the author.

Michal Drozd. "Traffic Hitting Dead Pods: Conntrack's Stale NAT Mapping". https://www.michal-drozd.com/en/blog/conntrack-stale-nat-mapping/ (Published November 14, 2024).