Back to blog

The Ghost Pod: Why Your Service Still Sends Traffic to Dead Endpoints

|
| kubernetes, networking, conntrack, debugging, kube-proxy, iptables

We kept seeing traffic from pods that no longer existed. “Why is this one node getting connection resets to the database?” The PostgreSQL pods were healthy. The Service endpoints looked correct. kubectl describe svc showed the right backend pods. But one application node was intermittently failing with ECONNRESET while others were fine. We spent two days checking application code, pool configurations, network policies—everything looked perfect.

The breakthrough came when I SSH’d to the failing node and ran conntrack -L | grep 5432. There it was: a NAT mapping pointing to 10.244.3.47—an IP that belonged to a pod that had been terminated during yesterday’s rolling deployment. The connection pool had opened that connection before the deployment, and conntrack was still faithfully translating packets to a destination that no longer existed. The kernel was doing exactly what it was supposed to do; it just had stale state from a connection that outlived the pod it was talking to.

This is one of the most frustrating Kubernetes networking issues because everything looks fine. Endpoints are correct. kube-proxy rules are correct. New connections work perfectly. But existing long-lived connections—database pools, gRPC streams, WebSocket connections—can remain pinned to dead endpoints through conntrack NAT mappings that persist until the connection closes or times out.

The fundamental issue is a mismatch between Kubernetes’ view of endpoints (updated immediately when a pod terminates) and the kernel’s conntrack table (preserves NAT mappings for the lifetime of the connection). When kube-proxy updates iptables rules to remove an endpoint, it doesn’t—and can’t—invalidate existing conntrack entries for established connections. Those connections continue to use the old NAT mapping, sending packets into the void.

Environment: Kubernetes 1.28+, kube-proxy in iptables mode, long-lived TCP connections (connection pools, gRPC, WebSockets)

Understanding the Mechanism

How kube-proxy and conntrack Interact

Normal Service flow:

Client Pod (10.244.1.50:45678)
    |
    | SYN to ClusterIP (10.96.100.50:5432)

iptables DNAT rule (kube-proxy managed)
    |
    | Translates to backend: 10.244.3.47:5432

conntrack creates NAT mapping:
    src=10.244.1.50:45678 → dst=10.96.100.50:5432
    reply: src=10.244.3.47:5432 → dst=10.244.1.50:45678
    |

Backend Pod (10.244.3.47:5432)

This mapping persists for the lifetime of the connection.

The Ghost Pod Problem

Timeline of failure:

T+0:00   Connection pool opens connection to postgres-svc
         conntrack entry created: → 10.244.3.47:5432
         Connection is ESTABLISHED, working fine

T+1:00   Rolling deployment starts
         New postgres pod: 10.244.3.48
         Old postgres pod (10.244.3.47) terminating

T+1:05   kube-proxy updates iptables rules
         Service now routes to 10.244.3.48
         NEW connections go to new pod ✓

T+1:10   Old pod fully terminated
         IP 10.244.3.47 no longer exists
         BUT: conntrack entry still maps to it!

T+1:15   Application reuses pooled connection
         Packet goes to 10.244.3.47 (via conntrack)
         No destination → ECONNRESET or timeout

Result: Some connections fail, others work
        Failure is node-local (conntrack is per-node)
        New connections always work (use updated rules)

Why Only Some Nodes?

conntrack is node-local:

Node A:
├── Pod 1 (opened conn at T+0)
│   └── conntrack: → old pod 10.244.3.47 ← STALE
├── Pod 2 (opened conn at T+2)
│   └── conntrack: → new pod 10.244.3.48 ← OK
└── New connections → 10.244.3.48 ✓

Node B:
├── Pod 3 (no existing connections)
│   └── All connections → 10.244.3.48 ✓
└── No stale conntrack entries

Only Node A's Pod 1 sees failures!
This makes it look like an application bug, not networking.

Diagnosing Ghost Pods

Check conntrack Entries

# SSH to the affected node
# Find conntrack entries for your service
conntrack -L -p tcp --dport 5432 2>/dev/null | head -20

# Example output showing stale entry:
# tcp  6 86393 ESTABLISHED src=10.244.1.50 dst=10.96.100.50 sport=45678 dport=5432
#      src=10.244.3.47 dst=10.244.1.50 sport=5432 dport=45678 [ASSURED] mark=0

# The reply src=10.244.3.47 is the actual backend
# If that IP doesn't exist anymore → ghost pod!

# Check if the backend IP exists
kubectl get pods -o wide | grep 10.244.3.47
# (no output = ghost pod confirmed)

# Count conntrack entries by destination
conntrack -L -p tcp 2>/dev/null | \
  grep -oP 'src=\K[0-9.]+(?= dst)' | \
  sort | uniq -c | sort -rn | head

Compare Endpoints vs conntrack

# Get current endpoints
kubectl get endpoints postgres-svc -o jsonpath='{.subsets[*].addresses[*].ip}'
# Output: 10.244.3.48 10.244.3.49

# Get conntrack destinations for that service
conntrack -L -p tcp --dport 5432 2>/dev/null | \
  grep -oP 'reply src=\K[0-9.]+' | sort -u
# Output: 10.244.3.47 10.244.3.48 10.244.3.49

# 10.244.3.47 is in conntrack but not in endpoints = ghost!

Identify Affected Connections

# Find all stale conntrack entries (IPs not in endpoints)
ENDPOINTS=$(kubectl get endpoints postgres-svc -o jsonpath='{.subsets[*].addresses[*].ip}' | tr ' ' '|')

conntrack -L -p tcp --dport 5432 2>/dev/null | while read line; do
  DEST=$(echo "$line" | grep -oP 'reply src=\K[0-9.]+')
  if ! echo "$DEST" | grep -qE "^($ENDPOINTS)$"; then
    echo "STALE: $line"
  fi
done

Application-Side Symptoms

# In application logs, look for:
# - ECONNRESET after period of success
# - "server closed the connection unexpectedly"
# - Intermittent timeouts to healthy services
# - Errors that correlate with deployment times

# Check if failures are node-specific
kubectl logs deploy/myapp --all-containers | grep -i "reset\|timeout" | \
  while read line; do
    POD=$(echo "$line" | grep -oP 'pod/\K[^/]+')
    NODE=$(kubectl get pod $POD -o jsonpath='{.spec.nodeName}')
    echo "$NODE: $line"
  done | sort | uniq -c

The Fix

Option 1: Flush Stale conntrack Entries

# Nuclear option: flush all conntrack for a destination port
conntrack -D -p tcp --dport 5432

# More surgical: flush entries for specific dead IP
conntrack -D -p tcp --reply-src 10.244.3.47

# Automated cleanup script
#!/bin/bash
# cleanup-ghost-conntrack.sh

SERVICE=$1
PORT=$2

# Get current valid endpoints
ENDPOINTS=$(kubectl get endpoints $SERVICE -o jsonpath='{.subsets[*].addresses[*].ip}' | tr ' ' '\n')

# Find and delete stale entries
conntrack -L -p tcp --dport $PORT 2>/dev/null | while read line; do
  DEST=$(echo "$line" | grep -oP 'reply src=\K[0-9.]+')
  if ! echo "$ENDPOINTS" | grep -qF "$DEST"; then
    echo "Deleting stale entry to $DEST"
    conntrack -D -p tcp --dport $PORT --reply-src $DEST
  fi
done

Option 2: Proper Connection Draining

# Pod spec with proper termination handling
apiVersion: v1
kind: Pod
spec:
  terminationGracePeriodSeconds: 60
  containers:
  - name: postgres
    lifecycle:
      preStop:
        exec:
          command:
          - /bin/sh
          - -c
          - |
            # Signal app to stop accepting new connections
            pg_ctl stop -m smart -w -t 30
            # Wait for existing connections to drain
            sleep 30
    # Readiness probe fails immediately on SIGTERM
    readinessProbe:
      exec:
        command: ["pg_isready"]
      periodSeconds: 5

Option 3: Connection Pool Configuration

// HikariCP - Force connection validation
HikariConfig config = new HikariConfig();
config.setConnectionTimeout(5000);
config.setValidationTimeout(3000);
config.setMaxLifetime(1800000);  // 30 minutes max
config.setKeepaliveTime(30000);  // TCP keepalive every 30s
config.setConnectionTestQuery("SELECT 1");

// Key: Limit connection lifetime to bound stale connection risk
config.setMaxLifetime(300000);  // 5 minutes - shorter than typical deploy
// Go sql.DB - Set connection lifetime
db.SetConnMaxLifetime(5 * time.Minute)
db.SetConnMaxIdleTime(1 * time.Minute)

// For gRPC - Enable keepalive with short timeout
conn, err := grpc.Dial(
    target,
    grpc.WithKeepaliveParams(keepalive.ClientParameters{
        Time:                10 * time.Second,
        Timeout:             3 * time.Second,
        PermitWithoutStream: true,
    }),
)

Option 4: Switch to IPVS Mode

# kube-proxy config - IPVS has better connection tracking
apiVersion: kubeproxy.config.k8s.io/v1alpha1
kind: KubeProxyConfiguration
mode: ipvs
ipvs:
  scheduler: lc  # least connection
  syncPeriod: 30s
  minSyncPeriod: 5s
  # IPVS can gracefully drain connections
  tcpTimeout: 900s
  tcpFinTimeout: 30s
  udpTimeout: 300s

Option 5: Use Headless Service for Stateful Connections

# Headless service - clients connect directly to pod IPs
apiVersion: v1
kind: Service
metadata:
  name: postgres-headless
spec:
  clusterIP: None  # Headless
  selector:
    app: postgres
  ports:
  - port: 5432
---
# Application connects to postgres-0.postgres-headless.namespace.svc
# No DNAT = no conntrack stale entries
# But: Must handle pod IP changes in application

Prevention

Deployment Strategy

# Rolling update with proper draining
apiVersion: apps/v1
kind: Deployment
spec:
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  template:
    spec:
      terminationGracePeriodSeconds: 60
      containers:
      - name: app
        lifecycle:
          preStop:
            exec:
              command:
              - /bin/sh
              - -c
              # Wait longer than conntrack timeout or pool refresh
              - sleep 45

Connection Lifetime Budget

Design principle: Connection lifetime < Deployment frequency

If you deploy every hour:
  - Max connection lifetime: 30 minutes
  - Pool refresh interval: 15 minutes
  - Stale connections get recycled before next deploy

If you deploy multiple times per day:
  - Max connection lifetime: 5-10 minutes
  - Aggressive connection recycling
  - Accept slight overhead of reconnection

Monitoring

Prometheus Alerts

groups:
- name: conntrack-ghost-pods
  rules:
  - alert: StaleConntrackEntries
    expr: |
      # Compare conntrack entries vs known endpoints
      # This requires custom exporter
      conntrack_stale_entries > 0
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Stale conntrack entries detected"

  - alert: ConnectionResetSpike
    expr: |
      rate(tcp_connection_resets_total[5m]) > 10
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High rate of TCP connection resets"

  - alert: ServiceEndpointChurn
    expr: |
      changes(kube_endpoint_address_available[10m]) > 5
    for: 5m
    labels:
      severity: info
    annotations:
      summary: "High endpoint churn - check for ghost pods"

conntrack Monitoring Script

#!/bin/bash
# monitor-conntrack.sh - Run via cron or DaemonSet

SERVICES="postgres-svc redis-svc"

for SVC in $SERVICES; do
  PORT=$(kubectl get svc $SVC -o jsonpath='{.spec.ports[0].port}')
  ENDPOINTS=$(kubectl get endpoints $SVC -o jsonpath='{.subsets[*].addresses[*].ip}' | tr ' ' '\n')

  STALE_COUNT=0
  while read line; do
    DEST=$(echo "$line" | grep -oP 'reply src=\K[0-9.]+')
    if ! echo "$ENDPOINTS" | grep -qF "$DEST"; then
      ((STALE_COUNT++))
    fi
  done < <(conntrack -L -p tcp --dport $PORT 2>/dev/null)

  echo "conntrack_stale_entries{service=\"$SVC\"} $STALE_COUNT"
done

Checklist

## Ghost Pod Prevention and Response

### Detection
- [ ] SSH to affected node and check conntrack entries
- [ ] Compare conntrack destinations vs current endpoints
- [ ] Verify failures are node-local (not cluster-wide)
- [ ] Correlate errors with recent deployments

### Immediate Fix
- [ ] Flush stale conntrack entries for affected service
- [ ] Restart affected pods to get fresh connections
- [ ] Verify new connections work correctly

### Prevention
- [ ] Set connection pool max lifetime < deployment frequency
- [ ] Configure proper preStop hooks for graceful drain
- [ ] Enable TCP keepalive on long-lived connections
- [ ] Consider IPVS mode for better connection tracking

### Monitoring
- [ ] Alert on connection reset spikes
- [ ] Monitor endpoint churn rate
- [ ] Track conntrack table growth
- [ ] Log correlation between resets and deployments

Conclusion

The ghost pod problem is a perfect example of how Kubernetes’ layered architecture can create subtle failure modes. Kubernetes updates its endpoint registry immediately when a pod terminates. kube-proxy updates iptables rules within seconds. But the kernel’s conntrack table—which tracks the NAT state for existing connections—doesn’t know or care about Kubernetes abstractions. It faithfully maintains mappings for connections that are still “established” from TCP’s perspective, even when the destination has been deleted.

The frustrating part is that debugging tools lie to you. kubectl get endpoints shows correct state. iptables -L -t nat shows correct rules. Network policies are fine. The Service is healthy. Only when you dig into conntrack -L on specific nodes do you see the stale NAT mappings causing failures.

The fundamental fix is connection lifecycle management. If you bound connection lifetime to be shorter than your deployment frequency, stale connections get recycled before they can cause problems. Most connection pools support max lifetime settings—use them. For gRPC and WebSockets, enable keepalive with aggressive timeouts so dead connections are detected quickly.

Key principles:

  1. conntrack is per-node and persists across endpoint changes—existing connections don’t see kube-proxy updates
  2. Long-lived connections are the risk—connection pools, gRPC streams, WebSockets can outlive pods
  3. Failures are node-local—different nodes have different conntrack state, making diagnosis confusing
  4. Set max connection lifetime < deployment frequency—recycle connections before they become stale
  5. Check conntrack first when you see intermittent ECONNRESET to Services that look healthy

The ghost pod might be haunting your cluster right now. Check conntrack -L on a node after your next deployment.


Related posts

Cite this article

If you reference this post, please link to the original URL and credit the author.

Michal Drozd. "The Ghost Pod: Why Your Service Still Sends Traffic to Dead Endpoints". https://www.michal-drozd.com/en/blog/kubernetes-ghost-pod-conntrack/ (Published January 5, 2025).