The Ghost Pod: Why Your Service Still Sends Traffic to Dead Endpoints
We kept seeing traffic from pods that no longer existed. “Why is this one node getting connection resets to the database?” The PostgreSQL pods were healthy. The Service endpoints looked correct. kubectl describe svc showed the right backend pods. But one application node was intermittently failing with ECONNRESET while others were fine. We spent two days checking application code, pool configurations, network policies—everything looked perfect.
The breakthrough came when I SSH’d to the failing node and ran conntrack -L | grep 5432. There it was: a NAT mapping pointing to 10.244.3.47—an IP that belonged to a pod that had been terminated during yesterday’s rolling deployment. The connection pool had opened that connection before the deployment, and conntrack was still faithfully translating packets to a destination that no longer existed. The kernel was doing exactly what it was supposed to do; it just had stale state from a connection that outlived the pod it was talking to.
This is one of the most frustrating Kubernetes networking issues because everything looks fine. Endpoints are correct. kube-proxy rules are correct. New connections work perfectly. But existing long-lived connections—database pools, gRPC streams, WebSocket connections—can remain pinned to dead endpoints through conntrack NAT mappings that persist until the connection closes or times out.
The fundamental issue is a mismatch between Kubernetes’ view of endpoints (updated immediately when a pod terminates) and the kernel’s conntrack table (preserves NAT mappings for the lifetime of the connection). When kube-proxy updates iptables rules to remove an endpoint, it doesn’t—and can’t—invalidate existing conntrack entries for established connections. Those connections continue to use the old NAT mapping, sending packets into the void.
Environment: Kubernetes 1.28+, kube-proxy in iptables mode, long-lived TCP connections (connection pools, gRPC, WebSockets)
Understanding the Mechanism
How kube-proxy and conntrack Interact
Normal Service flow:
Client Pod (10.244.1.50:45678)
|
| SYN to ClusterIP (10.96.100.50:5432)
↓
iptables DNAT rule (kube-proxy managed)
|
| Translates to backend: 10.244.3.47:5432
↓
conntrack creates NAT mapping:
src=10.244.1.50:45678 → dst=10.96.100.50:5432
reply: src=10.244.3.47:5432 → dst=10.244.1.50:45678
|
↓
Backend Pod (10.244.3.47:5432)
This mapping persists for the lifetime of the connection.
The Ghost Pod Problem
Timeline of failure:
T+0:00 Connection pool opens connection to postgres-svc
conntrack entry created: → 10.244.3.47:5432
Connection is ESTABLISHED, working fine
T+1:00 Rolling deployment starts
New postgres pod: 10.244.3.48
Old postgres pod (10.244.3.47) terminating
T+1:05 kube-proxy updates iptables rules
Service now routes to 10.244.3.48
NEW connections go to new pod ✓
T+1:10 Old pod fully terminated
IP 10.244.3.47 no longer exists
BUT: conntrack entry still maps to it!
T+1:15 Application reuses pooled connection
Packet goes to 10.244.3.47 (via conntrack)
No destination → ECONNRESET or timeout
Result: Some connections fail, others work
Failure is node-local (conntrack is per-node)
New connections always work (use updated rules)
Why Only Some Nodes?
conntrack is node-local:
Node A:
├── Pod 1 (opened conn at T+0)
│ └── conntrack: → old pod 10.244.3.47 ← STALE
├── Pod 2 (opened conn at T+2)
│ └── conntrack: → new pod 10.244.3.48 ← OK
└── New connections → 10.244.3.48 ✓
Node B:
├── Pod 3 (no existing connections)
│ └── All connections → 10.244.3.48 ✓
└── No stale conntrack entries
Only Node A's Pod 1 sees failures!
This makes it look like an application bug, not networking.
Diagnosing Ghost Pods
Check conntrack Entries
# SSH to the affected node
# Find conntrack entries for your service
conntrack -L -p tcp --dport 5432 2>/dev/null | head -20
# Example output showing stale entry:
# tcp 6 86393 ESTABLISHED src=10.244.1.50 dst=10.96.100.50 sport=45678 dport=5432
# src=10.244.3.47 dst=10.244.1.50 sport=5432 dport=45678 [ASSURED] mark=0
# The reply src=10.244.3.47 is the actual backend
# If that IP doesn't exist anymore → ghost pod!
# Check if the backend IP exists
kubectl get pods -o wide | grep 10.244.3.47
# (no output = ghost pod confirmed)
# Count conntrack entries by destination
conntrack -L -p tcp 2>/dev/null | \
grep -oP 'src=\K[0-9.]+(?= dst)' | \
sort | uniq -c | sort -rn | head
Compare Endpoints vs conntrack
# Get current endpoints
kubectl get endpoints postgres-svc -o jsonpath='{.subsets[*].addresses[*].ip}'
# Output: 10.244.3.48 10.244.3.49
# Get conntrack destinations for that service
conntrack -L -p tcp --dport 5432 2>/dev/null | \
grep -oP 'reply src=\K[0-9.]+' | sort -u
# Output: 10.244.3.47 10.244.3.48 10.244.3.49
# 10.244.3.47 is in conntrack but not in endpoints = ghost!
Identify Affected Connections
# Find all stale conntrack entries (IPs not in endpoints)
ENDPOINTS=$(kubectl get endpoints postgres-svc -o jsonpath='{.subsets[*].addresses[*].ip}' | tr ' ' '|')
conntrack -L -p tcp --dport 5432 2>/dev/null | while read line; do
DEST=$(echo "$line" | grep -oP 'reply src=\K[0-9.]+')
if ! echo "$DEST" | grep -qE "^($ENDPOINTS)$"; then
echo "STALE: $line"
fi
done
Application-Side Symptoms
# In application logs, look for:
# - ECONNRESET after period of success
# - "server closed the connection unexpectedly"
# - Intermittent timeouts to healthy services
# - Errors that correlate with deployment times
# Check if failures are node-specific
kubectl logs deploy/myapp --all-containers | grep -i "reset\|timeout" | \
while read line; do
POD=$(echo "$line" | grep -oP 'pod/\K[^/]+')
NODE=$(kubectl get pod $POD -o jsonpath='{.spec.nodeName}')
echo "$NODE: $line"
done | sort | uniq -c
The Fix
Option 1: Flush Stale conntrack Entries
# Nuclear option: flush all conntrack for a destination port
conntrack -D -p tcp --dport 5432
# More surgical: flush entries for specific dead IP
conntrack -D -p tcp --reply-src 10.244.3.47
# Automated cleanup script
#!/bin/bash
# cleanup-ghost-conntrack.sh
SERVICE=$1
PORT=$2
# Get current valid endpoints
ENDPOINTS=$(kubectl get endpoints $SERVICE -o jsonpath='{.subsets[*].addresses[*].ip}' | tr ' ' '\n')
# Find and delete stale entries
conntrack -L -p tcp --dport $PORT 2>/dev/null | while read line; do
DEST=$(echo "$line" | grep -oP 'reply src=\K[0-9.]+')
if ! echo "$ENDPOINTS" | grep -qF "$DEST"; then
echo "Deleting stale entry to $DEST"
conntrack -D -p tcp --dport $PORT --reply-src $DEST
fi
done
Option 2: Proper Connection Draining
# Pod spec with proper termination handling
apiVersion: v1
kind: Pod
spec:
terminationGracePeriodSeconds: 60
containers:
- name: postgres
lifecycle:
preStop:
exec:
command:
- /bin/sh
- -c
- |
# Signal app to stop accepting new connections
pg_ctl stop -m smart -w -t 30
# Wait for existing connections to drain
sleep 30
# Readiness probe fails immediately on SIGTERM
readinessProbe:
exec:
command: ["pg_isready"]
periodSeconds: 5
Option 3: Connection Pool Configuration
// HikariCP - Force connection validation
HikariConfig config = new HikariConfig();
config.setConnectionTimeout(5000);
config.setValidationTimeout(3000);
config.setMaxLifetime(1800000); // 30 minutes max
config.setKeepaliveTime(30000); // TCP keepalive every 30s
config.setConnectionTestQuery("SELECT 1");
// Key: Limit connection lifetime to bound stale connection risk
config.setMaxLifetime(300000); // 5 minutes - shorter than typical deploy
// Go sql.DB - Set connection lifetime
db.SetConnMaxLifetime(5 * time.Minute)
db.SetConnMaxIdleTime(1 * time.Minute)
// For gRPC - Enable keepalive with short timeout
conn, err := grpc.Dial(
target,
grpc.WithKeepaliveParams(keepalive.ClientParameters{
Time: 10 * time.Second,
Timeout: 3 * time.Second,
PermitWithoutStream: true,
}),
)
Option 4: Switch to IPVS Mode
# kube-proxy config - IPVS has better connection tracking
apiVersion: kubeproxy.config.k8s.io/v1alpha1
kind: KubeProxyConfiguration
mode: ipvs
ipvs:
scheduler: lc # least connection
syncPeriod: 30s
minSyncPeriod: 5s
# IPVS can gracefully drain connections
tcpTimeout: 900s
tcpFinTimeout: 30s
udpTimeout: 300s
Option 5: Use Headless Service for Stateful Connections
# Headless service - clients connect directly to pod IPs
apiVersion: v1
kind: Service
metadata:
name: postgres-headless
spec:
clusterIP: None # Headless
selector:
app: postgres
ports:
- port: 5432
---
# Application connects to postgres-0.postgres-headless.namespace.svc
# No DNAT = no conntrack stale entries
# But: Must handle pod IP changes in application
Prevention
Deployment Strategy
# Rolling update with proper draining
apiVersion: apps/v1
kind: Deployment
spec:
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
template:
spec:
terminationGracePeriodSeconds: 60
containers:
- name: app
lifecycle:
preStop:
exec:
command:
- /bin/sh
- -c
# Wait longer than conntrack timeout or pool refresh
- sleep 45
Connection Lifetime Budget
Design principle: Connection lifetime < Deployment frequency
If you deploy every hour:
- Max connection lifetime: 30 minutes
- Pool refresh interval: 15 minutes
- Stale connections get recycled before next deploy
If you deploy multiple times per day:
- Max connection lifetime: 5-10 minutes
- Aggressive connection recycling
- Accept slight overhead of reconnection
Monitoring
Prometheus Alerts
groups:
- name: conntrack-ghost-pods
rules:
- alert: StaleConntrackEntries
expr: |
# Compare conntrack entries vs known endpoints
# This requires custom exporter
conntrack_stale_entries > 0
for: 5m
labels:
severity: warning
annotations:
summary: "Stale conntrack entries detected"
- alert: ConnectionResetSpike
expr: |
rate(tcp_connection_resets_total[5m]) > 10
for: 5m
labels:
severity: warning
annotations:
summary: "High rate of TCP connection resets"
- alert: ServiceEndpointChurn
expr: |
changes(kube_endpoint_address_available[10m]) > 5
for: 5m
labels:
severity: info
annotations:
summary: "High endpoint churn - check for ghost pods"
conntrack Monitoring Script
#!/bin/bash
# monitor-conntrack.sh - Run via cron or DaemonSet
SERVICES="postgres-svc redis-svc"
for SVC in $SERVICES; do
PORT=$(kubectl get svc $SVC -o jsonpath='{.spec.ports[0].port}')
ENDPOINTS=$(kubectl get endpoints $SVC -o jsonpath='{.subsets[*].addresses[*].ip}' | tr ' ' '\n')
STALE_COUNT=0
while read line; do
DEST=$(echo "$line" | grep -oP 'reply src=\K[0-9.]+')
if ! echo "$ENDPOINTS" | grep -qF "$DEST"; then
((STALE_COUNT++))
fi
done < <(conntrack -L -p tcp --dport $PORT 2>/dev/null)
echo "conntrack_stale_entries{service=\"$SVC\"} $STALE_COUNT"
done
Checklist
## Ghost Pod Prevention and Response
### Detection
- [ ] SSH to affected node and check conntrack entries
- [ ] Compare conntrack destinations vs current endpoints
- [ ] Verify failures are node-local (not cluster-wide)
- [ ] Correlate errors with recent deployments
### Immediate Fix
- [ ] Flush stale conntrack entries for affected service
- [ ] Restart affected pods to get fresh connections
- [ ] Verify new connections work correctly
### Prevention
- [ ] Set connection pool max lifetime < deployment frequency
- [ ] Configure proper preStop hooks for graceful drain
- [ ] Enable TCP keepalive on long-lived connections
- [ ] Consider IPVS mode for better connection tracking
### Monitoring
- [ ] Alert on connection reset spikes
- [ ] Monitor endpoint churn rate
- [ ] Track conntrack table growth
- [ ] Log correlation between resets and deployments
Conclusion
The ghost pod problem is a perfect example of how Kubernetes’ layered architecture can create subtle failure modes. Kubernetes updates its endpoint registry immediately when a pod terminates. kube-proxy updates iptables rules within seconds. But the kernel’s conntrack table—which tracks the NAT state for existing connections—doesn’t know or care about Kubernetes abstractions. It faithfully maintains mappings for connections that are still “established” from TCP’s perspective, even when the destination has been deleted.
The frustrating part is that debugging tools lie to you. kubectl get endpoints shows correct state. iptables -L -t nat shows correct rules. Network policies are fine. The Service is healthy. Only when you dig into conntrack -L on specific nodes do you see the stale NAT mappings causing failures.
The fundamental fix is connection lifecycle management. If you bound connection lifetime to be shorter than your deployment frequency, stale connections get recycled before they can cause problems. Most connection pools support max lifetime settings—use them. For gRPC and WebSockets, enable keepalive with aggressive timeouts so dead connections are detected quickly.
Key principles:
- conntrack is per-node and persists across endpoint changes—existing connections don’t see kube-proxy updates
- Long-lived connections are the risk—connection pools, gRPC streams, WebSockets can outlive pods
- Failures are node-local—different nodes have different conntrack state, making diagnosis confusing
- Set max connection lifetime < deployment frequency—recycle connections before they become stale
- Check conntrack first when you see intermittent ECONNRESET to Services that look healthy
The ghost pod might be haunting your cluster right now. Check conntrack -L on a node after your next deployment.
Related Articles
- Kubernetes Conntrack Table Exhaustion - When conntrack fills up
- gRPC Keepalive Configuration - Keepalive for long-lived connections
Related posts
kube-proxy Micro-Outages: The xtables Lock Contention Problem
Random 1-3 second connection drops during deployments. CPU looks fine, memory is stable. The hidden cause: iptables-restore grabbing the xtables lock while endpoints churn.
Kubernetes Ghost Connections: Stale Conntrack DNAT Entries
Service returns wrong pod IPs after scaling. The cause: Linux conntrack keeps DNAT entries alive longer than pods exist, routing traffic to deleted endpoints.
Traffic Hitting Dead Pods: Conntrack's Stale NAT Mapping
Deploy causes 503s for exactly 2 minutes. The issue: conntrack keeps NAT mappings to old pod IPs even after Kubernetes removes endpoints, sending traffic to dead pods.
Kubernetes conntrack Table Exhaustion: The Silent Packet Killer
Random DNS timeouts, dropped connections, services timing out. Your nf_conntrack table is full. I show how to diagnose, monitor, and fix this Kubernetes networking issue.
Cite this article
If you reference this post, please link to the original URL and credit the author.