Kubernetes Headless Service DNS: Stale Records After Pod Deletion
Headless services are great until stale DNS makes them feel haunted. “Connection refused to pod that should exist.” The error made no sense. The pod was running—we could see it in kubectl get pods. But our client application kept getting connection refused errors. After extensive debugging, we discovered the pod we were connecting to wasn’t the pod we thought we were connecting to. Our client was still trying to reach the IP of a pod that had been deleted 30 seconds earlier.
The culprit was the combination of headless services and DNS caching. Regular Kubernetes services give you a single ClusterIP, and kube-proxy handles routing to healthy pods in real-time. But headless services (clusterIP: None) return all pod IPs directly via DNS. When a pod dies, the client’s DNS cache still contains the old IP until the cache TTL expires. During that window, you’re trying to connect to a pod that no longer exists.
What made this particularly insidious was the timing. During high-traffic periods, everything worked fine—connections were established frequently, DNS was refreshed often. But after quiet periods, cached DNS records aged. When traffic resumed, the first requests would try stale IPs and fail. It looked like an initialization problem, but it was actually a caching problem.
This is a fundamental trade-off of headless services. You get direct pod addressing (useful for stateful workloads, databases, etc.), but you lose the real-time routing that ClusterIP services provide. You’re trading convenience for control, and DNS caching is the price you pay.
Environment: Kubernetes 1.25+, headless services for StatefulSets or direct pod access, client-side DNS caching
The Problem
The Stale Connection Incident
Timeline:
T+0s StatefulSet has 3 replicas: pod-0, pod-1, pod-2
DNS: myapp-headless.ns.svc → [10.0.1.10, 10.0.1.11, 10.0.1.12]
T+10s pod-2 is deleted (scale down or rolling update)
T+11s Endpoints controller removes 10.0.1.12 from endpoints
T+12s CoreDNS receives updated endpoints
T+13s CoreDNS updates its cache
T+15s Client pod (with 30s DNS cache TTL) queries myapp-headless
Client's cached answer: [10.0.1.10, 10.0.1.11, 10.0.1.12]
Client tries 10.0.1.12 → "Connection refused"
T+45s Client DNS cache expires, gets fresh answer
Finally works correctly
Window of broken connections: ~30 seconds (DNS TTL)
Why Headless Services Are Different
# Regular ClusterIP service:
apiVersion: v1
kind: Service
metadata:
name: myapp
spec:
selector:
app: myapp
ports:
- port: 80
# DNS returns: single ClusterIP (10.96.x.x)
# kube-proxy handles routing to healthy pods
# Pod deletion = invisible to clients
# Headless service:
apiVersion: v1
kind: Service
metadata:
name: myapp-headless
spec:
clusterIP: None # <-- Headless!
selector:
app: myapp
ports:
- port: 80
# DNS returns: ALL pod IPs directly
# Client chooses which pod to connect to
# Pod deletion = client may have stale IP cached
Root Cause
DNS Caching Layers
DNS query path and caching:
┌─────────────────────────────────────────────────────────────┐
│ Application │
│ └─► Language DNS cache (Go: unlimited, Java: 30s default) │
│ └─► glibc nscd cache (if enabled) │
│ └─► Node-local DNS cache (if deployed) │
│ └─► CoreDNS (pods cache: 30s default) │
│ └─► Endpoints API (source of truth) │
└─────────────────────────────────────────────────────────────┘
Each layer can hold stale data!
Default TTLs:
- CoreDNS pods plugin: 5s (but clients cache longer)
- CoreDNS cache plugin: 30s
- Java InetAddress: 30s (or forever with security manager)
- Go net.Resolver: no built-in cache, but uses system resolver
- glibc with nscd: configurable, often 60s+
- Node-local DNS: configurable, often 30s
The Race Condition
Timing of pod deletion:
T+0.000s kubectl delete pod pod-2
T+0.050s API server marks pod as Terminating
T+0.100s Endpoints controller WATCHES the change
T+0.150s Endpoints object updated (pod-2 IP removed)
T+0.200s CoreDNS receives Endpoints update via watch
T+0.250s CoreDNS invalidates its cache entry
BUT: Client queried at T-5.000s
Client's cached DNS expires at T+25.000s
For 25 more seconds, client uses stale IP!
With connection failures:
- TCP: immediate "Connection refused" (pod IP unroutable)
- If IP reassigned to new pod: wrong data/wrong service!
Diagnosis
Check DNS TTL in Responses
# Query CoreDNS directly for headless service
kubectl run -it --rm debug --image=alpine --restart=Never -- \
nslookup -debug myapp-headless.default.svc.cluster.local
# Look for TTL in response:
# myapp-headless.default.svc.cluster.local
# origin = ns.dns.cluster.local
# ttl = 30 <-- This is cached client-side!
Check Endpoints Timing
# Watch endpoints changes
kubectl get endpoints myapp-headless -w
# Compare with pod deletion timing
kubectl get pods -w
# Check endpoints update latency
kubectl get endpoints myapp-headless -o json | \
jq '.subsets[].addresses[].ip'
Identify Cached Stale IPs
# From client pod, check what it thinks the IPs are
kubectl exec client-pod -- getent hosts myapp-headless.default.svc
# Compare with actual endpoints
kubectl get endpoints myapp-headless -o jsonpath='{.subsets[*].addresses[*].ip}'
# If different, client has stale cache
The Fix
Option 1: Reduce DNS TTL
# CoreDNS ConfigMap - reduce cache TTL
apiVersion: v1
kind: ConfigMap
metadata:
name: coredns
namespace: kube-system
data:
Corefile: |
.:53 {
kubernetes cluster.local in-addr.arpa ip6.arpa {
pods insecure
ttl 5 # Reduce from 30s to 5s
fallthrough in-addr.arpa ip6.arpa
}
cache 10 # Reduce cache TTL too
# ... rest of config
}
Option 2: Client-Side TTL Configuration
// Java: Disable DNS caching (not recommended for all cases)
java.security.Security.setProperty("networkaddress.cache.ttl", "5");
java.security.Security.setProperty("networkaddress.cache.negative.ttl", "1");
// Go: Use custom resolver with short cache
import "github.com/rs/dnscache"
resolver := &dnscache.Resolver{
Timeout: 5 * time.Second,
}
// Refresh cache periodically
go func() {
t := time.NewTicker(5 * time.Second)
for range t.C {
resolver.Refresh(true)
}
}()
Option 3: Connection Retry with Re-resolve
// Retry failed connections with DNS re-resolution
func connectWithRetry(service string) (net.Conn, error) {
var lastErr error
for attempt := 0; attempt < 3; attempt++ {
// Force DNS re-resolution on each attempt
ips, err := net.LookupHost(service)
if err != nil {
lastErr = err
continue
}
for _, ip := range ips {
conn, err := net.DialTimeout("tcp", ip+":80", 5*time.Second)
if err == nil {
return conn, nil
}
lastErr = err
}
time.Sleep(time.Duration(attempt+1) * time.Second)
}
return nil, fmt.Errorf("all attempts failed: %w", lastErr)
}
Option 4: Use Regular Service with Session Affinity
# If you don't need direct pod addressing,
# use regular ClusterIP service instead
apiVersion: v1
kind: Service
metadata:
name: myapp
spec:
type: ClusterIP # Not headless
selector:
app: myapp
ports:
- port: 80
sessionAffinity: ClientIP # If you need sticky sessions
Option 5: Graceful Shutdown with Delay
# Give DNS time to propagate before pod stops accepting connections
spec:
containers:
- name: myapp
lifecycle:
preStop:
exec:
command:
- /bin/sh
- -c
- |
# Stop accepting new connections
touch /tmp/shutdown
# Wait for DNS to propagate
sleep 10
# Then exit
readinessProbe:
exec:
command:
- /bin/sh
- -c
- "test ! -f /tmp/shutdown"
periodSeconds: 2
Monitoring
groups:
- name: headless-dns
rules:
- alert: HeadlessDNSStaleRecords
expr: |
rate(dns_lookup_failures_total{service_type="headless"}[5m]) > 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "High DNS lookup failures for headless services"
- alert: EndpointsUpdateDelay
expr: |
histogram_quantile(0.99,
rate(endpoint_update_latency_seconds_bucket[5m])
) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "Endpoints updates taking > 1s"
Checklist
## Headless Service Stale DNS
### Symptoms
- [ ] "Connection refused" to pods that were recently deleted
- [ ] Intermittent failures during rolling updates
- [ ] Issues resolve after waiting ~30 seconds
- [ ] Works with regular services, fails with headless
### Diagnosis
- [ ] Check DNS TTL in CoreDNS config
- [ ] Compare client cached IPs with actual endpoints
- [ ] Time endpoints update propagation
- [ ] Check client-side DNS caching settings
### Fixes
- [ ] Reduce CoreDNS TTL for kubernetes plugin
- [ ] Configure client DNS cache TTL
- [ ] Add connection retry with re-resolution
- [ ] Use preStop hook for graceful shutdown
- [ ] Consider regular ClusterIP if direct pod addressing not needed
Conclusion
This problem illustrates a fundamental architectural difference between headless and ClusterIP services. With a ClusterIP service, routing happens at the kernel level (via iptables or IPVS), and updates propagate in milliseconds. With a headless service, routing happens at the application level via DNS, and DNS is designed for caching. These are different tools with different trade-offs.
Headless services are the right choice when you need direct pod addressing—for StatefulSets where clients need to connect to specific replicas, for databases that need peer-to-peer discovery, for any workload where the client needs to know individual pod identities. But you have to account for DNS caching in your design.
The fix is usually some combination of shorter TTLs, client-side retry with re-resolution, and graceful shutdown delays. Each has costs: shorter TTLs mean more DNS queries, retry logic adds complexity, shutdown delays slow down rollouts. There’s no free lunch.
Key principles:
- DNS caching exists at multiple layers - client runtime, node-local DNS, CoreDNS all cache independently
- Reduce TTL for headless services - 5s instead of 30s gives faster convergence
- Implement client-side retry with re-resolution - on connection failure, re-query DNS
- Use graceful shutdown to give DNS time to propagate - preStop hook with readiness failure
- Question whether you need headless - if ClusterIP works for your use case, it avoids this entirely
Related Articles
- Kubernetes DNS Caching ndots - DNS configuration pitfalls
- Gossip Ghost Nodes IP Reuse - Similar stale IP issues
Related posts
Kubernetes conntrack Table Exhaustion: The Silent Packet Killer
Random DNS timeouts, dropped connections, services timing out. Your nf_conntrack table is full. I show how to diagnose, monitor, and fix this Kubernetes networking issue.
Traffic Hitting Dead Pods: Conntrack's Stale NAT Mapping
Deploy causes 503s for exactly 2 minutes. The issue: conntrack keeps NAT mappings to old pod IPs even after Kubernetes removes endpoints, sending traffic to dead pods.
Ephemeral Port Exhaustion: The Node That 'Goes Bad'
A single Kubernetes node starts failing connections to external services while pods look healthy. The hidden cause: sidecar proxies exhausting ephemeral ports with short-lived connections.
PMTU Blackholes: When Only Large Responses Hang
Small API responses work, large ones hang forever. The cause: ICMP 'Fragmentation Needed' messages filtered by firewalls, breaking Path MTU Discovery in overlay networks.
Cite this article
If you reference this post, please link to the original URL and credit the author.