Kubernetes Headless Service DNS: Stale Records After Pod Deletion

Headless services are great until stale DNS makes them feel haunted. “Connection refused to pod that should exist.” The error made no sense. The pod was running—we could see it in kubectl get pods. But our client application kept getting connection refused errors. After extensive debugging, we discovered the pod we were connecting to wasn’t the pod we thought we were connecting to. Our client was still trying to reach the IP of a pod that had been deleted 30 seconds earlier.

The culprit was the combination of headless services and DNS caching. Regular Kubernetes services give you a single ClusterIP, and kube-proxy handles routing to healthy pods in real-time. But headless services (clusterIP: None) return all pod IPs directly via DNS. When a pod dies, the client’s DNS cache still contains the old IP until the cache TTL expires. During that window, you’re trying to connect to a pod that no longer exists.

What made this particularly insidious was the timing. During high-traffic periods, everything worked fine—connections were established frequently, DNS was refreshed often. But after quiet periods, cached DNS records aged. When traffic resumed, the first requests would try stale IPs and fail. It looked like an initialization problem, but it was actually a caching problem.

This is a fundamental trade-off of headless services. You get direct pod addressing (useful for stateful workloads, databases, etc.), but you lose the real-time routing that ClusterIP services provide. You’re trading convenience for control, and DNS caching is the price you pay.

Environment: Kubernetes 1.25+, headless services for StatefulSets or direct pod access, client-side DNS caching

The Problem

The Stale Connection Incident

Timeline:

T+0s    StatefulSet has 3 replicas: pod-0, pod-1, pod-2
        DNS: myapp-headless.ns.svc → [10.0.1.10, 10.0.1.11, 10.0.1.12]

T+10s   pod-2 is deleted (scale down or rolling update)
T+11s   Endpoints controller removes 10.0.1.12 from endpoints
T+12s   CoreDNS receives updated endpoints
T+13s   CoreDNS updates its cache

T+15s   Client pod (with 30s DNS cache TTL) queries myapp-headless
        Client's cached answer: [10.0.1.10, 10.0.1.11, 10.0.1.12]
        Client tries 10.0.1.12 → "Connection refused"

T+45s   Client DNS cache expires, gets fresh answer
        Finally works correctly

Window of broken connections: ~30 seconds (DNS TTL)

Why Headless Services Are Different

# Regular ClusterIP service:
apiVersion: v1
kind: Service
metadata:
  name: myapp
spec:
  selector:
    app: myapp
  ports:
  - port: 80
# DNS returns: single ClusterIP (10.96.x.x)
# kube-proxy handles routing to healthy pods
# Pod deletion = invisible to clients

# Headless service:
apiVersion: v1
kind: Service
metadata:
  name: myapp-headless
spec:
  clusterIP: None  # <-- Headless!
  selector:
    app: myapp
  ports:
  - port: 80
# DNS returns: ALL pod IPs directly
# Client chooses which pod to connect to
# Pod deletion = client may have stale IP cached

Root Cause

DNS Caching Layers

DNS query path and caching:

┌─────────────────────────────────────────────────────────────┐
│ Application                                                 │
│ └─► Language DNS cache (Go: unlimited, Java: 30s default)  │
│     └─► glibc nscd cache (if enabled)                      │
│         └─► Node-local DNS cache (if deployed)             │
│             └─► CoreDNS (pods cache: 30s default)          │
│                 └─► Endpoints API (source of truth)        │
└─────────────────────────────────────────────────────────────┘

Each layer can hold stale data!

Default TTLs:
- CoreDNS pods plugin: 5s (but clients cache longer)
- CoreDNS cache plugin: 30s
- Java InetAddress: 30s (or forever with security manager)
- Go net.Resolver: no built-in cache, but uses system resolver
- glibc with nscd: configurable, often 60s+
- Node-local DNS: configurable, often 30s

The Race Condition

Timing of pod deletion:

T+0.000s  kubectl delete pod pod-2
T+0.050s  API server marks pod as Terminating
T+0.100s  Endpoints controller WATCHES the change
T+0.150s  Endpoints object updated (pod-2 IP removed)
T+0.200s  CoreDNS receives Endpoints update via watch
T+0.250s  CoreDNS invalidates its cache entry

BUT: Client queried at T-5.000s
     Client's cached DNS expires at T+25.000s
     For 25 more seconds, client uses stale IP!

With connection failures:
- TCP: immediate "Connection refused" (pod IP unroutable)
- If IP reassigned to new pod: wrong data/wrong service!

Diagnosis

Check DNS TTL in Responses

# Query CoreDNS directly for headless service
kubectl run -it --rm debug --image=alpine --restart=Never -- \
  nslookup -debug myapp-headless.default.svc.cluster.local

# Look for TTL in response:
# myapp-headless.default.svc.cluster.local
#     origin = ns.dns.cluster.local
#     ttl = 30  <-- This is cached client-side!

Check Endpoints Timing

# Watch endpoints changes
kubectl get endpoints myapp-headless -w

# Compare with pod deletion timing
kubectl get pods -w

# Check endpoints update latency
kubectl get endpoints myapp-headless -o json | \
  jq '.subsets[].addresses[].ip'

Identify Cached Stale IPs

# From client pod, check what it thinks the IPs are
kubectl exec client-pod -- getent hosts myapp-headless.default.svc

# Compare with actual endpoints
kubectl get endpoints myapp-headless -o jsonpath='{.subsets[*].addresses[*].ip}'

# If different, client has stale cache

The Fix

Option 1: Reduce DNS TTL

# CoreDNS ConfigMap - reduce cache TTL
apiVersion: v1
kind: ConfigMap
metadata:
  name: coredns
  namespace: kube-system
data:
  Corefile: |
    .:53 {
        kubernetes cluster.local in-addr.arpa ip6.arpa {
           pods insecure
           ttl 5  # Reduce from 30s to 5s
           fallthrough in-addr.arpa ip6.arpa
        }
        cache 10  # Reduce cache TTL too
        # ... rest of config
    }

Option 2: Client-Side TTL Configuration

// Java: Disable DNS caching (not recommended for all cases)
java.security.Security.setProperty("networkaddress.cache.ttl", "5");
java.security.Security.setProperty("networkaddress.cache.negative.ttl", "1");

// Go: Use custom resolver with short cache
import "github.com/rs/dnscache"

resolver := &dnscache.Resolver{
    Timeout: 5 * time.Second,
}

// Refresh cache periodically
go func() {
    t := time.NewTicker(5 * time.Second)
    for range t.C {
        resolver.Refresh(true)
    }
}()

Option 3: Connection Retry with Re-resolve

// Retry failed connections with DNS re-resolution
func connectWithRetry(service string) (net.Conn, error) {
    var lastErr error
    for attempt := 0; attempt < 3; attempt++ {
        // Force DNS re-resolution on each attempt
        ips, err := net.LookupHost(service)
        if err != nil {
            lastErr = err
            continue
        }

        for _, ip := range ips {
            conn, err := net.DialTimeout("tcp", ip+":80", 5*time.Second)
            if err == nil {
                return conn, nil
            }
            lastErr = err
        }

        time.Sleep(time.Duration(attempt+1) * time.Second)
    }
    return nil, fmt.Errorf("all attempts failed: %w", lastErr)
}

Option 4: Use Regular Service with Session Affinity

# If you don't need direct pod addressing,
# use regular ClusterIP service instead
apiVersion: v1
kind: Service
metadata:
  name: myapp
spec:
  type: ClusterIP  # Not headless
  selector:
    app: myapp
  ports:
  - port: 80
  sessionAffinity: ClientIP  # If you need sticky sessions

Option 5: Graceful Shutdown with Delay

# Give DNS time to propagate before pod stops accepting connections
spec:
  containers:
  - name: myapp
    lifecycle:
      preStop:
        exec:
          command:
          - /bin/sh
          - -c
          - |
            # Stop accepting new connections
            touch /tmp/shutdown
            # Wait for DNS to propagate
            sleep 10
            # Then exit

    readinessProbe:
      exec:
        command:
        - /bin/sh
        - -c
        - "test ! -f /tmp/shutdown"
      periodSeconds: 2

Monitoring

groups:
  - name: headless-dns
    rules:
      - alert: HeadlessDNSStaleRecords
        expr: |
          rate(dns_lookup_failures_total{service_type="headless"}[5m]) > 0.1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High DNS lookup failures for headless services"

      - alert: EndpointsUpdateDelay
        expr: |
          histogram_quantile(0.99,
            rate(endpoint_update_latency_seconds_bucket[5m])
          ) > 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Endpoints updates taking > 1s"

Checklist

## Headless Service Stale DNS

### Symptoms
- [ ] "Connection refused" to pods that were recently deleted
- [ ] Intermittent failures during rolling updates
- [ ] Issues resolve after waiting ~30 seconds
- [ ] Works with regular services, fails with headless

### Diagnosis
- [ ] Check DNS TTL in CoreDNS config
- [ ] Compare client cached IPs with actual endpoints
- [ ] Time endpoints update propagation
- [ ] Check client-side DNS caching settings

### Fixes
- [ ] Reduce CoreDNS TTL for kubernetes plugin
- [ ] Configure client DNS cache TTL
- [ ] Add connection retry with re-resolution
- [ ] Use preStop hook for graceful shutdown
- [ ] Consider regular ClusterIP if direct pod addressing not needed

Conclusion

This problem illustrates a fundamental architectural difference between headless and ClusterIP services. With a ClusterIP service, routing happens at the kernel level (via iptables or IPVS), and updates propagate in milliseconds. With a headless service, routing happens at the application level via DNS, and DNS is designed for caching. These are different tools with different trade-offs.

Headless services are the right choice when you need direct pod addressing—for StatefulSets where clients need to connect to specific replicas, for databases that need peer-to-peer discovery, for any workload where the client needs to know individual pod identities. But you have to account for DNS caching in your design.

The fix is usually some combination of shorter TTLs, client-side retry with re-resolution, and graceful shutdown delays. Each has costs: shorter TTLs mean more DNS queries, retry logic adds complexity, shutdown delays slow down rollouts. There’s no free lunch.

Key principles:

DNS caching exists at multiple layers - client runtime, node-local DNS, CoreDNS all cache independently
Reduce TTL for headless services - 5s instead of 30s gives faster convergence
Implement client-side retry with re-resolution - on connection failure, re-query DNS
Use graceful shutdown to give DNS time to propagate - preStop hook with readiness failure
Question whether you need headless - if ClusterIP works for your use case, it avoids this entirely

Kubernetes DNS Caching ndots - DNS configuration pitfalls
Gossip Ghost Nodes IP Reuse - Similar stale IP issues

Kubernetes Headless Service DNS: Stale Records After Pod Deletion

The Problem

The Stale Connection Incident

Why Headless Services Are Different

Root Cause

DNS Caching Layers

The Race Condition

Diagnosis

Check DNS TTL in Responses

Check Endpoints Timing

Identify Cached Stale IPs

The Fix

Option 1: Reduce DNS TTL

Option 2: Client-Side TTL Configuration

Option 3: Connection Retry with Re-resolve

Option 4: Use Regular Service with Session Affinity

Option 5: Graceful Shutdown with Delay

Monitoring

Checklist

Conclusion

Related posts

Cite this article

The Problem

The Stale Connection Incident

Why Headless Services Are Different

Root Cause

DNS Caching Layers

The Race Condition

Diagnosis

Check DNS TTL in Responses

Check Endpoints Timing

Identify Cached Stale IPs

The Fix

Option 1: Reduce DNS TTL

Option 2: Client-Side TTL Configuration

Option 3: Connection Retry with Re-resolve

Option 4: Use Regular Service with Session Affinity

Option 5: Graceful Shutdown with Delay

Monitoring

Checklist

Conclusion

Related Articles

Related posts

Cite this article