Kubernetes conntrack Table Exhaustion: The Silent Packet Killer

Conntrack exhaustion is a classic that still surprises teams at scale. “Our DNS lookups randomly fail.” “Services timeout but pods are healthy.” “Works locally but breaks in K8s.” We spent three days on this one. The symptoms made no sense—intermittent failures with no pattern, services that worked sometimes and failed others, DNS lookups that would succeed for one pod and fail for another on the same node.

The breakthrough came from checking dmesg on a node: “nf_conntrack: table full, dropping packet.” The conntrack table—Linux’s connection tracking subsystem—had filled up, and new connections were being silently dropped. There were no application-level errors because the packets never made it to the application. They were dropped in the kernel before the TCP handshake could complete.

Conntrack is one of those invisible Linux subsystems that most developers never think about. It tracks every network connection through the kernel, maintaining state for NAT, firewalls, and stateful packet filtering. In Kubernetes, every service call goes through conntrack because kube-proxy uses NAT to route traffic from ClusterIPs to pods. With thousands of pods making thousands of connections, the default conntrack table size of 128K entries fills up surprisingly fast.

What makes this especially insidious is the symptom pattern. When the table is full, new connections are dropped. But existing connections continue to work. So you see intermittent failures that seem random—some requests succeed, others time out. The failures correlate with nothing visible in application metrics. The problem is invisible at every layer except the kernel.

Tested on: EKS 1.28, GKE 1.27, Linux kernel 5.15, Calico CNI

What is conntrack?

Connection Tracking Basics

Linux connection tracking (conntrack) maps:
  Source IP:Port ←→ Destination IP:Port ←→ State

Purpose: Enable NAT, firewalls, and stateful packet filtering

Every connection through kube-proxy uses conntrack entry:

Pod A (10.1.1.5:45678) → Service (10.96.0.10:80) → Pod B (10.1.2.3:8080)
                    ↓
           conntrack entry tracks this mapping

Why Kubernetes is Conntrack-Heavy

Single HTTP request in Kubernetes:

1. DNS lookup to CoreDNS
   Pod → kube-dns Service → CoreDNS Pod (conntrack entry #1)

2. Service call
   Pod → Service ClusterIP → Backend Pod (conntrack entry #2)

3. If backend calls database
   Backend → DB Service → DB Pod (conntrack entry #3)

4. External API call (via NAT Gateway)
   Pod → NAT → External IP (conntrack entry #4)

Result: One user request = 4+ conntrack entries
At 10,000 RPS: 40,000+ entries constantly cycling

The Problem

Default Limits

# Check current conntrack settings
cat /proc/sys/net/netfilter/nf_conntrack_max
# Default: 131072 (128K entries)

cat /proc/sys/net/netfilter/nf_conntrack_count
# Current entries in use

# When count approaches max:
# - New connections get dropped silently
# - DNS lookups timeout (UDP)
# - TCP connections hang

Symptoms

Symptom 1: DNS timeouts
  - DNS is mostly UDP; even \"connectionless\" traffic still creates conntrack state (because kube-proxy NAT)
  - High UDP QPS churns entries fast
  - DNS failures cascade to all services

Symptom 2: Random connection failures
  - Some requests succeed, some fail
  - No pattern in logs
  - Pods report healthy

Symptom 3: High latency spikes
  - Connection setup takes longer
  - Existing connections affected by GC

Kernel log (dmesg):
  nf_conntrack: table full, dropping packet

Conntrack exhaustion is about dropping packets when the table is full. But there is another conntrack-shaped outage that looks very different:

kube-proxy RSS suddenly jumps into multi-GB territory
kube-proxy CPU spikes
it often correlates with updates to Pods/Services exposing UDP ports (CoreDNS is the usual trigger)
your conntrack table might not even be “full” - it’s just large enough that “scan the whole table” becomes expensive

This showed up very clearly in Kubernetes issue #129982 (kube-proxy v1.32): updates to Pods/Services with UDP ports triggered a full conntrack cleanup, iterating the whole table and burning memory/CPU.

What I do in practice:

Confirm it’s kube-proxy, not your workload

kubectl -n kube-system top pods -l k8s-app=kube-proxy

Check conntrack pressure on the node

cat /proc/sys/net/netfilter/nf_conntrack_{count,max}

Upgrade to a release that includes the fixes

In that incident class, two merged fixes are particularly relevant:
- Kubernetes PR #130032: “Conntrack memory leak fix”
- Kubernetes PR #130484: “conntrack reconciler must check the dst port”
The fix was backported into the Kubernetes 1.32 patch line, and the general advice is still: stay on the latest patch release for your branch. I would not try to “tune around” this if you can upgrade/backport - the failure mode is algorithmic.

If you are stuck on a provider build that lags behind upstream, one mitigation discussed in the issue was increasing --iptables-min-sync-period to reduce how often the sync path runs. It does not fix the underlying bug, but it can reduce the blast radius while you wait for an upgrade.

If this is the problem you’re hitting, the rest of this post (limits, timeouts, NodeLocal DNSCache) still helps, because a smaller/stabler conntrack table makes everything cheaper. But the real fix is upgrading kube-proxy/Kubernetes to a version with the reconciler fixes.

Diagnosing

Check Node Status

# SSH to node or use kubectl debug
kubectl debug node/my-node -it --image=busybox

# Inside debug container:
cat /proc/sys/net/netfilter/nf_conntrack_count
cat /proc/sys/net/netfilter/nf_conntrack_max

# Check for drops
dmesg | grep conntrack
# Look for: "table full, dropping packet"

# Connection breakdown by state
cat /proc/net/nf_conntrack | awk '{print $4}' | sort | uniq -c | sort -rn
#  45000 TIME_WAIT
#  38000 ESTABLISHED
#  12000 SYN_SENT
#   5000 FIN_WAIT

Prometheus Metrics

# Node exporter provides these:

# Current conntrack entries
node_nf_conntrack_entries

# Maximum allowed
node_nf_conntrack_entries_limit

# Usage percentage (critical metric)
node_nf_conntrack_entries / node_nf_conntrack_entries_limit

# Rate of new connections
rate(node_nf_conntrack_entries[5m])

High-Traffic Service Analysis

# Find services generating most connections
kubectl get endpoints -A -o json | jq '
  .items[] |
  select(.subsets != null) |
  {
    namespace: .metadata.namespace,
    name: .metadata.name,
    endpoints: [.subsets[].addresses // [] | length] | add
  }' | sort -k3 -rn

# Check for short-lived connections (HTTP without keepalive)
# These churn conntrack entries fastest

Solutions

1. Increase Conntrack Limits

# DaemonSet to tune sysctl on all nodes
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: sysctl-tuner
  namespace: kube-system
spec:
  selector:
    matchLabels:
      app: sysctl-tuner
  template:
    metadata:
      labels:
        app: sysctl-tuner
    spec:
      hostPID: true
      hostNetwork: true
      initContainers:
        - name: sysctl
          image: busybox
          securityContext:
            privileged: true
          command:
            - sh
            - -c
            - |
              sysctl -w net.netfilter.nf_conntrack_max=1048576
              sysctl -w net.netfilter.nf_conntrack_tcp_timeout_established=86400
              sysctl -w net.netfilter.nf_conntrack_tcp_timeout_time_wait=30
              sysctl -w net.core.somaxconn=65535
      containers:
        - name: pause
          image: k8s.gcr.io/pause:3.9

Cloud-Specific Configuration

# EKS: Use launch template with user data
#!/bin/bash
echo "net.netfilter.nf_conntrack_max=1048576" >> /etc/sysctl.conf
echo "net.netfilter.nf_conntrack_tcp_timeout_time_wait=30" >> /etc/sysctl.conf
sysctl -p

# GKE: Use node pool config
gcloud container node-pools create high-conntrack \
  --cluster=my-cluster \
  --node-config='linuxNodeConfig:
    sysctls:
      net.netfilter.nf_conntrack_max: "1048576"'

# AKS: Use custom node config
az aks nodepool add \
  --cluster-name my-cluster \
  --name highconn \
  --linux-os-config linuxOsConfig.json

# linuxOsConfig.json:
{
  "sysctls": {
    "netNetfilterNfConntrackMax": 1048576
  }
}

2. Reduce Conntrack Usage

# Use headless services where possible
# No conntrack needed for direct pod-to-pod
apiVersion: v1
kind: Service
metadata:
  name: my-service
spec:
  clusterIP: None  # Headless - no NAT, no conntrack
  selector:
    app: my-app
---
# Enable HTTP keepalive to reduce connection churn
apiVersion: v1
kind: ConfigMap
metadata:
  name: nginx-config
data:
  nginx.conf: |
    upstream backend {
      server backend:8080;
      keepalive 100;  # Reuse connections
    }

    server {
      location / {
        proxy_http_version 1.1;
        proxy_set_header Connection "";  # Enable keepalive
        proxy_pass http://backend;
      }
    }

3. Tune Timeout Values

# Default timeouts are too long for ephemeral traffic
# Reduce TIME_WAIT entries:

# Before: TIME_WAIT connections held for 120 seconds
# After: Released in 30 seconds
sysctl -w net.netfilter.nf_conntrack_tcp_timeout_time_wait=30

# Close dead connections faster
sysctl -w net.netfilter.nf_conntrack_tcp_timeout_close_wait=15

# Full timeout tuning for high-traffic:
net.netfilter.nf_conntrack_tcp_timeout_syn_sent=30
net.netfilter.nf_conntrack_tcp_timeout_syn_recv=30
net.netfilter.nf_conntrack_tcp_timeout_established=86400
net.netfilter.nf_conntrack_tcp_timeout_fin_wait=15
net.netfilter.nf_conntrack_tcp_timeout_close_wait=15
net.netfilter.nf_conntrack_tcp_timeout_last_ack=15
net.netfilter.nf_conntrack_tcp_timeout_time_wait=30
net.netfilter.nf_conntrack_tcp_timeout_close=10
net.netfilter.nf_conntrack_udp_timeout=30
net.netfilter.nf_conntrack_udp_timeout_stream=60

4. IPVS Mode (Alternative to iptables)

# IPVS uses less conntrack for services
# ConfigMap for kube-proxy
apiVersion: v1
kind: ConfigMap
metadata:
  name: kube-proxy
  namespace: kube-system
data:
  config.conf: |
    mode: "ipvs"
    ipvs:
      scheduler: "rr"
      strictARP: true

# IPVS benefits:
# - Better performance at scale (O(1) vs O(n) for iptables)
# - More efficient conntrack usage
# - Built-in load balancing algorithms

Monitoring

Alert Rules

groups:
- name: conntrack
  rules:
  - alert: ConntrackTableNearFull
    expr: |
      node_nf_conntrack_entries / node_nf_conntrack_entries_limit > 0.75
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Conntrack table at {{ $value | humanizePercentage }}"
      description: "Node {{ $labels.instance }} conntrack usage high"

  - alert: ConntrackTableCritical
    expr: |
      node_nf_conntrack_entries / node_nf_conntrack_entries_limit > 0.9
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: "Conntrack table at {{ $value | humanizePercentage }}"
      description: "Packet drops imminent on {{ $labels.instance }}"

  - alert: ConntrackEntriesGrowingFast
    expr: |
      rate(node_nf_conntrack_entries[5m]) > 1000
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "Conntrack growing at {{ $value }}/sec"
      description: "Possible connection leak or traffic spike"

Grafana Dashboard

# Panel 1: Conntrack usage per node
node_nf_conntrack_entries / node_nf_conntrack_entries_limit

# Panel 2: Entries over time
node_nf_conntrack_entries

# Panel 3: Rate of change
rate(node_nf_conntrack_entries[1m])

# Panel 4: Headroom
node_nf_conntrack_entries_limit - node_nf_conntrack_entries

DNS-Specific Issues

NodeLocal DNS Cache

# NodeLocal DNSCache reduces conntrack pressure
# DNS queries stay on-node, no conntrack needed

# Install NodeLocal DNSCache
kubectl apply -f https://raw.githubusercontent.com/kubernetes/kubernetes/master/cluster/addons/dns/nodelocaldns/nodelocaldns.yaml

# Benefits:
# - DNS queries don't traverse network
# - No conntrack entries for DNS
# - Reduced load on CoreDNS

CoreDNS Conntrack Bypass

# CoreDNS can use TCP to reduce UDP conntrack issues
# ConfigMap for CoreDNS
apiVersion: v1
kind: ConfigMap
metadata:
  name: coredns
  namespace: kube-system
data:
  Corefile: |
    .:53 {
        forward . /etc/resolv.conf {
            prefer_udp
            max_concurrent 1000
        }
        cache 30
    }

Checklist

## Conntrack Exhaustion Prevention

### Monitoring
- [ ] Alert on conntrack usage > 75%
- [ ] Dashboard showing conntrack entries per node
- [ ] Monitor connection rate (entries/sec)

### Tuning
- [ ] Increase nf_conntrack_max (1M+ for high traffic)
- [ ] Reduce TIME_WAIT timeout (30s)
- [ ] Consider IPVS mode for kube-proxy

### Architecture
- [ ] Use headless services where possible
- [ ] Enable HTTP keepalive between services
- [ ] Deploy NodeLocal DNS Cache
- [ ] Use connection pools in applications

### Investigation
- [ ] Check dmesg for "table full" messages
- [ ] Analyze conntrack state distribution
- [ ] Identify high-connection services

Conclusion

Conntrack exhaustion is a perfect example of how Kubernetes abstractions can hide infrastructure problems. Your application code is fine. Your Kubernetes manifests are fine. Your network policies are fine. But deep in the Linux kernel, a table fills up, and packets start dropping. No error messages, no logs, just silent failures that look like intermittent network issues.

The fix is straightforward once you know to look: increase the conntrack table size, reduce timeouts so entries expire faster, and consider NodeLocal DNS to reduce conntrack pressure from DNS queries. But the diagnosis is hard because the symptom—intermittent failures with no pattern—doesn’t point to conntrack. You have to know to check dmesg for “table full” messages, and most application developers don’t.

This is why conntrack monitoring should be part of your standard Kubernetes observability setup. The nf_conntrack_entries metric is cheap to collect, and an alert at 75% utilization gives you time to respond before users are affected. It’s much better to tune conntrack proactively than to debug it during an incident.

Key takeaways:

Default 128K limit is too low for production clusters with high traffic
DNS and short-lived connections fill the table fastest—enable NodeLocal DNS
Monitor nf_conntrack_entries before issues arise—alert at 75% utilization
Increase limits + reduce timeouts for immediate relief when you hit the limit
Consider IPVS mode for kube-proxy at scale—it uses conntrack more efficiently

Check your nodes now—the drops might already be happening, silently, in the kernel.

CoreDNS vs NodeLocal DNS Cache - DNS optimization
Kubernetes Cross-Zone Traffic - Network costs

Kubernetes conntrack Table Exhaustion: The Silent Packet Killer

What is conntrack?

Connection Tracking Basics

Why Kubernetes is Conntrack-Heavy

The Problem

Default Limits

Symptoms

Diagnosing

Check Node Status

Prometheus Metrics

High-Traffic Service Analysis

Solutions

1. Increase Conntrack Limits

Cloud-Specific Configuration

2. Reduce Conntrack Usage

3. Tune Timeout Values

4. IPVS Mode (Alternative to iptables)

Monitoring

Alert Rules

Grafana Dashboard

DNS-Specific Issues

NodeLocal DNS Cache

CoreDNS Conntrack Bypass

Checklist

Conclusion

Related posts

Cite this article

What is conntrack?

Connection Tracking Basics

Why Kubernetes is Conntrack-Heavy

The Problem

Default Limits

Symptoms

A related failure mode: kube-proxy conntrack cleanup storms (high RSS/CPU)

Diagnosing

Check Node Status

Prometheus Metrics

High-Traffic Service Analysis

Solutions

1. Increase Conntrack Limits

Cloud-Specific Configuration

2. Reduce Conntrack Usage

3. Tune Timeout Values

4. IPVS Mode (Alternative to iptables)

Monitoring

Alert Rules

Grafana Dashboard

DNS-Specific Issues

NodeLocal DNS Cache

CoreDNS Conntrack Bypass

Checklist

Conclusion

Related Articles

Related posts

Cite this article