Kubernetes conntrack Table Exhaustion: The Silent Packet Killer
Conntrack exhaustion is a classic that still surprises teams at scale. “Our DNS lookups randomly fail.” “Services timeout but pods are healthy.” “Works locally but breaks in K8s.” We spent three days on this one. The symptoms made no sense—intermittent failures with no pattern, services that worked sometimes and failed others, DNS lookups that would succeed for one pod and fail for another on the same node.
The breakthrough came from checking dmesg on a node: “nf_conntrack: table full, dropping packet.” The conntrack table—Linux’s connection tracking subsystem—had filled up, and new connections were being silently dropped. There were no application-level errors because the packets never made it to the application. They were dropped in the kernel before the TCP handshake could complete.
Conntrack is one of those invisible Linux subsystems that most developers never think about. It tracks every network connection through the kernel, maintaining state for NAT, firewalls, and stateful packet filtering. In Kubernetes, every service call goes through conntrack because kube-proxy uses NAT to route traffic from ClusterIPs to pods. With thousands of pods making thousands of connections, the default conntrack table size of 128K entries fills up surprisingly fast.
What makes this especially insidious is the symptom pattern. When the table is full, new connections are dropped. But existing connections continue to work. So you see intermittent failures that seem random—some requests succeed, others time out. The failures correlate with nothing visible in application metrics. The problem is invisible at every layer except the kernel.
Tested on: EKS 1.28, GKE 1.27, Linux kernel 5.15, Calico CNI
What is conntrack?
Connection Tracking Basics
Linux connection tracking (conntrack) maps:
Source IP:Port ←→ Destination IP:Port ←→ State
Purpose: Enable NAT, firewalls, and stateful packet filtering
Every connection through kube-proxy uses conntrack entry:
Pod A (10.1.1.5:45678) → Service (10.96.0.10:80) → Pod B (10.1.2.3:8080)
↓
conntrack entry tracks this mapping
Why Kubernetes is Conntrack-Heavy
Single HTTP request in Kubernetes:
1. DNS lookup to CoreDNS
Pod → kube-dns Service → CoreDNS Pod (conntrack entry #1)
2. Service call
Pod → Service ClusterIP → Backend Pod (conntrack entry #2)
3. If backend calls database
Backend → DB Service → DB Pod (conntrack entry #3)
4. External API call (via NAT Gateway)
Pod → NAT → External IP (conntrack entry #4)
Result: One user request = 4+ conntrack entries
At 10,000 RPS: 40,000+ entries constantly cycling
The Problem
Default Limits
# Check current conntrack settings
cat /proc/sys/net/netfilter/nf_conntrack_max
# Default: 131072 (128K entries)
cat /proc/sys/net/netfilter/nf_conntrack_count
# Current entries in use
# When count approaches max:
# - New connections get dropped silently
# - DNS lookups timeout (UDP)
# - TCP connections hang
Symptoms
Symptom 1: DNS timeouts
- DNS is mostly UDP; even \"connectionless\" traffic still creates conntrack state (because kube-proxy NAT)
- High UDP QPS churns entries fast
- DNS failures cascade to all services
Symptom 2: Random connection failures
- Some requests succeed, some fail
- No pattern in logs
- Pods report healthy
Symptom 3: High latency spikes
- Connection setup takes longer
- Existing connections affected by GC
Kernel log (dmesg):
nf_conntrack: table full, dropping packet
A related failure mode: kube-proxy conntrack cleanup storms (high RSS/CPU)
Conntrack exhaustion is about dropping packets when the table is full. But there is another conntrack-shaped outage that looks very different:
- kube-proxy RSS suddenly jumps into multi-GB territory
- kube-proxy CPU spikes
- it often correlates with updates to Pods/Services exposing UDP ports (CoreDNS is the usual trigger)
- your conntrack table might not even be “full” - it’s just large enough that “scan the whole table” becomes expensive
This showed up very clearly in Kubernetes issue #129982 (kube-proxy v1.32): updates to Pods/Services with UDP ports triggered a full conntrack cleanup, iterating the whole table and burning memory/CPU.
What I do in practice:
-
Confirm it’s kube-proxy, not your workload
kubectl -n kube-system top pods -l k8s-app=kube-proxy -
Check conntrack pressure on the node
cat /proc/sys/net/netfilter/nf_conntrack_{count,max} -
Upgrade to a release that includes the fixes
In that incident class, two merged fixes are particularly relevant:
- Kubernetes PR #130032: “Conntrack memory leak fix”
- Kubernetes PR #130484: “conntrack reconciler must check the dst port”
The fix was backported into the Kubernetes 1.32 patch line, and the general advice is still: stay on the latest patch release for your branch. I would not try to “tune around” this if you can upgrade/backport - the failure mode is algorithmic.
If you are stuck on a provider build that lags behind upstream, one mitigation discussed in the issue was increasing
--iptables-min-sync-periodto reduce how often the sync path runs. It does not fix the underlying bug, but it can reduce the blast radius while you wait for an upgrade.
If this is the problem you’re hitting, the rest of this post (limits, timeouts, NodeLocal DNSCache) still helps, because a smaller/stabler conntrack table makes everything cheaper. But the real fix is upgrading kube-proxy/Kubernetes to a version with the reconciler fixes.
Diagnosing
Check Node Status
# SSH to node or use kubectl debug
kubectl debug node/my-node -it --image=busybox
# Inside debug container:
cat /proc/sys/net/netfilter/nf_conntrack_count
cat /proc/sys/net/netfilter/nf_conntrack_max
# Check for drops
dmesg | grep conntrack
# Look for: "table full, dropping packet"
# Connection breakdown by state
cat /proc/net/nf_conntrack | awk '{print $4}' | sort | uniq -c | sort -rn
# 45000 TIME_WAIT
# 38000 ESTABLISHED
# 12000 SYN_SENT
# 5000 FIN_WAIT
Prometheus Metrics
# Node exporter provides these:
# Current conntrack entries
node_nf_conntrack_entries
# Maximum allowed
node_nf_conntrack_entries_limit
# Usage percentage (critical metric)
node_nf_conntrack_entries / node_nf_conntrack_entries_limit
# Rate of new connections
rate(node_nf_conntrack_entries[5m])
High-Traffic Service Analysis
# Find services generating most connections
kubectl get endpoints -A -o json | jq '
.items[] |
select(.subsets != null) |
{
namespace: .metadata.namespace,
name: .metadata.name,
endpoints: [.subsets[].addresses // [] | length] | add
}' | sort -k3 -rn
# Check for short-lived connections (HTTP without keepalive)
# These churn conntrack entries fastest
Solutions
1. Increase Conntrack Limits
# DaemonSet to tune sysctl on all nodes
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: sysctl-tuner
namespace: kube-system
spec:
selector:
matchLabels:
app: sysctl-tuner
template:
metadata:
labels:
app: sysctl-tuner
spec:
hostPID: true
hostNetwork: true
initContainers:
- name: sysctl
image: busybox
securityContext:
privileged: true
command:
- sh
- -c
- |
sysctl -w net.netfilter.nf_conntrack_max=1048576
sysctl -w net.netfilter.nf_conntrack_tcp_timeout_established=86400
sysctl -w net.netfilter.nf_conntrack_tcp_timeout_time_wait=30
sysctl -w net.core.somaxconn=65535
containers:
- name: pause
image: k8s.gcr.io/pause:3.9
Cloud-Specific Configuration
# EKS: Use launch template with user data
#!/bin/bash
echo "net.netfilter.nf_conntrack_max=1048576" >> /etc/sysctl.conf
echo "net.netfilter.nf_conntrack_tcp_timeout_time_wait=30" >> /etc/sysctl.conf
sysctl -p
# GKE: Use node pool config
gcloud container node-pools create high-conntrack \
--cluster=my-cluster \
--node-config='linuxNodeConfig:
sysctls:
net.netfilter.nf_conntrack_max: "1048576"'
# AKS: Use custom node config
az aks nodepool add \
--cluster-name my-cluster \
--name highconn \
--linux-os-config linuxOsConfig.json
# linuxOsConfig.json:
{
"sysctls": {
"netNetfilterNfConntrackMax": 1048576
}
}
2. Reduce Conntrack Usage
# Use headless services where possible
# No conntrack needed for direct pod-to-pod
apiVersion: v1
kind: Service
metadata:
name: my-service
spec:
clusterIP: None # Headless - no NAT, no conntrack
selector:
app: my-app
---
# Enable HTTP keepalive to reduce connection churn
apiVersion: v1
kind: ConfigMap
metadata:
name: nginx-config
data:
nginx.conf: |
upstream backend {
server backend:8080;
keepalive 100; # Reuse connections
}
server {
location / {
proxy_http_version 1.1;
proxy_set_header Connection ""; # Enable keepalive
proxy_pass http://backend;
}
}
3. Tune Timeout Values
# Default timeouts are too long for ephemeral traffic
# Reduce TIME_WAIT entries:
# Before: TIME_WAIT connections held for 120 seconds
# After: Released in 30 seconds
sysctl -w net.netfilter.nf_conntrack_tcp_timeout_time_wait=30
# Close dead connections faster
sysctl -w net.netfilter.nf_conntrack_tcp_timeout_close_wait=15
# Full timeout tuning for high-traffic:
net.netfilter.nf_conntrack_tcp_timeout_syn_sent=30
net.netfilter.nf_conntrack_tcp_timeout_syn_recv=30
net.netfilter.nf_conntrack_tcp_timeout_established=86400
net.netfilter.nf_conntrack_tcp_timeout_fin_wait=15
net.netfilter.nf_conntrack_tcp_timeout_close_wait=15
net.netfilter.nf_conntrack_tcp_timeout_last_ack=15
net.netfilter.nf_conntrack_tcp_timeout_time_wait=30
net.netfilter.nf_conntrack_tcp_timeout_close=10
net.netfilter.nf_conntrack_udp_timeout=30
net.netfilter.nf_conntrack_udp_timeout_stream=60
4. IPVS Mode (Alternative to iptables)
# IPVS uses less conntrack for services
# ConfigMap for kube-proxy
apiVersion: v1
kind: ConfigMap
metadata:
name: kube-proxy
namespace: kube-system
data:
config.conf: |
mode: "ipvs"
ipvs:
scheduler: "rr"
strictARP: true
# IPVS benefits:
# - Better performance at scale (O(1) vs O(n) for iptables)
# - More efficient conntrack usage
# - Built-in load balancing algorithms
Monitoring
Alert Rules
groups:
- name: conntrack
rules:
- alert: ConntrackTableNearFull
expr: |
node_nf_conntrack_entries / node_nf_conntrack_entries_limit > 0.75
for: 5m
labels:
severity: warning
annotations:
summary: "Conntrack table at {{ $value | humanizePercentage }}"
description: "Node {{ $labels.instance }} conntrack usage high"
- alert: ConntrackTableCritical
expr: |
node_nf_conntrack_entries / node_nf_conntrack_entries_limit > 0.9
for: 2m
labels:
severity: critical
annotations:
summary: "Conntrack table at {{ $value | humanizePercentage }}"
description: "Packet drops imminent on {{ $labels.instance }}"
- alert: ConntrackEntriesGrowingFast
expr: |
rate(node_nf_conntrack_entries[5m]) > 1000
for: 10m
labels:
severity: warning
annotations:
summary: "Conntrack growing at {{ $value }}/sec"
description: "Possible connection leak or traffic spike"
Grafana Dashboard
# Panel 1: Conntrack usage per node
node_nf_conntrack_entries / node_nf_conntrack_entries_limit
# Panel 2: Entries over time
node_nf_conntrack_entries
# Panel 3: Rate of change
rate(node_nf_conntrack_entries[1m])
# Panel 4: Headroom
node_nf_conntrack_entries_limit - node_nf_conntrack_entries
DNS-Specific Issues
NodeLocal DNS Cache
# NodeLocal DNSCache reduces conntrack pressure
# DNS queries stay on-node, no conntrack needed
# Install NodeLocal DNSCache
kubectl apply -f https://raw.githubusercontent.com/kubernetes/kubernetes/master/cluster/addons/dns/nodelocaldns/nodelocaldns.yaml
# Benefits:
# - DNS queries don't traverse network
# - No conntrack entries for DNS
# - Reduced load on CoreDNS
CoreDNS Conntrack Bypass
# CoreDNS can use TCP to reduce UDP conntrack issues
# ConfigMap for CoreDNS
apiVersion: v1
kind: ConfigMap
metadata:
name: coredns
namespace: kube-system
data:
Corefile: |
.:53 {
forward . /etc/resolv.conf {
prefer_udp
max_concurrent 1000
}
cache 30
}
Checklist
## Conntrack Exhaustion Prevention
### Monitoring
- [ ] Alert on conntrack usage > 75%
- [ ] Dashboard showing conntrack entries per node
- [ ] Monitor connection rate (entries/sec)
### Tuning
- [ ] Increase nf_conntrack_max (1M+ for high traffic)
- [ ] Reduce TIME_WAIT timeout (30s)
- [ ] Consider IPVS mode for kube-proxy
### Architecture
- [ ] Use headless services where possible
- [ ] Enable HTTP keepalive between services
- [ ] Deploy NodeLocal DNS Cache
- [ ] Use connection pools in applications
### Investigation
- [ ] Check dmesg for "table full" messages
- [ ] Analyze conntrack state distribution
- [ ] Identify high-connection services
Conclusion
Conntrack exhaustion is a perfect example of how Kubernetes abstractions can hide infrastructure problems. Your application code is fine. Your Kubernetes manifests are fine. Your network policies are fine. But deep in the Linux kernel, a table fills up, and packets start dropping. No error messages, no logs, just silent failures that look like intermittent network issues.
The fix is straightforward once you know to look: increase the conntrack table size, reduce timeouts so entries expire faster, and consider NodeLocal DNS to reduce conntrack pressure from DNS queries. But the diagnosis is hard because the symptom—intermittent failures with no pattern—doesn’t point to conntrack. You have to know to check dmesg for “table full” messages, and most application developers don’t.
This is why conntrack monitoring should be part of your standard Kubernetes observability setup. The nf_conntrack_entries metric is cheap to collect, and an alert at 75% utilization gives you time to respond before users are affected. It’s much better to tune conntrack proactively than to debug it during an incident.
Key takeaways:
- Default 128K limit is too low for production clusters with high traffic
- DNS and short-lived connections fill the table fastest—enable NodeLocal DNS
- Monitor
nf_conntrack_entriesbefore issues arise—alert at 75% utilization - Increase limits + reduce timeouts for immediate relief when you hit the limit
- Consider IPVS mode for kube-proxy at scale—it uses conntrack more efficiently
Check your nodes now—the drops might already be happening, silently, in the kernel.
Related Articles
- CoreDNS vs NodeLocal DNS Cache - DNS optimization
- Kubernetes Cross-Zone Traffic - Network costs
Related posts
Kubernetes Ghost Connections: Stale Conntrack DNAT Entries
Service returns wrong pod IPs after scaling. The cause: Linux conntrack keeps DNAT entries alive longer than pods exist, routing traffic to deleted endpoints.
The Ghost Pod: Why Your Service Still Sends Traffic to Dead Endpoints
Random ECONNRESET on some nodes but not others. Endpoints look fine. The culprit: conntrack NAT entries keeping long-lived connections pinned to pods that no longer exist.
Packets Arrive but the App Times Out: The rp_filter Trap in Kubernetes
tcpdump shows packets arriving, but your application sees nothing. The culprit: Linux reverse path filtering silently dropping packets before they reach iptables, triggered by asymmetric routing in multi-homed setups.
Kubernetes Headless Service DNS: Stale Records After Pod Deletion
Requests go to non-existent pods. The cause: headless service DNS records persist in client DNS cache after pods are deleted, before endpoints update propagates.
Cite this article
If you reference this post, please link to the original URL and credit the author.