Cilium BPF Conntrack Map Exhaustion: Random Resets While conntrack Looks Fine
You get intermittent connection reset by peer, gRPC UNAVAILABLE, or random timeouts… but only on some nodes. You check netfilter conntrack:
conntrack -Slooks normalnf_conntrack_countisn’t near the limit- kube-proxy isn’t even running (kube-proxy replacement)
And yet connections still die.
If you run Cilium, there’s a parallel universe: eBPF conntrack (CT) maps. When those maps fill up (or GC can’t keep up), you can lose new flows or destabilize existing ones—while classic conntrack metrics stay green.
Tested on: Kubernetes 1.29–1.31, Cilium 1.15–1.16, Linux 6.1–6.6, kube-proxy replacement enabled.
Incident narrative (anonymized)
A multi-tenant cluster migrated from kube-proxy to Cilium kube-proxy replacement. Shortly after, one service (HTTP/2 + gRPC) started seeing:
- periodic connection resets
- elevated tail latency
- errors concentrated on a subset of nodes
Blast radius: ~10–15% of requests (clients hashing to affected nodes) had errors.
Constraints:
- We couldn’t roll back kube-proxy replacement immediately (policy + maintenance window).
- We needed to mitigate without dropping all connections on the cluster.
Timeline
- T-0: Client error rate spikes; dashboards show no conntrack saturation.
- T+10m: Node-specific pattern emerges: errors correlate with a small set of nodes.
- T+20m:
cilium monitor --type dropshows drops with conntrack-related reasons. - T+30m:
cilium bpf metrics listshows CT map insert failures increasing. - T+45m: Confirmed: CT maps near max entries; GC not keeping up with churn.
- T+60m: Safe mitigation: increase CT map sizes + reduce connection churn.
- T+2h: Errors disappear; CT utilization stabilizes.
Mechanism: what actually happened
Cilium uses BPF maps for connection tracking
When Cilium handles service LB/NAT in eBPF, it tracks flows using BPF maps (per-protocol variants). These maps have:
- a fixed max size (unless using dynamic sizing options)
- a GC mechanism that must keep up with churn
- failure modes that show up as Cilium drops, not netfilter drops
Why it’s node-scoped
BPF maps are per node. That’s why:
- only some nodes fail (hot nodes, higher churn, noisy neighbors)
- restarting pods doesn’t help if traffic lands on the same saturated nodes
Why churn kills you
Short-lived connections (or aggressive load balancers) create many entries quickly. If GC can’t free them fast enough, inserts fail. Depending on the exact path, you’ll see:
- new connections failing
- existing ones being reset due to missing state
Runbook: confirm CT map exhaustion
What to check first
- Confirm the issue is node-local
If errors correlate with a subset of nodes, suspect per-node state (BPF maps, routing, kernel).
- Check Cilium drops
On an affected node (or via a privileged debug pod), run:
cilium monitor --type drop
If you see drops correlated with the error bursts, capture a short window (30–60s).
- Check BPF CT metrics
cilium bpf metrics list | head -n 50
Look for increasing counters that indicate CT insert failures / map pressure.
Exact metric names can vary by version, but “ct map full / insert fail” patterns are consistent.
How to confirm the hypothesis
A. Inspect CT map utilization
Depending on version, you can list CT entries:
cilium bpf ct list global | head
To approximate size:
cilium bpf ct list global | wc -l
If the count is near your configured max (or trending up without coming down), that’s a strong signal.
B. Dump Cilium config to find limits
cilium config view | grep -E 'bpf-ct|map'
You’re looking for keys like:
bpf-ct-global-tcp-maxbpf-ct-global-any-max- dynamic map sizing ratio (if used)
C. Validate that netfilter conntrack is not the bottleneck
This is mostly to avoid chasing the wrong layer:
conntrack -S | head
sysctl net.netfilter.nf_conntrack_max
If those are healthy while Cilium CT signals are not, the culprit is likely the BPF layer.
Safe mitigations
-
Increase CT map capacity (safest “infra” fix)
Increasebpf-ct-global-tcp-max/bpf-ct-global-any-maxand roll Cilium (node by node). -
Reduce connection churn (often the best fix)
- enable keep-alives
- use connection pooling in clients
- remove aggressive per-request dial behavior
-
Scale out the hot nodes
Spreads churn across more CT map capacity. -
Ensure sane timeouts
If connections linger longer than needed, CT entries live longer, increasing pressure.
Risky mitigations
- Flushing CT maps / restarting Cilium abruptly
- This can drop active connections cluster-wide on that node.
- It’s sometimes necessary, but treat it as a controlled outage.
- Lowering timeouts blindly
- Too aggressive timeouts can break legitimate long-lived connections.
What we changed (concrete)
1) Increase CT map sizes via Cilium config
We updated cilium-config (or Helm values generating it) to increase CT capacity.
Diff (illustrative):
# kube-system/cilium-config
-bpf-ct-global-tcp-max: "262144"
-bpf-ct-global-any-max: "131072"
+bpf-ct-global-tcp-max: "524288"
+bpf-ct-global-any-max: "262144"
We rolled this change gradually (one node at a time) to avoid mass connection disruption.
2) Reduce churn in the application layer
We found a client that opened a fresh connection per request during retries. We changed it to reuse connections.
Example (conceptual):
- Before: new TCP connection per attempt
- After: keepalive + pooled transport, capped retries
How to verify (measurable)
- Drop counters stop increasing during normal load:
cilium monitor --type dropshows no CT-related spikes
- CT map size stabilizes:
- entry counts fluctuate but don’t monotonically climb
- utilization stays below a chosen budget (e.g. ≤ 70%)
- Error rate drops:
- connection resets / gRPC
UNAVAILABLEback to baseline - tail latency improves
Prevention / guardrails
Budgets / invariants
- CT utilization budget per node (alert when close to max)
- Connection churn budget:
- cap connection creations per second per client
- treat “dial rate” as a first-class metric
Alerts to add
- CT insert failure counters (per node) > 0 sustained
- sudden increase in drops on a node (even before app errors)
- hot nodes: per-node error rate > baseline
Related reading
- Kubernetes conntrack Table Exhaustion: The Silent Packet Killer
- Traffic Hitting Dead Pods: Conntrack’s Stale NAT Mapping
- The Ghost Pod: Why Your Service Still Sends Traffic to Dead Endpoints
- kube-proxy Micro-Outages: The xtables Lock Contention Problem
- VXLAN Random Packet Drops: The Checksum Offload Trap
- Ephemeral Port Exhaustion: The Node That ‘Goes Bad’
- Packets Arrive but the App Times Out: The rp_filter Trap in Kubernetes
Related posts
HTTP Keep-Alive Connection Reset: Why Your Requests Fail with 'Connection Reset by Peer'
Sporadic 'connection reset by peer' errors in production. I'll show how keep-alive timeout mismatches between client and server cause this and how to fix it.
Kubernetes DNS: The ndots:5 Latency Tax
Every DNS query in K8s makes 5 failed lookups before succeeding. ndots:5 default causes 100ms+ latency. Here's how to fix it properly.
Ephemeral-Storage Evictions in Kubernetes: The Log Storm That Took Down Healthy Pods
Pods get evicted for ephemeral-storage while disk looks free. Debug nodefs/imagefs, container logs, kubelet GC, then enforce budgets and log rotation.
Packets Arrive but the App Times Out: The rp_filter Trap in Kubernetes
tcpdump shows packets arriving, but your application sees nothing. The culprit: Linux reverse path filtering silently dropping packets before they reach iptables, triggered by asymmetric routing in multi-homed setups.
Cite this article
If you reference this post, please link to the original URL and credit the author.