Back to blog

Cilium BPF Conntrack Map Exhaustion: Random Resets While conntrack Looks Fine

|
| kubernetes, cilium, ebpf, networking, troubleshooting

You get intermittent connection reset by peer, gRPC UNAVAILABLE, or random timeouts… but only on some nodes. You check netfilter conntrack:

  • conntrack -S looks normal
  • nf_conntrack_count isn’t near the limit
  • kube-proxy isn’t even running (kube-proxy replacement)

And yet connections still die.

If you run Cilium, there’s a parallel universe: eBPF conntrack (CT) maps. When those maps fill up (or GC can’t keep up), you can lose new flows or destabilize existing ones—while classic conntrack metrics stay green.

Tested on: Kubernetes 1.29–1.31, Cilium 1.15–1.16, Linux 6.1–6.6, kube-proxy replacement enabled.

Incident narrative (anonymized)

A multi-tenant cluster migrated from kube-proxy to Cilium kube-proxy replacement. Shortly after, one service (HTTP/2 + gRPC) started seeing:

  • periodic connection resets
  • elevated tail latency
  • errors concentrated on a subset of nodes

Blast radius: ~10–15% of requests (clients hashing to affected nodes) had errors.

Constraints:

  • We couldn’t roll back kube-proxy replacement immediately (policy + maintenance window).
  • We needed to mitigate without dropping all connections on the cluster.

Timeline

  • T-0: Client error rate spikes; dashboards show no conntrack saturation.
  • T+10m: Node-specific pattern emerges: errors correlate with a small set of nodes.
  • T+20m: cilium monitor --type drop shows drops with conntrack-related reasons.
  • T+30m: cilium bpf metrics list shows CT map insert failures increasing.
  • T+45m: Confirmed: CT maps near max entries; GC not keeping up with churn.
  • T+60m: Safe mitigation: increase CT map sizes + reduce connection churn.
  • T+2h: Errors disappear; CT utilization stabilizes.

Mechanism: what actually happened

Cilium uses BPF maps for connection tracking

When Cilium handles service LB/NAT in eBPF, it tracks flows using BPF maps (per-protocol variants). These maps have:

  • a fixed max size (unless using dynamic sizing options)
  • a GC mechanism that must keep up with churn
  • failure modes that show up as Cilium drops, not netfilter drops

Why it’s node-scoped

BPF maps are per node. That’s why:

  • only some nodes fail (hot nodes, higher churn, noisy neighbors)
  • restarting pods doesn’t help if traffic lands on the same saturated nodes

Why churn kills you

Short-lived connections (or aggressive load balancers) create many entries quickly. If GC can’t free them fast enough, inserts fail. Depending on the exact path, you’ll see:

  • new connections failing
  • existing ones being reset due to missing state

Runbook: confirm CT map exhaustion

What to check first

  1. Confirm the issue is node-local

If errors correlate with a subset of nodes, suspect per-node state (BPF maps, routing, kernel).

  1. Check Cilium drops

On an affected node (or via a privileged debug pod), run:

cilium monitor --type drop

If you see drops correlated with the error bursts, capture a short window (30–60s).

  1. Check BPF CT metrics
cilium bpf metrics list | head -n 50

Look for increasing counters that indicate CT insert failures / map pressure.

Exact metric names can vary by version, but “ct map full / insert fail” patterns are consistent.

How to confirm the hypothesis

A. Inspect CT map utilization

Depending on version, you can list CT entries:

cilium bpf ct list global | head

To approximate size:

cilium bpf ct list global | wc -l

If the count is near your configured max (or trending up without coming down), that’s a strong signal.

B. Dump Cilium config to find limits

cilium config view | grep -E 'bpf-ct|map'

You’re looking for keys like:

  • bpf-ct-global-tcp-max
  • bpf-ct-global-any-max
  • dynamic map sizing ratio (if used)

C. Validate that netfilter conntrack is not the bottleneck

This is mostly to avoid chasing the wrong layer:

conntrack -S | head
sysctl net.netfilter.nf_conntrack_max

If those are healthy while Cilium CT signals are not, the culprit is likely the BPF layer.

Safe mitigations

  1. Increase CT map capacity (safest “infra” fix)
    Increase bpf-ct-global-tcp-max / bpf-ct-global-any-max and roll Cilium (node by node).

  2. Reduce connection churn (often the best fix)

  • enable keep-alives
  • use connection pooling in clients
  • remove aggressive per-request dial behavior
  1. Scale out the hot nodes
    Spreads churn across more CT map capacity.

  2. Ensure sane timeouts
    If connections linger longer than needed, CT entries live longer, increasing pressure.

Risky mitigations

  1. Flushing CT maps / restarting Cilium abruptly
  • This can drop active connections cluster-wide on that node.
  • It’s sometimes necessary, but treat it as a controlled outage.
  1. Lowering timeouts blindly
  • Too aggressive timeouts can break legitimate long-lived connections.

What we changed (concrete)

1) Increase CT map sizes via Cilium config

We updated cilium-config (or Helm values generating it) to increase CT capacity.

Diff (illustrative):

# kube-system/cilium-config

-bpf-ct-global-tcp-max: "262144"
-bpf-ct-global-any-max: "131072"
+bpf-ct-global-tcp-max: "524288"
+bpf-ct-global-any-max: "262144"

We rolled this change gradually (one node at a time) to avoid mass connection disruption.

2) Reduce churn in the application layer

We found a client that opened a fresh connection per request during retries. We changed it to reuse connections.

Example (conceptual):

  • Before: new TCP connection per attempt
  • After: keepalive + pooled transport, capped retries

How to verify (measurable)

  1. Drop counters stop increasing during normal load:
  • cilium monitor --type drop shows no CT-related spikes
  1. CT map size stabilizes:
  • entry counts fluctuate but don’t monotonically climb
  • utilization stays below a chosen budget (e.g. ≤ 70%)
  1. Error rate drops:
  • connection resets / gRPC UNAVAILABLE back to baseline
  • tail latency improves

Prevention / guardrails

Budgets / invariants

  • CT utilization budget per node (alert when close to max)
  • Connection churn budget:
    • cap connection creations per second per client
    • treat “dial rate” as a first-class metric

Alerts to add

  • CT insert failure counters (per node) > 0 sustained
  • sudden increase in drops on a node (even before app errors)
  • hot nodes: per-node error rate > baseline

Related posts

Cite this article

If you reference this post, please link to the original URL and credit the author.

Michal Drozd. "Cilium BPF Conntrack Map Exhaustion: Random Resets While conntrack Looks Fine". https://www.michal-drozd.com/en/blog/cilium-bpf-conntrack-map-exhaustion/ (Published October 29, 2025).