Cilium BPF Conntrack Map Exhaustion: Random Resets While conntrack Looks Fine

You get intermittent connection reset by peer, gRPC UNAVAILABLE, or random timeouts… but only on some nodes. You check netfilter conntrack:

conntrack -S looks normal
nf_conntrack_count isn’t near the limit
kube-proxy isn’t even running (kube-proxy replacement)

And yet connections still die.

If you run Cilium, there’s a parallel universe: eBPF conntrack (CT) maps. When those maps fill up (or GC can’t keep up), you can lose new flows or destabilize existing ones—while classic conntrack metrics stay green.

Tested on: Kubernetes 1.29–1.31, Cilium 1.15–1.16, Linux 6.1–6.6, kube-proxy replacement enabled.

Incident narrative (anonymized)

A multi-tenant cluster migrated from kube-proxy to Cilium kube-proxy replacement. Shortly after, one service (HTTP/2 + gRPC) started seeing:

periodic connection resets
elevated tail latency
errors concentrated on a subset of nodes

Blast radius: ~10–15% of requests (clients hashing to affected nodes) had errors.

Constraints:

We couldn’t roll back kube-proxy replacement immediately (policy + maintenance window).
We needed to mitigate without dropping all connections on the cluster.

Timeline

T-0: Client error rate spikes; dashboards show no conntrack saturation.
T+10m: Node-specific pattern emerges: errors correlate with a small set of nodes.
T+20m: cilium monitor --type drop shows drops with conntrack-related reasons.
T+30m: cilium bpf metrics list shows CT map insert failures increasing.
T+45m: Confirmed: CT maps near max entries; GC not keeping up with churn.
T+60m: Safe mitigation: increase CT map sizes + reduce connection churn.
T+2h: Errors disappear; CT utilization stabilizes.

Mechanism: what actually happened

Cilium uses BPF maps for connection tracking

When Cilium handles service LB/NAT in eBPF, it tracks flows using BPF maps (per-protocol variants). These maps have:

a fixed max size (unless using dynamic sizing options)
a GC mechanism that must keep up with churn
failure modes that show up as Cilium drops, not netfilter drops

Why it’s node-scoped

BPF maps are per node. That’s why:

only some nodes fail (hot nodes, higher churn, noisy neighbors)
restarting pods doesn’t help if traffic lands on the same saturated nodes

Why churn kills you

Short-lived connections (or aggressive load balancers) create many entries quickly. If GC can’t free them fast enough, inserts fail. Depending on the exact path, you’ll see:

new connections failing
existing ones being reset due to missing state

Runbook: confirm CT map exhaustion

What to check first

Confirm the issue is node-local

If errors correlate with a subset of nodes, suspect per-node state (BPF maps, routing, kernel).

Check Cilium drops

On an affected node (or via a privileged debug pod), run:

cilium monitor --type drop

If you see drops correlated with the error bursts, capture a short window (30–60s).

Check BPF CT metrics

cilium bpf metrics list | head -n 50

Look for increasing counters that indicate CT insert failures / map pressure.

Exact metric names can vary by version, but “ct map full / insert fail” patterns are consistent.

How to confirm the hypothesis

A. Inspect CT map utilization

Depending on version, you can list CT entries:

cilium bpf ct list global | head

To approximate size:

cilium bpf ct list global | wc -l

If the count is near your configured max (or trending up without coming down), that’s a strong signal.

B. Dump Cilium config to find limits

cilium config view | grep -E 'bpf-ct|map'

You’re looking for keys like:

bpf-ct-global-tcp-max
bpf-ct-global-any-max
dynamic map sizing ratio (if used)

C. Validate that netfilter conntrack is not the bottleneck

This is mostly to avoid chasing the wrong layer:

conntrack -S | head
sysctl net.netfilter.nf_conntrack_max

If those are healthy while Cilium CT signals are not, the culprit is likely the BPF layer.

Safe mitigations

Increase CT map capacity (safest “infra” fix)
Increase bpf-ct-global-tcp-max / bpf-ct-global-any-max and roll Cilium (node by node).
Reduce connection churn (often the best fix)

enable keep-alives
use connection pooling in clients
remove aggressive per-request dial behavior

Scale out the hot nodes
Spreads churn across more CT map capacity.
Ensure sane timeouts
If connections linger longer than needed, CT entries live longer, increasing pressure.

Risky mitigations

Flushing CT maps / restarting Cilium abruptly

This can drop active connections cluster-wide on that node.
It’s sometimes necessary, but treat it as a controlled outage.

Lowering timeouts blindly

Too aggressive timeouts can break legitimate long-lived connections.

What we changed (concrete)

1) Increase CT map sizes via Cilium config

We updated cilium-config (or Helm values generating it) to increase CT capacity.

Diff (illustrative):

# kube-system/cilium-config

-bpf-ct-global-tcp-max: "262144"
-bpf-ct-global-any-max: "131072"
+bpf-ct-global-tcp-max: "524288"
+bpf-ct-global-any-max: "262144"

We rolled this change gradually (one node at a time) to avoid mass connection disruption.

2) Reduce churn in the application layer

We found a client that opened a fresh connection per request during retries. We changed it to reuse connections.

Example (conceptual):

Before: new TCP connection per attempt
After: keepalive + pooled transport, capped retries

How to verify (measurable)

Drop counters stop increasing during normal load:

cilium monitor --type drop shows no CT-related spikes

CT map size stabilizes:

entry counts fluctuate but don’t monotonically climb
utilization stays below a chosen budget (e.g. ≤ 70%)

Error rate drops:

connection resets / gRPC UNAVAILABLE back to baseline
tail latency improves

Prevention / guardrails

Budgets / invariants

CT utilization budget per node (alert when close to max)
Connection churn budget:
- cap connection creations per second per client
- treat “dial rate” as a first-class metric

Alerts to add

CT insert failure counters (per node) > 0 sustained
sudden increase in drops on a node (even before app errors)
hot nodes: per-node error rate > baseline

Cilium BPF Conntrack Map Exhaustion: Random Resets While conntrack Looks Fine

Incident narrative (anonymized)

Timeline

Mechanism: what actually happened

Cilium uses BPF maps for connection tracking

Why it’s node-scoped

Why churn kills you

Runbook: confirm CT map exhaustion

What to check first

How to confirm the hypothesis

Safe mitigations

Risky mitigations

What we changed (concrete)

1) Increase CT map sizes via Cilium config

2) Reduce churn in the application layer

How to verify (measurable)

Prevention / guardrails

Budgets / invariants

Alerts to add

Related posts

Cite this article

Incident narrative (anonymized)

Timeline

Mechanism: what actually happened

Cilium uses BPF maps for connection tracking

Why it’s node-scoped

Why churn kills you

Runbook: confirm CT map exhaustion

What to check first

How to confirm the hypothesis

Safe mitigations

Risky mitigations

What we changed (concrete)

1) Increase CT map sizes via Cilium config

2) Reduce churn in the application layer

How to verify (measurable)

Prevention / guardrails

Budgets / invariants

Alerts to add

Related reading

Related posts

Cite this article