tcpdump Sees SYNs, but the Service Times Out: The Listen Backlog Trap
This was one of those incidents that makes you doubt your tooling. Clients were timing out. tcpdump on the node showed SYN packets arriving on the right interface. Sometimes we even saw SYN-ACK leaving. But the service logs were quiet, and the application behaved as if no one was connecting.
The root cause wasn’t a firewall, DNS, or “random network flakiness”. It was the kernel doing exactly what it’s designed to do when a process can’t accept connections fast enough: it stops admitting new connections. Depending on one sysctl, clients either time out (silent drop) or get a fast RST.
Environment: Linux nodes, Kubernetes, high connection churn, CPU throttling / busy event loop / slow accept loop
The Symptom Pattern
What makes this bug so confusing is how “reasonable” each signal looks in isolation:
- Clients: intermittent timeouts on connect (or during the first request).
- tcpdump: SYNs clearly arrive to the node / pod IP, on the expected port.
- Application: no access logs, no request logs, sometimes no
accept()activity at all.
This is exactly the sort of situation where people start chasing the wrong suspects (conntrack, CNI, MTU, load balancer health checks) because “the packets are getting here”.
The Hidden Mechanism: Two Queues, Not One
When you call listen(fd, backlog), Linux ends up managing two related queues for that socket:
- SYN queue (half-open connections): connections that sent SYN, got SYN-ACK, but aren’t fully established yet.
- Accept queue (fully established, waiting for
accept()): the handshake is complete, and the kernel is waiting for your process to callaccept()and start reading.
If the accept queue fills up because your process can’t keep up (CPU starvation, too few worker threads, blocking in the accept loop, heavy TLS handshakes in a single-threaded runtime), Linux has to decide what to do with the next connection attempt.
In practice:
- You can see SYNs in tcpdump because capture happens before your application ever sees a socket.
- You can see SYN-ACK because the kernel can respond even if your process is stuck.
- And yet the app sees nothing because new connections never make it into the accept queue.
How to Prove It (In Production)
1) Check the listening socket queue
On the node (or in the pod netns), check the listener’s queue:
ss -ltnp 'sport = :443'
For a listening socket, Recv-Q is a practical proxy for “how many fully established connections are waiting to be accepted”. If it keeps hitting a ceiling during incident windows, you’re looking at accept queue pressure.
2) Read the kernel counters that don’t lie
Linux exposes explicit counters for this:
# TcpExt is two lines: header names then values
grep -A1 '^TcpExt:' /proc/net/netstat | tail -n2
Look for:
ListenOverflows— the accept queue overflowed (kernel couldn’t enqueue another established connection).ListenDrops— SYNs to listen sockets dropped.SyncookiesSent/SyncookiesRecv— SYN cookies activity (a sign of SYN queue pressure).
If ListenOverflows increments during the incident, that’s your smoking gun.
3) Check the sysctls that control “timeout vs reset”
sysctl net.core.somaxconn
sysctl net.ipv4.tcp_max_syn_backlog
sysctl net.ipv4.tcp_syncookies
sysctl net.ipv4.tcp_abort_on_overflow
Key detail:
net.ipv4.tcp_abort_on_overflow=0(default on many systems): overflow often looks like client timeouts.net.ipv4.tcp_abort_on_overflow=1: overflow is more likely to look like a fast RST, which is easier to detect and retry around.
Repro Lab: Make tcpdump “Lie” On Purpose
You can reproduce the “packets arrive but the app doesn’t see them” failure mode on any Linux machine.
Step 1: Start a server that accepts very slowly (tiny backlog)
python3 - <<'PY'
import socket, time
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
s.bind(("0.0.0.0", 8080))
s.listen(1) # intentionally tiny
print("listening on :8080")
while True:
conn, addr = s.accept()
time.sleep(5) # simulate starvation / stuck accept loop
conn.close()
PY
Step 2: Flood it with connections that stay open
# Requires bash /dev/tcp support
for i in $(seq 1 2000); do (exec 3<>/dev/tcp/127.0.0.1/8080; sleep 60) & done
wait
Step 3: Observe the backlog filling and overflows
ss -ltn 'sport = :8080'
grep -A1 '^TcpExt:' /proc/net/netstat | tail -n2
Step 4 (optional): Show how one sysctl changes the symptom
# 0 -> more “timeouts”
sysctl -w net.ipv4.tcp_abort_on_overflow=0
# 1 -> more “connection reset by peer”
sysctl -w net.ipv4.tcp_abort_on_overflow=1
Same root cause, different client-visible behavior.
Why This Happens in Kubernetes More Than You’d Expect
Kubernetes makes it easier to accidentally create “accept starvation”:
- CPU limits/throttling: the process is runnable but can’t get CPU at the right time.
- Single-threaded accept loops: one slow path blocks admitting new connections.
- TLS handshakes on the hot path: heavy work before a request is even logged.
- Burst traffic after rollouts: reconnect storms can saturate accept queues.
This is why “it worked in dev” is a common part of the story.
Fixes That Actually Work
Apply fixes in this order (because tuning sysctls won’t save a fundamentally overloaded server):
- Make accepting fast and boring: ensure the accept loop can always run; avoid doing heavy work before handing off.
- Right-size concurrency: enough workers/threads to drain the accept queue under peak connect rate.
- Increase backlog safely: set
listen(backlog)to a sensible number and raisenet.core.somaxconnon nodes so the value isn’t capped. - Size the SYN queue: if you see SYN cookies, consider
net.ipv4.tcp_max_syn_backlog. - Prefer fast failure over mystery timeouts: consider
tcp_abort_on_overflow=1so clients get immediate signal.
Monitoring Checklist
- Alert on increases in
ListenOverflows(any sustained rate is a red flag). - Track
ListenDropsandSyncookiesSentduring incident windows. - Correlate overflows with CPU throttling and connection churn (reconnect spikes).
- Add a canary that measures connect latency, not just HTTP latency.
- Validate backlog settings after node image upgrades (defaults change).
If this incident felt familiar, you’ll likely appreciate the same debugging lesson in a different form: tcpdump can show packets that never reach your app because the kernel drops them earlier in the pipeline (for example, with reverse path filtering).
Related posts
TCP TIME_WAIT Port Exhaustion: When Connection Pooling Isn't Enough
Service can't connect to database - 'cannot assign requested address'. The cause: ephemeral ports exhausted by thousands of sockets stuck in TIME_WAIT state.
ingress-nginx Reload Storms: Why 502 Spikes Track Ingress Churn
NGINX Ingress reload storms can drop keep-alives and cause 502 spikes. Runbook to prove reload impact, reduce churn, and harden graceful reload.
Redis AOF fsync Latency Spikes: When Durability Becomes Your p99
Redis AOF can turn durability into p99 spikes: fsync pressure and rewrite fork CoW. Runbook to confirm, mitigate safely, and add guardrails.
Packets Arrive but the App Times Out: The rp_filter Trap in Kubernetes
tcpdump shows packets arriving, but your application sees nothing. The culprit: Linux reverse path filtering silently dropping packets before they reach iptables, triggered by asymmetric routing in multi-homed setups.
Cite this article
If you reference this post, please link to the original URL and credit the author.