PMTU Blackholes: When Only Large Responses Hang
This one was pure networking folklore until it hit us. “The API works for small responses but hangs for large ones.” We spent days adding timeouts and retries before realizing the actual problem had nothing to do with application code. Health checks worked. Simple queries worked. But the moment a response exceeded about 1,400 bytes, the connection would hang until timeout.
The debugging was frustrating because everything seemed to point in different directions. Application logs showed nothing. tcpdump on the sender showed packets leaving successfully. The receiving end simply never got them. It was as if packets above a certain size were vanishing into a black hole.
And that’s exactly what was happening. Path MTU Discovery (PMTUD) is a mechanism where routers tell senders “your packets are too big, send smaller ones.” But in our environment, the ICMP messages that carry this information were being silently dropped by a security group. The sender never learned that it needed to use smaller packets, so it kept sending oversized ones that got dropped. No error, no retry, just silence.
This is one of those networking problems that’s invisible at the application layer. Your code is correct. Your network configuration looks correct. But deep in the stack, a firewall rule is filtering ICMP, and the entire system breaks for payloads above a certain size. It’s the kind of bug that makes you question reality.
Environment: Kubernetes 1.28, Calico overlay (VXLAN), cloud provider with default security groups
The Problem
Symptoms That Make No Sense
Pattern we observed:
✓ GET /health → 200 OK (50 bytes) - works
✓ GET /api/user/1 → 200 OK (500 bytes) - works
✗ GET /api/users → hangs forever (15KB) - fails
✗ POST /api/upload → hangs (large body) - fails
Clues:
- Same endpoint, different payload sizes
- Threshold around 1400-1500 bytes
- Cross-node traffic affected more
- Worked fine in dev environment
Why This Is Hard to Debug
# Application logs show nothing useful
# Just timeout after 30 seconds
# tcpdump on sender shows packets leaving
tcpdump -i eth0 host $DEST_IP
# 10:00:00 IP src > dst: TCP ... length 1460
# 10:00:00 IP src > dst: TCP ... length 1460
# Packets are being sent... but no response
# The problem is invisible at application layer
# Because ICMP messages that would signal the issue
# are being dropped by a firewall somewhere
Root Cause
Path MTU Discovery 101
Normal PMTU Discovery:
┌──────────┐ 1500B packet ┌──────────┐ Can't fragment ┌──────────┐
│ Sender │──────────────────▶│ Router │────────────────────▶│ Drop │
└──────────┘ └──────────┘ └──────────┘
│
│ ICMP "Fragmentation Needed"
│ MTU = 1400
▼
┌──────────┐ 1400B packets ┌──────────┐
│ Sender │──────────────────▶│ Router │──────────────────▶ Delivered!
└──────────┘ (after ICMP) └──────────┘
PMTU Blackhole (broken):
┌──────────┐ 1500B packet ┌──────────┐ Can't fragment ┌──────────┐
│ Sender │──────────────────▶│ Router │────────────────────▶│ Drop │
└──────────┘ └──────────┘ └──────────┘
│
│ ICMP "Fragmentation Needed"
▼
┌──────────┐
│ Firewall │──▶ DROPPED (ICMP filtered)
└──────────┘
Result: Sender never learns about MTU problem
Keeps sending 1500B packets
Connection hangs forever
Overlay Network Makes It Worse
Physical MTU: 1500 bytes
With VXLAN overlay:
┌─────────────────────────────────────────────────┐
│ Original IP header (20B) + TCP (20B) + Data │
└─────────────────────────────────────────────────┘
│
│ VXLAN encapsulation adds:
│ - Outer IP header: 20 bytes
│ - UDP header: 8 bytes
│ - VXLAN header: 8 bytes
│ - Outer Ethernet: 14 bytes
▼
┌─────────────────────────────────────────────────┐
│ Outer headers (50B) + Original packet (1450B) │
│ = 1500 bytes (just fits!) │
└─────────────────────────────────────────────────┘
But if original packet is slightly larger:
┌─────────────────────────────────────────────────┐
│ Outer headers (50B) + Original (1460B) = 1510B │
│ EXCEEDS MTU → needs fragmentation or ICMP │
└─────────────────────────────────────────────────┘
Diagnosis
Step 1: Identify the Threshold
# Find the exact size where things break
for size in 1000 1200 1400 1450 1480 1500; do
echo -n "Size $size: "
timeout 5 curl -s -o /dev/null -w "%{http_code}" \
"http://$SERVICE_IP/api/generate?size=$size" || echo "TIMEOUT"
done
# Output:
# Size 1000: 200
# Size 1200: 200
# Size 1400: 200
# Size 1450: TIMEOUT <-- Threshold found!
# Size 1480: TIMEOUT
# Size 1500: TIMEOUT
Step 2: Check ICMP Filtering
# From a pod, try to see if ICMP is reachable
kubectl exec -it $POD -- ping -c 3 -s 1472 -M do $DEST_IP
# If you see:
# ping: local error: message too long, mtu=1450
# That's good - PMTU is working locally
# But if packets just disappear across nodes:
kubectl exec -it $POD -- tracepath $DEST_IP
# Look for "asymm" or "no reply" entries
Step 3: Capture the Missing ICMP
# On the destination node, capture ICMP
tcpdump -i any icmp
# On the source node, send large ping
kubectl exec -it $POD -- ping -c 1 -s 1472 -M do $DEST_IP
# If no ICMP appears on source, it's being filtered somewhere
Step 4: Check MTU Along the Path
#!/bin/bash
# pmtu-probe.sh - Find actual MTU
TARGET_IP=$1
MAX_SIZE=1500
MIN_SIZE=1000
while [ $((MAX_SIZE - MIN_SIZE)) -gt 1 ]; do
MID=$(( (MAX_SIZE + MIN_SIZE) / 2 ))
# -M do = don't fragment, -s = payload size (minus 28 for IP+ICMP headers)
if ping -c 1 -W 2 -M do -s $((MID - 28)) $TARGET_IP > /dev/null 2>&1; then
MIN_SIZE=$MID
echo "Size $MID: OK"
else
MAX_SIZE=$MID
echo "Size $MID: FAIL"
fi
done
echo "Actual MTU: $MIN_SIZE"
The Fix
Option 1: MSS Clamping at CNI
# For Calico - configure MSS clamping
apiVersion: crd.projectcalico.org/v1
kind: FelixConfiguration
metadata:
name: default
spec:
# Clamp MSS to prevent oversized packets
mtuIfacePattern: "^(eth|en).*"
# Set MTU explicitly for overlay
ipipMTU: 1440
vxlanMTU: 1450
wireguardMTU: 1420
# This tells TCP sessions to negotiate smaller segments
# No ICMP needed!
Option 2: Fix Security Group Rules
# AWS Security Group - allow ICMP type 3 (Destination Unreachable)
aws ec2 authorize-security-group-ingress \
--group-id sg-xxx \
--protocol icmp \
--port 3 \
--cidr 10.0.0.0/8
# GCP Firewall - allow ICMP
gcloud compute firewall-rules create allow-icmp-pmtu \
--direction=INGRESS \
--priority=1000 \
--network=default \
--action=ALLOW \
--rules=icmp \
--source-ranges=10.0.0.0/8
Option 3: Set Interface MTU Explicitly
# DaemonSet to set MTU on all nodes
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: set-mtu
namespace: kube-system
spec:
selector:
matchLabels:
app: set-mtu
template:
metadata:
labels:
app: set-mtu
spec:
hostNetwork: true
initContainers:
- name: set-mtu
image: alpine
securityContext:
privileged: true
command:
- /bin/sh
- -c
- |
# Find pod interface (usually vxlan.calico or similar)
for iface in vxlan.calico flannel.1 cilium_vxlan; do
if ip link show $iface 2>/dev/null; then
ip link set $iface mtu 1450
echo "Set $iface MTU to 1450"
fi
done
containers:
- name: pause
image: gcr.io/google_containers/pause:3.2
Option 4: TCP MSS Rewriting via iptables
# On each node, clamp MSS for all pod traffic
iptables -t mangle -A POSTROUTING \
-p tcp --tcp-flags SYN,RST SYN \
-o vxlan.calico \
-j TCPMSS --clamp-mss-to-pmtu
# Or set explicit value
iptables -t mangle -A POSTROUTING \
-p tcp --tcp-flags SYN,RST SYN \
-o vxlan.calico \
-j TCPMSS --set-mss 1360
Monitoring
Prometheus Rules
groups:
- name: pmtu
rules:
# Alert on high TCP retransmissions (symptom of PMTU issues)
- alert: HighTCPRetransmissions
expr: |
rate(node_netstat_Tcp_RetransSegs[5m]) /
rate(node_netstat_Tcp_OutSegs[5m]) > 0.01
for: 10m
labels:
severity: warning
annotations:
summary: "High TCP retransmissions on {{ $labels.instance }}"
description: "May indicate PMTU blackholing"
# Alert on ICMP unreachables (good - means PMTU is working)
- alert: ICMPUnreachableSpike
expr: |
rate(node_netstat_Icmp_InDestUnreachs[5m]) > 100
for: 5m
labels:
severity: info
annotations:
summary: "ICMP Destination Unreachable spike on {{ $labels.instance }}"
Quick Health Check
#!/bin/bash
# pmtu-health.sh - Run from each node
echo "=== MTU Configuration ==="
ip addr show | grep mtu
echo "=== ICMP Statistics ==="
cat /proc/net/snmp | grep Icmp
echo "=== PMTU Cache ==="
ip route get 10.0.0.1 # Replace with known cross-node IP
echo "=== Checking for PMTU issues ==="
# High retransmits without drops = possible PMTU issue
netstat -s | grep -E "(retransmit|segments sent)"
Checklist
## PMTU Blackhole Diagnosis
### Symptoms
- [ ] Large responses hang, small ones work
- [ ] Threshold around 1400-1500 bytes
- [ ] Cross-node traffic worse than same-node
- [ ] Works in non-overlay environments
### Diagnosis
- [ ] Find exact size threshold
- [ ] Check if ICMP type 3 is allowed
- [ ] Verify MTU on overlay interfaces
- [ ] Check cloud security groups for ICMP
### Fixes
- [ ] Enable MSS clamping at CNI level
- [ ] Allow ICMP type 3 in security groups
- [ ] Set explicit MTU on overlay interfaces
- [ ] Add iptables MSS rewriting rules
### Verification
- [ ] Test with large payloads after fix
- [ ] Monitor TCP retransmissions
- [ ] Verify ICMP messages are flowing
Conclusion
PMTU blackholes represent a fundamental challenge of layered networking. Each layer—physical network, overlay network, application—has its own assumptions about packet sizes. When those assumptions conflict, and the feedback mechanism (ICMP) is blocked, the result is silent failure.
The insidious nature of this problem comes from its partial success. Small requests work perfectly, which gives false confidence. Health checks pass because they have small payloads. It’s only when you hit larger payloads—often in production with real data—that the problem manifests. And when it does, there’s no error message, just a timeout.
Overlay networks amplify the problem because they add encapsulation overhead. A VXLAN header adds 50 bytes. If your physical MTU is 1500, you can only send 1450 bytes of inner payload. But TCP doesn’t know about the overlay—it negotiates based on the visible interface MTU. So it tries to send 1460-byte segments, they get encapsulated to 1510 bytes, and they’re dropped.
The fix is straightforward once you understand the problem. MSS clamping tells TCP to use smaller segments, avoiding the need for fragmentation or PMTU discovery. Allowing ICMP type 3 lets the feedback mechanism work as designed. But the key insight is that you need to configure this proactively. By the time you’re debugging hanging connections, you’ve already lost hours or days.
Key takeaways:
- Small requests work - gives false confidence that networking is fine
- No errors appear - just timeouts, because ICMP feedback is blocked
- ICMP filtering is invisible - you can’t see what the firewall drops
- Overlay encapsulation reduces effective MTU - problem emerges only in production
The fix is simple (MSS clamping or allow ICMP), but diagnosis requires understanding the interaction between overlay networks, PMTU discovery, and firewall rules. When in doubt, enable MSS clamping at the CNI level—it’s cheap insurance against a frustrating class of problems.
Related Articles
- VXLAN Checksum Offload Packet Drops - Another overlay network trap
- Kubernetes DNS Caching - More networking edge cases
Related posts
VXLAN Random Packet Drops: The Checksum Offload Trap
Cross-node gRPC calls randomly fail but local traffic works fine. The culprit: TX checksum offload corrupting VXLAN headers on specific NIC drivers. Here's how to diagnose and fix.
tcpdump Sees SYNs, but the Service Times Out: The Listen Backlog Trap
Clients time out, tcpdump shows SYNs (sometimes even SYN-ACK), yet your app logs nothing. The culprit is often the Linux listen/accept queues overflowing under load or CPU starvation.
kube-proxy Micro-Outages: The xtables Lock Contention Problem
Random 1-3 second connection drops during deployments. CPU looks fine, memory is stable. The hidden cause: iptables-restore grabbing the xtables lock while endpoints churn.
Ephemeral Port Exhaustion: The Node That 'Goes Bad'
A single Kubernetes node starts failing connections to external services while pods look healthy. The hidden cause: sidecar proxies exhausting ephemeral ports with short-lived connections.
Cite this article
If you reference this post, please link to the original URL and credit the author.