PMTU Blackholes: When Only Large Responses Hang

This one was pure networking folklore until it hit us. “The API works for small responses but hangs for large ones.” We spent days adding timeouts and retries before realizing the actual problem had nothing to do with application code. Health checks worked. Simple queries worked. But the moment a response exceeded about 1,400 bytes, the connection would hang until timeout.

The debugging was frustrating because everything seemed to point in different directions. Application logs showed nothing. tcpdump on the sender showed packets leaving successfully. The receiving end simply never got them. It was as if packets above a certain size were vanishing into a black hole.

And that’s exactly what was happening. Path MTU Discovery (PMTUD) is a mechanism where routers tell senders “your packets are too big, send smaller ones.” But in our environment, the ICMP messages that carry this information were being silently dropped by a security group. The sender never learned that it needed to use smaller packets, so it kept sending oversized ones that got dropped. No error, no retry, just silence.

This is one of those networking problems that’s invisible at the application layer. Your code is correct. Your network configuration looks correct. But deep in the stack, a firewall rule is filtering ICMP, and the entire system breaks for payloads above a certain size. It’s the kind of bug that makes you question reality.

Environment: Kubernetes 1.28, Calico overlay (VXLAN), cloud provider with default security groups

The Problem

Symptoms That Make No Sense

Pattern we observed:

✓ GET /health           → 200 OK (50 bytes)     - works
✓ GET /api/user/1       → 200 OK (500 bytes)    - works
✗ GET /api/users        → hangs forever (15KB)  - fails
✗ POST /api/upload      → hangs (large body)    - fails

Clues:
- Same endpoint, different payload sizes
- Threshold around 1400-1500 bytes
- Cross-node traffic affected more
- Worked fine in dev environment

Why This Is Hard to Debug

# Application logs show nothing useful
# Just timeout after 30 seconds

# tcpdump on sender shows packets leaving
tcpdump -i eth0 host $DEST_IP
# 10:00:00 IP src > dst: TCP ... length 1460
# 10:00:00 IP src > dst: TCP ... length 1460
# Packets are being sent... but no response

# The problem is invisible at application layer
# Because ICMP messages that would signal the issue
# are being dropped by a firewall somewhere

Root Cause

Path MTU Discovery 101

Normal PMTU Discovery:

┌──────────┐   1500B packet    ┌──────────┐   Can't fragment    ┌──────────┐
│  Sender  │──────────────────▶│  Router  │────────────────────▶│   Drop   │
└──────────┘                   └──────────┘                     └──────────┘
                                    │
                                    │ ICMP "Fragmentation Needed"
                                    │ MTU = 1400
                                    ▼
┌──────────┐   1400B packets   ┌──────────┐
│  Sender  │──────────────────▶│  Router  │──────────────────▶ Delivered!
└──────────┘  (after ICMP)     └──────────┘


PMTU Blackhole (broken):

┌──────────┐   1500B packet    ┌──────────┐   Can't fragment    ┌──────────┐
│  Sender  │──────────────────▶│  Router  │────────────────────▶│   Drop   │
└──────────┘                   └──────────┘                     └──────────┘
                                    │
                                    │ ICMP "Fragmentation Needed"
                                    ▼
                              ┌──────────┐
                              │ Firewall │──▶ DROPPED (ICMP filtered)
                              └──────────┘

Result: Sender never learns about MTU problem
        Keeps sending 1500B packets
        Connection hangs forever

Overlay Network Makes It Worse

Physical MTU: 1500 bytes

With VXLAN overlay:
┌─────────────────────────────────────────────────┐
│ Original IP header (20B) + TCP (20B) + Data     │
└─────────────────────────────────────────────────┘
                    │
                    │ VXLAN encapsulation adds:
                    │ - Outer IP header: 20 bytes
                    │ - UDP header: 8 bytes
                    │ - VXLAN header: 8 bytes
                    │ - Outer Ethernet: 14 bytes
                    ▼
┌─────────────────────────────────────────────────┐
│ Outer headers (50B) + Original packet (1450B)   │
│ = 1500 bytes (just fits!)                       │
└─────────────────────────────────────────────────┘

But if original packet is slightly larger:
┌─────────────────────────────────────────────────┐
│ Outer headers (50B) + Original (1460B) = 1510B  │
│ EXCEEDS MTU → needs fragmentation or ICMP       │
└─────────────────────────────────────────────────┘

Diagnosis

Step 1: Identify the Threshold

# Find the exact size where things break
for size in 1000 1200 1400 1450 1480 1500; do
  echo -n "Size $size: "
  timeout 5 curl -s -o /dev/null -w "%{http_code}" \
    "http://$SERVICE_IP/api/generate?size=$size" || echo "TIMEOUT"
done

# Output:
# Size 1000: 200
# Size 1200: 200
# Size 1400: 200
# Size 1450: TIMEOUT  <-- Threshold found!
# Size 1480: TIMEOUT
# Size 1500: TIMEOUT

Step 2: Check ICMP Filtering

# From a pod, try to see if ICMP is reachable
kubectl exec -it $POD -- ping -c 3 -s 1472 -M do $DEST_IP

# If you see:
# ping: local error: message too long, mtu=1450
# That's good - PMTU is working locally

# But if packets just disappear across nodes:
kubectl exec -it $POD -- tracepath $DEST_IP
# Look for "asymm" or "no reply" entries

Step 3: Capture the Missing ICMP

# On the destination node, capture ICMP
tcpdump -i any icmp

# On the source node, send large ping
kubectl exec -it $POD -- ping -c 1 -s 1472 -M do $DEST_IP

# If no ICMP appears on source, it's being filtered somewhere

Step 4: Check MTU Along the Path

#!/bin/bash
# pmtu-probe.sh - Find actual MTU

TARGET_IP=$1
MAX_SIZE=1500
MIN_SIZE=1000

while [ $((MAX_SIZE - MIN_SIZE)) -gt 1 ]; do
  MID=$(( (MAX_SIZE + MIN_SIZE) / 2 ))

  # -M do = don't fragment, -s = payload size (minus 28 for IP+ICMP headers)
  if ping -c 1 -W 2 -M do -s $((MID - 28)) $TARGET_IP > /dev/null 2>&1; then
    MIN_SIZE=$MID
    echo "Size $MID: OK"
  else
    MAX_SIZE=$MID
    echo "Size $MID: FAIL"
  fi
done

echo "Actual MTU: $MIN_SIZE"

The Fix

Option 1: MSS Clamping at CNI

# For Calico - configure MSS clamping
apiVersion: crd.projectcalico.org/v1
kind: FelixConfiguration
metadata:
  name: default
spec:
  # Clamp MSS to prevent oversized packets
  mtuIfacePattern: "^(eth|en).*"
  # Set MTU explicitly for overlay
  ipipMTU: 1440
  vxlanMTU: 1450
  wireguardMTU: 1420

# This tells TCP sessions to negotiate smaller segments
# No ICMP needed!

Option 2: Fix Security Group Rules

# AWS Security Group - allow ICMP type 3 (Destination Unreachable)
aws ec2 authorize-security-group-ingress \
  --group-id sg-xxx \
  --protocol icmp \
  --port 3 \
  --cidr 10.0.0.0/8

# GCP Firewall - allow ICMP
gcloud compute firewall-rules create allow-icmp-pmtu \
  --direction=INGRESS \
  --priority=1000 \
  --network=default \
  --action=ALLOW \
  --rules=icmp \
  --source-ranges=10.0.0.0/8

Option 3: Set Interface MTU Explicitly

# DaemonSet to set MTU on all nodes
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: set-mtu
  namespace: kube-system
spec:
  selector:
    matchLabels:
      app: set-mtu
  template:
    metadata:
      labels:
        app: set-mtu
    spec:
      hostNetwork: true
      initContainers:
        - name: set-mtu
          image: alpine
          securityContext:
            privileged: true
          command:
            - /bin/sh
            - -c
            - |
              # Find pod interface (usually vxlan.calico or similar)
              for iface in vxlan.calico flannel.1 cilium_vxlan; do
                if ip link show $iface 2>/dev/null; then
                  ip link set $iface mtu 1450
                  echo "Set $iface MTU to 1450"
                fi
              done
      containers:
        - name: pause
          image: gcr.io/google_containers/pause:3.2

Option 4: TCP MSS Rewriting via iptables

# On each node, clamp MSS for all pod traffic
iptables -t mangle -A POSTROUTING \
  -p tcp --tcp-flags SYN,RST SYN \
  -o vxlan.calico \
  -j TCPMSS --clamp-mss-to-pmtu

# Or set explicit value
iptables -t mangle -A POSTROUTING \
  -p tcp --tcp-flags SYN,RST SYN \
  -o vxlan.calico \
  -j TCPMSS --set-mss 1360

Monitoring

Prometheus Rules

groups:
  - name: pmtu
    rules:
      # Alert on high TCP retransmissions (symptom of PMTU issues)
      - alert: HighTCPRetransmissions
        expr: |
          rate(node_netstat_Tcp_RetransSegs[5m]) /
          rate(node_netstat_Tcp_OutSegs[5m]) > 0.01
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High TCP retransmissions on {{ $labels.instance }}"
          description: "May indicate PMTU blackholing"

      # Alert on ICMP unreachables (good - means PMTU is working)
      - alert: ICMPUnreachableSpike
        expr: |
          rate(node_netstat_Icmp_InDestUnreachs[5m]) > 100
        for: 5m
        labels:
          severity: info
        annotations:
          summary: "ICMP Destination Unreachable spike on {{ $labels.instance }}"

Quick Health Check

#!/bin/bash
# pmtu-health.sh - Run from each node

echo "=== MTU Configuration ==="
ip addr show | grep mtu

echo "=== ICMP Statistics ==="
cat /proc/net/snmp | grep Icmp

echo "=== PMTU Cache ==="
ip route get 10.0.0.1 # Replace with known cross-node IP

echo "=== Checking for PMTU issues ==="
# High retransmits without drops = possible PMTU issue
netstat -s | grep -E "(retransmit|segments sent)"

Checklist

## PMTU Blackhole Diagnosis

### Symptoms
- [ ] Large responses hang, small ones work
- [ ] Threshold around 1400-1500 bytes
- [ ] Cross-node traffic worse than same-node
- [ ] Works in non-overlay environments

### Diagnosis
- [ ] Find exact size threshold
- [ ] Check if ICMP type 3 is allowed
- [ ] Verify MTU on overlay interfaces
- [ ] Check cloud security groups for ICMP

### Fixes
- [ ] Enable MSS clamping at CNI level
- [ ] Allow ICMP type 3 in security groups
- [ ] Set explicit MTU on overlay interfaces
- [ ] Add iptables MSS rewriting rules

### Verification
- [ ] Test with large payloads after fix
- [ ] Monitor TCP retransmissions
- [ ] Verify ICMP messages are flowing

Conclusion

PMTU blackholes represent a fundamental challenge of layered networking. Each layer—physical network, overlay network, application—has its own assumptions about packet sizes. When those assumptions conflict, and the feedback mechanism (ICMP) is blocked, the result is silent failure.

The insidious nature of this problem comes from its partial success. Small requests work perfectly, which gives false confidence. Health checks pass because they have small payloads. It’s only when you hit larger payloads—often in production with real data—that the problem manifests. And when it does, there’s no error message, just a timeout.

Overlay networks amplify the problem because they add encapsulation overhead. A VXLAN header adds 50 bytes. If your physical MTU is 1500, you can only send 1450 bytes of inner payload. But TCP doesn’t know about the overlay—it negotiates based on the visible interface MTU. So it tries to send 1460-byte segments, they get encapsulated to 1510 bytes, and they’re dropped.

The fix is straightforward once you understand the problem. MSS clamping tells TCP to use smaller segments, avoiding the need for fragmentation or PMTU discovery. Allowing ICMP type 3 lets the feedback mechanism work as designed. But the key insight is that you need to configure this proactively. By the time you’re debugging hanging connections, you’ve already lost hours or days.

Key takeaways:

Small requests work - gives false confidence that networking is fine
No errors appear - just timeouts, because ICMP feedback is blocked
ICMP filtering is invisible - you can’t see what the firewall drops
Overlay encapsulation reduces effective MTU - problem emerges only in production

The fix is simple (MSS clamping or allow ICMP), but diagnosis requires understanding the interaction between overlay networks, PMTU discovery, and firewall rules. When in doubt, enable MSS clamping at the CNI level—it’s cheap insurance against a frustrating class of problems.

VXLAN Checksum Offload Packet Drops - Another overlay network trap
Kubernetes DNS Caching - More networking edge cases

PMTU Blackholes: When Only Large Responses Hang

The Problem

Symptoms That Make No Sense

Why This Is Hard to Debug

Root Cause

Path MTU Discovery 101

Overlay Network Makes It Worse

Diagnosis

Step 1: Identify the Threshold

Step 2: Check ICMP Filtering

Step 3: Capture the Missing ICMP

Step 4: Check MTU Along the Path

The Fix

Option 1: MSS Clamping at CNI

Option 2: Fix Security Group Rules

Option 3: Set Interface MTU Explicitly

Option 4: TCP MSS Rewriting via iptables

Monitoring

Prometheus Rules

Quick Health Check

Checklist

Conclusion

Related posts

Cite this article

The Problem

Symptoms That Make No Sense

Why This Is Hard to Debug

Root Cause

Path MTU Discovery 101

Overlay Network Makes It Worse

Diagnosis

Step 1: Identify the Threshold

Step 2: Check ICMP Filtering

Step 3: Capture the Missing ICMP

Step 4: Check MTU Along the Path

The Fix

Option 1: MSS Clamping at CNI

Option 2: Fix Security Group Rules

Option 3: Set Interface MTU Explicitly

Option 4: TCP MSS Rewriting via iptables

Monitoring

Prometheus Rules

Quick Health Check

Checklist

Conclusion

Related Articles

Related posts

Cite this article