VXLAN Random Packet Drops: The Checksum Offload Trap
We chased packet drops for days before finding VXLAN checksum offload. “gRPC is flaky between nodes but works fine locally.” We spent three days blaming the application before realizing that specific nodes were silently corrupting VXLAN packets due to a NIC driver bug in checksum offload.
This was one of the most frustrating debugging sessions of my career. The application team was convinced they had a bug. The network team was convinced the application was misconfigured. Both were wrong. The problem was a subtle interaction between the kernel’s VXLAN encapsulation and the NIC’s hardware checksum offload that only manifested on certain combinations of driver version and firmware.
What made this particularly insidious was the failure pattern. Not every packet failed—maybe 1-5%. Enough to cause timeouts and errors, but not enough to trigger obvious network alarms. And only cross-node traffic failed; pods on the same node communicated perfectly. This screamed “application bug” to everyone who looked at it.
Environment: Kubernetes 1.28, Calico VXLAN mode, Intel X710 NICs (specific firmware version)
The Problem
Symptoms That Look Like Application Issues
The timeline of a typical incident looks like a software problem:
Timeline of a typical incident:
08:00 Deploy new service version
08:05 Random gRPC timeouts start appearing
08:10 Only cross-node calls fail (~5% error rate)
08:15 Restart pods - temporarily helps
08:30 Errors return, different pods affected
09:00 Blame "network instability"
What made this maddening:
- Same pod-to-pod works locally - calls within a node never failed
- Only specific source nodes - not all nodes exhibited the issue
- Intermittent pattern - not every packet, roughly 1-5%
- No errors in application logs - just timeouts
The “restart temporarily helps” pattern threw us off. When you restart a pod, it might land on a different node—one that doesn’t have the buggy NIC driver. So the problem “goes away” until traffic shifts back to a problematic node.
The Hidden Corruption
The breakthrough came when we captured packets at the receiving end:
# Capture on the receiving node showed bad checksums
tcpdump -i eth0 -vvv udp port 4789 2>&1 | grep -i checksum
# Output:
# UDP, length 1234: [bad udp cksum 0x1a2b -> 0x3c4d!]
# UDP, length 567: [bad udp cksum 0x5e6f -> 0x7890!]
# But the sender thought everything was fine!
# Because hardware was supposed to calculate checksum...
Bad UDP checksums on VXLAN packets. The receiving kernel sees a corrupted packet and drops it silently—no ICMP error, no application notification, just a dropped packet. The sender times out waiting for a response that will never come.
The sender didn’t know anything was wrong because it delegated checksum calculation to hardware. The kernel said “I’m sending this packet, please calculate the UDP checksum.” The hardware said “OK” and then calculated it wrong.
Root Cause Analysis
How Checksum Offload Breaks VXLAN
Modern NICs support “checksum offload” where the hardware calculates TCP/UDP checksums instead of the CPU. This is normally a good thing—it’s faster and reduces CPU load. But VXLAN adds a complication: there are now two sets of headers.
Normal packet flow:
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Application │────▶│ Kernel │────▶│ NIC │
│ (gRPC) │ │ (VXLAN) │ │ (TX Offload)│
└─────────────┘ └─────────────┘ └─────────────┘
│ │
Outer UDP header Should calculate
checksum = 0x0000 checksum here
(placeholder) │
▼
BUG: NIC doesn't
handle VXLAN right
What happens with the bug:
1. Kernel creates VXLAN-encapsulated packet
2. Kernel sets outer UDP checksum = 0 (let NIC calculate)
3. Kernel enables tx-udp_tnl-csum-segmentation
4. NIC calculates checksum WRONG for encapsulated packets
5. Receiving kernel sees bad checksum → drops packet
6. Application sees timeout, has no idea why
The problem is that some NIC firmware versions don’t correctly handle the “tunnel checksum” case. They were designed for simple TCP/UDP traffic, and when the kernel asks them to calculate checksums for VXLAN-encapsulated traffic, they miscalculate—either using wrong offsets or ignoring parts of the packet.
This bug is specific to certain driver/firmware combinations. The exact same NIC with different firmware might work perfectly. That’s why only specific nodes exhibited the problem—they happened to have the problematic firmware version.
Identifying Affected Nodes
To find which nodes have the problem:
#!/bin/bash
# check-offload-settings.sh
echo "=== Checking TX Offload Settings ==="
for node in $(kubectl get nodes -o name | cut -d/ -f2); do
echo -e "\n--- Node: $node ---"
# Check offload settings
kubectl debug node/$node -it --image=nicolaka/netshoot -- \
ethtool -k eth0 2>/dev/null | grep -E "(tx-udp_tnl|tx-checksum)"
done
# Look for:
# tx-udp_tnl-csum-segmentation: on <-- potential problem
# tx-udp_tnl-segmentation: on
The tx-udp_tnl-csum-segmentation feature is what tells the NIC to calculate checksums for tunneled (VXLAN) traffic. If this is on and you’re seeing checksum errors, the NIC is miscalculating.
Correlating With NIC Driver Version
The bug is usually in specific driver/firmware combinations:
# Find which nodes have the problematic driver
kubectl get nodes -o json | jq -r '.items[] | .metadata.name' | while read node; do
echo "=== $node ==="
kubectl debug node/$node -it --image=busybox -- \
cat /sys/class/net/eth0/device/driver/module/version 2>/dev/null
kubectl debug node/$node -it --image=busybox -- \
ethtool -i eth0 2>/dev/null | grep -E "(driver|version|firmware)"
done
# Example problematic output:
# driver: i40e
# version: 2.14.13
# firmware-version: 8.30 0x8000af86 <-- this specific combo is buggy
Document which driver/firmware combinations are affected. This helps you target fixes and predict which new nodes might have the same problem.
The Fix
Immediate Mitigation
Disable the problematic offload immediately:
# Disable the problematic offload on affected nodes
# Run on each affected node:
ethtool -K eth0 tx-udp_tnl-csum-segmentation off
ethtool -K eth0 tx-udp_tnl-segmentation off
# Verify
ethtool -k eth0 | grep tx-udp_tnl
# tx-udp_tnl-segmentation: off
# tx-udp_tnl-csum-segmentation: off
This tells the NIC to stop handling tunnel checksums. The kernel will calculate checksums in software instead. There’s a slight CPU cost, but it’s negligible compared to the corruption problems.
The effect is immediate—packets start flowing correctly within seconds.
Persistent Fix via DaemonSet
The ethtool settings don’t persist across reboots. Deploy a DaemonSet to apply them automatically:
# nic-offload-fix.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: nic-offload-fix
namespace: kube-system
spec:
selector:
matchLabels:
app: nic-offload-fix
template:
metadata:
labels:
app: nic-offload-fix
spec:
hostNetwork: true
hostPID: true
nodeSelector:
# Only run on nodes with known-bad NIC
node.kubernetes.io/nic-type: "i40e"
initContainers:
- name: disable-offload
image: alpine
securityContext:
privileged: true
command:
- /bin/sh
- -c
- |
apk add --no-cache ethtool
# Find the primary interface
IFACE=$(ip route | grep default | awk '{print $5}')
echo "Disabling tx-udp_tnl offloads on $IFACE"
ethtool -K $IFACE tx-udp_tnl-csum-segmentation off || true
ethtool -K $IFACE tx-udp_tnl-segmentation off || true
echo "Done. Current settings:"
ethtool -k $IFACE | grep tx-udp_tnl
containers:
- name: pause
image: gcr.io/google_containers/pause:3.2
tolerations:
- operator: Exists
The nodeSelector ensures this only runs on nodes you’ve identified as problematic. Label those nodes accordingly.
Long-Term: Firmware Update
The proper fix is updating the NIC firmware to a version that handles tunnel checksums correctly:
# Check if firmware update fixes the issue
# Intel provides NVM update tool for X710
# Download from Intel support
wget https://downloadmirror.intel.com/.../700Series_NVMUpdatePackage_v8_40_Linux.tar.gz
# Apply update (requires maintenance window)
tar xzf 700Series_NVMUpdatePackage_v8_40_Linux.tar.gz
cd 700Series/Linux_x64
./nvmupdate64e -u -l -o update.log
# Reboot node
# Re-enable offloads and test
ethtool -K eth0 tx-udp_tnl-csum-segmentation on
After the firmware update, re-enable offloads and test thoroughly. If checksums are now correct, you can remove the DaemonSet workaround for that node.
Detection and Monitoring
Prometheus Metrics for Bad Checksums
Set up monitoring to catch this problem before it causes outages:
# Note: This requires node_exporter with textfile collector
# Create a script that runs periodically
# /etc/node_exporter/scripts/nic_checksum_errors.sh
#!/bin/bash
IFACE="eth0"
RX_ERRORS=$(ethtool -S $IFACE 2>/dev/null | grep rx_csum_offload_errors | awk '{print $2}')
echo "# HELP nic_rx_csum_errors NIC RX checksum errors"
echo "# TYPE nic_rx_csum_errors counter"
echo "nic_rx_csum_errors{interface=\"$IFACE\"} ${RX_ERRORS:-0}"
This exports checksum error counts from the NIC’s statistics. A rising count indicates the problem is occurring.
Alert Rules
groups:
- name: network-offload
rules:
- alert: NICChecksumErrors
expr: |
rate(nic_rx_csum_errors[5m]) > 0
for: 5m
labels:
severity: warning
annotations:
summary: "NIC {{ $labels.instance }} showing checksum errors"
description: "May indicate TX offload bug on sending nodes"
- alert: CrossNodeGRPCFailures
expr: |
(
sum by (source_node, dest_node) (
rate(grpc_client_handled_total{grpc_code!="OK"}[5m])
)
/
sum by (source_node, dest_node) (
rate(grpc_client_handled_total[5m])
)
) > 0.01
for: 5m
labels:
severity: warning
annotations:
summary: "Cross-node gRPC errors between {{ $labels.source_node }} and {{ $labels.dest_node }}"
The cross-node gRPC error alert catches the symptom even if you don’t have checksum metrics. If errors are consistently higher between specific node pairs, investigate the network path.
Debugging Playbook
When you suspect this issue, follow this systematic approach:
#!/bin/bash
# vxlan-debug.sh - Run when you suspect this issue
echo "=== Step 1: Identify failing paths ==="
# From a failing pod, trace where packets go
kubectl exec -it $POD -- traceroute -n $DEST_POD_IP
echo "=== Step 2: Capture on destination node ==="
# On the node where destination pod runs
tcpdump -i any -nn udp port 4789 -c 100 -w /tmp/vxlan.pcap
echo "=== Step 3: Analyze checksums ==="
tcpdump -r /tmp/vxlan.pcap -vvv 2>&1 | grep -c "bad.*cksum"
echo "=== Step 4: Check source node offload ==="
# On the source node
ethtool -k eth0 | grep -E "(tx-udp_tnl|segmentation)"
echo "=== Step 5: Test with offload disabled ==="
ethtool -K eth0 tx-udp_tnl-csum-segmentation off
# Retry the failing requests
# If they work, you found the issue
Step 5 is the definitive test. If disabling offload makes the errors go away, you’ve confirmed the root cause.
Checklist
## VXLAN Packet Drop Diagnosis
### Symptoms
- [ ] Cross-node traffic fails intermittently
- [ ] Local (same-node) traffic works fine
- [ ] Specific source nodes are worse
- [ ] No application-level errors, just timeouts
### Diagnosis
- [ ] Capture packets on receiving end
- [ ] Check for "bad udp cksum" in tcpdump
- [ ] Identify NIC driver/firmware on bad nodes
- [ ] Compare ethtool offload settings
### Fix
- [ ] Disable tx-udp_tnl-csum-segmentation
- [ ] Deploy DaemonSet for persistence
- [ ] Plan firmware updates
- [ ] Add monitoring for checksum errors
Conclusion
This failure mode is particularly nasty because it looks like application instability:
- Looks like application instability - timeouts, flakiness, “works locally”
- Only affects cross-node traffic - overlay encapsulation triggers the bug
- Specific to NIC/driver combos - not every node is affected
- Silent corruption - no errors until packets arrive and checksums don’t match
The fix is simple (ethtool -K to disable offload), but finding the root cause requires packet-level debugging that most teams don’t do for “intermittent gRPC issues.”
The lesson: when you see cross-node networking failures that don’t affect local traffic, always check NIC offload settings. And when you deploy new hardware or drivers, test VXLAN checksum offload explicitly before putting nodes into production.
Related Articles
- Kubernetes DNS Caching - DNS issues affecting cross-node traffic
- gRPC Load Balancing in Kubernetes - gRPC connection management
Related posts
PMTU Blackholes: When Only Large Responses Hang
Small API responses work, large ones hang forever. The cause: ICMP 'Fragmentation Needed' messages filtered by firewalls, breaking Path MTU Discovery in overlay networks.
kube-proxy Micro-Outages: The xtables Lock Contention Problem
Random 1-3 second connection drops during deployments. CPU looks fine, memory is stable. The hidden cause: iptables-restore grabbing the xtables lock while endpoints churn.
Ephemeral Port Exhaustion: The Node That 'Goes Bad'
A single Kubernetes node starts failing connections to external services while pods look healthy. The hidden cause: sidecar proxies exhausting ephemeral ports with short-lived connections.
Traffic Hitting Dead Pods: Conntrack's Stale NAT Mapping
Deploy causes 503s for exactly 2 minutes. The issue: conntrack keeps NAT mappings to old pod IPs even after Kubernetes removes endpoints, sending traffic to dead pods.
Cite this article
If you reference this post, please link to the original URL and credit the author.