Back to blog

VXLAN Random Packet Drops: The Checksum Offload Trap

|
| kubernetes, networking, vxlan, debugging, nic, overlay-networks

We chased packet drops for days before finding VXLAN checksum offload. “gRPC is flaky between nodes but works fine locally.” We spent three days blaming the application before realizing that specific nodes were silently corrupting VXLAN packets due to a NIC driver bug in checksum offload.

This was one of the most frustrating debugging sessions of my career. The application team was convinced they had a bug. The network team was convinced the application was misconfigured. Both were wrong. The problem was a subtle interaction between the kernel’s VXLAN encapsulation and the NIC’s hardware checksum offload that only manifested on certain combinations of driver version and firmware.

What made this particularly insidious was the failure pattern. Not every packet failed—maybe 1-5%. Enough to cause timeouts and errors, but not enough to trigger obvious network alarms. And only cross-node traffic failed; pods on the same node communicated perfectly. This screamed “application bug” to everyone who looked at it.

Environment: Kubernetes 1.28, Calico VXLAN mode, Intel X710 NICs (specific firmware version)

The Problem

Symptoms That Look Like Application Issues

The timeline of a typical incident looks like a software problem:

Timeline of a typical incident:

08:00  Deploy new service version
08:05  Random gRPC timeouts start appearing
08:10  Only cross-node calls fail (~5% error rate)
08:15  Restart pods - temporarily helps
08:30  Errors return, different pods affected
09:00  Blame "network instability"

What made this maddening:

  • Same pod-to-pod works locally - calls within a node never failed
  • Only specific source nodes - not all nodes exhibited the issue
  • Intermittent pattern - not every packet, roughly 1-5%
  • No errors in application logs - just timeouts

The “restart temporarily helps” pattern threw us off. When you restart a pod, it might land on a different node—one that doesn’t have the buggy NIC driver. So the problem “goes away” until traffic shifts back to a problematic node.

The Hidden Corruption

The breakthrough came when we captured packets at the receiving end:

# Capture on the receiving node showed bad checksums
tcpdump -i eth0 -vvv udp port 4789 2>&1 | grep -i checksum

# Output:
# UDP, length 1234: [bad udp cksum 0x1a2b -> 0x3c4d!]
# UDP, length 567: [bad udp cksum 0x5e6f -> 0x7890!]

# But the sender thought everything was fine!
# Because hardware was supposed to calculate checksum...

Bad UDP checksums on VXLAN packets. The receiving kernel sees a corrupted packet and drops it silently—no ICMP error, no application notification, just a dropped packet. The sender times out waiting for a response that will never come.

The sender didn’t know anything was wrong because it delegated checksum calculation to hardware. The kernel said “I’m sending this packet, please calculate the UDP checksum.” The hardware said “OK” and then calculated it wrong.

Root Cause Analysis

How Checksum Offload Breaks VXLAN

Modern NICs support “checksum offload” where the hardware calculates TCP/UDP checksums instead of the CPU. This is normally a good thing—it’s faster and reduces CPU load. But VXLAN adds a complication: there are now two sets of headers.

Normal packet flow:
┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│ Application │────▶│   Kernel    │────▶│     NIC     │
│   (gRPC)    │     │  (VXLAN)    │     │ (TX Offload)│
└─────────────┘     └─────────────┘     └─────────────┘
                           │                    │
                    Outer UDP header      Should calculate
                    checksum = 0x0000     checksum here
                    (placeholder)              │

                                         BUG: NIC doesn't
                                         handle VXLAN right

What happens with the bug:
1. Kernel creates VXLAN-encapsulated packet
2. Kernel sets outer UDP checksum = 0 (let NIC calculate)
3. Kernel enables tx-udp_tnl-csum-segmentation
4. NIC calculates checksum WRONG for encapsulated packets
5. Receiving kernel sees bad checksum → drops packet
6. Application sees timeout, has no idea why

The problem is that some NIC firmware versions don’t correctly handle the “tunnel checksum” case. They were designed for simple TCP/UDP traffic, and when the kernel asks them to calculate checksums for VXLAN-encapsulated traffic, they miscalculate—either using wrong offsets or ignoring parts of the packet.

This bug is specific to certain driver/firmware combinations. The exact same NIC with different firmware might work perfectly. That’s why only specific nodes exhibited the problem—they happened to have the problematic firmware version.

Identifying Affected Nodes

To find which nodes have the problem:

#!/bin/bash
# check-offload-settings.sh

echo "=== Checking TX Offload Settings ==="

for node in $(kubectl get nodes -o name | cut -d/ -f2); do
  echo -e "\n--- Node: $node ---"

  # Check offload settings
  kubectl debug node/$node -it --image=nicolaka/netshoot -- \
    ethtool -k eth0 2>/dev/null | grep -E "(tx-udp_tnl|tx-checksum)"
done

# Look for:
# tx-udp_tnl-csum-segmentation: on  <-- potential problem
# tx-udp_tnl-segmentation: on

The tx-udp_tnl-csum-segmentation feature is what tells the NIC to calculate checksums for tunneled (VXLAN) traffic. If this is on and you’re seeing checksum errors, the NIC is miscalculating.

Correlating With NIC Driver Version

The bug is usually in specific driver/firmware combinations:

# Find which nodes have the problematic driver
kubectl get nodes -o json | jq -r '.items[] | .metadata.name' | while read node; do
  echo "=== $node ==="
  kubectl debug node/$node -it --image=busybox -- \
    cat /sys/class/net/eth0/device/driver/module/version 2>/dev/null
  kubectl debug node/$node -it --image=busybox -- \
    ethtool -i eth0 2>/dev/null | grep -E "(driver|version|firmware)"
done

# Example problematic output:
# driver: i40e
# version: 2.14.13
# firmware-version: 8.30 0x8000af86  <-- this specific combo is buggy

Document which driver/firmware combinations are affected. This helps you target fixes and predict which new nodes might have the same problem.

The Fix

Immediate Mitigation

Disable the problematic offload immediately:

# Disable the problematic offload on affected nodes
# Run on each affected node:

ethtool -K eth0 tx-udp_tnl-csum-segmentation off
ethtool -K eth0 tx-udp_tnl-segmentation off

# Verify
ethtool -k eth0 | grep tx-udp_tnl
# tx-udp_tnl-segmentation: off
# tx-udp_tnl-csum-segmentation: off

This tells the NIC to stop handling tunnel checksums. The kernel will calculate checksums in software instead. There’s a slight CPU cost, but it’s negligible compared to the corruption problems.

The effect is immediate—packets start flowing correctly within seconds.

Persistent Fix via DaemonSet

The ethtool settings don’t persist across reboots. Deploy a DaemonSet to apply them automatically:

# nic-offload-fix.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nic-offload-fix
  namespace: kube-system
spec:
  selector:
    matchLabels:
      app: nic-offload-fix
  template:
    metadata:
      labels:
        app: nic-offload-fix
    spec:
      hostNetwork: true
      hostPID: true
      nodeSelector:
        # Only run on nodes with known-bad NIC
        node.kubernetes.io/nic-type: "i40e"
      initContainers:
        - name: disable-offload
          image: alpine
          securityContext:
            privileged: true
          command:
            - /bin/sh
            - -c
            - |
              apk add --no-cache ethtool
              # Find the primary interface
              IFACE=$(ip route | grep default | awk '{print $5}')
              echo "Disabling tx-udp_tnl offloads on $IFACE"
              ethtool -K $IFACE tx-udp_tnl-csum-segmentation off || true
              ethtool -K $IFACE tx-udp_tnl-segmentation off || true
              echo "Done. Current settings:"
              ethtool -k $IFACE | grep tx-udp_tnl
      containers:
        - name: pause
          image: gcr.io/google_containers/pause:3.2
      tolerations:
        - operator: Exists

The nodeSelector ensures this only runs on nodes you’ve identified as problematic. Label those nodes accordingly.

Long-Term: Firmware Update

The proper fix is updating the NIC firmware to a version that handles tunnel checksums correctly:

# Check if firmware update fixes the issue
# Intel provides NVM update tool for X710

# Download from Intel support
wget https://downloadmirror.intel.com/.../700Series_NVMUpdatePackage_v8_40_Linux.tar.gz

# Apply update (requires maintenance window)
tar xzf 700Series_NVMUpdatePackage_v8_40_Linux.tar.gz
cd 700Series/Linux_x64
./nvmupdate64e -u -l -o update.log

# Reboot node
# Re-enable offloads and test
ethtool -K eth0 tx-udp_tnl-csum-segmentation on

After the firmware update, re-enable offloads and test thoroughly. If checksums are now correct, you can remove the DaemonSet workaround for that node.

Detection and Monitoring

Prometheus Metrics for Bad Checksums

Set up monitoring to catch this problem before it causes outages:

# Note: This requires node_exporter with textfile collector
# Create a script that runs periodically

# /etc/node_exporter/scripts/nic_checksum_errors.sh
#!/bin/bash
IFACE="eth0"
RX_ERRORS=$(ethtool -S $IFACE 2>/dev/null | grep rx_csum_offload_errors | awk '{print $2}')
echo "# HELP nic_rx_csum_errors NIC RX checksum errors"
echo "# TYPE nic_rx_csum_errors counter"
echo "nic_rx_csum_errors{interface=\"$IFACE\"} ${RX_ERRORS:-0}"

This exports checksum error counts from the NIC’s statistics. A rising count indicates the problem is occurring.

Alert Rules

groups:
  - name: network-offload
    rules:
      - alert: NICChecksumErrors
        expr: |
          rate(nic_rx_csum_errors[5m]) > 0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "NIC {{ $labels.instance }} showing checksum errors"
          description: "May indicate TX offload bug on sending nodes"

      - alert: CrossNodeGRPCFailures
        expr: |
          (
            sum by (source_node, dest_node) (
              rate(grpc_client_handled_total{grpc_code!="OK"}[5m])
            )
            /
            sum by (source_node, dest_node) (
              rate(grpc_client_handled_total[5m])
            )
          ) > 0.01
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Cross-node gRPC errors between {{ $labels.source_node }} and {{ $labels.dest_node }}"

The cross-node gRPC error alert catches the symptom even if you don’t have checksum metrics. If errors are consistently higher between specific node pairs, investigate the network path.

Debugging Playbook

When you suspect this issue, follow this systematic approach:

#!/bin/bash
# vxlan-debug.sh - Run when you suspect this issue

echo "=== Step 1: Identify failing paths ==="
# From a failing pod, trace where packets go
kubectl exec -it $POD -- traceroute -n $DEST_POD_IP

echo "=== Step 2: Capture on destination node ==="
# On the node where destination pod runs
tcpdump -i any -nn udp port 4789 -c 100 -w /tmp/vxlan.pcap

echo "=== Step 3: Analyze checksums ==="
tcpdump -r /tmp/vxlan.pcap -vvv 2>&1 | grep -c "bad.*cksum"

echo "=== Step 4: Check source node offload ==="
# On the source node
ethtool -k eth0 | grep -E "(tx-udp_tnl|segmentation)"

echo "=== Step 5: Test with offload disabled ==="
ethtool -K eth0 tx-udp_tnl-csum-segmentation off
# Retry the failing requests
# If they work, you found the issue

Step 5 is the definitive test. If disabling offload makes the errors go away, you’ve confirmed the root cause.

Checklist

## VXLAN Packet Drop Diagnosis

### Symptoms
- [ ] Cross-node traffic fails intermittently
- [ ] Local (same-node) traffic works fine
- [ ] Specific source nodes are worse
- [ ] No application-level errors, just timeouts

### Diagnosis
- [ ] Capture packets on receiving end
- [ ] Check for "bad udp cksum" in tcpdump
- [ ] Identify NIC driver/firmware on bad nodes
- [ ] Compare ethtool offload settings

### Fix
- [ ] Disable tx-udp_tnl-csum-segmentation
- [ ] Deploy DaemonSet for persistence
- [ ] Plan firmware updates
- [ ] Add monitoring for checksum errors

Conclusion

This failure mode is particularly nasty because it looks like application instability:

  1. Looks like application instability - timeouts, flakiness, “works locally”
  2. Only affects cross-node traffic - overlay encapsulation triggers the bug
  3. Specific to NIC/driver combos - not every node is affected
  4. Silent corruption - no errors until packets arrive and checksums don’t match

The fix is simple (ethtool -K to disable offload), but finding the root cause requires packet-level debugging that most teams don’t do for “intermittent gRPC issues.”

The lesson: when you see cross-node networking failures that don’t affect local traffic, always check NIC offload settings. And when you deploy new hardware or drivers, test VXLAN checksum offload explicitly before putting nodes into production.


Related posts

Cite this article

If you reference this post, please link to the original URL and credit the author.

Michal Drozd. "VXLAN Random Packet Drops: The Checksum Offload Trap". https://www.michal-drozd.com/en/blog/vxlan-checksum-offload-packet-drops/ (Published October 21, 2024).