Linux ARP Cache Stale Entries: Failover Traffic Blackhole
We failed over cleanly, then traffic disappeared into a stale ARP cache. “Load balancer failover completed but traffic still going to dead node.” The monitoring showed the database primary had switched to the standby node five minutes ago. Keepalived reported the VIP was active on the new primary. The new primary was receiving connections. But half our application servers were timing out, unable to connect to the database.
The cause was something that doesn’t appear in any application log or monitoring dashboard: Linux ARP cache. Each server had cached the MAC address of the old primary, and they kept sending packets to that MAC address even after the IP moved to a different machine. The packets went to a server that was either powered off or no longer owned that IP—a network blackhole where packets enter and never return.
This incident taught me that network failover isn’t just about IP addresses. It’s about the entire Layer 2 to Layer 3 mapping that every host maintains in its ARP cache. When you move an IP to a new server, you’re changing which MAC address should receive traffic for that IP. And every client that has the old mapping cached will continue sending traffic to the old destination until their cache expires or is forcibly updated.
What made this particularly frustrating was the inconsistency. Some application servers worked fine—they happened to have empty ARP caches or had recently refreshed them. Others failed completely. The “some work, some don’t” pattern initially led us down the wrong debugging path, suspecting application bugs or load balancer misconfiguration.
Environment: Linux servers, VRRP/keepalived/floating IPs, database failovers, load balancer migrations
The Problem
Traffic Blackhole After Failover
Database failover timeline:
T+0:00 Primary (10.0.1.10, MAC aa:bb:cc:11:22:33) is healthy
App servers have ARP: 10.0.1.10 → aa:bb:cc:11:22:33
T+0:30 Primary fails, keepalived triggers failover
T+0:31 Secondary takes over VIP 10.0.1.10
Secondary MAC: aa:bb:cc:44:55:66
Secondary sends gratuitous ARP
T+0:32 Some app servers update ARP cache ✓
T+0:33 Other app servers still have old cache:
10.0.1.10 → aa:bb:cc:11:22:33 (dead MAC!)
T+0:35 Traffic from stale-cache servers → blackhole
Packets sent to old MAC, never reach new primary
Connection timeouts, application errors
T+5:00 ARP cache expires, gets refreshed
Finally works correctly
Why Gratuitous ARP Doesn’t Always Work
You might be thinking, “Doesn’t gratuitous ARP solve this?” In theory, yes. When a server takes over a VIP, it broadcasts a gratuitous ARP—an unsolicited ARP reply that says “this IP is now at my MAC address.” Every host on the network should update their cache.
In practice, gratuitous ARP is unreliable for several reasons. First, it’s a broadcast, and broadcasts don’t always reach everywhere you expect. Switches might not flood to all ports. Network firewalls might drop ARP packets. VLANs might not propagate broadcasts correctly. Second, Linux has conservative ARP cache update rules—it doesn’t blindly accept unsolicited updates, especially for entries it considers “confirmed.”
Gratuitous ARP propagation issues:
┌─────────────────────────────────────────────────────────────┐
│ Secondary sends: "10.0.1.10 is at aa:bb:cc:44:55:66" │
│ │
│ This is a broadcast, but... │
│ │
│ ✗ Switches may not flood to all ports │
│ ✗ Some hosts ignore unsolicited ARP updates │
│ ✗ Linux gc_stale_time prevents immediate update │
│ ✗ Network firewalls may drop ARP broadcasts │
│ ✗ VLANs may not propagate gratuitous ARP correctly │
└─────────────────────────────────────────────────────────────┘
Linux ARP cache behavior:
- gc_stale_time: 60s default - won't refresh before this
- base_reachable_time: 30s - considers entry valid
- Even with gratuitous ARP, may prefer existing "confirmed" entry
Root Cause
Linux ARP Cache States
Understanding the ARP cache state machine is essential for debugging these issues. Linux doesn’t simply have “cached” and “not cached” entries. Each ARP entry goes through a lifecycle of states, and critically, even “stale” entries are still used for sending traffic.
The key insight is that STALE doesn’t mean “don’t use.” It means “use, but maybe verify soon.” When you send a packet to a STALE entry, Linux delivers the packet using the cached MAC address while simultaneously initiating a probe to verify the entry is still valid. If the probe fails, the entry eventually moves to FAILED. But during that transition—which can take 60+ seconds—traffic continues flowing to the old, possibly dead, MAC address.
This design makes sense for normal operation: you don’t want to delay every packet while waiting for ARP verification. But during failover, it means your servers stubbornly continue sending traffic to the old primary for up to two minutes.
ARP entry lifecycle:
INCOMPLETE → REACHABLE → STALE → DELAY → PROBE → FAILED
↑ │
└───────────────────────┘ (or refresh)
State descriptions:
- REACHABLE: Recently confirmed, used directly
- STALE: Not confirmed recently, but still used (!)
- DELAY: About to probe for confirmation
- PROBE: Actively probing (unicast ARP)
- FAILED: Unreachable
The problem: STALE entries are USED, not refreshed immediately!
Default Timeouts
# Check ARP cache parameters
sysctl net.ipv4.neigh.default.gc_stale_time
# 60 seconds - how long before STALE is garbage collected
sysctl net.ipv4.neigh.default.base_reachable_time_ms
# 30000 ms - how long entry is REACHABLE
sysctl net.ipv4.neigh.default.gc_thresh3
# 1024 - max entries before aggressive GC
# Total time entry can be stale but used: gc_stale_time + probing
# Can be 60-120 seconds of traffic to wrong destination!
Diagnosis
Check ARP Cache State
# View ARP cache with state
ip neigh show
# Output:
# 10.0.1.10 dev eth0 lladdr aa:bb:cc:11:22:33 STALE
# ^^^^^ Problem!
# Watch ARP cache changes
ip monitor neigh
# Check specific entry
ip neigh show 10.0.1.10
Verify Traffic Path
# Check if packets are reaching the right destination
tcpdump -i eth0 -n host 10.0.1.10
# You'll see traffic going to WRONG MAC:
# 12:00:01 IP app-server > 10.0.1.10: TCP...
# Frame dst: aa:bb:cc:11:22:33 (old MAC, dead server!)
# Correct after refresh:
# Frame dst: aa:bb:cc:44:55:66 (new MAC, active server)
Check Gratuitous ARP Reception
# On app server, watch for gratuitous ARP
tcpdump -i eth0 -n arp
# Should see:
# 12:00:30 ARP, Reply 10.0.1.10 is-at aa:bb:cc:44:55:66
# But if not seen, check:
# - Network path (VLANs, firewalls)
# - Switch flooding behavior
# - Kernel ignoring unsolicited ARP
The Fix
Option 1: Reduce ARP Cache Timeouts
# Reduce time entries stay STALE
sysctl -w net.ipv4.neigh.default.gc_stale_time=30
sysctl -w net.ipv4.neigh.eth0.gc_stale_time=30
# Reduce REACHABLE time
sysctl -w net.ipv4.neigh.default.base_reachable_time_ms=15000
# Make persistent in /etc/sysctl.d/arp.conf:
net.ipv4.neigh.default.gc_stale_time = 30
net.ipv4.neigh.default.base_reachable_time_ms = 15000
Option 2: Accept Gratuitous ARP Updates
# Enable accepting unsolicited ARP updates
sysctl -w net.ipv4.conf.all.arp_accept=1
sysctl -w net.ipv4.conf.eth0.arp_accept=1
# This makes Linux update cache on gratuitous ARP
# even for existing entries
# Persistent:
net.ipv4.conf.all.arp_accept = 1
Option 3: Flush ARP Cache on Failover
#!/bin/bash
# failover_arp_flush.sh - Run on client servers after failover
VIP="10.0.1.10"
# Delete specific ARP entry
ip neigh del $VIP dev eth0 2>/dev/null
# Or flush and let it re-learn
ip neigh flush $VIP
# Verify
ip neigh show $VIP
# Ansible playbook to flush ARP on failover
---
- name: Flush stale ARP entries
hosts: app_servers
tasks:
- name: Delete VIP ARP entry
command: ip neigh del {{ vip }} dev eth0
ignore_errors: yes
- name: Force ARP refresh with ping
command: ping -c 1 {{ vip }}
ignore_errors: yes
Option 4: Send Multiple Gratuitous ARPs
# In keepalived.conf - send more gratuitous ARPs
vrrp_instance VI_1 {
state MASTER
interface eth0
virtual_router_id 51
priority 100
# Send gratuitous ARP multiple times
garp_master_delay 1
garp_master_repeat 5 # Send 5 times
garp_master_refresh 30 # Repeat every 30s
garp_master_refresh_repeat 2
virtual_ipaddress {
10.0.1.10/24
}
}
Option 5: Use arping for Active Refresh
#!/bin/bash
# active_arp_announce.sh - Run on new primary after failover
VIP="10.0.1.10"
INTERFACE="eth0"
# Send gratuitous ARP flood
for i in {1..10}; do
arping -U -I $INTERFACE $VIP -c 1
arping -A -I $INTERFACE $VIP -c 1
sleep 0.1
done
# -U: Unsolicited ARP reply (gratuitous)
# -A: Unsolicited ARP request (some systems prefer this)
Monitoring
groups:
- name: arp-cache
rules:
- alert: ARPCacheStaleVIP
expr: |
time() - arp_entry_last_confirmed_seconds{ip=~"10.0.1.*"} > 120
for: 5m
labels:
severity: warning
annotations:
summary: "Stale ARP entry for VIP {{ $labels.ip }}"
- alert: FailoverTrafficBlackhole
expr: |
rate(tcp_retransmits_total{dest=~"10.0.1.*"}[1m]) > 100
for: 2m
labels:
severity: critical
annotations:
summary: "High TCP retransmits to VIP - possible ARP issue"
Checklist
## Linux ARP Cache Failover
### Before Failover
- [ ] Configure reduced gc_stale_time (30s instead of 60s)
- [ ] Enable arp_accept=1 on client servers
- [ ] Configure keepalived for multiple gratuitous ARPs
- [ ] Test failover with tcpdump monitoring
### During/After Failover
- [ ] Verify gratuitous ARP sent from new primary
- [ ] Check ARP cache state on client servers
- [ ] Flush stale entries if needed
- [ ] Monitor for connection timeouts
### If Traffic Blackhole Occurs
- [ ] Identify affected servers: ip neigh show | grep STALE
- [ ] Flush ARP cache: ip neigh flush <VIP>
- [ ] Send additional gratuitous ARPs from new primary
- [ ] Check network path for ARP propagation issues
Conclusion
This failure mode is particularly insidious because everything looks correct at the network layer—the VIP is active on the new server, the new server is responding to health checks, and monitoring shows the failover completed successfully. The problem is invisible until you start looking at Layer 2 addresses and ARP cache state on client machines.
The fundamental lesson is that Linux ARP cache is sticky—it prefers existing entries over gratuitous ARP updates. This is actually good behavior for stability (you don’t want a malicious host to hijack traffic by sending fake gratuitous ARPs), but it creates problems during legitimate failovers.
The defensive strategy is layered: reduce cache timeouts so stale entries expire faster, enable arp_accept so gratuitous ARPs are actually processed, configure your failover tools to send multiple gratuitous ARPs over time, and have automation to flush caches if needed.
Key principles:
- STALE entries are still used - traffic continues to old MAC address for up to 60-120 seconds
- gc_stale_time=30 for faster cache expiry instead of the default 60 seconds
- arp_accept=1 to accept gratuitous ARP updates for existing entries
- Send multiple gratuitous ARPs from new primary—one might not reach everyone
- Monitor TCP retransmits after failover as an indicator of ARP problems
Related Articles
- Conntrack Stale NAT Mappings - Network state issues
- Gossip Ghost Nodes IP Reuse - Stale network identity
Related posts
Kubernetes Ghost Connections: Stale Conntrack DNAT Entries
Service returns wrong pod IPs after scaling. The cause: Linux conntrack keeps DNAT entries alive longer than pods exist, routing traffic to deleted endpoints.
Ephemeral Port Exhaustion: The Node That 'Goes Bad'
A single Kubernetes node starts failing connections to external services while pods look healthy. The hidden cause: sidecar proxies exhausting ephemeral ports with short-lived connections.
Kubernetes conntrack Table Exhaustion: The Silent Packet Killer
Random DNS timeouts, dropped connections, services timing out. Your nf_conntrack table is full. I show how to diagnose, monitor, and fix this Kubernetes networking issue.
TCP TIME_WAIT Port Exhaustion: When Connection Pooling Isn't Enough
Service can't connect to database - 'cannot assign requested address'. The cause: ephemeral ports exhausted by thousands of sockets stuck in TIME_WAIT state.
Cite this article
If you reference this post, please link to the original URL and credit the author.