Split-Brain From a Clock Step Backwards: Wall Time in Lease-Based Systems
The weirdest split-brain I saw started with a clock that went backwards. “Two nodes became leader simultaneously.” The alert woke me at 2 AM. Our job scheduler was in chaos—duplicate jobs running, data being processed twice, race conditions everywhere. Both node A and node B believed they held the leadership lease. Both were processing work. Both were right, according to their local clocks.
The cause turned out to be a 2-second NTP clock correction backwards on node A. The VM had drifted while suspended for a live migration, and when NTP caught up, it stepped the clock back to correct the drift. Node A, which had acquired a 30-second lease, suddenly found itself with 32 seconds of “remaining” lease time. During those extra 2 seconds, node B acquired a new lease in the database. Both nodes were now leaders.
This incident taught me one of the fundamental truths of distributed systems: wall-clock time is not monotonic, and any code that treats it as monotonic is waiting to fail. Mixing System.currentTimeMillis() (wall-clock) with duration-based timeouts creates a hidden trap that only springs when time goes backwards—which happens more often than you’d expect in virtualized environments.
The fix wasn’t complicated, but it required understanding why the original code was wrong. You can’t measure elapsed time with wall-clock readings, because wall-clock can jump. You need monotonic time for durations and fencing tokens for safety.
Environment: Custom leader election using database-backed leases, NTP-synchronized nodes
The Problem
The Split-Brain Incident
Timeline:
10:00:00 Node A acquires lease (expires at 10:00:30)
10:00:15 Node A: "I'm leader, lease valid until 10:00:30"
10:00:16 NTP steps Node A's clock BACK 2 seconds
10:00:14* Node A: "Clock says 10:00:14, lease valid until 10:00:30"
"I have 16 more seconds!" (actually only 14)
10:00:30 Lease expires in database
10:00:31 Node B acquires lease (expires at 10:01:01)
10:00:31 Node B: "I'm leader!"
10:00:28* Node A still thinks it's 10:00:28
Node A: "I'm still leader! 2 seconds left!"
Result: Both nodes think they're leader!
The Vulnerable Code
// Common lease-based leader election pattern
public class LeaseManager {
private Instant leaseExpiry;
public boolean isLeader() {
// BUG: Uses wall clock time
return Instant.now().isBefore(leaseExpiry);
}
public void acquireLease(Duration leaseDuration) {
// Store expiry as wall-clock time
leaseExpiry = Instant.now().plus(leaseDuration);
writeToDatabase(leaseExpiry);
}
}
// If clock steps backwards:
// - leaseExpiry stays at original wall-clock value
// - Instant.now() returns earlier time
// - isLeader() returns true for "extra" time
Root Cause
Wall Clock vs Monotonic Time
This is one of those computer science fundamentals that’s easy to forget in practice. Modern operating systems provide two different time sources, and they serve very different purposes.
Wall-clock time (System.currentTimeMillis(), time.time(), Instant.now()) represents “what time is it?”—the time you’d see on a clock on the wall. This time is synchronized with external sources (NTP servers) and can jump forwards or backwards to stay accurate. It’s great for logging, scheduling, and displaying to users.
Monotonic time (System.nanoTime(), time.monotonic()) represents “how much time has passed?”—a steadily increasing counter that never goes backwards. It’s not synchronized with anything external. It’s purely for measuring durations.
The trap in lease-based systems is using wall-clock time to measure whether a lease has expired. If you acquired a lease at wall-clock 10:00:00 with a 30-second duration, you compute expiry as 10:00:30. Then you check if (now < expiry) to see if you’re still leader. This works perfectly—until wall-clock steps backwards.
Wall Clock (System.currentTimeMillis(), Instant.now()):
├── Can jump forwards (NTP sync)
├── Can jump backwards (NTP correction)
├── Affected by DST, leap seconds
└── NOT suitable for measuring durations!
Monotonic Clock (System.nanoTime()):
├── Only moves forward
├── Unaffected by NTP
├── Rate may vary slightly
└── Suitable for measuring durations
The trap:
┌─────────────────────────────────────────────────┐
│ Duration timeout = start_wall_time + 30 seconds│
│ │
│ If wall clock steps back 2 seconds: │
│ Timeout appears to be 32 seconds! │
└─────────────────────────────────────────────────┘
How NTP Causes Time Steps
# NTP typically slews time (gradual adjustment)
# But if drift is too large, it STEPS (instant jump)
# Check NTP status:
chronyc tracking
# System time: 0.000000002 seconds slow of NTP time
# Last offset: -0.000000814 seconds # Small slew
# OR
# Last offset: -2.345 seconds # Large step!
# Force NTP to step (dangerous in production):
# chronyc makestep
# Common causes of large steps:
# - VM suspend/resume
# - Container migration
# - Network partition resolving
# - New node joining cluster
Diagnosis
Check for Time Jumps
# Monitor time jumps with a script
#!/bin/bash
PREV=$(date +%s.%N)
while true; do
sleep 0.1
NOW=$(date +%s.%N)
DIFF=$(echo "$NOW - $PREV - 0.1" | bc)
if (( $(echo "$DIFF > 0.5 || $DIFF < -0.5" | bc -l) )); then
echo "TIME JUMP: $DIFF seconds at $(date)"
fi
PREV=$NOW
done
Check NTP Logs
# chrony logs
journalctl -u chronyd | grep -E "(makestep|System clock)"
# Look for:
# System clock was stepped by -2.345 seconds
Detect Split-Brain
-- If using database-backed leases
SELECT node_id, lease_acquired_at, lease_expires_at
FROM leader_leases
WHERE is_active = true;
-- Should return exactly 1 row
-- Multiple rows = split-brain!
The Fix
Option 1: Use Monotonic Time for Durations
public class SafeLeaseManager {
private long leaseAcquiredNanos; // Monotonic
private long leaseDurationNanos;
public boolean isLeader() {
// Uses monotonic time - cannot be affected by clock adjustments
long elapsed = System.nanoTime() - leaseAcquiredNanos;
return elapsed < leaseDurationNanos;
}
public void acquireLease(Duration leaseDuration) {
leaseAcquiredNanos = System.nanoTime();
leaseDurationNanos = leaseDuration.toNanos();
// Still write wall-clock expiry for external visibility
writeToDatabase(Instant.now().plus(leaseDuration));
}
}
Option 2: Use Fencing Tokens
// Fencing token: monotonically increasing number
// Even if two nodes think they're leader,
// only the one with higher token can write
public class FencedLeaseManager {
private long fencingToken;
public boolean acquireLease() {
// Atomically increment and read fencing token
Long newToken = database.incrementAndGet("lease_fencing_token");
if (newToken != null) {
this.fencingToken = newToken;
return true;
}
return false;
}
public void writeWithFence(String key, Object value) {
// Database rejects writes with lower fencing token
database.conditionalWrite(key, value, this.fencingToken);
}
}
Option 3: Server-Side Lease Validation
// Don't trust client-side lease checks
// Always validate lease at the coordination point
// Client requests work as leader
// Server validates lease EVERY TIME
public class LeaseCoordinator {
public Result executeAsLeader(String nodeId, Work work) {
// Fetch current lease from database (source of truth)
Lease currentLease = database.getCurrentLease();
if (!currentLease.heldBy(nodeId)) {
throw new NotLeaderException();
}
if (currentLease.isExpired()) { // Server time!
throw new LeaseExpiredException();
}
return work.execute();
}
}
Option 4: Shorten Lease + Heartbeat
// Instead of 30-second lease with one check
// Use 5-second lease with continuous renewal
public class HeartbeatLeaseManager {
private static final Duration LEASE_DURATION = Duration.ofSeconds(5);
private static final Duration HEARTBEAT_INTERVAL = Duration.ofSeconds(1);
public void maintainLeadership() {
while (shouldBeLeader) {
boolean renewed = renewLease(LEASE_DURATION);
if (!renewed) {
stepDown();
return;
}
Thread.sleep(HEARTBEAT_INTERVAL.toMillis());
}
}
// If clock steps back:
// - Lease expires in database within 5 seconds
// - Other node can acquire within 5 seconds
// - Much smaller window for split-brain
}
Monitoring
groups:
- name: time-sync
rules:
- alert: NTPClockStep
expr: |
abs(node_ntp_offset_seconds) > 0.5
for: 1m
labels:
severity: warning
annotations:
summary: "Large NTP offset on {{ $labels.instance }}"
- alert: MultipleLeaders
expr: |
count(leader_election_is_leader == 1) > 1
for: 30s
labels:
severity: critical
annotations:
summary: "Split-brain detected - multiple leaders!"
Checklist
## Clock Step Split-Brain
### Symptoms
- [ ] Two nodes claiming leadership simultaneously
- [ ] Data inconsistency after NTP sync
- [ ] VM resume causing issues
- [ ] Lease-based systems misbehaving
### Diagnosis
- [ ] Check NTP logs for time steps
- [ ] Monitor for multiple leaders
- [ ] Review lease check code for wall-clock usage
- [ ] Check for VM suspend/resume events
### Fixes
- [ ] Use monotonic time for duration checks
- [ ] Implement fencing tokens
- [ ] Server-side lease validation
- [ ] Shorten lease duration + heartbeat
- [ ] Alert on NTP clock steps
Conclusion
This failure mode is particularly insidious because it only manifests under specific conditions: clock steps backwards, which are rare but not as rare as you’d think. VMs suspend and resume. Containers get live-migrated. NTP catches up after network partitions. Each of these can cause time to jump backwards.
The fundamental lesson is that “use leases for simplicity” is incomplete advice. Leases are simple conceptually, but implementing them correctly requires understanding the wall-clock vs monotonic-time distinction. Using Instant.now() for lease expiry checks is natural and intuitive—and wrong.
The defense-in-depth approach combines multiple techniques: monotonic time for local lease checks (prevents the clock-step problem), fencing tokens for write operations (prevents stale-leader writes even if split-brain occurs), shorter lease durations with frequent renewal (reduces the window of vulnerability), and monitoring for NTP clock adjustments (alerts you when conditions are ripe for problems).
Key principles:
- Never use wall-clock for duration measurement - use monotonic time (System.nanoTime())
- Fencing tokens prevent stale-leader writes - even if split-brain occurs, only the real leader can write
- Shorter leases = smaller split-brain window - 5 seconds is better than 30 seconds
- Monitor for clock adjustments - alert on NTP steps > 500ms
- Server-side validation - don’t trust client-side lease checks for critical operations
Related Articles
- Distributed System Testing - Testing for time-related failures
- Idempotency Keys Replica Lag - Another consistency trap
Related posts
The Cert Isn't Expired, Your Node Is: Time Drift Breaking TLS and JWT in Kubernetes
Sporadic TLS handshake failures and JWT rejections across services. Everything passes when you check. The culprit: a node's clock drifted or jumped, and NTP fixed it before you could catch it.
Double Charges From Idempotency Keys: The Replica Lag Trap
Perfect idempotency logic, but customers still get charged twice. The cause: checking idempotency keys against a read replica that's seconds behind the primary during traffic spikes.
Gossip Protocol Ghost Nodes: IP Reuse Haunting Your Cluster
New node joins cluster but gets shunned. Old node's IP is still in gossip protocol's failure detection blacklist. The zombie membership record lives on.
Elasticsearch Hot Shard Problem: When One Node Does All the Work
5 data nodes but one is at 100% CPU. Uneven routing keys create hot shards. I show how to detect skew and fix it with routing strategies.
Cite this article
If you reference this post, please link to the original URL and credit the author.