The Cert Isn't Expired, Your Node Is: Time Drift Breaking TLS and JWT in Kubernetes
Time drift turns TLS and JWT into a weird, intermittent horror story. “Certificate expired? But we just renewed it last week!” The alert dashboard was lighting up with TLS handshake failures across multiple services. Some pods were failing to connect to external APIs. Others were rejecting JWTs as expired. But when I checked the certificates, they were valid for months. When I checked the JWTs, they hadn’t expired. The problem was gone by the time I started investigating.
Then it happened again two days later. Same pattern—burst of failures, then everything fine. This time I was faster. I SSH’d to one of the affected nodes immediately and ran chronyc tracking. There it was: NTP had just corrected a 15-second drift with a step adjustment. For those 15 seconds, the node’s clock had been wrong enough to push certificate notBefore checks into “not yet valid” territory and JWT exp claims into “expired” territory.
The frustrating part was that by the time you investigate, the evidence is gone. NTP fixes the clock, services recover, and all you have are mysterious logs about expired certificates that aren’t actually expired. You have to know to look at NTP/chrony logs and time sync metrics to see that the clock itself was the problem.
This is one of those distributed systems bugs that treats time as reliable when it fundamentally isn’t. Wall-clock time can jump forwards or backwards. Your certificates, tokens, and database replication all trust the clock to be accurate, and when it isn’t, they fail in ways that seem impossible.
Environment: Kubernetes 1.28+, nodes with NTP/chrony, services using TLS and JWT authentication
Understanding the Problem
When Time Validation Fails
TLS Certificate validation:
┌─────────────────────────────────────────────────────────────┐
│ Certificate validity │
│ notBefore now() notAfter │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ 2024-01-01 2024-06-15 2025-01-01 │
│ │ │ │ │
│ └─────── valid ───┴────── valid ─────┘ │
└─────────────────────────────────────────────────────────────┘
Clock drifts 15 seconds into the PAST:
┌─────────────────────────────────────────────────────────────┐
│ notBefore now() │
│ │ │ (clock thinks it's Dec 31, 2023) │
│ ▼ ▼ │
│ 2024-01-01 2023-12-31 │
│ │ │ │
│ └── NOT YET VALID! │
│ "certificate is not valid yet" │
└─────────────────────────────────────────────────────────────┘
Clock drifts 15 seconds into the FUTURE:
┌─────────────────────────────────────────────────────────────┐
│ now() notAfter │
│ │ │ │
│ ▼ ▼ │
│ 2025-01-02 2025-01-01 │
│ │ │ │
│ └── EXPIRED! │
│ "certificate has expired"│
└─────────────────────────────────────────────────────────────┘
JWT Token Validation Fails Similarly
JWT claims:
{
"iat": 1718450000, // issued at: June 15, 2024 12:00:00
"exp": 1718453600, // expires: June 15, 2024 13:00:00
"nbf": 1718450000 // not before: June 15, 2024 12:00:00
}
Normal validation (now = 12:30:00):
iat (12:00:00) < now (12:30:00) < exp (13:00:00) ✓
Clock drift +35 minutes (now thinks it's 13:05:00):
now (13:05:00) > exp (13:00:00)
→ "Token expired" (but it's not!)
Clock drift -35 minutes (now thinks it's 11:55:00):
now (11:55:00) < nbf (12:00:00)
→ "Token not yet valid" (but it should be!)
Why This Happens in Kubernetes
Common clock drift scenarios:
1. VM live migration / pause-resume
┌────────────────────────────────────────────┐
│ Node VM paused for migration │
│ Real time passes: 30 seconds │
│ VM resumes │
│ Node clock is now 30 seconds in the past │
│ NTP eventually corrects with a step │
└────────────────────────────────────────────┘
2. Cloud provider maintenance
┌────────────────────────────────────────────┐
│ Hypervisor doing maintenance │
│ Guest VM clock stops for N seconds │
│ Clock suddenly behind when VM resumes │
└────────────────────────────────────────────┘
3. NTP step correction
┌────────────────────────────────────────────┐
│ Node clock drifting slowly over days │
│ NTP notices 15-second drift │
│ NTP steps clock forward/backward │
│ Suddenly: time jumps! │
└────────────────────────────────────────────┘
4. Bad NTP configuration
┌────────────────────────────────────────────┐
│ Wrong NTP servers configured │
│ NTP server itself has wrong time │
│ All nodes "sync" to wrong time │
└────────────────────────────────────────────┘
5. Clocksource issues
┌────────────────────────────────────────────┐
│ TSC (Time Stamp Counter) not stable │
│ Kernel switches clocksource │
│ Time may jump during transition │
└────────────────────────────────────────────┘
Diagnosing Time Issues
Check Current Time Sync Status
# On the affected node
# Using chrony (most common on modern systems)
chronyc tracking
# Reference ID : 169.254.169.123 (time.google.com)
# Stratum : 3
# Ref time (UTC) : Mon Jun 15 12:30:00 2024
# System time : 0.000015 seconds fast of NTP time
# Last offset : +0.000012 seconds ← small is good
# RMS offset : 0.000025 seconds
# Frequency : 1.234 ppm slow
# Root delay : 0.001234 seconds
# Root dispersion : 0.000123 seconds
# Update interval : 1024.0 seconds
# Leap status : Normal
# Check sources
chronyc sources -v
# ^* time.google.com 2 10 377 64 +0.012ms +/- 20ms
# Using timedatectl
timedatectl status
# Look for: System clock synchronized: yes
Check for Recent Time Steps
# Chrony logs - look for steps
journalctl -u chronyd --since "2 hours ago" | grep -i "step\|jump\|offset"
# Jun 15 12:00:00 node1 chronyd[1234]: System clock was stepped by 15.234567 seconds
# Alternative: chrony's tracking log
chronyc tracking | grep -i "Last offset"
# If large (>100ms), recent step likely happened
# systemd-timesyncd (if using)
journalctl -u systemd-timesyncd --since "2 hours ago" | grep -i "step\|adjust"
Check Kernel Clock Events
# Kernel messages about time
dmesg | grep -iE "clocksource|timekeep|tsc|time.*jump"
# Look for:
# - "Clocksource tsc unstable"
# - "timekeeping: Nonstop clocksource"
# - "PM: suspend exit" (VM pause/resume)
# Check current clocksource
cat /sys/devices/system/clocksource/clocksource0/current_clocksource
# tsc, hpet, or acpi_pm
Prometheus Metrics (if available)
# Current time offset from NTP
node_timex_offset_seconds
# Time sync status (1 = synced)
node_timex_sync_status
# Frequency error
node_timex_frequency_adjustment_ratio
# Maximum error estimate
node_timex_maxerror_seconds
# Alert on large offset
abs(node_timex_offset_seconds) > 0.5
Application-Level Detection
# From application logs, look for time-related errors
kubectl logs deploy/api-server | grep -iE "expired|not.yet.valid|time|clock"
# Common error patterns:
# "x509: certificate has expired or is not yet valid"
# "token expired"
# "token not yet valid"
# "signature is not yet valid"
# "request time too skewed"
Reproduction Lab
Using libfaketime (Safe, Process-Local)
# docker-compose.yml - Safe time drift simulation
version: '3.8'
services:
tls-server:
image: nginx:alpine
volumes:
- ./nginx.conf:/etc/nginx/nginx.conf:ro
- ./certs:/etc/nginx/certs:ro
ports:
- "8443:8443"
client-normal:
image: alpine:3.18
command: sh -c "apk add curl ca-certificates && sleep infinity"
depends_on:
- tls-server
client-future:
image: debian:bookworm
environment:
# libfaketime will make this container think time is 2 days in the future
LD_PRELOAD: /usr/lib/x86_64-linux-gnu/faketime/libfaketime.so.1
FAKETIME: "+2d"
command: sh -c "apt-get update && apt-get install -y curl libfaketime ca-certificates && sleep infinity"
depends_on:
- tls-server
client-past:
image: debian:bookworm
environment:
LD_PRELOAD: /usr/lib/x86_64-linux-gnu/faketime/libfaketime.so.1
FAKETIME: "-2d"
command: sh -c "apt-get update && apt-get install -y curl libfaketime ca-certificates && sleep infinity"
depends_on:
- tls-server
Generate Short-Lived Test Certificate
# Create certificate valid for only 1 day
mkdir -p certs
openssl req -x509 -newkey rsa:2048 -nodes -days 1 \
-keyout certs/key.pem -out certs/cert.pem \
-subj "/CN=localhost" \
-addext "subjectAltName=DNS:localhost,DNS:tls-server"
nginx.conf
events {}
http {
server {
listen 8443 ssl;
ssl_certificate /etc/nginx/certs/cert.pem;
ssl_certificate_key /etc/nginx/certs/key.pem;
location / {
return 200 "OK\n";
}
}
}
Run the Demo
docker compose up -d
sleep 10
# Normal client - works (trusting self-signed for demo)
docker compose exec client-normal curl -vk https://tls-server:8443/
# Should show: HTTP/1.1 200 OK
# Client in the future (+2 days) - certificate "expired"
docker compose exec client-future curl -v https://tls-server:8443/
# Error: certificate has expired
# Client in the past (-2 days) - certificate "not yet valid"
docker compose exec client-past curl -v https://tls-server:8443/
# Error: certificate is not yet valid
# With -k (ignore cert) it works, proving the issue is time-based
docker compose exec client-future curl -vk https://tls-server:8443/
# HTTP/1.1 200 OK
The Fix
1. Configure NTP to Slew Instead of Step
# /etc/chrony/chrony.conf
# Use reliable NTP servers
server time.google.com iburst
server time.cloudflare.com iburst
# Only allow step at boot (first 3 updates), then slew
makestep 1.0 3
# After boot, use slew (gradual adjustment)
# Maximum slew rate (don't jump more than this)
maxslewrate 500
# Enable RTC sync
rtcsync
# Log significant events
log tracking measurements statistics
logdir /var/log/chrony
# Apply changes
systemctl restart chronyd
# Verify
chronyc tracking
2. Monitor Time Sync
# Prometheus alerting rules
groups:
- name: time-sync
rules:
- alert: NodeTimeOffsetHigh
expr: |
abs(node_timex_offset_seconds) > 0.1
for: 2m
labels:
severity: warning
annotations:
summary: "Node {{ $labels.instance }} time offset is {{ $value }}s"
description: "Time drift detected - may cause TLS/JWT failures"
- alert: NodeTimeNotSynced
expr: |
node_timex_sync_status != 1
for: 5m
labels:
severity: critical
annotations:
summary: "Node {{ $labels.instance }} time not synchronized"
description: "NTP sync lost - certificate and token validation will fail"
- alert: NodeTimeStepDetected
expr: |
changes(node_timex_offset_seconds[5m]) > 0
and abs(delta(node_timex_offset_seconds[5m])) > 0.5
for: 1m
labels:
severity: warning
annotations:
summary: "Time step detected on {{ $labels.instance }}"
description: "Clock was stepped by ~{{ $value }}s - check for issues"
3. Add JWT Clock Skew Tolerance
// Go - Add clock skew tolerance to JWT validation
import (
"github.com/golang-jwt/jwt/v5"
"time"
)
func validateToken(tokenString string) (*jwt.Token, error) {
return jwt.Parse(tokenString, keyFunc,
// Allow 30 seconds of clock skew
jwt.WithLeeway(30*time.Second),
)
}
// Java - JJWT clock skew
import io.jsonwebtoken.Jwts;
import io.jsonwebtoken.Clock;
Jwts.parser()
.setAllowedClockSkewSeconds(30) // 30 second tolerance
.setSigningKey(key)
.parseClaimsJws(token);
# Python - PyJWT clock skew
import jwt
from datetime import timedelta
jwt.decode(
token,
key,
algorithms=["RS256"],
leeway=timedelta(seconds=30) # 30 second tolerance
)
4. Kubernetes-Level: Ensure Node Time Sync
# DaemonSet to monitor/enforce time sync
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: time-sync-monitor
namespace: kube-system
spec:
selector:
matchLabels:
app: time-sync-monitor
template:
metadata:
labels:
app: time-sync-monitor
spec:
hostNetwork: true
hostPID: true
containers:
- name: monitor
image: alpine:3.18
command:
- sh
- -c
- |
apk add --no-cache chrony
while true; do
OFFSET=$(chronyc tracking | grep "System time" | awk '{print $4}')
if [ $(echo "$OFFSET > 0.5" | bc -l) -eq 1 ]; then
echo "WARNING: Time offset is ${OFFSET}s"
fi
sleep 60
done
securityContext:
privileged: true
volumeMounts:
- name: chrony-sock
mountPath: /var/run/chrony
volumes:
- name: chrony-sock
hostPath:
path: /var/run/chrony
tolerations:
- operator: Exists
5. Use Monotonic Time for Durations
// Wrong: Using wall clock for elapsed time
startTime := time.Now()
// ... do work ...
elapsed := time.Now().Sub(startTime) // Can be negative if clock stepped!
// Right: time.Since uses monotonic clock internally
startTime := time.Now() // Stores both wall and monotonic
// ... do work ...
elapsed := time.Since(startTime) // Uses monotonic portion
// Even better for explicit monotonic needs:
start := time.Now()
// ... work ...
if time.Since(start) > timeout { // Monotonic comparison
// timed out
}
// Java - Use nanoTime for durations, not currentTimeMillis
// Wrong:
long start = System.currentTimeMillis();
// ... work ...
long elapsed = System.currentTimeMillis() - start; // Can go negative!
// Right:
long start = System.nanoTime();
// ... work ...
long elapsed = System.nanoTime() - start; // Always increases
Monitoring Dashboard
Key Metrics to Track
# Time offset (should be near 0)
node_timex_offset_seconds{instance=~".*"}
# Sync status (should be 1)
node_timex_sync_status{instance=~".*"}
# Rate of offset change (sudden changes = steps)
rate(node_timex_offset_seconds[5m])
# Maximum error (uncertainty bound)
node_timex_maxerror_seconds{instance=~".*"}
# Correlate with TLS errors
sum(rate(nginx_ingress_controller_ssl_certificate_expiry_time_seconds[5m])) by (host)
Log Correlation Query
# Find TLS/JWT errors that correlate with time events
# Step 1: Find time step events
journalctl -u chronyd --since "24 hours ago" | grep "step" | awk '{print $1" "$2" "$3}'
# Step 2: Check application logs around those times
# For each timestamp from above:
kubectl logs deploy/api-server --since-time="2024-06-15T12:00:00Z" \
| grep -E "expired|not.yet.valid" | head -20
Checklist
## Time Drift Debugging Checklist
### Immediate Detection
- [ ] Check current offset: `chronyc tracking`
- [ ] Check for recent steps: `journalctl -u chronyd | grep step`
- [ ] Check kernel time events: `dmesg | grep -i clock`
- [ ] Check NTP sync status: `timedatectl status`
### Root Cause
- [ ] VM pause/resume events (live migration)
- [ ] NTP misconfiguration
- [ ] Wrong NTP servers
- [ ] Clocksource instability
### Fix
- [ ] Configure chrony to slew after boot: `makestep 1.0 3`
- [ ] Add JWT clock skew tolerance (30s recommended)
- [ ] Monitor time offset with alerting
- [ ] Use monotonic time for duration calculations
### Prevention
- [ ] Alert on |offset| > 100ms
- [ ] Alert on sync status != 1
- [ ] Log time step events
- [ ] Review VM live migration policies
Conclusion
Time drift is one of those problems that feels like it shouldn’t happen in 2025. We have NTP. We have atomic clocks backing cloud providers’ time servers. Yet somehow, your Kubernetes node’s clock can still be wrong enough to break TLS and JWT validation.
The core issue is that NTP is designed to fix drift, not prevent it. When your VM is paused for migration, or when the hypervisor has a moment of instability, your node’s clock falls behind. NTP notices and corrects it—often with a “step” adjustment that jumps the clock forward. For that brief window before correction, and during the step itself, time-based validation fails.
What makes this particularly frustrating is that the evidence disappears. By the time you SSH to the node and check, NTP has already fixed the clock. The certificates are valid. The JWTs haven’t expired. You’re left with mysterious logs about expired things that aren’t actually expired.
The fix is threefold: configure NTP to slew instead of step (after initial boot sync), add clock skew tolerance to your JWT validation, and monitor time offset as a first-class infrastructure metric. When you see TLS or JWT errors that don’t match certificate/token validity, check chronyc tracking and journalctl -u chronyd before assuming the certificate is actually bad.
Key principles:
- Wall-clock time is not reliable—it can jump forwards or backwards
- NTP fixes drift but can cause steps—a sudden 15-second jump breaks everything briefly
- TLS and JWT validation trust the clock—wrong clock = false “expired” errors
- Evidence disappears after NTP fixes it—check chrony logs, not just current time
- Add clock skew tolerance—30 seconds of leeway handles most drift scenarios
The next time you see “certificate expired” for a certificate that isn’t expired, check your node’s time sync history first.
Related Articles
- Split-Brain from Clock Step Backwards - Wall time in lease-based systems
- gRPC Keepalive Transport Closing - Another “works then fails” debugging story
Related posts
Split-Brain From a Clock Step Backwards: Wall Time in Lease-Based Systems
Two nodes both believe they hold the leader lease. The cause: a small NTP time step backwards combined with code that mixes wall-clock time with duration-based timeouts.
ingress-nginx Reload Storms: Why 502 Spikes Track Ingress Churn
NGINX Ingress reload storms can drop keep-alives and cause 502 spikes. Runbook to prove reload impact, reduce churn, and harden graceful reload.
tcpdump Sees SYNs, but the Service Times Out: The Listen Backlog Trap
Clients time out, tcpdump shows SYNs (sometimes even SYN-ACK), yet your app logs nothing. The culprit is often the Linux listen/accept queues overflowing under load or CPU starvation.
Packets Arrive but the App Times Out: The rp_filter Trap in Kubernetes
tcpdump shows packets arriving, but your application sees nothing. The culprit: Linux reverse path filtering silently dropping packets before they reach iptables, triggered by asymmetric routing in multi-homed setups.
Cite this article
If you reference this post, please link to the original URL and credit the author.