The Cert Isn't Expired, Your Node Is: Time Drift Breaking TLS and JWT in Kubernetes

Time drift turns TLS and JWT into a weird, intermittent horror story. “Certificate expired? But we just renewed it last week!” The alert dashboard was lighting up with TLS handshake failures across multiple services. Some pods were failing to connect to external APIs. Others were rejecting JWTs as expired. But when I checked the certificates, they were valid for months. When I checked the JWTs, they hadn’t expired. The problem was gone by the time I started investigating.

Then it happened again two days later. Same pattern—burst of failures, then everything fine. This time I was faster. I SSH’d to one of the affected nodes immediately and ran chronyc tracking. There it was: NTP had just corrected a 15-second drift with a step adjustment. For those 15 seconds, the node’s clock had been wrong enough to push certificate notBefore checks into “not yet valid” territory and JWT exp claims into “expired” territory.

The frustrating part was that by the time you investigate, the evidence is gone. NTP fixes the clock, services recover, and all you have are mysterious logs about expired certificates that aren’t actually expired. You have to know to look at NTP/chrony logs and time sync metrics to see that the clock itself was the problem.

This is one of those distributed systems bugs that treats time as reliable when it fundamentally isn’t. Wall-clock time can jump forwards or backwards. Your certificates, tokens, and database replication all trust the clock to be accurate, and when it isn’t, they fail in ways that seem impossible.

Environment: Kubernetes 1.28+, nodes with NTP/chrony, services using TLS and JWT authentication

Understanding the Problem

When Time Validation Fails

TLS Certificate validation:
┌─────────────────────────────────────────────────────────────┐
│                    Certificate validity                      │
│  notBefore           now()            notAfter              │
│      │                 │                  │                 │
│      ▼                 ▼                  ▼                 │
│  2024-01-01     2024-06-15          2025-01-01             │
│      │                 │                  │                 │
│      └─────── valid ───┴────── valid ─────┘                 │
└─────────────────────────────────────────────────────────────┘

Clock drifts 15 seconds into the PAST:
┌─────────────────────────────────────────────────────────────┐
│  notBefore           now()                                  │
│      │                 │ (clock thinks it's Dec 31, 2023)   │
│      ▼                 ▼                                    │
│  2024-01-01     2023-12-31                                  │
│      │                 │                                    │
│      └── NOT YET VALID!                                     │
│           "certificate is not valid yet"                    │
└─────────────────────────────────────────────────────────────┘

Clock drifts 15 seconds into the FUTURE:
┌─────────────────────────────────────────────────────────────┐
│                                  now()            notAfter  │
│                                    │                  │     │
│                                    ▼                  ▼     │
│                             2025-01-02          2025-01-01  │
│                                    │                  │     │
│                                    └── EXPIRED!             │
│                                    "certificate has expired"│
└─────────────────────────────────────────────────────────────┘

JWT Token Validation Fails Similarly

JWT claims:
{
  "iat": 1718450000,  // issued at: June 15, 2024 12:00:00
  "exp": 1718453600,  // expires: June 15, 2024 13:00:00
  "nbf": 1718450000   // not before: June 15, 2024 12:00:00
}

Normal validation (now = 12:30:00):
  iat (12:00:00) < now (12:30:00) < exp (13:00:00) ✓

Clock drift +35 minutes (now thinks it's 13:05:00):
  now (13:05:00) > exp (13:00:00)
  → "Token expired" (but it's not!)

Clock drift -35 minutes (now thinks it's 11:55:00):
  now (11:55:00) < nbf (12:00:00)
  → "Token not yet valid" (but it should be!)

Why This Happens in Kubernetes

Common clock drift scenarios:

1. VM live migration / pause-resume
   ┌────────────────────────────────────────────┐
   │ Node VM paused for migration               │
   │ Real time passes: 30 seconds               │
   │ VM resumes                                 │
   │ Node clock is now 30 seconds in the past   │
   │ NTP eventually corrects with a step        │
   └────────────────────────────────────────────┘

2. Cloud provider maintenance
   ┌────────────────────────────────────────────┐
   │ Hypervisor doing maintenance               │
   │ Guest VM clock stops for N seconds         │
   │ Clock suddenly behind when VM resumes      │
   └────────────────────────────────────────────┘

3. NTP step correction
   ┌────────────────────────────────────────────┐
   │ Node clock drifting slowly over days       │
   │ NTP notices 15-second drift                │
   │ NTP steps clock forward/backward           │
   │ Suddenly: time jumps!                      │
   └────────────────────────────────────────────┘

4. Bad NTP configuration
   ┌────────────────────────────────────────────┐
   │ Wrong NTP servers configured               │
   │ NTP server itself has wrong time           │
   │ All nodes "sync" to wrong time             │
   └────────────────────────────────────────────┘

5. Clocksource issues
   ┌────────────────────────────────────────────┐
   │ TSC (Time Stamp Counter) not stable        │
   │ Kernel switches clocksource                │
   │ Time may jump during transition            │
   └────────────────────────────────────────────┘

Diagnosing Time Issues

Check Current Time Sync Status

# On the affected node
# Using chrony (most common on modern systems)
chronyc tracking
# Reference ID    : 169.254.169.123 (time.google.com)
# Stratum         : 3
# Ref time (UTC)  : Mon Jun 15 12:30:00 2024
# System time     : 0.000015 seconds fast of NTP time
# Last offset     : +0.000012 seconds  ← small is good
# RMS offset      : 0.000025 seconds
# Frequency       : 1.234 ppm slow
# Root delay      : 0.001234 seconds
# Root dispersion : 0.000123 seconds
# Update interval : 1024.0 seconds
# Leap status     : Normal

# Check sources
chronyc sources -v
# ^* time.google.com   2   10   377    64   +0.012ms   +/-  20ms

# Using timedatectl
timedatectl status
# Look for: System clock synchronized: yes

Check for Recent Time Steps

# Chrony logs - look for steps
journalctl -u chronyd --since "2 hours ago" | grep -i "step\|jump\|offset"
# Jun 15 12:00:00 node1 chronyd[1234]: System clock was stepped by 15.234567 seconds

# Alternative: chrony's tracking log
chronyc tracking | grep -i "Last offset"
# If large (>100ms), recent step likely happened

# systemd-timesyncd (if using)
journalctl -u systemd-timesyncd --since "2 hours ago" | grep -i "step\|adjust"

Check Kernel Clock Events

# Kernel messages about time
dmesg | grep -iE "clocksource|timekeep|tsc|time.*jump"
# Look for:
# - "Clocksource tsc unstable"
# - "timekeeping: Nonstop clocksource"
# - "PM: suspend exit" (VM pause/resume)

# Check current clocksource
cat /sys/devices/system/clocksource/clocksource0/current_clocksource
# tsc, hpet, or acpi_pm

Prometheus Metrics (if available)

# Current time offset from NTP
node_timex_offset_seconds

# Time sync status (1 = synced)
node_timex_sync_status

# Frequency error
node_timex_frequency_adjustment_ratio

# Maximum error estimate
node_timex_maxerror_seconds

# Alert on large offset
abs(node_timex_offset_seconds) > 0.5

Application-Level Detection

# From application logs, look for time-related errors
kubectl logs deploy/api-server | grep -iE "expired|not.yet.valid|time|clock"

# Common error patterns:
# "x509: certificate has expired or is not yet valid"
# "token expired"
# "token not yet valid"
# "signature is not yet valid"
# "request time too skewed"

Reproduction Lab

Using libfaketime (Safe, Process-Local)

# docker-compose.yml - Safe time drift simulation
version: '3.8'

services:
  tls-server:
    image: nginx:alpine
    volumes:
      - ./nginx.conf:/etc/nginx/nginx.conf:ro
      - ./certs:/etc/nginx/certs:ro
    ports:
      - "8443:8443"

  client-normal:
    image: alpine:3.18
    command: sh -c "apk add curl ca-certificates && sleep infinity"
    depends_on:
      - tls-server

  client-future:
    image: debian:bookworm
    environment:
      # libfaketime will make this container think time is 2 days in the future
      LD_PRELOAD: /usr/lib/x86_64-linux-gnu/faketime/libfaketime.so.1
      FAKETIME: "+2d"
    command: sh -c "apt-get update && apt-get install -y curl libfaketime ca-certificates && sleep infinity"
    depends_on:
      - tls-server

  client-past:
    image: debian:bookworm
    environment:
      LD_PRELOAD: /usr/lib/x86_64-linux-gnu/faketime/libfaketime.so.1
      FAKETIME: "-2d"
    command: sh -c "apt-get update && apt-get install -y curl libfaketime ca-certificates && sleep infinity"
    depends_on:
      - tls-server

Generate Short-Lived Test Certificate

# Create certificate valid for only 1 day
mkdir -p certs
openssl req -x509 -newkey rsa:2048 -nodes -days 1 \
  -keyout certs/key.pem -out certs/cert.pem \
  -subj "/CN=localhost" \
  -addext "subjectAltName=DNS:localhost,DNS:tls-server"

nginx.conf

events {}
http {
  server {
    listen 8443 ssl;
    ssl_certificate /etc/nginx/certs/cert.pem;
    ssl_certificate_key /etc/nginx/certs/key.pem;
    location / {
      return 200 "OK\n";
    }
  }
}

Run the Demo

docker compose up -d
sleep 10

# Normal client - works (trusting self-signed for demo)
docker compose exec client-normal curl -vk https://tls-server:8443/
# Should show: HTTP/1.1 200 OK

# Client in the future (+2 days) - certificate "expired"
docker compose exec client-future curl -v https://tls-server:8443/
# Error: certificate has expired

# Client in the past (-2 days) - certificate "not yet valid"
docker compose exec client-past curl -v https://tls-server:8443/
# Error: certificate is not yet valid

# With -k (ignore cert) it works, proving the issue is time-based
docker compose exec client-future curl -vk https://tls-server:8443/
# HTTP/1.1 200 OK

The Fix

1. Configure NTP to Slew Instead of Step

# /etc/chrony/chrony.conf

# Use reliable NTP servers
server time.google.com iburst
server time.cloudflare.com iburst

# Only allow step at boot (first 3 updates), then slew
makestep 1.0 3

# After boot, use slew (gradual adjustment)
# Maximum slew rate (don't jump more than this)
maxslewrate 500

# Enable RTC sync
rtcsync

# Log significant events
log tracking measurements statistics
logdir /var/log/chrony

# Apply changes
systemctl restart chronyd

# Verify
chronyc tracking

2. Monitor Time Sync

# Prometheus alerting rules
groups:
- name: time-sync
  rules:
  - alert: NodeTimeOffsetHigh
    expr: |
      abs(node_timex_offset_seconds) > 0.1
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: "Node {{ $labels.instance }} time offset is {{ $value }}s"
      description: "Time drift detected - may cause TLS/JWT failures"

  - alert: NodeTimeNotSynced
    expr: |
      node_timex_sync_status != 1
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "Node {{ $labels.instance }} time not synchronized"
      description: "NTP sync lost - certificate and token validation will fail"

  - alert: NodeTimeStepDetected
    expr: |
      changes(node_timex_offset_seconds[5m]) > 0
      and abs(delta(node_timex_offset_seconds[5m])) > 0.5
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: "Time step detected on {{ $labels.instance }}"
      description: "Clock was stepped by ~{{ $value }}s - check for issues"

3. Add JWT Clock Skew Tolerance

// Go - Add clock skew tolerance to JWT validation
import (
    "github.com/golang-jwt/jwt/v5"
    "time"
)

func validateToken(tokenString string) (*jwt.Token, error) {
    return jwt.Parse(tokenString, keyFunc,
        // Allow 30 seconds of clock skew
        jwt.WithLeeway(30*time.Second),
    )
}

// Java - JJWT clock skew
import io.jsonwebtoken.Jwts;
import io.jsonwebtoken.Clock;

Jwts.parser()
    .setAllowedClockSkewSeconds(30)  // 30 second tolerance
    .setSigningKey(key)
    .parseClaimsJws(token);

# Python - PyJWT clock skew
import jwt
from datetime import timedelta

jwt.decode(
    token,
    key,
    algorithms=["RS256"],
    leeway=timedelta(seconds=30)  # 30 second tolerance
)

4. Kubernetes-Level: Ensure Node Time Sync

# DaemonSet to monitor/enforce time sync
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: time-sync-monitor
  namespace: kube-system
spec:
  selector:
    matchLabels:
      app: time-sync-monitor
  template:
    metadata:
      labels:
        app: time-sync-monitor
    spec:
      hostNetwork: true
      hostPID: true
      containers:
      - name: monitor
        image: alpine:3.18
        command:
        - sh
        - -c
        - |
          apk add --no-cache chrony
          while true; do
            OFFSET=$(chronyc tracking | grep "System time" | awk '{print $4}')
            if [ $(echo "$OFFSET > 0.5" | bc -l) -eq 1 ]; then
              echo "WARNING: Time offset is ${OFFSET}s"
            fi
            sleep 60
          done
        securityContext:
          privileged: true
        volumeMounts:
        - name: chrony-sock
          mountPath: /var/run/chrony
      volumes:
      - name: chrony-sock
        hostPath:
          path: /var/run/chrony
      tolerations:
      - operator: Exists

5. Use Monotonic Time for Durations

// Wrong: Using wall clock for elapsed time
startTime := time.Now()
// ... do work ...
elapsed := time.Now().Sub(startTime)  // Can be negative if clock stepped!

// Right: time.Since uses monotonic clock internally
startTime := time.Now()  // Stores both wall and monotonic
// ... do work ...
elapsed := time.Since(startTime)  // Uses monotonic portion

// Even better for explicit monotonic needs:
start := time.Now()
// ... work ...
if time.Since(start) > timeout {  // Monotonic comparison
    // timed out
}

// Java - Use nanoTime for durations, not currentTimeMillis
// Wrong:
long start = System.currentTimeMillis();
// ... work ...
long elapsed = System.currentTimeMillis() - start;  // Can go negative!

// Right:
long start = System.nanoTime();
// ... work ...
long elapsed = System.nanoTime() - start;  // Always increases

Monitoring Dashboard

Key Metrics to Track

# Time offset (should be near 0)
node_timex_offset_seconds{instance=~".*"}

# Sync status (should be 1)
node_timex_sync_status{instance=~".*"}

# Rate of offset change (sudden changes = steps)
rate(node_timex_offset_seconds[5m])

# Maximum error (uncertainty bound)
node_timex_maxerror_seconds{instance=~".*"}

# Correlate with TLS errors
sum(rate(nginx_ingress_controller_ssl_certificate_expiry_time_seconds[5m])) by (host)

Log Correlation Query

# Find TLS/JWT errors that correlate with time events
# Step 1: Find time step events
journalctl -u chronyd --since "24 hours ago" | grep "step" | awk '{print $1" "$2" "$3}'

# Step 2: Check application logs around those times
# For each timestamp from above:
kubectl logs deploy/api-server --since-time="2024-06-15T12:00:00Z" \
  | grep -E "expired|not.yet.valid" | head -20

Checklist

## Time Drift Debugging Checklist

### Immediate Detection
- [ ] Check current offset: `chronyc tracking`
- [ ] Check for recent steps: `journalctl -u chronyd | grep step`
- [ ] Check kernel time events: `dmesg | grep -i clock`
- [ ] Check NTP sync status: `timedatectl status`

### Root Cause
- [ ] VM pause/resume events (live migration)
- [ ] NTP misconfiguration
- [ ] Wrong NTP servers
- [ ] Clocksource instability

### Fix
- [ ] Configure chrony to slew after boot: `makestep 1.0 3`
- [ ] Add JWT clock skew tolerance (30s recommended)
- [ ] Monitor time offset with alerting
- [ ] Use monotonic time for duration calculations

### Prevention
- [ ] Alert on |offset| > 100ms
- [ ] Alert on sync status != 1
- [ ] Log time step events
- [ ] Review VM live migration policies

Conclusion

Time drift is one of those problems that feels like it shouldn’t happen in 2025. We have NTP. We have atomic clocks backing cloud providers’ time servers. Yet somehow, your Kubernetes node’s clock can still be wrong enough to break TLS and JWT validation.

The core issue is that NTP is designed to fix drift, not prevent it. When your VM is paused for migration, or when the hypervisor has a moment of instability, your node’s clock falls behind. NTP notices and corrects it—often with a “step” adjustment that jumps the clock forward. For that brief window before correction, and during the step itself, time-based validation fails.

What makes this particularly frustrating is that the evidence disappears. By the time you SSH to the node and check, NTP has already fixed the clock. The certificates are valid. The JWTs haven’t expired. You’re left with mysterious logs about expired things that aren’t actually expired.

The fix is threefold: configure NTP to slew instead of step (after initial boot sync), add clock skew tolerance to your JWT validation, and monitor time offset as a first-class infrastructure metric. When you see TLS or JWT errors that don’t match certificate/token validity, check chronyc tracking and journalctl -u chronyd before assuming the certificate is actually bad.

Key principles:

Wall-clock time is not reliable—it can jump forwards or backwards
NTP fixes drift but can cause steps—a sudden 15-second jump breaks everything briefly
TLS and JWT validation trust the clock—wrong clock = false “expired” errors
Evidence disappears after NTP fixes it—check chrony logs, not just current time
Add clock skew tolerance—30 seconds of leeway handles most drift scenarios

The next time you see “certificate expired” for a certificate that isn’t expired, check your node’s time sync history first.

Split-Brain from Clock Step Backwards - Wall time in lease-based systems
gRPC Keepalive Transport Closing - Another “works then fails” debugging story

The Cert Isn't Expired, Your Node Is: Time Drift Breaking TLS and JWT in Kubernetes

Understanding the Problem

When Time Validation Fails

JWT Token Validation Fails Similarly

Why This Happens in Kubernetes

Diagnosing Time Issues

Check Current Time Sync Status

Check for Recent Time Steps

Check Kernel Clock Events

Prometheus Metrics (if available)

Application-Level Detection

Reproduction Lab

Using libfaketime (Safe, Process-Local)

Generate Short-Lived Test Certificate

nginx.conf

Run the Demo

The Fix

1. Configure NTP to Slew Instead of Step

2. Monitor Time Sync

3. Add JWT Clock Skew Tolerance

4. Kubernetes-Level: Ensure Node Time Sync

5. Use Monotonic Time for Durations

Monitoring Dashboard

Key Metrics to Track

Log Correlation Query

Checklist

Conclusion

Related posts

Cite this article

Understanding the Problem

When Time Validation Fails

JWT Token Validation Fails Similarly

Why This Happens in Kubernetes

Diagnosing Time Issues

Check Current Time Sync Status

Check for Recent Time Steps

Check Kernel Clock Events

Prometheus Metrics (if available)

Application-Level Detection

Reproduction Lab

Using libfaketime (Safe, Process-Local)

Generate Short-Lived Test Certificate

nginx.conf

Run the Demo

The Fix

1. Configure NTP to Slew Instead of Step

2. Monitor Time Sync

3. Add JWT Clock Skew Tolerance

4. Kubernetes-Level: Ensure Node Time Sync

5. Use Monotonic Time for Durations

Monitoring Dashboard

Key Metrics to Track

Log Correlation Query

Checklist

Conclusion

Related Articles

Related posts

Cite this article