Redis AOF fsync Latency Spikes: When Durability Becomes Your p99
This incident looks like “Redis is slow sometimes”, but it becomes painfully repeatable once you learn the pattern:
- p99 latency spikes in bursts (often every second or around persistence activity)
- CPU isn’t pegged
- network is fine
- retries hide it until your tail SLO collapses
If you run Redis with AOF enabled, there are two classic latency killers:
- fsync and disk contention (durability work competing with serving requests)
- AOF rewrite fork CoW spikes (memory pressure and paging during BGREWRITEAOF)
Tested on: Redis 7.2–7.4, Linux 6.1–6.6, Kubernetes with PV-backed storage (fast SSD and slower network disks).
Incident narrative (anonymized)
We enabled AOF on a Redis instance to reduce data loss during node failures. Immediately:
- p99 latency went from a few ms to 50–300ms spikes
- during rewrite windows, RSS jumped and the pod got close to OOM
- the application started retrying, amplifying traffic
Constraint: we couldn’t disable AOF. It was a product requirement. We needed to make persistence predictable and keep rewrite from becoming a memory cliff.
Timeline
- T-0: p99 spikes appear shortly after enabling AOF.
- T+10m:
INFO persistenceshowsaof_delayed_fsyncgrowing during spikes. - T+20m: disk latency correlates with the spikes.
- T+30m: BGREWRITEAOF correlates with RSS growth (CoW behavior).
- T+45m: mitigation: reduce fsync contention during rewrite and add memory headroom.
- T+2h: p99 stabilizes; rewrite no longer threatens OOM.
- T+1d: guardrails: alerts on delayed fsync, fork time, and rewrite frequency.
Mechanism: where the latency actually comes from
AOF makes durability a real IO workload
With AOF, Redis appends write commands to the AOF file and fsyncs based on policy (appendfsync). Even when fsync is done asynchronously, disk contention still matters:
- the filesystem and block device are shared resources
- background writes and fsync flushes can compete with reads and writes
- under slow disks, the whole process experiences latency spikes
A good production signal is aof_delayed_fsync: “we wanted to fsync, but we were late”.
BGREWRITEAOF triggers fork and CoW
Rewrite creates a compacted AOF. Redis forks:
- child writes a new AOF
- parent continues serving traffic
- copy-on-write means memory usage spikes when pages are modified
If memory headroom is tight, rewrite can push you into:
- reclaim stalls
- major faults
- OOMKilled on the pod
Runbook: prove AOF is the cause
1) Check persistence state and delayed fsync
redis-cli INFO persistence | rg -n "aof_enabled|appendfsync|aof_delayed_fsync|aof_rewrite_in_progress|aof_current_size|aof_base_size|latest_fork_usec"
What I look for:
aof_enabled:1appendfsyncpolicyaof_delayed_fsyncincreasing during spikesaof_rewrite_in_progress:1during memory spikeslatest_fork_usec(fork time budget)
2) Correlate with disk latency
On the node, iostat is the fastest truth:
iostat -x 1 10
If await jumps during spikes, you’re paying for storage latency in your p99.
3) Check memory headroom and OOM risk
In Kubernetes:
kubectl -n <ns> top pod <redis-pod>
kubectl -n <ns> describe pod <redis-pod> | rg -n "Limits|Requests|OOMKilled|Restart" -n
If BGREWRITEAOF pushes RSS close to limit, you need more headroom or a different persistence strategy.
Safe mitigations (practical order)
1) Use a sane fsync policy
Most production setups use appendfsync everysec (tradeoff: up to about one second of data loss on crash).
2) Reduce fsync contention during rewrite
This is a very common knob:
no-appendfsync-on-rewrite yes
It reduces fsync pressure during BGREWRITEAOF windows.
3) Tune rewrite thresholds so rewrites are bounded
auto-aof-rewrite-percentage 100
auto-aof-rewrite-min-size 512mb
The exact numbers are workload-dependent. The contract is: rewrite frequency is bounded and explainable.
4) Add memory headroom for fork CoW
If you can’t afford CoW headroom, you can’t afford rewrite. In Kubernetes, this often means:
- higher memory limit
- Guaranteed QoS (requests equal limits) for critical Redis
- dedicated node pool for predictable memory
5) Put AOF on predictable storage
AOF on slow storage is a p99 tax you pay forever. If you need durability, pay for stable latency.
Risky mitigations
appendfsync no: durability changes and data loss can be large on crash.- repeatedly killing BGREWRITEAOF: you can create a rewrite backlog and bigger future rewrites.
- moving knobs without measuring
aof_delayed_fsyncand fork time: you will fly blind.
What we changed (concrete)
1) Make persistence predictable (config)
Representative redis.conf:
appendonly yes
appendfsync everysec
no-appendfsync-on-rewrite yes
auto-aof-rewrite-percentage 100
auto-aof-rewrite-min-size 512mb
2) Budget memory headroom explicitly
We raised memory and made it Guaranteed for the critical instance:
resources:
requests:
memory: "4Gi"
limits:
memory: "4Gi"
3) Use storage with predictable latency
We moved the PV to a storage class with stable latency (or local SSD where appropriate).
4) Add alerts that catch it early
We alert on:
aof_delayed_fsyncincreasing over baseline- fork time (
latest_fork_usec) exceeding a budget - rewrites happening too often or running too long
How to verify
- p99 stabilizes (spikes reduce in frequency and magnitude).
aof_delayed_fsyncstays near zero.
redis-cli INFO persistence | rg -n "aof_delayed_fsync"
- BGREWRITEAOF no longer pushes RSS near the limit.
Related reading
- Redis Memory Fragmentation: When maxmemory Isn’t Enough
- Redis Cluster Slot Migration: Temporary Memory Explosion
- Linux Page Cache Thrashing in Containers: When Free Memory Isn’t Free
- PostgreSQL Checkpoint Spikes: Why p99 Explodes Every N Minutes
- Kubernetes OOM Killer: Why Your Container Dies at 50% Memory
- Structured Logging Performance: When Your Logger Becomes the Bottleneck
Related posts
tcpdump Sees SYNs, but the Service Times Out: The Listen Backlog Trap
Clients time out, tcpdump shows SYNs (sometimes even SYN-ACK), yet your app logs nothing. The culprit is often the Linux listen/accept queues overflowing under load or CPU starvation.
ingress-nginx Reload Storms: Why 502 Spikes Track Ingress Churn
NGINX Ingress reload storms can drop keep-alives and cause 502 spikes. Runbook to prove reload impact, reduce churn, and harden graceful reload.
Pods Stuck in Terminating: A Production Decision Tree for Finalizers, Volumes, and Dead Nodes
A conservative runbook to unstick Pods safely: finalizers, CSI/volume cleanup stalls, dead nodes, and when (and how) to force-delete.
Kafka Consumer Rebalance Storms: Why Scaling Consumers Can Increase Lag
Kafka consumer rebalances can make lag worse when you scale out. Diagnose max.poll interval, heartbeats, and assignment strategy; apply safe config diffs.
Cite this article
If you reference this post, please link to the original URL and credit the author.