Redis AOF fsync Latency Spikes: When Durability Becomes Your p99

This incident looks like “Redis is slow sometimes”, but it becomes painfully repeatable once you learn the pattern:

p99 latency spikes in bursts (often every second or around persistence activity)
CPU isn’t pegged
network is fine
retries hide it until your tail SLO collapses

If you run Redis with AOF enabled, there are two classic latency killers:

fsync and disk contention (durability work competing with serving requests)
AOF rewrite fork CoW spikes (memory pressure and paging during BGREWRITEAOF)

Tested on: Redis 7.2–7.4, Linux 6.1–6.6, Kubernetes with PV-backed storage (fast SSD and slower network disks).

Incident narrative (anonymized)

We enabled AOF on a Redis instance to reduce data loss during node failures. Immediately:

p99 latency went from a few ms to 50–300ms spikes
during rewrite windows, RSS jumped and the pod got close to OOM
the application started retrying, amplifying traffic

Constraint: we couldn’t disable AOF. It was a product requirement. We needed to make persistence predictable and keep rewrite from becoming a memory cliff.

Timeline

T-0: p99 spikes appear shortly after enabling AOF.
T+10m: INFO persistence shows aof_delayed_fsync growing during spikes.
T+20m: disk latency correlates with the spikes.
T+30m: BGREWRITEAOF correlates with RSS growth (CoW behavior).
T+45m: mitigation: reduce fsync contention during rewrite and add memory headroom.
T+2h: p99 stabilizes; rewrite no longer threatens OOM.
T+1d: guardrails: alerts on delayed fsync, fork time, and rewrite frequency.

Mechanism: where the latency actually comes from

AOF makes durability a real IO workload

With AOF, Redis appends write commands to the AOF file and fsyncs based on policy (appendfsync). Even when fsync is done asynchronously, disk contention still matters:

the filesystem and block device are shared resources
background writes and fsync flushes can compete with reads and writes
under slow disks, the whole process experiences latency spikes

A good production signal is aof_delayed_fsync: “we wanted to fsync, but we were late”.

BGREWRITEAOF triggers fork and CoW

Rewrite creates a compacted AOF. Redis forks:

child writes a new AOF
parent continues serving traffic
copy-on-write means memory usage spikes when pages are modified

If memory headroom is tight, rewrite can push you into:

reclaim stalls
major faults
OOMKilled on the pod

Runbook: prove AOF is the cause

1) Check persistence state and delayed fsync

redis-cli INFO persistence | rg -n "aof_enabled|appendfsync|aof_delayed_fsync|aof_rewrite_in_progress|aof_current_size|aof_base_size|latest_fork_usec"

What I look for:

aof_enabled:1
appendfsync policy
aof_delayed_fsync increasing during spikes
aof_rewrite_in_progress:1 during memory spikes
latest_fork_usec (fork time budget)

2) Correlate with disk latency

On the node, iostat is the fastest truth:

iostat -x 1 10

If await jumps during spikes, you’re paying for storage latency in your p99.

3) Check memory headroom and OOM risk

In Kubernetes:

kubectl -n <ns> top pod <redis-pod>
kubectl -n <ns> describe pod <redis-pod> | rg -n "Limits|Requests|OOMKilled|Restart" -n

If BGREWRITEAOF pushes RSS close to limit, you need more headroom or a different persistence strategy.

Safe mitigations (practical order)

1) Use a sane fsync policy

Most production setups use appendfsync everysec (tradeoff: up to about one second of data loss on crash).

2) Reduce fsync contention during rewrite

This is a very common knob:

no-appendfsync-on-rewrite yes

It reduces fsync pressure during BGREWRITEAOF windows.

3) Tune rewrite thresholds so rewrites are bounded

auto-aof-rewrite-percentage 100
auto-aof-rewrite-min-size 512mb

The exact numbers are workload-dependent. The contract is: rewrite frequency is bounded and explainable.

4) Add memory headroom for fork CoW

If you can’t afford CoW headroom, you can’t afford rewrite. In Kubernetes, this often means:

higher memory limit
Guaranteed QoS (requests equal limits) for critical Redis
dedicated node pool for predictable memory

5) Put AOF on predictable storage

AOF on slow storage is a p99 tax you pay forever. If you need durability, pay for stable latency.

Risky mitigations

appendfsync no: durability changes and data loss can be large on crash.
repeatedly killing BGREWRITEAOF: you can create a rewrite backlog and bigger future rewrites.
moving knobs without measuring aof_delayed_fsync and fork time: you will fly blind.

What we changed (concrete)

1) Make persistence predictable (config)

Representative redis.conf:

appendonly yes
appendfsync everysec
no-appendfsync-on-rewrite yes
auto-aof-rewrite-percentage 100
auto-aof-rewrite-min-size 512mb

2) Budget memory headroom explicitly

We raised memory and made it Guaranteed for the critical instance:

resources:
  requests:
    memory: "4Gi"
  limits:
    memory: "4Gi"

3) Use storage with predictable latency

We moved the PV to a storage class with stable latency (or local SSD where appropriate).

4) Add alerts that catch it early

We alert on:

aof_delayed_fsync increasing over baseline
fork time (latest_fork_usec) exceeding a budget
rewrites happening too often or running too long

How to verify

p99 stabilizes (spikes reduce in frequency and magnitude).
aof_delayed_fsync stays near zero.

redis-cli INFO persistence | rg -n "aof_delayed_fsync"

BGREWRITEAOF no longer pushes RSS near the limit.

Redis AOF fsync Latency Spikes: When Durability Becomes Your p99

Incident narrative (anonymized)

Timeline

Mechanism: where the latency actually comes from

AOF makes durability a real IO workload

BGREWRITEAOF triggers fork and CoW

Runbook: prove AOF is the cause

1) Check persistence state and delayed fsync

2) Correlate with disk latency

3) Check memory headroom and OOM risk

Safe mitigations (practical order)

1) Use a sane fsync policy

2) Reduce fsync contention during rewrite

3) Tune rewrite thresholds so rewrites are bounded

4) Add memory headroom for fork CoW

5) Put AOF on predictable storage

Risky mitigations

What we changed (concrete)

1) Make persistence predictable (config)

2) Budget memory headroom explicitly

3) Use storage with predictable latency

4) Add alerts that catch it early

How to verify

Related posts

Cite this article

Incident narrative (anonymized)

Timeline

Mechanism: where the latency actually comes from

AOF makes durability a real IO workload

BGREWRITEAOF triggers fork and CoW

Runbook: prove AOF is the cause

1) Check persistence state and delayed fsync

2) Correlate with disk latency

3) Check memory headroom and OOM risk

Safe mitigations (practical order)

1) Use a sane fsync policy

2) Reduce fsync contention during rewrite

3) Tune rewrite thresholds so rewrites are bounded

4) Add memory headroom for fork CoW

5) Put AOF on predictable storage

Risky mitigations

What we changed (concrete)

1) Make persistence predictable (config)

2) Budget memory headroom explicitly

3) Use storage with predictable latency

4) Add alerts that catch it early

How to verify

Related reading

Related posts

Cite this article