Prometheus remote_write backpressure: when monitoring fills the disk (and still loses data)

Remote write looks like “just another output”. In practice it’s a pipeline with sharp failure modes:

the remote endpoint becomes slow or offline,
remote_write starts falling behind,
backlog grows (in memory and in the WAL),
and after a while you get two problems at once:
1. disk pressure / Prometheus restarts
2. data that can’t be fully caught up (gaps in remote storage)

This post is an operational runbook: symptoms → measurements → safe actions → prevention.

Tested on: Prometheus 2.45+ (TSDB + remote_write), common remote storage backends (Mimir/Thanos/“anything that accepts /api/v1/write”). Metric names vary by version.

How remote_write works (only what you need during an incident)

remote_write reads samples from the WAL (write-ahead log). That’s great: short remote outages can be buffered on disk.

But there are hard edges:

If the endpoint is down “long enough”, your buffer is not infinite.
When the endpoint comes back, you may get a catch-up burst that can saturate CPU/network (and sometimes overload the remote endpoint again).

Symptoms (what you’ll see)

In graphs

samples_pending keeps growing and never returns
retried_samples_total grows steadily (constant retrying)
“highest sent timestamp” lags far behind wall clock time
the disk used by Prometheus grows at a scary slope

In logs

context deadline exceeded, 500, 429, TLS errors
retry/backoff messages (depends on version and log level)

Minimum signals to collect

Metric names have changed historically. Treat the concrete names below as “typical” and verify what your version exposes.

1) How far behind remote_write is (lag)

Typical metric:

prometheus_remote_storage_queue_highest_sent_timestamp_seconds (most recent successfully sent timestamp)

Lag query:

time() - prometheus_remote_storage_queue_highest_sent_timestamp_seconds

If it keeps increasing, remote_write is not keeping up.

2) How much backlog is pending

Typical metric:

prometheus_remote_storage_samples_pending

prometheus_remote_storage_samples_pending

3) Disk trend (time-to-disk-full)

If you have node exporter, track:

node_filesystem_avail_bytes
the slope of available bytes over time

If not, take a manual snapshot (host/pod):

du -sh /prometheus
du -sh /prometheus/wal 2>/dev/null || true
df -h /prometheus

Incident playbook: what to do when remote_write is failing

Step 0: confirm it’s the remote endpoint (not local networking)

From the Prometheus host/pod:

curl -v https://REMOTE-ENDPOINT/api/v1/write

This may not return “200 OK” (some endpoints don’t accept GET), but it’s useful for DNS/TLS/connectivity and latency.

Step 1: decide if you’re still in a “safe window”

If lag is growing and the endpoint is down, ask:

How long has remote_write been effectively behind?
Is the endpoint likely to be back before you hit disk full or irreversible backlog loss?

Step 2: choose the trade-off (survive vs “no gaps”)

When the endpoint is down, you only have a few options:

Option A — maximize catch-up chance (endpoint returns soon)

keep remote_write enabled,
protect Prometheus from OOM/disk full,
prepare for a catch-up burst when the endpoint returns.

Option B — keep Prometheus alive even if remote storage gets gaps

temporarily disable remote_write or aggressively reduce the volume you send,
goal: Prometheus survives and local alerting continues,
accept gaps in remote storage.

This choice must be explicit. “Let’s just wait” is often the worst strategy.

Step 3: tune `queue_config` (endpoint is up, but slow)

Template (don’t treat numbers as universal; these are levers):

remote_write:
  - url: https://REMOTE-ENDPOINT/api/v1/write
    queue_config:
      # Backpressure / throughput knobs:
      max_samples_per_send: 2000
      capacity: 10000
      min_shards: 1
      max_shards: 50

      # Retry/backoff knobs:
      batch_send_deadline: 5s
      min_backoff: 30ms
      max_backoff: 5s

Practical rules:

capacity: if too low, the queue fills and backlog stops draining.
- good starting point: make capacity a multiple of max_samples_per_send
max_shards: more shards = more parallelism = more throughput… but also:
- more memory,
- more load on the remote endpoint (risk of re-breaking it on recovery)
min_backoff/max_backoff: helps avoid a “retry storm” after recovery

Step 4: after recovery, watch catch-up (not just “green”)

After the remote storage is back, you want to see:

samples_pending trending down,
“highest sent timestamp” catching up to current time,
shard count not pinned at max (sustained saturation).

Catch-up can take a while and it’s expensive. Watch CPU/network saturation.

Prevention: treat remote_write as a contract with a budget

Define a remote_write budget:

Max lag (e.g. “remote_write must not be behind more than X minutes”)
Disk budget for Prometheus data (GB you can tolerate during an outage)
Memory budget (how much extra RAM remote_write is allowed to use)
Fallback strategy:
- when to disable remote_write so Prometheus survives,
- what to drop first (high-cardinality, noisy, low-value)

And most importantly: alerts that fire long before disk full.

What I’d do in prod

My default strategy:

have a dashboard for remote_write lag + pending + shard saturation
have alerts: “remote_write behind”, “pending growing”, “disk trend”
decide in advance whether priority is:
- “no gaps in remote storage”, or
- “local alerting must stay alive”
prepare safe switches:
- temporarily reduce volume (drop low-value metrics),
- conservative max_shards so recovery doesn’t take the endpoint down again

FAQ

Does increasing `max_shards` always help?

No. It can just move the bottleneck:

overload the remote endpoint,
or blow up Prometheus RAM usage.

Is it safe to delete the WAL?

That’s “data loss by design”. If you do it, you’re explicitly discarding backlog. Don’t do it without explicit incident approval.

How can I reduce volume without disabling remote_write?

Drop low-value metrics via relabeling/metric_relabel_configs (or reduce scrape scope). Start with whatever has the worst cardinality.

/en/blog/prometheus-cardinality-explosion/ (when volume/churn comes from cardinality)
/en/blog/cardinality-contracts-prometheus-label-budgets/ (label budgets as a contract)
/en/blog/dash-contracts-grafana-alerts-ci/ (dashboards/alerts as a contract)

Prometheus remote_write backpressure: when monitoring fills the disk (and still loses data)

How remote_write works (only what you need during an incident)

Symptoms (what you’ll see)

In graphs

In logs

Minimum signals to collect

1) How far behind remote_write is (lag)

2) How much backlog is pending

3) Disk trend (time-to-disk-full)

Incident playbook: what to do when remote_write is failing

Step 0: confirm it’s the remote endpoint (not local networking)

Step 1: decide if you’re still in a “safe window”

Step 2: choose the trade-off (survive vs “no gaps”)

Option A — maximize catch-up chance (endpoint returns soon)

Option B — keep Prometheus alive even if remote storage gets gaps

Step 3: tune `queue_config` (endpoint is up, but slow)

Step 4: after recovery, watch catch-up (not just “green”)

Prevention: treat remote_write as a contract with a budget

What I’d do in prod

FAQ

Does increasing `max_shards` always help?

Is it safe to delete the WAL?

How can I reduce volume without disabling remote_write?

Further reading

Related posts

Cite this article

How remote_write works (only what you need during an incident)

Symptoms (what you’ll see)

In graphs

In logs

Minimum signals to collect

1) How far behind remote_write is (lag)

2) How much backlog is pending

3) Disk trend (time-to-disk-full)

Incident playbook: what to do when remote_write is failing

Step 0: confirm it’s the remote endpoint (not local networking)

Step 1: decide if you’re still in a “safe window”

Step 2: choose the trade-off (survive vs “no gaps”)

Option A — maximize catch-up chance (endpoint returns soon)

Option B — keep Prometheus alive even if remote storage gets gaps

Step 3: tune queue_config (endpoint is up, but slow)

Step 4: after recovery, watch catch-up (not just “green”)

Prevention: treat remote_write as a contract with a budget

What I’d do in prod

FAQ

Does increasing max_shards always help?

Is it safe to delete the WAL?

How can I reduce volume without disabling remote_write?

Related reading

Further reading

Related posts

Cite this article

Step 3: tune `queue_config` (endpoint is up, but slow)

Does increasing `max_shards` always help?