Prometheus remote_write backpressure: when monitoring fills the disk (and still loses data)
Remote write looks like “just another output”. In practice it’s a pipeline with sharp failure modes:
- the remote endpoint becomes slow or offline,
remote_writestarts falling behind,- backlog grows (in memory and in the WAL),
- and after a while you get two problems at once:
- disk pressure / Prometheus restarts
- data that can’t be fully caught up (gaps in remote storage)
This post is an operational runbook: symptoms → measurements → safe actions → prevention.
Tested on: Prometheus 2.45+ (TSDB + remote_write), common remote storage backends (Mimir/Thanos/“anything that accepts /api/v1/write”). Metric names vary by version.
How remote_write works (only what you need during an incident)
remote_write reads samples from the WAL (write-ahead log). That’s great: short remote outages can be buffered on disk.
But there are hard edges:
- If the endpoint is down “long enough”, your buffer is not infinite.
- When the endpoint comes back, you may get a catch-up burst that can saturate CPU/network (and sometimes overload the remote endpoint again).
Symptoms (what you’ll see)
In graphs
samples_pendingkeeps growing and never returnsretried_samples_totalgrows steadily (constant retrying)- “highest sent timestamp” lags far behind wall clock time
- the disk used by Prometheus grows at a scary slope
In logs
context deadline exceeded,500,429, TLS errors- retry/backoff messages (depends on version and log level)
Minimum signals to collect
Metric names have changed historically. Treat the concrete names below as “typical” and verify what your version exposes.
1) How far behind remote_write is (lag)
Typical metric:
prometheus_remote_storage_queue_highest_sent_timestamp_seconds(most recent successfully sent timestamp)
Lag query:
time() - prometheus_remote_storage_queue_highest_sent_timestamp_seconds
If it keeps increasing, remote_write is not keeping up.
2) How much backlog is pending
Typical metric:
prometheus_remote_storage_samples_pending
prometheus_remote_storage_samples_pending
3) Disk trend (time-to-disk-full)
If you have node exporter, track:
node_filesystem_avail_bytes- the slope of available bytes over time
If not, take a manual snapshot (host/pod):
du -sh /prometheus
du -sh /prometheus/wal 2>/dev/null || true
df -h /prometheus
Incident playbook: what to do when remote_write is failing
Step 0: confirm it’s the remote endpoint (not local networking)
From the Prometheus host/pod:
curl -v https://REMOTE-ENDPOINT/api/v1/write
This may not return “200 OK” (some endpoints don’t accept GET), but it’s useful for DNS/TLS/connectivity and latency.
Step 1: decide if you’re still in a “safe window”
If lag is growing and the endpoint is down, ask:
- How long has remote_write been effectively behind?
- Is the endpoint likely to be back before you hit disk full or irreversible backlog loss?
Step 2: choose the trade-off (survive vs “no gaps”)
When the endpoint is down, you only have a few options:
Option A — maximize catch-up chance (endpoint returns soon)
- keep remote_write enabled,
- protect Prometheus from OOM/disk full,
- prepare for a catch-up burst when the endpoint returns.
Option B — keep Prometheus alive even if remote storage gets gaps
- temporarily disable remote_write or aggressively reduce the volume you send,
- goal: Prometheus survives and local alerting continues,
- accept gaps in remote storage.
This choice must be explicit. “Let’s just wait” is often the worst strategy.
Step 3: tune queue_config (endpoint is up, but slow)
Template (don’t treat numbers as universal; these are levers):
remote_write:
- url: https://REMOTE-ENDPOINT/api/v1/write
queue_config:
# Backpressure / throughput knobs:
max_samples_per_send: 2000
capacity: 10000
min_shards: 1
max_shards: 50
# Retry/backoff knobs:
batch_send_deadline: 5s
min_backoff: 30ms
max_backoff: 5s
Practical rules:
capacity: if too low, the queue fills and backlog stops draining.- good starting point: make
capacitya multiple ofmax_samples_per_send
- good starting point: make
max_shards: more shards = more parallelism = more throughput… but also:- more memory,
- more load on the remote endpoint (risk of re-breaking it on recovery)
min_backoff/max_backoff: helps avoid a “retry storm” after recovery
Step 4: after recovery, watch catch-up (not just “green”)
After the remote storage is back, you want to see:
samples_pendingtrending down,- “highest sent timestamp” catching up to current time,
- shard count not pinned at max (sustained saturation).
Catch-up can take a while and it’s expensive. Watch CPU/network saturation.
Prevention: treat remote_write as a contract with a budget
Define a remote_write budget:
- Max lag (e.g. “remote_write must not be behind more than X minutes”)
- Disk budget for Prometheus data (GB you can tolerate during an outage)
- Memory budget (how much extra RAM remote_write is allowed to use)
- Fallback strategy:
- when to disable remote_write so Prometheus survives,
- what to drop first (high-cardinality, noisy, low-value)
And most importantly: alerts that fire long before disk full.
What I’d do in prod
My default strategy:
- have a dashboard for remote_write lag + pending + shard saturation
- have alerts: “remote_write behind”, “pending growing”, “disk trend”
- decide in advance whether priority is:
- “no gaps in remote storage”, or
- “local alerting must stay alive”
- prepare safe switches:
- temporarily reduce volume (drop low-value metrics),
- conservative
max_shardsso recovery doesn’t take the endpoint down again
FAQ
Does increasing max_shards always help?
No. It can just move the bottleneck:
- overload the remote endpoint,
- or blow up Prometheus RAM usage.
Is it safe to delete the WAL?
That’s “data loss by design”. If you do it, you’re explicitly discarding backlog. Don’t do it without explicit incident approval.
How can I reduce volume without disabling remote_write?
Drop low-value metrics via relabeling/metric_relabel_configs (or reduce scrape scope). Start with whatever has the worst cardinality.
Related reading
/en/blog/prometheus-cardinality-explosion/(when volume/churn comes from cardinality)/en/blog/cardinality-contracts-prometheus-label-budgets/(label budgets as a contract)/en/blog/dash-contracts-grafana-alerts-ci/(dashboards/alerts as a contract)
Further reading
Related posts
Cardinality Contracts: Prometheus Labels as an API with Budgets
Define label budgets, enforce them in CI, and add a runtime firewall to stop cardinality explosions before production.
Dash Contracts in Go: CI Compiler for Grafana Dashboards and Prometheus Alerts
Extract PromQL from dashboards and rules, verify selectors against /metrics, and fail CI before dashboards go dark.
Pods Stuck in Terminating: A Production Decision Tree for Finalizers, Volumes, and Dead Nodes
A conservative runbook to unstick Pods safely: finalizers, CSI/volume cleanup stalls, dead nodes, and when (and how) to force-delete.
Prometheus Cardinality Explosion: Detection, Prevention, and Recovery
One developer added user_id label. Prometheus OOM'd. I show how to detect high-cardinality metrics before they kill your monitoring, with relabel configs to drop them.
Cite this article
If you reference this post, please link to the original URL and credit the author.