OpenTelemetry Collector Backpressure: Fixing Drops with memory_limiter and Queues

The failure mode looks like “tracing is flaky”:

Developers swear they emitted spans.
The backend (Tempo/Jaeger/OTLP endpoint) looks healthy.
Yet traces go missing exactly when traffic is highest.
Collector pods restart (OOMKilled) or silently drop data.

I’ve had this happen in the worst possible moment: during an incident where we needed traces most.

The core idea is simple: telemetry must be best-effort, and the collector must be configured to shed load intentionally instead of dying under backpressure.

Tested on: OpenTelemetry Collector (contrib) 0.9x–0.10x, Kubernetes 1.29–1.31, OTLP over gRPC, exporters with retry + queues.

Incident narrative (anonymized)

We run a node-local collector as a DaemonSet (OTLP receiver), exporting to a central gateway and then to our trace backend. During a traffic spike:

exporter latency increased (backend ingest was slower)
the collector started buffering more
memory climbed steadily until OOMKilled
after restart, it repeated
we lost traces for the exact interval we were investigating

Blast radius: observability degraded for the whole cluster; we still had metrics/logs, but trace-driven debugging was blind.

Constraint: We could not “just scale the backend” in the moment. We needed the collector to remain stable under partial backend slowness.

Timeline

T-0: on-call wants traces; trace coverage suddenly drops.
T+10m: collector pods show memory climb; some get OOMKilled.
T+20m: logs contain “dropping data” / “queue full” messages.
T+30m: we confirm exporter backpressure (slow sends, retries).
T+45m: mitigation: enable memory_limiter, tune queues, reduce batch size.
T+90m: memory stabilizes, drop rate becomes bounded and visible, not catastrophic.

Mechanism: why collectors die (or drop invisibly)

Backpressure is normal — you must decide how to lose data

A collector pipeline is roughly:

receivers → processors → exporters

When exporters slow down:

queues fill
retries kick in
memory usage grows (buffered spans/logs/metrics)
eventually you either:
- OOMKill the collector (worst: you lose more)
- or drop data (better: controlled loss)

Tail sampling makes memory pressure worse

Tail sampling needs to hold data until it decides. If you combine tail sampling with exporter backpressure and no memory limiter, memory cliffs are very easy to trigger.

“No errors” doesn’t mean “no drops”

Many teams only monitor backend ingest. But drops can happen:

in the collector queue
in processors (memory limiter shedding)
in exporters when retry budget is exhausted

If you don’t graph collector telemetry about itself, you’re blind.

Runbook: prove collector backpressure and drops

What to check first

Look at collector pod restarts and memory

kubectl -n observability get pods -o wide
kubectl -n observability describe pod <otelcol-pod> | grep -n "OOMKilled\\|Restart" -n
kubectl -n observability top pod <otelcol-pod>

Read collector logs around the drop window

kubectl -n observability logs <otelcol-pod> --since=30m | tail -n 200

I’m looking for phrases like:

“dropping data”
“queue is full”
“exporting failed”
“retrying”

Scrape /metrics and discover the actual metric names Collector metrics evolve; I don’t assume names — I grep what’s there.

kubectl -n observability port-forward pod/<otelcol-pod> 8888:8888
curl -s http://127.0.0.1:8888/metrics | grep -E "^otelcol_" | head -n 50
curl -s http://127.0.0.1:8888/metrics | grep -E "exporter|queue|dropped|refused|failed" | head -n 50

How to confirm the hypothesis

You want to see a coherent story:

exporter send latency increases
queue utilization rises towards capacity
accepted spans stay high but “sent” falls behind
“failed” or “dropped” increases
memory climbs (if no limiter) or drop increases (if limiter is active)

If you have Prometheus, I usually graph:

accepted vs sent (by pipeline/exporter)
queue size / capacity
failed / dropped counters
collector RSS

(Use the exact metric names present in your build; the /metrics grep above gives you the ground truth.)

Safe mitigations

Enable memory_limiter and put it early in the pipeline This converts “collector dies” into “collector drops under pressure”.
Enable sending queues on exporters So you can absorb short spikes without retry storms.
Tune batch sizes Huge batches can create latency spikes and memory growth; too small can increase overhead. I prefer modest batches with predictable memory.
Reduce load temporarily

reduce sampling ratio for low-value telemetry
disable high-cardinality attributes
pause debug log-to-span bridges

Risky mitigations

“Increase queue size until it works”
- you’re just increasing blast radius when the backend is slow for longer than the queue horizon
“Turn off retries”
- you might create even more drop (sometimes needed, but be deliberate)
“Add file_storage everywhere”
- can move your incident from RAM to disk (and then to evictions)

What we changed (concrete config)

Before (typical “works in staging” config):

no memory limiter
queues small or disabled
batch defaults

After: memory limiter + batch tuning + exporter queue.

Example snippet:

processors:
  memory_limiter:
    check_interval: 1s
    limit_percentage: 75
    spike_limit_percentage: 15
  batch:
    timeout: 1s
    send_batch_size: 2048
    send_batch_max_size: 4096

exporters:
  otlp:
    endpoint: otel-gateway.observability.svc.cluster.local:4317
    tls:
      insecure: false
    sending_queue:
      enabled: true
      num_consumers: 4
      queue_size: 10000
    retry_on_failure:
      enabled: true
      initial_interval: 200ms
      max_interval: 5s
      max_elapsed_time: 30s

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [otlp]

Important ordering detail:

memory_limiter must be before batch (and before tail sampling if you use it), otherwise it won’t protect you from buildup.

How to verify (measurable)

Collector memory stabilizes

RSS flat under steady peak load
no OOMKills

Drops become bounded and visible

you may still drop under extreme backend slowness, but it’s a controlled rate, not “everything disappeared”.

Queues behave like shock absorbers

queue usage rises during short spikes, then drains back
it does not sit pegged at capacity for long periods

Backend recovers without a retry storm

exporter retries stop once backend latency returns

Prevention / guardrails

Telemetry loss budget
- acceptable drop rate (e.g. ≤ 0.1% steady state; higher allowed during incidents)
Collector SLO
- restart-free under peak
- bounded memory under backpressure
Sampling policy as a contract
- tail sampling requires explicit sizing and memory budget
Dashboards and alerts
- dropped/refused counters
- queue utilization
- exporter latency and retry rate

OpenTelemetry Collector Backpressure: Fixing Drops with memory_limiter and Queues

Incident narrative (anonymized)

Timeline

Mechanism: why collectors die (or drop invisibly)

Backpressure is normal — you must decide how to lose data

Tail sampling makes memory pressure worse

“No errors” doesn’t mean “no drops”

Runbook: prove collector backpressure and drops

What to check first

How to confirm the hypothesis

Safe mitigations

Risky mitigations

What we changed (concrete config)

How to verify (measurable)

Prevention / guardrails

Related posts

Cite this article

Incident narrative (anonymized)

Timeline

Mechanism: why collectors die (or drop invisibly)

Backpressure is normal — you must decide how to lose data

Tail sampling makes memory pressure worse

“No errors” doesn’t mean “no drops”

Runbook: prove collector backpressure and drops

What to check first

How to confirm the hypothesis

Safe mitigations

Risky mitigations

What we changed (concrete config)

How to verify (measurable)

Prevention / guardrails

Related reading

Related posts

Cite this article