Back to blog

OpenTelemetry Collector Backpressure: Fixing Drops with memory_limiter and Queues

The failure mode looks like “tracing is flaky”:

  • Developers swear they emitted spans.
  • The backend (Tempo/Jaeger/OTLP endpoint) looks healthy.
  • Yet traces go missing exactly when traffic is highest.
  • Collector pods restart (OOMKilled) or silently drop data.

I’ve had this happen in the worst possible moment: during an incident where we needed traces most.

The core idea is simple: telemetry must be best-effort, and the collector must be configured to shed load intentionally instead of dying under backpressure.

Tested on: OpenTelemetry Collector (contrib) 0.9x–0.10x, Kubernetes 1.29–1.31, OTLP over gRPC, exporters with retry + queues.

Incident narrative (anonymized)

We run a node-local collector as a DaemonSet (OTLP receiver), exporting to a central gateway and then to our trace backend. During a traffic spike:

  • exporter latency increased (backend ingest was slower)
  • the collector started buffering more
  • memory climbed steadily until OOMKilled
  • after restart, it repeated
  • we lost traces for the exact interval we were investigating

Blast radius: observability degraded for the whole cluster; we still had metrics/logs, but trace-driven debugging was blind.

Constraint: We could not “just scale the backend” in the moment. We needed the collector to remain stable under partial backend slowness.

Timeline

  • T-0: on-call wants traces; trace coverage suddenly drops.
  • T+10m: collector pods show memory climb; some get OOMKilled.
  • T+20m: logs contain “dropping data” / “queue full” messages.
  • T+30m: we confirm exporter backpressure (slow sends, retries).
  • T+45m: mitigation: enable memory_limiter, tune queues, reduce batch size.
  • T+90m: memory stabilizes, drop rate becomes bounded and visible, not catastrophic.

Mechanism: why collectors die (or drop invisibly)

Backpressure is normal — you must decide how to lose data

A collector pipeline is roughly:

receivers → processors → exporters

When exporters slow down:

  • queues fill
  • retries kick in
  • memory usage grows (buffered spans/logs/metrics)
  • eventually you either:
    • OOMKill the collector (worst: you lose more)
    • or drop data (better: controlled loss)

Tail sampling makes memory pressure worse

Tail sampling needs to hold data until it decides. If you combine tail sampling with exporter backpressure and no memory limiter, memory cliffs are very easy to trigger.

“No errors” doesn’t mean “no drops”

Many teams only monitor backend ingest. But drops can happen:

  • in the collector queue
  • in processors (memory limiter shedding)
  • in exporters when retry budget is exhausted

If you don’t graph collector telemetry about itself, you’re blind.

Runbook: prove collector backpressure and drops

What to check first

  1. Look at collector pod restarts and memory
kubectl -n observability get pods -o wide
kubectl -n observability describe pod <otelcol-pod> | grep -n "OOMKilled\\|Restart" -n
kubectl -n observability top pod <otelcol-pod>
  1. Read collector logs around the drop window
kubectl -n observability logs <otelcol-pod> --since=30m | tail -n 200

I’m looking for phrases like:

  • “dropping data”
  • “queue is full”
  • “exporting failed”
  • “retrying”
  1. Scrape /metrics and discover the actual metric names Collector metrics evolve; I don’t assume names — I grep what’s there.
kubectl -n observability port-forward pod/<otelcol-pod> 8888:8888
curl -s http://127.0.0.1:8888/metrics | grep -E "^otelcol_" | head -n 50
curl -s http://127.0.0.1:8888/metrics | grep -E "exporter|queue|dropped|refused|failed" | head -n 50

How to confirm the hypothesis

You want to see a coherent story:

  • exporter send latency increases
  • queue utilization rises towards capacity
  • accepted spans stay high but “sent” falls behind
  • “failed” or “dropped” increases
  • memory climbs (if no limiter) or drop increases (if limiter is active)

If you have Prometheus, I usually graph:

  • accepted vs sent (by pipeline/exporter)
  • queue size / capacity
  • failed / dropped counters
  • collector RSS

(Use the exact metric names present in your build; the /metrics grep above gives you the ground truth.)

Safe mitigations

  1. Enable memory_limiter and put it early in the pipeline This converts “collector dies” into “collector drops under pressure”.

  2. Enable sending queues on exporters So you can absorb short spikes without retry storms.

  3. Tune batch sizes Huge batches can create latency spikes and memory growth; too small can increase overhead. I prefer modest batches with predictable memory.

  4. Reduce load temporarily

  • reduce sampling ratio for low-value telemetry
  • disable high-cardinality attributes
  • pause debug log-to-span bridges

Risky mitigations

  • “Increase queue size until it works”
    • you’re just increasing blast radius when the backend is slow for longer than the queue horizon
  • “Turn off retries”
    • you might create even more drop (sometimes needed, but be deliberate)
  • “Add file_storage everywhere”
    • can move your incident from RAM to disk (and then to evictions)

What we changed (concrete config)

Before (typical “works in staging” config):

  • no memory limiter
  • queues small or disabled
  • batch defaults

After: memory limiter + batch tuning + exporter queue.

Example snippet:

processors:
  memory_limiter:
    check_interval: 1s
    limit_percentage: 75
    spike_limit_percentage: 15
  batch:
    timeout: 1s
    send_batch_size: 2048
    send_batch_max_size: 4096

exporters:
  otlp:
    endpoint: otel-gateway.observability.svc.cluster.local:4317
    tls:
      insecure: false
    sending_queue:
      enabled: true
      num_consumers: 4
      queue_size: 10000
    retry_on_failure:
      enabled: true
      initial_interval: 200ms
      max_interval: 5s
      max_elapsed_time: 30s

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [otlp]

Important ordering detail:

  • memory_limiter must be before batch (and before tail sampling if you use it), otherwise it won’t protect you from buildup.

How to verify (measurable)

  1. Collector memory stabilizes
  • RSS flat under steady peak load
  • no OOMKills
  1. Drops become bounded and visible
  • you may still drop under extreme backend slowness, but it’s a controlled rate, not “everything disappeared”.
  1. Queues behave like shock absorbers
  • queue usage rises during short spikes, then drains back
  • it does not sit pegged at capacity for long periods
  1. Backend recovers without a retry storm
  • exporter retries stop once backend latency returns

Prevention / guardrails

  • Telemetry loss budget
    • acceptable drop rate (e.g. ≤ 0.1% steady state; higher allowed during incidents)
  • Collector SLO
    • restart-free under peak
    • bounded memory under backpressure
  • Sampling policy as a contract
    • tail sampling requires explicit sizing and memory budget
  • Dashboards and alerts
    • dropped/refused counters
    • queue utilization
    • exporter latency and retry rate

Related posts

Cite this article

If you reference this post, please link to the original URL and credit the author.

Michal Drozd. "OpenTelemetry Collector Backpressure: Fixing Drops with memory_limiter and Queues". https://www.michal-drozd.com/en/blog/otel-collector-backpressure-memory-limiter/ (Published December 4, 2025).