OpenTelemetry Collector Backpressure: Fixing Drops with memory_limiter and Queues
The failure mode looks like “tracing is flaky”:
- Developers swear they emitted spans.
- The backend (Tempo/Jaeger/OTLP endpoint) looks healthy.
- Yet traces go missing exactly when traffic is highest.
- Collector pods restart (OOMKilled) or silently drop data.
I’ve had this happen in the worst possible moment: during an incident where we needed traces most.
The core idea is simple: telemetry must be best-effort, and the collector must be configured to shed load intentionally instead of dying under backpressure.
Tested on: OpenTelemetry Collector (contrib) 0.9x–0.10x, Kubernetes 1.29–1.31, OTLP over gRPC, exporters with retry + queues.
Incident narrative (anonymized)
We run a node-local collector as a DaemonSet (OTLP receiver), exporting to a central gateway and then to our trace backend. During a traffic spike:
- exporter latency increased (backend ingest was slower)
- the collector started buffering more
- memory climbed steadily until OOMKilled
- after restart, it repeated
- we lost traces for the exact interval we were investigating
Blast radius: observability degraded for the whole cluster; we still had metrics/logs, but trace-driven debugging was blind.
Constraint: We could not “just scale the backend” in the moment. We needed the collector to remain stable under partial backend slowness.
Timeline
- T-0: on-call wants traces; trace coverage suddenly drops.
- T+10m: collector pods show memory climb; some get OOMKilled.
- T+20m: logs contain “dropping data” / “queue full” messages.
- T+30m: we confirm exporter backpressure (slow sends, retries).
- T+45m: mitigation: enable
memory_limiter, tune queues, reduce batch size. - T+90m: memory stabilizes, drop rate becomes bounded and visible, not catastrophic.
Mechanism: why collectors die (or drop invisibly)
Backpressure is normal — you must decide how to lose data
A collector pipeline is roughly:
receivers → processors → exporters
When exporters slow down:
- queues fill
- retries kick in
- memory usage grows (buffered spans/logs/metrics)
- eventually you either:
- OOMKill the collector (worst: you lose more)
- or drop data (better: controlled loss)
Tail sampling makes memory pressure worse
Tail sampling needs to hold data until it decides. If you combine tail sampling with exporter backpressure and no memory limiter, memory cliffs are very easy to trigger.
“No errors” doesn’t mean “no drops”
Many teams only monitor backend ingest. But drops can happen:
- in the collector queue
- in processors (memory limiter shedding)
- in exporters when retry budget is exhausted
If you don’t graph collector telemetry about itself, you’re blind.
Runbook: prove collector backpressure and drops
What to check first
- Look at collector pod restarts and memory
kubectl -n observability get pods -o wide
kubectl -n observability describe pod <otelcol-pod> | grep -n "OOMKilled\\|Restart" -n
kubectl -n observability top pod <otelcol-pod>
- Read collector logs around the drop window
kubectl -n observability logs <otelcol-pod> --since=30m | tail -n 200
I’m looking for phrases like:
- “dropping data”
- “queue is full”
- “exporting failed”
- “retrying”
- Scrape
/metricsand discover the actual metric names Collector metrics evolve; I don’t assume names — I grep what’s there.
kubectl -n observability port-forward pod/<otelcol-pod> 8888:8888
curl -s http://127.0.0.1:8888/metrics | grep -E "^otelcol_" | head -n 50
curl -s http://127.0.0.1:8888/metrics | grep -E "exporter|queue|dropped|refused|failed" | head -n 50
How to confirm the hypothesis
You want to see a coherent story:
- exporter send latency increases
- queue utilization rises towards capacity
- accepted spans stay high but “sent” falls behind
- “failed” or “dropped” increases
- memory climbs (if no limiter) or drop increases (if limiter is active)
If you have Prometheus, I usually graph:
- accepted vs sent (by pipeline/exporter)
- queue size / capacity
- failed / dropped counters
- collector RSS
(Use the exact metric names present in your build; the /metrics grep above gives you the ground truth.)
Safe mitigations
-
Enable
memory_limiterand put it early in the pipeline This converts “collector dies” into “collector drops under pressure”. -
Enable sending queues on exporters So you can absorb short spikes without retry storms.
-
Tune
batchsizes Huge batches can create latency spikes and memory growth; too small can increase overhead. I prefer modest batches with predictable memory. -
Reduce load temporarily
- reduce sampling ratio for low-value telemetry
- disable high-cardinality attributes
- pause debug log-to-span bridges
Risky mitigations
- “Increase queue size until it works”
- you’re just increasing blast radius when the backend is slow for longer than the queue horizon
- “Turn off retries”
- you might create even more drop (sometimes needed, but be deliberate)
- “Add file_storage everywhere”
- can move your incident from RAM to disk (and then to evictions)
What we changed (concrete config)
Before (typical “works in staging” config):
- no memory limiter
- queues small or disabled
- batch defaults
After: memory limiter + batch tuning + exporter queue.
Example snippet:
processors:
memory_limiter:
check_interval: 1s
limit_percentage: 75
spike_limit_percentage: 15
batch:
timeout: 1s
send_batch_size: 2048
send_batch_max_size: 4096
exporters:
otlp:
endpoint: otel-gateway.observability.svc.cluster.local:4317
tls:
insecure: false
sending_queue:
enabled: true
num_consumers: 4
queue_size: 10000
retry_on_failure:
enabled: true
initial_interval: 200ms
max_interval: 5s
max_elapsed_time: 30s
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [otlp]
Important ordering detail:
memory_limitermust be beforebatch(and before tail sampling if you use it), otherwise it won’t protect you from buildup.
How to verify (measurable)
- Collector memory stabilizes
- RSS flat under steady peak load
- no OOMKills
- Drops become bounded and visible
- you may still drop under extreme backend slowness, but it’s a controlled rate, not “everything disappeared”.
- Queues behave like shock absorbers
- queue usage rises during short spikes, then drains back
- it does not sit pegged at capacity for long periods
- Backend recovers without a retry storm
- exporter retries stop once backend latency returns
Prevention / guardrails
- Telemetry loss budget
- acceptable drop rate (e.g. ≤ 0.1% steady state; higher allowed during incidents)
- Collector SLO
- restart-free under peak
- bounded memory under backpressure
- Sampling policy as a contract
- tail sampling requires explicit sizing and memory budget
- Dashboards and alerts
- dropped/refused counters
- queue utilization
- exporter latency and retry rate
Related reading
- Tail-Based Sampling in OpenTelemetry: Sizing, Memory Crashes and Cost Model
- Span Contracts: Trace-Driven API Contract Testing with OpenTelemetry
- Prometheus remote_write backpressure: when monitoring fills the disk (and still loses data)
- Prometheus Cardinality Explosion: Detection, Prevention, and Recovery
- Dash Contracts in Go: CI Compiler for Grafana Dashboards and Prometheus Alerts
- Structured Logging Performance: When Your Logger Becomes the Bottleneck
- Adaptive Concurrency Limits: Stop Guessing Thread Pool Sizes
Related posts
Tail-Based Sampling in OpenTelemetry: Sizing, Memory Crashes and Cost Model
Practical sizing guide for tail sampling in OpenTelemetry Collector. From decision_wait through memory limits to cost-benefit analysis.
RSS Contracts: Stop OOMKilled Java Pods in Kubernetes by Testing RSS as an API
Use cgroup RSS budgets, CI sampling, and runtime headroom to catch JVM memory regressions before they hit production.
Pods Stuck in Terminating: A Production Decision Tree for Finalizers, Volumes, and Dead Nodes
A conservative runbook to unstick Pods safely: finalizers, CSI/volume cleanup stalls, dead nodes, and when (and how) to force-delete.
Kubernetes Graceful Shutdown as a Contract: Zero 502s During Rollouts (HTTP + gRPC)
A reproducible way to eliminate rollout 502/ECONNRESET: readiness-driven draining, preStop, SIGTERM handling, and a measurable drain budget.
Cite this article
If you reference this post, please link to the original URL and credit the author.