Tail-based sampling v OpenTelemetry: Sizing, pamäťové pády a cost model
Tail sampling som sa naucil az po velmi realnej telemetrickej fakture. “Tail sampling je super” - toto som počul na konferencii. O mesiac neskôr OTel Collector OOMkill každých 30 minút, lebo sme nastavili príliš veľký decision_wait bez dostatočnej pamäte.
Dokumentácia vysvetľuje AKO zapnúť tail sampling. Nikde však nepíše koľko pamäte potrebuješ, aký num_traces je bezpečný, alebo kedy ti sampling začne dropovať najdôležitejšie traces.
Testované na: OTel Collector 0.96+, Kubernetes, Jaeger backend. Produkčne na systémoch s 10k spans/s.
Prečo Tail Sampling
Head Sampling (tradičný)
Request → Sample decision → Trace
↓
80% dropped (random)
20% kept
Problém: Dropneš 80% traces PRED tým, než vieš či sú zaujímavé.
Tail Sampling
Request → Collect ALL spans → Wait for completion → Decision
↓
Keep: errors, slow, interesting
Drop: fast, successful
Benefit: Vidíš 100% error traces, 100% slow traces, sample healthy.
Základná Konfigurácia
# otel-collector-config.yaml
processors:
tail_sampling:
decision_wait: 10s
num_traces: 100000
expected_new_traces_per_sec: 1000
policies:
# Vždy zachovaj errors
- name: errors
type: status_code
status_code:
status_codes: [ERROR]
# Vždy zachovaj slow traces
- name: slow-traces
type: latency
latency:
threshold_ms: 500
# Sample healthy traces
- name: probabilistic-sample
type: probabilistic
probabilistic:
sampling_percentage: 10
Sizing: Kľúčové Parametre
decision_wait
Ako dlho čakáš na všetky spany pred rozhodnutím.
decision_wait: 10s # Čakaj max 10s na kompletný trace
Trade-off:
- Príliš krátky → nevidíš všetky spany (distributed traces)
- Príliš dlhý → vysoká spotreba pamäte
Odporúčanie: max_latency_P99 + 2-3s buffer
num_traces
Maximálny počet traces v pamäti naraz.
num_traces: 100000
Výpočet:
num_traces = expected_new_traces_per_sec × decision_wait × safety_factor
Príklad:
1000 traces/s × 10s × 2 = 20,000 traces
Čo ak prekročíš? Collector začne dropovať NAJSTARŠIE traces (vrátane error traces!).
Pamäťový Odhad
memory_per_trace ≈ 10-50KB (závisí od počtu spans)
total_memory = num_traces × memory_per_trace
Príklad:
100,000 traces × 20KB = 2GB RAM pre tail sampling buffer
Memory Sizing Formula
Required Memory (GB) =
(traces_per_second × decision_wait_seconds × avg_spans_per_trace × bytes_per_span)
/ 1_000_000_000
Kde:
- bytes_per_span ≈ 500-2000 (závisí od atribútov)
- safety_factor = 1.5-2x
Príklad
Vstup:
- 1000 traces/s
- decision_wait: 15s
- 10 spans/trace
- 1KB/span
Výpočet:
1000 × 15 × 10 × 1000 = 150,000,000 bytes = 150MB
So safety factor 2x: 300MB pre sampling buffer
+ base collector overhead: ~200MB
= Minimum 500MB, odporúčam 1GB
Produkčný Deployment
Kubernetes Resources
# collector-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: otel-collector
spec:
template:
spec:
containers:
- name: collector
image: otel/opentelemetry-collector-contrib:0.96.0
resources:
requests:
memory: 1Gi
cpu: 500m
limits:
memory: 2Gi
cpu: 1000m
env:
- name: GOMEMLIMIT
value: "1800MiB" # 90% of limit
Memory Limiter Processor
processors:
memory_limiter:
check_interval: 1s
limit_mib: 1800
spike_limit_mib: 400
Kompletný Pipeline
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
processors:
memory_limiter:
check_interval: 1s
limit_mib: 1800
spike_limit_mib: 400
batch:
timeout: 1s
send_batch_size: 1024
tail_sampling:
decision_wait: 10s
num_traces: 50000
expected_new_traces_per_sec: 500
policies:
- name: errors
type: status_code
status_code:
status_codes: [ERROR]
- name: slow
type: latency
latency:
threshold_ms: 500
- name: sample-rest
type: probabilistic
probabilistic:
sampling_percentage: 5
exporters:
otlp:
endpoint: jaeger:4317
tls:
insecure: true
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, tail_sampling, batch]
exporters: [otlp]
Cost Model
Traces Kept vs Dropped
Input: 1,000,000 spans/day
Without sampling:
- Storage: 1M × 1KB = 1GB/day
- 30 days retention: 30GB
- Cost @ $0.10/GB: $3/day = $90/month
With tail sampling (10% + all errors):
- Errors (1%): 10,000 spans
- Slow (5%): 50,000 spans
- Sampled (10% of rest): 94,000 spans
- Total: ~154,000 spans (15.4%)
- Storage: 154KB × 1KB = 154MB/day
- 30 days: 4.6GB
- Cost: $0.46/day = $14/month
Savings: 84%
Break-even Analysis
Tail Sampling Costs:
- Collector resources: ~$50/month (1 replica with 2GB RAM)
- Complexity: Engineering time
Savings:
- Storage: $76/month
- Query performance: Faster (less data)
ROI: Positive if >500k spans/day
Monitoring Tail Sampling
Prometheus Metrics
# Pridaj prometheus exporter
exporters:
prometheus:
endpoint: 0.0.0.0:8888
service:
telemetry:
metrics:
address: 0.0.0.0:8888
Kľúčové Metriky
# Sampling decisions
rate(otelcol_processor_tail_sampling_sampling_decision_latency_count[5m])
# Traces dropped kvôli num_traces limitu
rate(otelcol_processor_tail_sampling_sampling_traces_dropped[5m])
# Memory usage
process_resident_memory_bytes{job="otel-collector"}
# Queue depth (ak používaš batching)
otelcol_exporter_queue_size
Alerts
groups:
- name: otel-collector
rules:
- alert: TailSamplingDropping
expr: rate(otelcol_processor_tail_sampling_sampling_traces_dropped[5m]) > 0
for: 5m
labels:
severity: warning
annotations:
summary: "Tail sampling is dropping traces"
description: "Increase num_traces or reduce decision_wait"
- alert: CollectorHighMemory
expr: process_resident_memory_bytes{job="otel-collector"} / 1e9 > 1.5
for: 5m
labels:
severity: warning
annotations:
summary: "OTel Collector memory > 1.5GB"
Common Pitfalls
1. Príliš dlhý decision_wait
# BAD: 60s wait = obrovská spotreba RAM
decision_wait: 60s
# GOOD: 10-15s pre väčšinu use cases
decision_wait: 10s
2. Príliš malý num_traces
# BAD: Dropuje traces pri spike
num_traces: 1000
# GOOD: 2x expected load
num_traces: 50000
3. Composite Policy vs Multiple Policies
# BAD: Každá policy evaluuje nezávisle
policies:
- name: errors
type: status_code
status_codes: [ERROR]
- name: slow
type: latency
threshold_ms: 500
# GOOD: Composite pre AND/OR logiku
policies:
- name: composite-policy
type: composite
composite:
max_total_spans_per_second: 1000
policy_order: [errors, slow-errors, sample]
composite_sub_policy:
- name: errors
type: status_code
status_codes: [ERROR]
- name: slow-errors
type: and
and:
- name: slow
type: latency
latency:
threshold_ms: 500
- name: probabilistic
type: probabilistic
probabilistic:
sampling_percentage: 50
Záver
Tail sampling je mocný nástroj, ale vyžaduje správny sizing. Kľúčové body:
- decision_wait = P99 latency + buffer (10-15s typicky)
- num_traces = traces/s × decision_wait × 2
- Memory = num_traces × 20KB (minimum odhad)
- Monitoring = sleduj dropped traces a memory
- Cost = ROI pozitívne od 500k+ spans/day
FAQ
Čo ak mám veľmi dlhé traces (minúty)?
Zvýš decision_wait, ale priprav sa na vyššiu pamäť. Alternatíva: split trace na menšie segmenty.
Môžem škálovať horizontálne?
Nie priamo. Tail sampling potrebuje všetky spany jedného trace na jednom collector. Použi load balancing s trace ID affinity.
Čo ak collector crashne?
Stratiš in-flight traces. Použi persistent queue (file storage) pre recovery.
Súvisiace články
- K8s Connection Storm - Monitoring pod rolloutov
- CI/CD pre monorepo - Integrácia OTel do pipeline
Súvisiace články
Prometheus Kardinalita Explózia: Detekcia, Prevencia a Obnova
Jeden developer pridal user_id label. Prometheus dostal OOM. Ukážem ako detekovať high-cardinality metriky skôr než zabiajú monitoring, s relabel configami na ich drop.
OpenTelemetry Collector backpressure: dropy, memory_limiter a queue ako guardrails
OpenTelemetry Collector pri loade dropuje spany kvôli backpressure exportérov. Oprava cez memory_limiter, queue a batch tuning + verifikácia.
Prometheus remote_write backpressure: keď monitoring zaplní disk a ešte aj stratí dáta
Runbook pre výpadky remote_write: ako zmerať lag, odhadnúť time-to-disk-full, bezpečne ladiť queue_config a vedome zvoliť trade-off medzi prežitím a stratou.
CoreDNS vs NodeLocal DNS Cache: Zníženie Kubernetes DNS Latencie 10x
Vaše pody robia 100 DNS queries per request. CoreDNS je bottleneck. Benchmarkujem NodeLocal DNS cache a ukážem konfiguráciu pre produkciu.
Citujte tento článok
Ak na článok odkazujete, pridajte pôvodnú URL a uveďte autora.