Prometheus Cardinality Explosion: Detection, Prevention, and Recovery
Cardinality explosions don’t happen in tests; they happen on your bill. Friday afternoon, quick deploy, and Grafana looked quiet. Not in a good way. Prometheus memory crossed 60GB and still OOMed. We’d already doubled the box once. The TSDB status endpoint told the story in seconds: someone had added user_id to http_requests_total.
On paper it sounded harmless: “I just want latency per user.” In Prometheus, every unique label combination is a new time series. Ten million users meant ten million series. One innocent-looking change turned monitoring into the incident.
What makes cardinality explosions particularly dangerous is how quickly they compound. If you have 5 HTTP methods, 50 status codes, and 100 endpoints, that’s 25,000 series—manageable. Add 10 million users as a label dimension, and you get 250 billion potential series. Even if only a fraction materialize, you’re still looking at millions of series, each consuming memory and disk.
The tragedy is that high-cardinality labels are useless in Prometheus anyway. You can’t meaningfully visualize 10 million user-specific time series. What you actually want—debugging a specific user’s requests—is better served by logs or traces. Prometheus is for aggregate metrics with bounded cardinality. Using it for high-cardinality data doesn’t just break Prometheus; it also doesn’t solve the problem you’re trying to solve.
Tested on: Prometheus 2.47, 50-node Kubernetes cluster, 2M active time series
Understanding Cardinality
What Creates Time Series
Metric cardinality = product of all label values
Example:
http_requests_total{
method="GET", # 5 values (GET, POST, PUT, DELETE, PATCH)
status="200", # 50 values (200, 201, 400, 401, 404, 500...)
endpoint="/api/v1" # 100 values (endpoints)
}
Cardinality: 5 × 50 × 100 = 25,000 time series
Add user_id label with 1M users:
Cardinality: 5 × 50 × 100 × 1,000,000 = 25,000,000,000 time series
└─ Prometheus dies
Memory Impact
Prometheus memory usage:
Per active time series:
- ~3KB RAM for recent samples (last 2 hours)
- ~1.5KB for TSDB head chunks
Real-world example:
Before: 500,000 time series × 3KB = 1.5GB
After adding user_id: 50,000,000 × 3KB = 150GB
That's a single bad label causing 100x memory increase
Detection
TSDB Status Endpoint
# Check current cardinality
curl -s localhost:9090/api/v1/status/tsdb | jq .
# Output:
{
"seriesCountByMetricName": [
{"name": "http_requests_total", "value": 25000000}, # RED FLAG
{"name": "process_cpu_seconds_total", "value": 500},
...
],
"labelValueCountByLabelName": [
{"name": "user_id", "value": 10000000}, # RED FLAG
{"name": "instance", "value": 50},
{"name": "method", "value": 5},
...
],
"seriesCountByLabelValuePair": [
{"name": "job=api-server", "value": 25000000},
...
]
}
PromQL Queries
# Total active time series
prometheus_tsdb_head_series
# Time series created per second (spike detection)
rate(prometheus_tsdb_head_series_created_total[5m])
# Memory used by TSDB head
prometheus_tsdb_head_chunks_storage_size_bytes
# Cardinality by metric name
topk(10, count by (__name__) ({__name__=~".+"}))
# Cardinality by label
topk(10, count by (user_id) ({user_id=~".+"}))
Proactive Monitoring
# prometheus-alerts.yaml
groups:
- name: cardinality
rules:
- alert: HighCardinalityMetric
expr: |
topk(1, count by (__name__) ({__name__=~".+"})) > 100000
for: 10m
labels:
severity: warning
annotations:
summary: "Metric {{ $labels.__name__ }} has >100k series"
- alert: TimeSeriesExplosion
expr: |
rate(prometheus_tsdb_head_series_created_total[5m]) > 1000
for: 5m
labels:
severity: critical
annotations:
summary: "Creating {{ $value }}/sec new time series"
- alert: HighCardinalityLabel
expr: |
prometheus_tsdb_head_series > 1000000
for: 5m
labels:
severity: warning
annotations:
summary: "Total time series exceeds 1M"
Prevention
Relabel Config to Drop High-Cardinality Labels
# prometheus.yml
scrape_configs:
- job_name: 'api-servers'
static_configs:
- targets: ['api:8080']
metric_relabel_configs:
# Drop metrics with user_id label entirely
- source_labels: [user_id]
regex: .+
action: drop
# Or drop just the label, keep the metric
- regex: user_id
action: labeldrop
# Drop metrics matching pattern
- source_labels: [__name__]
regex: "expensive_metric_.*"
action: drop
# Hash high-cardinality labels to reduce cardinality
- source_labels: [request_id]
regex: (.+)
target_label: request_id_bucket
replacement: "bucket_${1:0:2}" # First 2 chars = 256 buckets
action: replace
- regex: request_id
action: labeldrop
Recording Rules for Aggregation
# Instead of storing high-cardinality metrics,
# aggregate them at scrape time
groups:
- name: aggregations
rules:
# Aggregate per-user metrics to per-endpoint
- record: http_requests:by_endpoint:rate5m
expr: |
sum by (endpoint, method, status) (
rate(http_requests_total[5m])
)
# Keep only top N label values
- record: http_requests:top_endpoints:rate5m
expr: |
topk(100,
sum by (endpoint) (rate(http_requests_total[5m]))
)
Application-Level Prevention
// Bad: High-cardinality label
var httpRequests = prometheus.NewCounterVec(
prometheus.CounterOpts{
Name: "http_requests_total",
},
[]string{"method", "status", "endpoint", "user_id"}, // BAD!
)
// Good: Remove unbounded labels
var httpRequests = prometheus.NewCounterVec(
prometheus.CounterOpts{
Name: "http_requests_total",
},
[]string{"method", "status", "endpoint"}, // Bounded cardinality
)
// If you need per-user metrics, use histograms or logs
var requestDuration = prometheus.NewHistogramVec(
prometheus.HistogramOpts{
Name: "http_request_duration_seconds",
Buckets: prometheus.DefBuckets,
},
[]string{"method", "endpoint"}, // No user_id!
)
Label Value Bounding
// Bound endpoint cardinality
func normalizeEndpoint(path string) string {
// /users/12345 → /users/:id
// /orders/abc-def → /orders/:id
patterns := []struct {
regex *regexp.Regexp
replacement string
}{
{regexp.MustCompile(`/users/[^/]+`), "/users/:id"},
{regexp.MustCompile(`/orders/[^/]+`), "/orders/:id"},
{regexp.MustCompile(`/\d+`), "/:id"},
}
result := path
for _, p := range patterns {
result = p.regex.ReplaceAllString(result, p.replacement)
}
// Catch-all for unknown patterns
if strings.Count(result, "/") > 5 {
return "/other"
}
return result
}
Recovery
Emergency Procedures
# 1. Identify the culprit
curl -s localhost:9090/api/v1/status/tsdb | jq '.data.seriesCountByMetricName[:10]'
# 2. Add drop rule immediately
# Edit prometheus.yml, add to metric_relabel_configs:
# - source_labels: [__name__]
# regex: "bad_metric_name"
# action: drop
# 3. Reload Prometheus config (no restart needed)
curl -X POST localhost:9090/-/reload
# 4. Force TSDB head compaction to free memory
# (Prometheus 2.39+)
curl -X POST localhost:9090/api/v1/admin/tsdb/head_compaction
# 5. If still OOMing, delete the bad metric series
# WARNING: This is destructive!
curl -X POST -g 'localhost:9090/api/v1/admin/tsdb/delete_series?match[]=bad_metric_name'
# 6. Clean tombstones
curl -X POST localhost:9090/api/v1/admin/tsdb/clean_tombstones
Preventing Future Incidents
# prometheus.yml
global:
scrape_interval: 15s
# Limit samples per scrape
sample_limit: 50000 # Per target
# Limit labels per sample
label_limit: 30
label_name_length_limit: 200
label_value_length_limit: 2000
scrape_configs:
- job_name: 'api'
sample_limit: 10000 # Override per job
metric_relabel_configs:
# Drop all metrics with suspicious labels
- source_labels: [user_id, customer_id, request_id, session_id]
regex: .+
action: drop
Monitoring Dashboard
Grafana Panels
# Panel 1: Total Time Series
prometheus_tsdb_head_series
# Panel 2: Time Series Growth Rate
rate(prometheus_tsdb_head_series_created_total[5m])
# Panel 3: Memory Usage
prometheus_tsdb_head_chunks_storage_size_bytes / 1024 / 1024 / 1024
# Panel 4: Top 10 Metrics by Cardinality
topk(10, count by (__name__) ({__name__=~".+"}))
# Panel 5: Churn Rate (series created - deleted)
rate(prometheus_tsdb_head_series_created_total[5m])
- rate(prometheus_tsdb_head_series_removed_total[5m])
# Panel 6: Scrape Duration (can indicate cardinality issues)
prometheus_target_scrape_pool_sync_total
Cardinality Budget
# Set cardinality budgets per team/service
# Implement via recording rules + alerts
groups:
- name: cardinality_budgets
rules:
# Track cardinality per job
- record: job:prometheus_series:count
expr: count by (job) ({__name__=~".+"})
# Alert when job exceeds budget
- alert: CardinalityBudgetExceeded
expr: |
job:prometheus_series:count{job="api-server"} > 50000
labels:
severity: warning
annotations:
summary: "Job {{ $labels.job }} exceeds 50k series budget"
Best Practices
Label Guidelines
## Safe Labels (bounded cardinality)
✅ method: GET, POST, PUT, DELETE, PATCH (5 values)
✅ status_code: 200, 201, 400, 401, 403, 404, 500, 502, 503 (~20 values)
✅ service_name: bounded by number of services (~100)
✅ environment: dev, staging, prod (3 values)
✅ region: us-east-1, us-west-2, eu-west-1 (~10 values)
## Dangerous Labels (unbounded cardinality)
❌ user_id: millions of users
❌ request_id: infinite
❌ email: millions
❌ ip_address: potentially millions
❌ trace_id: infinite
❌ timestamp: infinite
❌ url_path (raw): unbounded (needs normalization)
## Rule of Thumb
Label cardinality should be < 1000 values
Total metric cardinality should be < 10,000 series
Architecture for High-Cardinality Data
Need per-user metrics? Don't use Prometheus labels.
Alternative approaches:
1. Logs + Log aggregation
User activity → Structured logs → Loki/Elasticsearch
Query: sum(rate({job="api"} |= "user_id=123")) by (endpoint)
2. Event streaming
User events → Kafka → ClickHouse/TimescaleDB
Query: SELECT count(*) FROM events WHERE user_id = 123
3. Exemplars (Prometheus 2.26+)
Attach trace_id to histogram buckets
Low cardinality metrics + high cardinality exemplars
4. Remote write to specialized TSDB
High-cardinality → Victoria Metrics / M3DB / Thanos
Better cardinality handling
Checklist
## Prometheus Cardinality Management
### Detection
- [ ] Monitor prometheus_tsdb_head_series
- [ ] Alert on series creation rate > 1000/sec
- [ ] Check /api/v1/status/tsdb regularly
- [ ] Dashboard showing top metrics by cardinality
### Prevention
- [ ] Relabel configs to drop dangerous labels
- [ ] sample_limit per scrape target
- [ ] Application-level label bounding
- [ ] Code review for new metrics
### Recovery Plan
- [ ] Document emergency drop procedures
- [ ] Know how to delete_series
- [ ] Test config reload process
- [ ] Runbook for cardinality incidents
### Best Practices
- [ ] Label cardinality < 1000 values
- [ ] No unbounded labels (user_id, request_id)
- [ ] Use logs for high-cardinality data
- [ ] Recording rules for aggregation
Conclusion
Cardinality explosion is the number one way to kill Prometheus. Unlike CPU or memory pressure that builds gradually, cardinality explosion can take you from healthy to OOMing within hours of deploying a single bad metric. The failure mode is also catastrophic: when Prometheus OOMs, you lose not just the bad metric but all your monitoring.
The root cause is almost always a misunderstanding of what Prometheus is for. Prometheus tracks aggregate metrics with bounded cardinality—things like “how many requests per endpoint” or “what’s the 99th percentile latency by service.” It’s not designed for per-user, per-request, or per-session data. Those use cases belong in logs (for debugging individual events) or traces (for request flows).
Prevention is far easier than recovery. Add relabel configs to drop dangerous labels before they’re ingested. Set sample_limit per scrape target to cap damage from any single target. Review new metrics in code review with cardinality in mind. Monitor prometheus_tsdb_head_series and alert when it grows unexpectedly.
The key insight is that cardinality is multiplicative. Each label dimension multiplies with every other. A metric with labels that each have 10 values creates 10^n series where n is the number of labels. Five labels with 10 values each = 100,000 series. Add one label with 1 million values, and you have 100 billion potential series.
Key principles:
- One bad label can create millions of series—label cardinality multiplies across all dimensions
- Monitor
prometheus_tsdb_head_seriesconstantly—it’s your early warning system - Use relabel_configs to drop dangerous labels before ingestion, not after
- Bound all label values at the application level—normalize URLs, hash IDs, limit cardinality
- Use logs for high-cardinality data—Prometheus is for aggregates, not individual events
Check your TSDB status now. The explosion might already be happening, and every hour makes recovery harder.
Related Articles
- OpenTelemetry Tail Sampling - Observability at scale
- Structured Logging Performance - Log aggregation alternative
Related posts
Cardinality Contracts: Prometheus Labels as an API with Budgets
Define label budgets, enforce them in CI, and add a runtime firewall to stop cardinality explosions before production.
Prometheus Native Histograms in Production: Rollout Plan, Budgets, and Failure Modes
Prometheus native histograms can blow up memory, WAL, and remote_write. This guide shows a staged rollout, budgets, and concrete queries to verify safety.
Tail-Based Sampling in OpenTelemetry: Sizing, Memory Crashes and Cost Model
Practical sizing guide for tail sampling in OpenTelemetry Collector. From decision_wait through memory limits to cost-benefit analysis.
Dash Contracts in Go: CI Compiler for Grafana Dashboards and Prometheus Alerts
Extract PromQL from dashboards and rules, verify selectors against /metrics, and fail CI before dashboards go dark.
Cite this article
If you reference this post, please link to the original URL and credit the author.