Cardinality Contracts: Prometheus Labels as an API with Budgets
I have broken Prometheus with a ‘harmless’ label more than once. If you run Prometheus (or OTel metrics feeding Prometheus/remote_write), you know this incident pattern:
- “nothing changed”
- “we only added one label”
- and then:
- Prometheus RAM climbs
- remote_write costs spike
- queries slow down
- dashboards and alerts fall apart
The root cause is almost always the same: cardinality explosion. A single label creates too many unique time series, and your monitoring system becomes the incident.
This post introduces a practical concept that turns hand-wavy advice into a guardrail:
Cardinality Contracts = explicit budgets for metric cardinality + automatic verification in CI + a runtime firewall when CI is not enough.
This is not a theoretical best practice. It is an operational contract that makes labels behave like an API.
Tested on: Prometheus 2.47, OTel metrics via remote_write, Kubernetes workloads with 500k+ active series.
Why “just don’t add user_id” is not enough
Everyone knows user_id should not be a label. It still happens because:
- a router stops setting
routeand you export rawpath - a debug label slips in and never leaves
- a feature flag introduces a new dimension
- a middleware change starts emitting a new value set
The problem is that labels are an API. If you treat them like an API, you need:
- explicit spec
- compatibility rules
- breaking-change detection
That is exactly what Cardinality Contracts provide.
What the contract covers
Two things, both measurable:
- Series count per metric (e.g.,
http_server_requests_totalmust stay under 5k series) - Unique values per label (e.g.,
routemax 250,statusmax 25)
If someone accidentally exposes raw paths like /users/123, the contract fails immediately.
Artifacts you keep in Git
cardinality.budgets.yml- readable budgetstools/cardinality_guard.py- enforcement script- CI job that scrapes
/metricsand fails on budget violations - (optional)
cardinality.baseline.json+ diff output in PRs
1) Define budgets in YAML
Start with the top 5 metrics that drive your bill or memory usage. Budgets are guardrails, not the truth. Start higher and tighten later.
# cardinality.budgets.yml
budgets:
http_server_requests_total:
max_series: 5000
labels:
method: 10
status: 25
route: 250
http_server_request_duration_seconds_bucket:
max_series: 20000
labels:
le: 50
route: 250
status: 25
db_query_duration_seconds:
max_series: 2000
labels:
operation: 20
table: 200
2) CI smoke: scrape /metrics and count cardinality
Workflow:
- start app + dependencies (docker compose / kind / testcontainers)
- run a short smoke test
- scrape
/metrics - run the guard script
Minimal Python guard (low dependency)
# tools/cardinality_guard.py
import re
import sys
import json
from collections import defaultdict
try:
import yaml # pip install pyyaml
except ImportError:
print("Missing dependency: pyyaml (pip install pyyaml)", file=sys.stderr)
sys.exit(2)
METRIC_LINE = re.compile(r'^([a-zA-Z_:][a-zA-Z0-9_:]*)(\{.*\})?\s+[-+]?\d')
def parse_labels(label_blob: str):
if not label_blob:
return {}
s = label_blob.strip()[1:-1].strip()
if not s:
return {}
labels = {}
parts = []
cur = []
in_q = False
esc = False
for ch in s:
if esc:
cur.append(ch)
esc = False
elif ch == '\\':
cur.append(ch)
esc = True
elif ch == '"':
cur.append(ch)
in_q = not in_q
elif ch == ',' and not in_q:
parts.append(''.join(cur).strip())
cur = []
else:
cur.append(ch)
if cur:
parts.append(''.join(cur).strip())
for p in parts:
if not p:
continue
k, v = p.split("=", 1)
k = k.strip()
v = v.strip()
if v.startswith('"') and v.endswith('"'):
v = v[1:-1]
labels[k] = v
return labels
def load_budgets(path: str):
with open(path, "r", encoding="utf-8") as f:
doc = yaml.safe_load(f)
return doc.get("budgets", {})
def main():
if len(sys.argv) != 4:
print("Usage: python tools/cardinality_guard.py <budgets.yml> <metrics.txt> <report.json>", file=sys.stderr)
sys.exit(2)
budgets_path, metrics_path, report_path = sys.argv[1], sys.argv[2], sys.argv[3]
budgets = load_budgets(budgets_path)
series = defaultdict(set)
label_values = defaultdict(lambda: defaultdict(set))
with open(metrics_path, "r", encoding="utf-8") as f:
for line in f:
line = line.strip()
if not line or line.startswith("#"):
continue
m = METRIC_LINE.match(line)
if not m:
continue
metric = m.group(1)
label_blob = m.group(2)
labels = parse_labels(label_blob) if label_blob else {}
fp = "|".join([f"{k}={labels[k]}" for k in sorted(labels.keys())])
series[metric].add(fp)
for k, v in labels.items():
label_values[metric][k].add(v)
report = {"metrics": {}, "violations": []}
for metric, labelsets in series.items():
metric_series = len(labelsets)
rep = {
"series": metric_series,
"labels": {k: len(vs) for k, vs in label_values[metric].items()}
}
report["metrics"][metric] = rep
if metric in budgets:
b = budgets[metric]
max_series = b.get("max_series")
if max_series is not None and metric_series > int(max_series):
report["violations"].append({
"metric": metric,
"type": "max_series",
"observed": metric_series,
"budget": int(max_series)
})
label_budgets = b.get("labels", {})
for label, max_vals in label_budgets.items():
observed = rep["labels"].get(label, 0)
if observed > int(max_vals):
report["violations"].append({
"metric": metric,
"type": "label_values",
"label": label,
"observed": observed,
"budget": int(max_vals)
})
with open(report_path, "w", encoding="utf-8") as f:
json.dump(report, f, indent=2, sort_keys=True)
if report["violations"]:
print("CARDINALITY CONTRACT FAILED:")
for v in report["violations"]:
if v["type"] == "max_series":
print(f"- {v['metric']}: series {v['observed']} > budget {v['budget']}")
else:
print(f"- {v['metric']}[{v['label']}]: values {v['observed']} > budget {v['budget']}")
sys.exit(1)
print("Cardinality contract OK.")
sys.exit(0)
if __name__ == "__main__":
main()
3) Minimal GitHub Actions job
# .github/workflows/cardinality.yml
name: Cardinality contracts
on:
pull_request:
jobs:
cardinality:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Start stack
run: docker compose up -d --build
- name: Run smoke tests
run: |
curl -fsS http://localhost:8080/healthz
- name: Scrape metrics
run: |
curl -fsS http://localhost:8080/metrics > metrics.txt
- name: Install deps
run: |
python -m pip install --upgrade pip
pip install pyyaml
- name: Enforce contracts
run: |
python tools/cardinality_guard.py cardinality.budgets.yml metrics.txt cardinality.report.json
- name: Upload report
if: always()
uses: actions/upload-artifact@v4
with:
name: cardinality-report
path: cardinality.report.json
4) Runtime Cardinality Firewall (when CI is not enough)
CI catches regressions, but some labels only appear in prod (tenants, feature flags, edge paths). A runtime guardrail prevents a monitoring outage when CI misses a path.
Pattern: if a label exceeds budget, bucket the value into __other__ and increment an overflow counter.
class BudgetedLabel {
private max: number;
private seen: Map<string, true>;
constructor(max: number) {
this.max = max;
this.seen = new Map();
}
normalize(v: string): string {
if (this.seen.has(v)) return v;
if (this.seen.size < this.max) {
this.seen.set(v, true);
return v;
}
return "__other__";
}
}
This is a circuit breaker for telemetry. It keeps Prometheus alive and tells you who tried to break the contract.
Breaking vs non-breaking rules
Breaking:
- series count exceeds budget
- unique values for a labeled dimension exceed budget
Non-breaking:
- cardinality goes down
- a label is removed entirely (if you do not depend on it in dashboards)
Anti-patterns people still ship
- Raw path as label (
/users/123) - PII in labels (email, user_id, request_id)
- “Temporary” debug label (it will live forever)
Use route templates or handler names instead.
FAQ
“Budgets will annoy us.” If set reasonably, budgets only trigger when you are about to ship a monitoring incident.
“We are multi-tenant and want tenant label.” Then treat it like a budgeted API. If it does not fit, put it in logs or traces instead of metrics.
“Histograms are huge anyway.”
Yes. That is why budgets for *_bucket are separate and focus on the big multipliers: route, status, le.
Production checklist
- Define budgets for top 5 metrics
- Enforce in CI (fail only on over-budget)
- Add runtime firewall for high-risk labels
- Track overflow counter and page on spikes
Related articles
- Prometheus Cardinality Explosion: Detection, Prevention, and Recovery
- Span Contracts: Trace-Driven API Contract Testing
Conclusion
Cardinality Contracts are simple:
- define budgets
- enforce in CI
- add a runtime safety net
But the impact is large: fewer incidents, lower telemetry costs, and a monitoring stack that stays healthy even when someone makes a bad label choice.
Related posts
Prometheus Cardinality Explosion: Detection, Prevention, and Recovery
One developer added user_id label. Prometheus OOM'd. I show how to detect high-cardinality metrics before they kill your monitoring, with relabel configs to drop them.
Dash Contracts in Go: CI Compiler for Grafana Dashboards and Prometheus Alerts
Extract PromQL from dashboards and rules, verify selectors against /metrics, and fail CI before dashboards go dark.
Prometheus Native Histograms in Production: Rollout Plan, Budgets, and Failure Modes
Prometheus native histograms can blow up memory, WAL, and remote_write. This guide shows a staged rollout, budgets, and concrete queries to verify safety.
Prometheus remote_write backpressure: when monitoring fills the disk (and still loses data)
A practical runbook for remote_write outages: measure lag, estimate time-to-disk-full, tune queue_config safely, and choose explicit survival trade-offs.
Cite this article
If you reference this post, please link to the original URL and credit the author.