Cardinality Contracts: sprav z Prometheus labelov API s budgetom
Prometheus som uz parkrat polozil aj jednym “nevinne” vyzerajucim labelom. Ak pouzivas Prometheus (alebo OTel metrics, ktore idu do Promethea/remote_write), poznas tento incident pattern:
- “nic sa nemenilo”
- “len sme pridali jeden label”
- a potom:
- Prometheus zere RAM
- remote_write ucet vyleti
- dotazy sa spomalia
- grafy a alerty sa rozpadnu
Pricina je skoro vzdy rovnaka: cardinality explosion. Jeden label vytvori prilis vela unikatnych time series a monitoring sa zmeni na incident.
Tento clanok zavadza prakticky koncept, ktory z rad spravi guardrail:
Cardinality Contracts = explicitne budgety na cardinality + automaticke overovanie v CI + runtime firewall, ked CI nestaci.
Nie je to teoria ani marketing. Je to operacny kontrakt, ktory spravi z labelov API.
Testovane na: Prometheus 2.47, OTel metrics cez remote_write, Kubernetes workloady s 500k+ aktivnymi seriami.
Preco “nedavaj tam user_id” nestaci
Vsetci vedia, ze user_id nema byt label. Aj tak sa to deje, lebo:
- router prestane nastavovat
routea vyexportujes rawpath - debug label prejde do produkcie
- feature flag prida novu dimenziu
- middleware zmeni spravanie a zacne produkovat novu sadu hodnot
Problem je v tom, ze labels su API. Ak ich beries ako API, potrebujes:
- explicitnu specifikaciu
- kompatibilitu
- detekciu breaking zmien
A presne to riesia Cardinality Contracts.
Co kontrakt pokryva
Dve veci, obe meratelne:
- Pocet time series na metriku (napr.
http_server_requests_totalmusi zostat pod 5k seriami) - Pocet unikatnych hodnot labelu (napr.
routemax 250,statusmax 25)
Ak niekto omylom exportuje raw pathy ako /users/123, kontrakt zlyha okamzite.
Artefakty v Gite
cardinality.budgets.yml- citatelne budgetytools/cardinality_guard.py- guard skript- CI job, ktory scrape-ne
/metricsa failne pri prekroceni - (volitelne)
cardinality.baseline.json+ diff report v PR
1) Definuj budgety v YAML
Zacni top 5 metrikami, ktore ti najviac rastu. Budgety su guardrails, nie absolutna pravda. Zacni vysoko a postupne sprisnuj.
# cardinality.budgets.yml
budgets:
http_server_requests_total:
max_series: 5000
labels:
method: 10
status: 25
route: 250
http_server_request_duration_seconds_bucket:
max_series: 20000
labels:
le: 50
route: 250
status: 25
db_query_duration_seconds:
max_series: 2000
labels:
operation: 20
table: 200
2) CI smoke: scrape /metrics a spocitaj cardinality
Workflow:
- spusti appku + dependencies (docker compose / kind / testcontainers)
- prebehne kratky smoke test
- scrape-ni
/metrics - spusti guard skript
Minimal Python guard (low dependency)
# tools/cardinality_guard.py
import re
import sys
import json
from collections import defaultdict
try:
import yaml # pip install pyyaml
except ImportError:
print("Missing dependency: pyyaml (pip install pyyaml)", file=sys.stderr)
sys.exit(2)
METRIC_LINE = re.compile(r'^([a-zA-Z_:][a-zA-Z0-9_:]*)(\{.*\})?\s+[-+]?\d')
def parse_labels(label_blob: str):
if not label_blob:
return {}
s = label_blob.strip()[1:-1].strip()
if not s:
return {}
labels = {}
parts = []
cur = []
in_q = False
esc = False
for ch in s:
if esc:
cur.append(ch)
esc = False
elif ch == '\\':
cur.append(ch)
esc = True
elif ch == '"':
cur.append(ch)
in_q = not in_q
elif ch == ',' and not in_q:
parts.append(''.join(cur).strip())
cur = []
else:
cur.append(ch)
if cur:
parts.append(''.join(cur).strip())
for p in parts:
if not p:
continue
k, v = p.split("=", 1)
k = k.strip()
v = v.strip()
if v.startswith('"') and v.endswith('"'):
v = v[1:-1]
labels[k] = v
return labels
def load_budgets(path: str):
with open(path, "r", encoding="utf-8") as f:
doc = yaml.safe_load(f)
return doc.get("budgets", {})
def main():
if len(sys.argv) != 4:
print("Usage: python tools/cardinality_guard.py <budgets.yml> <metrics.txt> <report.json>", file=sys.stderr)
sys.exit(2)
budgets_path, metrics_path, report_path = sys.argv[1], sys.argv[2], sys.argv[3]
budgets = load_budgets(budgets_path)
series = defaultdict(set)
label_values = defaultdict(lambda: defaultdict(set))
with open(metrics_path, "r", encoding="utf-8") as f:
for line in f:
line = line.strip()
if not line or line.startswith("#"):
continue
m = METRIC_LINE.match(line)
if not m:
continue
metric = m.group(1)
label_blob = m.group(2)
labels = parse_labels(label_blob) if label_blob else {}
fp = "|".join([f"{k}={labels[k]}" for k in sorted(labels.keys())])
series[metric].add(fp)
for k, v in labels.items():
label_values[metric][k].add(v)
report = {"metrics": {}, "violations": []}
for metric, labelsets in series.items():
metric_series = len(labelsets)
rep = {
"series": metric_series,
"labels": {k: len(vs) for k, vs in label_values[metric].items()}
}
report["metrics"][metric] = rep
if metric in budgets:
b = budgets[metric]
max_series = b.get("max_series")
if max_series is not None and metric_series > int(max_series):
report["violations"].append({
"metric": metric,
"type": "max_series",
"observed": metric_series,
"budget": int(max_series)
})
label_budgets = b.get("labels", {})
for label, max_vals in label_budgets.items():
observed = rep["labels"].get(label, 0)
if observed > int(max_vals):
report["violations"].append({
"metric": metric,
"type": "label_values",
"label": label,
"observed": observed,
"budget": int(max_vals)
})
with open(report_path, "w", encoding="utf-8") as f:
json.dump(report, f, indent=2, sort_keys=True)
if report["violations"]:
print("CARDINALITY CONTRACT FAILED:")
for v in report["violations"]:
if v["type"] == "max_series":
print(f"- {v['metric']}: series {v['observed']} > budget {v['budget']}")
else:
print(f"- {v['metric']}[{v['label']}]: values {v['observed']} > budget {v['budget']}")
sys.exit(1)
print("Cardinality contract OK.")
sys.exit(0)
if __name__ == "__main__":
main()
3) Minimal GitHub Actions job
# .github/workflows/cardinality.yml
name: Cardinality contracts
on:
pull_request:
jobs:
cardinality:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Start stack
run: docker compose up -d --build
- name: Run smoke tests
run: |
curl -fsS http://localhost:8080/healthz
- name: Scrape metrics
run: |
curl -fsS http://localhost:8080/metrics > metrics.txt
- name: Install deps
run: |
python -m pip install --upgrade pip
pip install pyyaml
- name: Enforce contracts
run: |
python tools/cardinality_guard.py cardinality.budgets.yml metrics.txt cardinality.report.json
- name: Upload report
if: always()
uses: actions/upload-artifact@v4
with:
name: cardinality-report
path: cardinality.report.json
4) Runtime Cardinality Firewall (ked CI nestaci)
CI chyti regresie, ale niektore label hodnoty vzniknu az v produkcii (tenanty, feature flagy, edge pathy). Runtime guardrail ta zachrani, ked CI nieco nevidi.
Pattern: ak label prekroci budget, zabucketuj hodnotu do __other__ a inkrementuj overflow counter.
class BudgetedLabel {
private max: number;
private seen: Map<string, true>;
constructor(max: number) {
this.max = max;
this.seen = new Map();
}
normalize(v: string): string {
if (this.seen.has(v)) return v;
if (this.seen.size < this.max) {
this.seen.set(v, true);
return v;
}
return "__other__";
}
}
Je to circuit breaker pre telemetriu. Prometheus prezije a ty uvidis, kto prekrocil kontrakt.
Breaking vs non-breaking pravidla
Breaking:
- pocet series prekroci budget
- pocet unikatnych hodnot labelu prekroci budget
Non-breaking:
- cardinality klesne
- label sa odstrani (ak na nom nezavisis v dashboardoch)
Anti-patterny, ktore ludia stale shipuju
- Raw path ako label (
/users/123) - PII v labeloch (email, user_id, request_id)
- “Docasny” debug label (zije vecne)
Pouzi route template alebo handler name.
FAQ
“Budgety nas budu otravovat.” Ak su rozumne nastavene, budes ich riesit iba vtedy, ked by si inak o par dni riesil incident.
“Sme multi-tenant a chceme tenant label.” Potom ho bud rozumne budgetuj, alebo ho presun do logov/traces. Ak to nevlezie do budgetu, do Promethea to nepatri.
“Histogramy su velke aj tak.”
Ano. Preto davaju zmysel osobitne budgety pre *_bucket a fokus na hlavne multiplikatory: route, status, le.
Production checklist
- Definuj budgety pre top 5 metrik
- V CI failni len ked prekrocis budget
- Pridaj runtime firewall pre rizikove labely
- Sleduj overflow counter a alertuj na spikes
Suvisiace clanky
- Prometheus Cardinality Explozia: detekcia, prevencia a obnova
- Span Contracts: Trace-driven API contract testing
Zaver
Cardinality Contracts su jednoduche:
- definuj budgety
- over ich v CI
- pridaj runtime poistku
Ale dopad je velky: menej incidentov, nizsie naklady na telemetriu a monitoring, ktory ostane zdravy aj pri zlom label rozhodnuti.
Súvisiace články
Prometheus Kardinalita Explózia: Detekcia, Prevencia a Obnova
Jeden developer pridal user_id label. Prometheus dostal OOM. Ukážem ako detekovať high-cardinality metriky skôr než zabiajú monitoring, s relabel configami na ich drop.
Dash Contracts v Go: CI kompilator pre Grafana dashboardy a Prometheus alerty
Vytiahni PromQL z dashboardov a rules suborov, over selektory proti /metrics a zastav CI este pred deployom.
Prometheus native histogramy v produkcii: rollout plán, budgety a failure módy
Prometheus native histogramy vedia odpáliť pamäť, WAL aj remote_write. Návod na postupné nasadenie, budgety a konkrétne queries na verifikáciu.
Prometheus remote_write backpressure: keď monitoring zaplní disk a ešte aj stratí dáta
Runbook pre výpadky remote_write: ako zmerať lag, odhadnúť time-to-disk-full, bezpečne ladiť queue_config a vedome zvoliť trade-off medzi prežitím a stratou.
Citujte tento článok
Ak na článok odkazujete, pridajte pôvodnú URL a uveďte autora.