Dash Contracts v Go: CI kompilator pre Grafana dashboardy a Prometheus alerty
Unavil ma No data panel po mergoch, ktore v CI vyzerali cisto. Dashboardy a alerty su klienti API tvojich metrik. Ked premenujes metriku, label alebo hodnotu labelu, mozes rozbit:
- Grafana panely (ticho ukazuju “No data”)
- alert rules (prestanu palit)
- recorded metriky, ktore zmenia vyznam bez toho, aby si si to vsimol
Tento clanok z toho spravi kontrakt a overi ho v CI bez potreby Promethea.
Dash Contracts = mnozina PromQL selectorov z dashboardov a rules, overena proti scraped
/metrics.
Klucovy check je jednoduchy: pre kazdy selector existuje aspon jedna seria, ktora splna vsetky label matchery. Ak nie, dashboard alebo alert je uz teraz rozbity alebo sa rozbije po merge.
Testovane na: Prometheus 2.47, Grafana 10, OTel metrics cez remote_write, Kubernetes workloady s 500k+ aktivnymi seriami.
Preco na tom zalezi
Rady typu “nedavaj user_id ako label” su spravne, ale nestacia. V praxi ta zlomia jemnejsie zmeny:
- router prestane nastavovat
routea exportujes rawpath - debug label ostane v produkcii a vyrobi explozivnu cardinality
- label sa premenuje (
statusnastatus_code) a dashboardy zhasnu
Metriky su API. Dash Contracts z nich spravia explicitne kompatibilitne testy.
Ako to funguje
- Prejdi Grafana dashboard JSONy a Prometheus rules YAML
- Vytiahni PromQL vyrazy
- Normalizuj Grafana makra, aby PromQL parser vedel vyraz precitat
- Z kazdeho vyrazu vytiahni vsetky vector selectory
- Scrape-ni
/metricsa naparsuj series labely - Over, ze kazdy selector je splnitelny aspon jednou seriou
Ziadny Prometheus. Len kompilacia observability.
Go implementacia: dashcontract
CLI je v tools/dashcontract. Je to jeden binar, ktory vies pustit v CI.
go.mod
module example.com/dashcontract
go 1.22
require (
github.com/prometheus/common v0.67.4
github.com/prometheus/prometheus v0.308.1
gopkg.in/yaml.v3 v3.0.1
)
Prometheus moduly maju specificke tagovanie: repo verzia v3.y.z mapuje na Go modul v0.3y.z. V CI je dobre pinovat verziu.
main.go
package main
import (
"bytes"
"encoding/json"
"errors"
"flag"
"fmt"
"io"
"net/http"
"os"
"path/filepath"
"sort"
"strings"
"time"
"github.com/prometheus/common/expfmt"
"github.com/prometheus/common/model"
"github.com/prometheus/prometheus/model/labels"
"github.com/prometheus/prometheus/promql/parser"
"gopkg.in/yaml.v3"
)
type QuerySource struct {
Kind string // grafana|promrule
File string
JSONPath string // grafana
Group string // promrule
Rule string // promrule (alert or record name)
}
type Query struct {
Expr string
NormExpr string
Source QuerySource
}
type Series struct {
Labels map[string]string // includes __name__
}
type MetricStats struct {
SeriesCount int
LabelKeys map[string]struct{}
Values map[string]map[string]struct{} // label -> distinct values (limited)
}
type MetricsIndex struct {
ByMetric map[string][]Series
All []Series
Stats map[string]*MetricStats
Names []string
}
type Violation struct {
Reason string
Expr string
Selector []string
MetricHint string
Source QuerySource
Suggestions []string
}
func main() {
var (
dashDir = flag.String("dashboards", "", "Directory with Grafana dashboards JSON (optional)")
rulesDir = flag.String("rules", "", "Directory with Prometheus rules YAML (optional)")
metrics = flag.String("metrics-url", "", "URL to scrape /metrics from (required)")
timeout = flag.Duration("timeout", 5*time.Second, "HTTP timeout for scraping metrics")
reportOut = flag.String("report", "dashcontract_report.json", "Path to JSON report output")
)
flag.Parse()
if strings.TrimSpace(*metrics) == "" {
fatal("missing -metrics-url")
}
if strings.TrimSpace(*dashDir) == "" && strings.TrimSpace(*rulesDir) == "" {
fatal("provide at least one of: -dashboards, -rules")
}
var queries []Query
if *dashDir != "" {
q, err := ExtractGrafanaQueries(*dashDir)
if err != nil {
fatal(err.Error())
}
queries = append(queries, q...)
}
if *rulesDir != "" {
q, err := ExtractPromRuleQueries(*rulesDir)
if err != nil {
fatal(err.Error())
}
queries = append(queries, q...)
}
idx, err := ScrapeAndIndexMetrics(*metrics, *timeout)
if err != nil {
fatal(fmt.Sprintf("scrape metrics failed: %v", err))
}
violations := VerifyQueries(queries, idx)
fmt.Printf("DashContract: %d queries, %d metrics, %d series\n", len(queries), len(idx.Names), len(idx.All))
if len(violations) == 0 {
fmt.Println("OK: all selectors are satisfiable against scraped /metrics")
} else {
fmt.Printf("FAIL: %d selector(s) not satisfiable\n", len(violations))
for i, v := range violations {
fmt.Printf("\n[%d] %s\n", i+1, v.Reason)
fmt.Printf(" expr: %s\n", v.Expr)
fmt.Printf(" selector: %s\n", strings.Join(v.Selector, ", "))
fmt.Printf(" source: %s (%s)\n", v.Source.File, v.Source.Kind)
if v.Source.JSONPath != "" {
fmt.Printf(" jsonPath: %s\n", v.Source.JSONPath)
}
if v.Source.Group != "" || v.Source.Rule != "" {
fmt.Printf(" rule: group=%q name=%q\n", v.Source.Group, v.Source.Rule)
}
for _, s := range v.Suggestions {
fmt.Printf(" hint: %s\n", s)
}
}
}
if err := WriteJSONReport(*reportOut, violations); err != nil {
fmt.Fprintf(os.Stderr, "report write failed: %v\n", err)
}
if len(violations) > 0 {
os.Exit(1)
}
}
func fatal(msg string) {
fmt.Fprintln(os.Stderr, "error:", msg)
os.Exit(2)
}
func WriteJSONReport(path string, violations []Violation) error {
type out struct {
Violations []Violation `json:"violations"`
}
b, err := json.MarshalIndent(out{Violations: violations}, "", " ")
if err != nil {
return err
}
return os.WriteFile(path, b, 0o644)
}
func ExtractGrafanaQueries(root string) ([]Query, error) {
var out []Query
err := filepath.WalkDir(root, func(path string, d os.DirEntry, err error) error {
if err != nil {
return err
}
if d.IsDir() {
return nil
}
if !strings.HasSuffix(strings.ToLower(path), ".json") {
return nil
}
b, err := os.ReadFile(path)
if err != nil {
return err
}
var doc any
if err := json.Unmarshal(b, &doc); err != nil {
return nil
}
out = append(out, ExtractExprFieldsFromJSON(doc, path)...)
return nil
})
return out, err
}
func ExtractExprFieldsFromJSON(v any, file string) []Query {
var out []Query
var walk func(node any, jp string)
walk = func(node any, jp string) {
switch t := node.(type) {
case map[string]any:
for k, vv := range t {
p := jp
if p == "" {
p = "$"
}
p2 := p + "." + k
if k == "expr" {
if s, ok := vv.(string); ok {
s = strings.TrimSpace(s)
if s != "" {
out = append(out, Query{
Expr: s,
NormExpr: NormalizeGrafanaPromQL(s),
Source: QuerySource{
Kind: "grafana",
File: file,
JSONPath: p2,
},
})
}
}
}
walk(vv, p2)
}
case []any:
for i, vv := range t {
walk(vv, fmt.Sprintf("%s[%d]", jp, i))
}
default:
}
}
walk(v, "$")
return out
}
func NormalizeGrafanaPromQL(expr string) string {
r := strings.NewReplacer(
"${__rate_interval}", "5m",
"$__rate_interval", "5m",
"${__interval}", "1m",
"$__interval", "1m",
"${__interval_ms}", "60000",
"$__interval_ms", "60000",
"${__range}", "5m",
"$__range", "5m",
"${__range_s}", "300",
"$__range_s", "300",
"${__range_ms}", "300000",
"$__range_ms", "300000",
)
return r.Replace(expr)
}
type promRuleFile struct {
Groups []struct {
Name string `yaml:"name"`
Rules []struct {
Alert string `yaml:"alert"`
Record string `yaml:"record"`
Expr string `yaml:"expr"`
} `yaml:"rules"`
} `yaml:"groups"`
}
func ExtractPromRuleQueries(root string) ([]Query, error) {
var out []Query
err := filepath.WalkDir(root, func(path string, d os.DirEntry, err error) error {
if err != nil {
return err
}
if d.IsDir() {
return nil
}
low := strings.ToLower(path)
if !(strings.HasSuffix(low, ".yml") || strings.HasSuffix(low, ".yaml")) {
return nil
}
b, err := os.ReadFile(path)
if err != nil {
return err
}
dec := yaml.NewDecoder(bytes.NewReader(b))
for {
var doc promRuleFile
err := dec.Decode(&doc)
if errors.Is(err, io.EOF) {
break
}
if err != nil {
break
}
for _, g := range doc.Groups {
for _, r := range g.Rules {
expr := strings.TrimSpace(r.Expr)
if expr == "" {
continue
}
name := r.Alert
if name == "" {
name = r.Record
}
out = append(out, Query{
Expr: expr,
NormExpr: expr,
Source: QuerySource{
Kind: "promrule",
File: path,
Group: g.Name,
Rule: name,
},
})
}
}
}
return nil
})
return out, err
}
func ScrapeAndIndexMetrics(url string, timeout time.Duration) (*MetricsIndex, error) {
client := &http.Client{Timeout: timeout}
resp, err := client.Get(url) // #nosec G107 - user-provided URL, intended for internal CI
if err != nil {
return nil, err
}
defer resp.Body.Close()
if resp.StatusCode < 200 || resp.StatusCode >= 300 {
return nil, fmt.Errorf("http %d", resp.StatusCode)
}
p := expfmt.NewTextParser(model.LegacyValidation)
fams, err := p.TextToMetricFamilies(resp.Body)
if err != nil {
return nil, err
}
idx := &MetricsIndex{
ByMetric: make(map[string][]Series, len(fams)),
Stats: make(map[string]*MetricStats, len(fams)),
}
for name, mf := range fams {
for _, m := range mf.Metric {
lm := map[string]string{
labels.MetricName: name,
}
for _, lp := range m.Label {
lm[lp.GetName()] = lp.GetValue()
}
s := Series{Labels: lm}
idx.ByMetric[name] = append(idx.ByMetric[name], s)
idx.All = append(idx.All, s)
st := idx.Stats[name]
if st == nil {
st = &MetricStats{
LabelKeys: map[string]struct{}{},
Values: map[string]map[string]struct{}{},
}
idx.Stats[name] = st
}
st.SeriesCount++
for k, v := range lm {
if k == labels.MetricName {
continue
}
st.LabelKeys[k] = struct{}{}
if st.Values[k] == nil {
st.Values[k] = map[string]struct{}{}
}
if len(st.Values[k]) < 10 {
st.Values[k][v] = struct{}{}
}
}
}
}
for name := range fams {
idx.Names = append(idx.Names, name)
}
sort.Strings(idx.Names)
return idx, nil
}
func VerifyQueries(queries []Query, idx *MetricsIndex) []Violation {
var violations []Violation
for _, q := range queries {
ast, err := parser.ParseExpr(q.NormExpr)
if err != nil {
violations = append(violations, Violation{
Reason: "promql_parse_error (often Grafana macro or templating; extend NormalizeGrafanaPromQL)",
Expr: q.Expr,
Source: q.Source,
Selector: []string{
err.Error(),
},
Suggestions: []string{
"Check for Grafana macros like $__... and normalize them before parsing.",
},
})
continue
}
selectorSets := parser.ExtractSelectors(ast)
if len(selectorSets) == 0 {
continue
}
for _, mset := range selectorSets {
if selectorSatisfied(mset, idx) {
continue
}
violations = append(violations, buildViolation(q, mset, idx))
}
}
return violations
}
func selectorSatisfied(matchers []*labels.Matcher, idx *MetricsIndex) bool {
candidates := candidateSeries(matchers, idx)
for _, s := range candidates {
if seriesMatchesAll(matchers, s) {
return true
}
}
return false
}
func seriesMatchesAll(matchers []*labels.Matcher, s Series) bool {
for _, m := range matchers {
v := s.Labels[m.Name]
if !m.Matches(v) {
return false
}
}
return true
}
func candidateSeries(matchers []*labels.Matcher, idx *MetricsIndex) []Series {
var nm *labels.Matcher
for _, m := range matchers {
if m.Name == labels.MetricName {
nm = m
break
}
}
if nm == nil {
return idx.All
}
if nm.Type == labels.MatchEqual {
return idx.ByMetric[nm.Value]
}
var out []Series
for metricName, ss := range idx.ByMetric {
if nm.Matches(metricName) {
out = append(out, ss...)
}
}
return out
}
func buildViolation(q Query, matchers []*labels.Matcher, idx *MetricsIndex) Violation {
v := Violation{
Reason: "selector_not_satisfiable_against_scraped_metrics",
Expr: q.Expr,
Source: q.Source,
}
for _, m := range matchers {
v.Selector = append(v.Selector, m.String())
if m.Name == labels.MetricName {
v.MetricHint = m.String()
}
}
metricName := extractExactMetricName(matchers)
if metricName != "" {
if _, ok := idx.ByMetric[metricName]; !ok {
v.Reason = "missing_metric_family_in_/metrics"
v.Suggestions = append(v.Suggestions,
fmt.Sprintf("exporter does not expose metric %q", metricName),
fmt.Sprintf("top metrics: %s", strings.Join(sampleStrings(idx.Names, 10), ", ")),
)
return v
}
st := idx.Stats[metricName]
if st != nil {
keys := make([]string, 0, len(st.LabelKeys))
for k := range st.LabelKeys {
keys = append(keys, k)
}
sort.Strings(keys)
v.Suggestions = append(v.Suggestions,
fmt.Sprintf("metric exists (%d series) but no series matches all matchers", st.SeriesCount),
fmt.Sprintf("available labels for %q: %s", metricName, strings.Join(keys, ", ")),
)
for _, m := range matchers {
if m.Name == labels.MetricName {
continue
}
if vals, ok := st.Values[m.Name]; ok && len(vals) > 0 {
v.Suggestions = append(v.Suggestions,
fmt.Sprintf("observed values for %q (sample): %s", m.Name, strings.Join(mapKeysSorted(vals), ", ")),
)
} else {
v.Suggestions = append(v.Suggestions,
fmt.Sprintf("label %q does not exist on this metric (rename or relabeling)", m.Name),
)
}
}
}
} else {
v.Suggestions = append(v.Suggestions,
"selector has no exact metric name; verification is broader and may be noisy",
fmt.Sprintf("top metrics: %s", strings.Join(sampleStrings(idx.Names, 10), ", ")),
)
}
return v
}
func extractExactMetricName(matchers []*labels.Matcher) string {
for _, m := range matchers {
if m.Name == labels.MetricName && m.Type == labels.MatchEqual {
return m.Value
}
}
return ""
}
func mapKeysSorted(m map[string]struct{}) []string {
out := make([]string, 0, len(m))
for k := range m {
out = append(out, k)
}
sort.Strings(out)
return out
}
func sampleStrings(xs []string, n int) []string {
if len(xs) <= n {
return xs
}
return xs[:n]
}
Lokalne pouzitie
go run ./tools/dashcontract \
-dashboards ./grafana/dashboards \
-rules ./prometheus/rules \
-metrics-url http://localhost:8080/metrics \
-report dashcontract_report.json
Pri chybe dostanes konkretny vyraz, selector a hinty k tomu, co chyba.
Minimal CI gate
name: dashcontract
on: [pull_request]
jobs:
dashcontract:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-go@v5
with:
go-version: "1.22"
- name: Start app (example)
run: |
echo "TODO: start app and expose /metrics"
- name: DashContract verify
run: |
go run ./tools/dashcontract \
-dashboards ./grafana/dashboards \
-rules ./prometheus/rules \
-metrics-url http://localhost:8080/metrics \
-report dashcontract_report.json
- name: Upload report
if: always()
uses: actions/upload-artifact@v4
with:
name: dashcontract_report
path: dashcontract_report.json
Poznamky k presnosti
__name__je specialny label s nazvom metriky/metricsparsujeme cezexpfmt.NewTextParseraTextToMetricFamilies- selector matchery su
labels.Matchera pouzivajuMatches
Rozsirenia
- Over
group by (...)labely proti dostupnym labelom - Deprecation policy pre metriky v YAML manifeste
- SARIF vystup pre inline PR anotacie
Suvisiace clanky
- Prometheus Cardinality Explozia: detekcia, prevencia a obnova
- Cardinality Contracts: Prometheus labely ako API s budgetom
Zaver
Dash Contracts spravia observability zavislosti explicitne:
- vytiahni query
- over selector
- failni CI skor, nez zhasnu dashboardy
Je to jednoduchy check, ale chyti triedu problemov, ktore vacsina timov objavi az v produkcii.
Súvisiace články
Cardinality Contracts: sprav z Prometheus labelov API s budgetom
Definuj budgety na cardinality, over ich v CI a pridaj runtime firewall, aby si zastavil explozie labelov pred produkciou.
Prometheus remote_write backpressure: keď monitoring zaplní disk a ešte aj stratí dáta
Runbook pre výpadky remote_write: ako zmerať lag, odhadnúť time-to-disk-full, bezpečne ladiť queue_config a vedome zvoliť trade-off medzi prežitím a stratou.
Prometheus Kardinalita Explózia: Detekcia, Prevencia a Obnova
Jeden developer pridal user_id label. Prometheus dostal OOM. Ukážem ako detekovať high-cardinality metriky skôr než zabiajú monitoring, s relabel configami na ich drop.
Prometheus native histogramy v produkcii: rollout plán, budgety a failure módy
Prometheus native histogramy vedia odpáliť pamäť, WAL aj remote_write. Návod na postupné nasadenie, budgety a konkrétne queries na verifikáciu.
Citujte tento článok
Ak na článok odkazujete, pridajte pôvodnú URL a uveďte autora.