Grafana dashboards

Three platform dashboards live in the operations repo at clusters/prod/monitoring/dashboards/ and are auto-mounted into Grafana by the kube-prometheus-stack sidecar.

File	UID	Covers
`verifluence-api.json`	`vf-api`	HTTP server traffic, domain bargauges, Node runtime
`verifluence-worker.json`	`vf-worker`	Poll-job lifecycle, external HTTP (incl. 401 detection), Node runtime
`verifluence-webhooks.json`	`vf-webhooks`	Inbound Kick webhook receiver + async-handler outcomes

Open them at https://grafana.verifluence.io → search "Verifluence".

How dashboards reach Grafana

operations/                                    Grafana sidecar              Grafana UI
clusters/prod/monitoring/dashboards/             watches ConfigMaps           reloads
  verifluence-api.json        ─┐                 with label                   the
  verifluence-worker.json     ─┤   kustomize     grafana_dashboard: "1"       dashboard
  verifluence-webhooks.json   ─┘   →  ConfigMap  in any namespace             on change
  kustomization.yaml             one per file

The kustomization.yaml in that directory generates one ConfigMap per JSON file with the grafana_dashboard: "1" label. The Grafana sidecar container watches across all namespaces (searchNamespace: ALL) and mounts each ConfigMap as a dashboard.

A failure mode worth knowing about: if the kustomization has the label in the wrong place (it's generatorOptions.labels, not on the configMapGenerator entries), the ConfigMaps get created without the label and the sidecar silently ignores them. We hit and fixed this in April — see commit 2fe2ac1.

What each dashboard answers at a glance

`vf-api`

Are requests being served? Top: req/sec, 5xx rate %, in-flight, p99 latency stat tiles
Which routes are slow? Per-route p99 latency time series (top 10 by request rate). Hono route patterns, not URLs — bounded cardinality.
Domain shape: bargauges for offers / negotiations / deals / submissions / campaigns by status. Refreshed every 30 s by the API process via a setInterval that runs SELECT status, COUNT(*) … GROUP BY status per table.
Process health: eventloop lag (with red threshold at 100 ms), heap+RSS, CPU, GC pause p95.

`vf-worker`

The headline panel is "Seconds since last successful run (per job)" — a table with green ≤ 300 s, yellow 300–1800 s, red > 1800 s. PromQL:

promql

time() - max by(job) (vf_worker_poll_last_success_timestamp_seconds{service="worker"})

Plus:

Per-job error rate, p95 duration, runs/min
401 / network / 429 stats with thresholds (the Kick auth-health row)
External HTTP rate by service+endpoint+status — the full audit trail of every Kick / X / OAuth call
Worker process health (same Node defaults as the API)

`vf-webhooks`

Stat row: OK/sec, invalid_signature/sec, stale_timestamp/sec, duplicate/sec — the security-signal trio plus the informational retry counter
Receive rate by event_type + result mix (which fail-modes are actually happening)
Sync ack latency p50/p95/p99 with the Kick 1 s retry threshold marked as a redline
Async-handler error rate %, p95 duration, per-event_type breakdowns, total runs/min by event type
A source template variable so the dashboard generalises if we add more webhook providers later (Twitch, YouTube, etc.)

Adding a new dashboard

Build it in the Grafana UI with the ${datasource} template variable for Prometheus (so it isn't pinned to a specific datasource UID).
Export via the share-link "Export" tab → "Save to file" → check "Export for sharing externally" so it includes __inputs / __requires blocks. Or use the simpler internal JSON if you'll always provision it.
Drop the file at operations/clusters/prod/monitoring/dashboards/<name>.json.
Add a configMapGenerator entry for it in the directory's kustomization.yaml. The grafana_dashboard: "1" label is set once at the file level via generatorOptions.labels.
Commit + push — Flux reconciles, the sidecar mounts the new ConfigMap, Grafana reloads.

Iterating on an existing dashboard

You cannot safely edit a provisioned dashboard in the Grafana UI — the sidecar will overwrite your changes on the next reconcile. The working pattern is:

Open the dashboard in Grafana, click the gear → "Save as" → create a draft copy in your personal namespace
Iterate on the draft until you're happy
Export the draft, replace the file in operations, commit
Delete the draft

The "Make editable" UI option is misleading in our setup — the file-backed ConfigMap is the source of truth.

Naming + labels conventions

UID stays stable (vf-api, vf-worker, vf-webhooks) so deep links never break. If you fork a dashboard, give it a new UID.
Tags include verifluence plus a domain tag (api, worker, webhooks). Grafana's tag-based search makes the family easy to find.
Refresh default 30 s — matches the Prometheus scrape interval.
Time range default "now-1h" for traffic dashboards, "now-3h" for slower-moving worker/webhook activity.

Source

JSON: operations/clusters/prod/monitoring/dashboards/
Sidecar config: clusters/prod/monitoring/releases/kube-prometheus-stack.yaml under grafana.sidecar.dashboards

Grafana dashboards ​

How dashboards reach Grafana ​

What each dashboard answers at a glance ​

vf-api ​

vf-worker ​

vf-webhooks ​

Adding a new dashboard ​

Iterating on an existing dashboard ​

Naming + labels conventions ​

Source ​