Grafana dashboards
Three platform dashboards live in the operations repo at clusters/prod/monitoring/dashboards/ and are auto-mounted into Grafana by the kube-prometheus-stack sidecar.
| File | UID | Covers |
|---|---|---|
verifluence-api.json | vf-api | HTTP server traffic, domain bargauges, Node runtime |
verifluence-worker.json | vf-worker | Poll-job lifecycle, external HTTP (incl. 401 detection), Node runtime |
verifluence-webhooks.json | vf-webhooks | Inbound Kick webhook receiver + async-handler outcomes |
Open them at https://grafana.verifluence.io → search "Verifluence".
How dashboards reach Grafana
operations/ Grafana sidecar Grafana UI
clusters/prod/monitoring/dashboards/ watches ConfigMaps reloads
verifluence-api.json ─┐ with label the
verifluence-worker.json ─┤ kustomize grafana_dashboard: "1" dashboard
verifluence-webhooks.json ─┘ → ConfigMap in any namespace on change
kustomization.yaml one per fileThe kustomization.yaml in that directory generates one ConfigMap per JSON file with the grafana_dashboard: "1" label. The Grafana sidecar container watches across all namespaces (searchNamespace: ALL) and mounts each ConfigMap as a dashboard.
A failure mode worth knowing about: if the kustomization has the label in the wrong place (it's generatorOptions.labels, not on the configMapGenerator entries), the ConfigMaps get created without the label and the sidecar silently ignores them. We hit and fixed this in April — see commit 2fe2ac1.
What each dashboard answers at a glance
vf-api
- Are requests being served? Top: req/sec, 5xx rate %, in-flight, p99 latency stat tiles
- Which routes are slow? Per-route p99 latency time series (top 10 by request rate). Hono route patterns, not URLs — bounded cardinality.
- Domain shape: bargauges for offers / negotiations / deals / submissions / campaigns by status. Refreshed every 30 s by the API process via a
setIntervalthat runsSELECT status, COUNT(*) … GROUP BY statusper table. - Process health: eventloop lag (with red threshold at 100 ms), heap+RSS, CPU, GC pause p95.
vf-worker
The headline panel is "Seconds since last successful run (per job)" — a table with green ≤ 300 s, yellow 300–1800 s, red > 1800 s. PromQL:
time() - max by(job) (vf_worker_poll_last_success_timestamp_seconds{service="worker"})Plus:
- Per-job error rate, p95 duration, runs/min
- 401 / network / 429 stats with thresholds (the Kick auth-health row)
- External HTTP rate by service+endpoint+status — the full audit trail of every Kick / X / OAuth call
- Worker process health (same Node defaults as the API)
vf-webhooks
- Stat row: OK/sec, invalid_signature/sec, stale_timestamp/sec, duplicate/sec — the security-signal trio plus the informational retry counter
- Receive rate by event_type + result mix (which fail-modes are actually happening)
- Sync ack latency p50/p95/p99 with the Kick 1 s retry threshold marked as a redline
- Async-handler error rate %, p95 duration, per-event_type breakdowns, total runs/min by event type
- A
sourcetemplate variable so the dashboard generalises if we add more webhook providers later (Twitch, YouTube, etc.)
Adding a new dashboard
- Build it in the Grafana UI with the
${datasource}template variable for Prometheus (so it isn't pinned to a specific datasource UID). - Export via the share-link "Export" tab → "Save to file" → check "Export for sharing externally" so it includes
__inputs/__requiresblocks. Or use the simpler internal JSON if you'll always provision it. - Drop the file at
operations/clusters/prod/monitoring/dashboards/<name>.json. - Add a
configMapGeneratorentry for it in the directory'skustomization.yaml. Thegrafana_dashboard: "1"label is set once at the file level viageneratorOptions.labels. - Commit + push — Flux reconciles, the sidecar mounts the new ConfigMap, Grafana reloads.
Iterating on an existing dashboard
You cannot safely edit a provisioned dashboard in the Grafana UI — the sidecar will overwrite your changes on the next reconcile. The working pattern is:
- Open the dashboard in Grafana, click the gear → "Save as" → create a draft copy in your personal namespace
- Iterate on the draft until you're happy
- Export the draft, replace the file in
operations, commit - Delete the draft
The "Make editable" UI option is misleading in our setup — the file-backed ConfigMap is the source of truth.
Naming + labels conventions
- UID stays stable (
vf-api,vf-worker,vf-webhooks) so deep links never break. If you fork a dashboard, give it a new UID. - Tags include
verifluenceplus a domain tag (api,worker,webhooks). Grafana's tag-based search makes the family easy to find. - Refresh default 30 s — matches the Prometheus scrape interval.
- Time range default "now-1h" for traffic dashboards, "now-3h" for slower-moving worker/webhook activity.
Source
- JSON:
operations/clusters/prod/monitoring/dashboards/ - Sidecar config:
clusters/prod/monitoring/releases/kube-prometheus-stack.yamlundergrafana.sidecar.dashboards