Skip to content

Grafana dashboards

Three platform dashboards live in the operations repo at clusters/prod/monitoring/dashboards/ and are auto-mounted into Grafana by the kube-prometheus-stack sidecar.

FileUIDCovers
verifluence-api.jsonvf-apiHTTP server traffic, domain bargauges, Node runtime
verifluence-worker.jsonvf-workerPoll-job lifecycle, external HTTP (incl. 401 detection), Node runtime
verifluence-webhooks.jsonvf-webhooksInbound Kick webhook receiver + async-handler outcomes

Open them at https://grafana.verifluence.io → search "Verifluence".

How dashboards reach Grafana

operations/                                    Grafana sidecar              Grafana UI
clusters/prod/monitoring/dashboards/             watches ConfigMaps           reloads
  verifluence-api.json        ─┐                 with label                   the
  verifluence-worker.json     ─┤   kustomize     grafana_dashboard: "1"       dashboard
  verifluence-webhooks.json   ─┘   →  ConfigMap  in any namespace             on change
  kustomization.yaml             one per file

The kustomization.yaml in that directory generates one ConfigMap per JSON file with the grafana_dashboard: "1" label. The Grafana sidecar container watches across all namespaces (searchNamespace: ALL) and mounts each ConfigMap as a dashboard.

A failure mode worth knowing about: if the kustomization has the label in the wrong place (it's generatorOptions.labels, not on the configMapGenerator entries), the ConfigMaps get created without the label and the sidecar silently ignores them. We hit and fixed this in April — see commit 2fe2ac1.

What each dashboard answers at a glance

vf-api

  • Are requests being served? Top: req/sec, 5xx rate %, in-flight, p99 latency stat tiles
  • Which routes are slow? Per-route p99 latency time series (top 10 by request rate). Hono route patterns, not URLs — bounded cardinality.
  • Domain shape: bargauges for offers / negotiations / deals / submissions / campaigns by status. Refreshed every 30 s by the API process via a setInterval that runs SELECT status, COUNT(*) … GROUP BY status per table.
  • Process health: eventloop lag (with red threshold at 100 ms), heap+RSS, CPU, GC pause p95.

vf-worker

The headline panel is "Seconds since last successful run (per job)" — a table with green ≤ 300 s, yellow 300–1800 s, red > 1800 s. PromQL:

promql
time() - max by(job) (vf_worker_poll_last_success_timestamp_seconds{service="worker"})

Plus:

  • Per-job error rate, p95 duration, runs/min
  • 401 / network / 429 stats with thresholds (the Kick auth-health row)
  • External HTTP rate by service+endpoint+status — the full audit trail of every Kick / X / OAuth call
  • Worker process health (same Node defaults as the API)

vf-webhooks

  • Stat row: OK/sec, invalid_signature/sec, stale_timestamp/sec, duplicate/sec — the security-signal trio plus the informational retry counter
  • Receive rate by event_type + result mix (which fail-modes are actually happening)
  • Sync ack latency p50/p95/p99 with the Kick 1 s retry threshold marked as a redline
  • Async-handler error rate %, p95 duration, per-event_type breakdowns, total runs/min by event type
  • A source template variable so the dashboard generalises if we add more webhook providers later (Twitch, YouTube, etc.)

Adding a new dashboard

  1. Build it in the Grafana UI with the ${datasource} template variable for Prometheus (so it isn't pinned to a specific datasource UID).
  2. Export via the share-link "Export" tab → "Save to file" → check "Export for sharing externally" so it includes __inputs / __requires blocks. Or use the simpler internal JSON if you'll always provision it.
  3. Drop the file at operations/clusters/prod/monitoring/dashboards/<name>.json.
  4. Add a configMapGenerator entry for it in the directory's kustomization.yaml. The grafana_dashboard: "1" label is set once at the file level via generatorOptions.labels.
  5. Commit + push — Flux reconciles, the sidecar mounts the new ConfigMap, Grafana reloads.

Iterating on an existing dashboard

You cannot safely edit a provisioned dashboard in the Grafana UI — the sidecar will overwrite your changes on the next reconcile. The working pattern is:

  1. Open the dashboard in Grafana, click the gear → "Save as" → create a draft copy in your personal namespace
  2. Iterate on the draft until you're happy
  3. Export the draft, replace the file in operations, commit
  4. Delete the draft

The "Make editable" UI option is misleading in our setup — the file-backed ConfigMap is the source of truth.

Naming + labels conventions

  • UID stays stable (vf-api, vf-worker, vf-webhooks) so deep links never break. If you fork a dashboard, give it a new UID.
  • Tags include verifluence plus a domain tag (api, worker, webhooks). Grafana's tag-based search makes the family easy to find.
  • Refresh default 30 s — matches the Prometheus scrape interval.
  • Time range default "now-1h" for traffic dashboards, "now-3h" for slower-moving worker/webhook activity.

Source

Verifluence Documentation