Monitoring
The Verifluence platform exposes Prometheus-format metrics from its two backend processes. A single kube-prometheus-stack deployment in the monitoring namespace scrapes both, feeds Grafana dashboards, evaluates PrometheusRule CRDs into alerts, and routes alerts to a Telegram channel via Alertmanager.
End-to-end flow
api / worker pods Prometheus Alertmanager Telegram
emit /metrics ───► scrapes via ───► routes by ───► channel
(prom-client) ServiceMonitors severity label
│
▼
Grafana dashboards
+ Alerting UI
(read-only window
on Alertmanager)Three things to keep in mind:
- Metrics are emitted on a side port (
:9091for api,:9092for worker) — never on the application port. Decouples them from auth / CORS / Sentry middleware. See Why a separate/metricsport below. - Scraping uses Service+ServiceMonitor, not PodMonitor. Each pod has a headless metrics-only Service; the operator's relabel rules pick those up cleanly without needing
containerPortdeclarations the unichart Helm chart doesn't expose. - Alerts flow through Alertmanager, not Grafana. Grafana's Alerting UI is wired to Alertmanager as a read-only datasource so you can see the route tree, contact points, silences, and active alerts without leaving Grafana.
Topology
api pod
├── :3000 application traffic (public via CF Tunnel + private via envoy-private)
└── :9091 /metrics (cluster-internal — never exposed via Gateway)
worker pod
├── :3002 /_liveness, /_readiness
└── :9092 /metrics (cluster-internal — never exposed via Gateway)Each metric carries a service={api|worker} label so a single Prometheus scrape config covers both. The metrics-only Services use honorLabels: true so the labels emitted by prom-client (e.g. job=viewer-poll on poll metrics) survive Prometheus's scrape-target relabelling instead of getting renamed to exported_*.
Pages in this section
| Page | What it covers |
|---|---|
| API metrics | HTTP server stats, domain gauges (offers/deals/etc by status), Node runtime defaults |
| Worker metrics | Poll-job lifecycle (last-success heartbeat, error rate, duration), external HTTP (incl. 401 detection), Node runtime |
| Webhook metrics | Inbound Kick webhook receiver — signature/replay failures, async-handler outcomes |
| Dashboards | Grafana dashboards vf-api / vf-worker / vf-webhooks — what they show, where the JSON lives, how to add a new one |
| Alerting | PrometheusRule CRDs, Alertmanager routes + receivers, Telegram delivery, Grafana Alerting UI, how to add new rules and silences |
Naming convention
vf_<scope>_<name>_<unit>Examples: vf_http_request_duration_seconds, vf_worker_poll_runs_total, vf_external_http_total. Counters end in _total, durations in _seconds. No metric carries a user / streamer / offer ID — those would explode cardinality and are kept in logs (Loki) instead.
Why a separate /metrics port
- Decouples metrics from auth / CORS / route matching. The main
:3000listener has secure-headers, CORS, session auth, Sentry middleware. A side port has none of that — nothing can break the Prometheus text format with an unrelated middleware change. - Public exposure becomes an explicit, loud change. No Service in
operations/exposes:9091/:9092. Adding it requires editing the Service's port list and a Gateway listener — two file changes that scream "you're exposing internals" in a PR. - Standard convention. ServiceMonitor configs, Helm charts, and the kube-prometheus-stack examples all assume a separate named
metricsport. Predictability is worth a port number.
Additional smaller wins: no self-feedback loop in the HTTP histogram from Prometheus's own scrapes; metrics scrape stays available even if :3000's request queue backs up; mirrors the existing :3002 probe-port pattern in worker.ts.
Source
The implementation lives in a single module shared by both processes:
| File | Role |
|---|---|
api/src/metrics.ts | Registry, HTTP middleware, poll wrapper, external-HTTP recorder, domain gauge tick, side-port server |
api/src/server.ts | Wires HTTP middleware + starts metrics server (:9091) + starts the 30 s domain gauge tick |
api/src/worker.ts | Wraps every cron callback with instrumentPoll(name, fn), starts metrics server (:9092) |
operations/clusters/prod/stage/monitoring/{api,worker}-metrics.yaml | Headless Service + ServiceMonitor pair per pod |
operations/clusters/prod/monitoring/dashboards/verifluence-*.json | Grafana dashboard JSON, mounted via the dashboards-kustomization sidecar |
operations/clusters/prod/stage/monitoring/poller-staleness-rules.yaml | The custom worker-poller-staleness PrometheusRule |
operations/clusters/prod/monitoring/releases/kube-prometheus-stack.yaml | Alertmanager config (routes + Telegram receivers) and Grafana Alerting wiring |