Skip to content

Monitoring

The Verifluence platform exposes Prometheus-format metrics from its two backend processes. A single kube-prometheus-stack deployment in the monitoring namespace scrapes both, feeds Grafana dashboards, evaluates PrometheusRule CRDs into alerts, and routes alerts to a Telegram channel via Alertmanager.

End-to-end flow

api / worker pods           Prometheus              Alertmanager           Telegram
  emit /metrics      ───►   scrapes via    ───►    routes by      ───►   channel
  (prom-client)             ServiceMonitors        severity label


                            Grafana dashboards
                            + Alerting UI
                            (read-only window
                             on Alertmanager)

Three things to keep in mind:

  1. Metrics are emitted on a side port (:9091 for api, :9092 for worker) — never on the application port. Decouples them from auth / CORS / Sentry middleware. See Why a separate /metrics port below.
  2. Scraping uses Service+ServiceMonitor, not PodMonitor. Each pod has a headless metrics-only Service; the operator's relabel rules pick those up cleanly without needing containerPort declarations the unichart Helm chart doesn't expose.
  3. Alerts flow through Alertmanager, not Grafana. Grafana's Alerting UI is wired to Alertmanager as a read-only datasource so you can see the route tree, contact points, silences, and active alerts without leaving Grafana.

Topology

api pod
├── :3000  application traffic    (public via CF Tunnel + private via envoy-private)
└── :9091  /metrics               (cluster-internal — never exposed via Gateway)

worker pod
├── :3002  /_liveness, /_readiness
└── :9092  /metrics               (cluster-internal — never exposed via Gateway)

Each metric carries a service={api|worker} label so a single Prometheus scrape config covers both. The metrics-only Services use honorLabels: true so the labels emitted by prom-client (e.g. job=viewer-poll on poll metrics) survive Prometheus's scrape-target relabelling instead of getting renamed to exported_*.

Pages in this section

PageWhat it covers
API metricsHTTP server stats, domain gauges (offers/deals/etc by status), Node runtime defaults
Worker metricsPoll-job lifecycle (last-success heartbeat, error rate, duration), external HTTP (incl. 401 detection), Node runtime
Webhook metricsInbound Kick webhook receiver — signature/replay failures, async-handler outcomes
DashboardsGrafana dashboards vf-api / vf-worker / vf-webhooks — what they show, where the JSON lives, how to add a new one
AlertingPrometheusRule CRDs, Alertmanager routes + receivers, Telegram delivery, Grafana Alerting UI, how to add new rules and silences

Naming convention

vf_<scope>_<name>_<unit>

Examples: vf_http_request_duration_seconds, vf_worker_poll_runs_total, vf_external_http_total. Counters end in _total, durations in _seconds. No metric carries a user / streamer / offer ID — those would explode cardinality and are kept in logs (Loki) instead.

Why a separate /metrics port

  1. Decouples metrics from auth / CORS / route matching. The main :3000 listener has secure-headers, CORS, session auth, Sentry middleware. A side port has none of that — nothing can break the Prometheus text format with an unrelated middleware change.
  2. Public exposure becomes an explicit, loud change. No Service in operations/ exposes :9091 / :9092. Adding it requires editing the Service's port list and a Gateway listener — two file changes that scream "you're exposing internals" in a PR.
  3. Standard convention. ServiceMonitor configs, Helm charts, and the kube-prometheus-stack examples all assume a separate named metrics port. Predictability is worth a port number.

Additional smaller wins: no self-feedback loop in the HTTP histogram from Prometheus's own scrapes; metrics scrape stays available even if :3000's request queue backs up; mirrors the existing :3002 probe-port pattern in worker.ts.

Source

The implementation lives in a single module shared by both processes:

FileRole
api/src/metrics.tsRegistry, HTTP middleware, poll wrapper, external-HTTP recorder, domain gauge tick, side-port server
api/src/server.tsWires HTTP middleware + starts metrics server (:9091) + starts the 30 s domain gauge tick
api/src/worker.tsWraps every cron callback with instrumentPoll(name, fn), starts metrics server (:9092)
operations/clusters/prod/stage/monitoring/{api,worker}-metrics.yamlHeadless Service + ServiceMonitor pair per pod
operations/clusters/prod/monitoring/dashboards/verifluence-*.jsonGrafana dashboard JSON, mounted via the dashboards-kustomization sidecar
operations/clusters/prod/stage/monitoring/poller-staleness-rules.yamlThe custom worker-poller-staleness PrometheusRule
operations/clusters/prod/monitoring/releases/kube-prometheus-stack.yamlAlertmanager config (routes + Telegram receivers) and Grafana Alerting wiring

Verifluence Documentation