Monitoring

The Verifluence platform exposes Prometheus-format metrics from its two backend processes. A single kube-prometheus-stack deployment in the monitoring namespace scrapes both, feeds Grafana dashboards, evaluates PrometheusRule CRDs into alerts, and routes alerts to a Telegram channel via Alertmanager.

End-to-end flow

api / worker pods           Prometheus              Alertmanager           Telegram
  emit /metrics      ───►   scrapes via    ───►    routes by      ───►   channel
  (prom-client)             ServiceMonitors        severity label
                                  │
                                  ▼
                            Grafana dashboards
                            + Alerting UI
                            (read-only window
                             on Alertmanager)

Three things to keep in mind:

Metrics are emitted on a side port (:9091 for api, :9092 for worker) — never on the application port. Decouples them from auth / CORS / Sentry middleware. See Why a separate /metrics port below.
Scraping uses Service+ServiceMonitor, not PodMonitor. Each pod has a headless metrics-only Service; the operator's relabel rules pick those up cleanly without needing containerPort declarations the unichart Helm chart doesn't expose.
Alerts flow through Alertmanager, not Grafana. Grafana's Alerting UI is wired to Alertmanager as a read-only datasource so you can see the route tree, contact points, silences, and active alerts without leaving Grafana.

Topology

api pod
├── :3000  application traffic    (public via CF Tunnel + private via envoy-private)
└── :9091  /metrics               (cluster-internal — never exposed via Gateway)

worker pod
├── :3002  /_liveness, /_readiness
└── :9092  /metrics               (cluster-internal — never exposed via Gateway)

Each metric carries a service={api|worker} label so a single Prometheus scrape config covers both. The metrics-only Services use honorLabels: true so the labels emitted by prom-client (e.g. job=viewer-poll on poll metrics) survive Prometheus's scrape-target relabelling instead of getting renamed to exported_*.

Pages in this section

Page	What it covers
API metrics	HTTP server stats, domain gauges (offers/deals/etc by status), Node runtime defaults
Worker metrics	Poll-job lifecycle (last-success heartbeat, error rate, duration), external HTTP (incl. 401 detection), Node runtime
Webhook metrics	Inbound Kick webhook receiver — signature/replay failures, async-handler outcomes
Dashboards	Grafana dashboards `vf-api` / `vf-worker` / `vf-webhooks` — what they show, where the JSON lives, how to add a new one
Alerting	PrometheusRule CRDs, Alertmanager routes + receivers, Telegram delivery, Grafana Alerting UI, how to add new rules and silences

Naming convention

vf_<scope>_<name>_<unit>

Examples: vf_http_request_duration_seconds, vf_worker_poll_runs_total, vf_external_http_total. Counters end in _total, durations in _seconds. No metric carries a user / streamer / offer ID — those would explode cardinality and are kept in logs (Loki) instead.

Why a separate `/metrics` port

Decouples metrics from auth / CORS / route matching. The main :3000 listener has secure-headers, CORS, session auth, Sentry middleware. A side port has none of that — nothing can break the Prometheus text format with an unrelated middleware change.
Public exposure becomes an explicit, loud change. No Service in operations/ exposes :9091 / :9092. Adding it requires editing the Service's port list and a Gateway listener — two file changes that scream "you're exposing internals" in a PR.
Standard convention. ServiceMonitor configs, Helm charts, and the kube-prometheus-stack examples all assume a separate named metrics port. Predictability is worth a port number.

Additional smaller wins: no self-feedback loop in the HTTP histogram from Prometheus's own scrapes; metrics scrape stays available even if :3000's request queue backs up; mirrors the existing :3002 probe-port pattern in worker.ts.

Source

The implementation lives in a single module shared by both processes:

File	Role
`api/src/metrics.ts`	Registry, HTTP middleware, poll wrapper, external-HTTP recorder, domain gauge tick, side-port server
`api/src/server.ts`	Wires HTTP middleware + starts metrics server (`:9091`) + starts the 30 s domain gauge tick
`api/src/worker.ts`	Wraps every cron callback with `instrumentPoll(name, fn)`, starts metrics server (`:9092`)
`operations/clusters/prod/stage/monitoring/{api,worker}-metrics.yaml`	Headless Service + ServiceMonitor pair per pod
`operations/clusters/prod/monitoring/dashboards/verifluence-*.json`	Grafana dashboard JSON, mounted via the dashboards-kustomization sidecar
`operations/clusters/prod/stage/monitoring/poller-staleness-rules.yaml`	The custom worker-poller-staleness PrometheusRule
`operations/clusters/prod/monitoring/releases/kube-prometheus-stack.yaml`	Alertmanager config (routes + Telegram receivers) and Grafana Alerting wiring

Monitoring ​

End-to-end flow ​

Topology ​

Pages in this section ​

Naming convention ​

Why a separate /metrics port ​