Skip to content

API metrics

Exposed by the api process at :9091/metrics (cluster-internal). All metrics carry service="api".

HTTP server

Recorded by a Hono middleware mounted before the auth gate, so 401s and CORS-rejected requests show up in the histogram too.

MetricTypeLabels
vf_http_requests_totalcountermethod, route, status_code
vf_http_request_duration_secondshistogrammethod, route, status_code
vf_http_requests_in_flightgauge

The route label is the Hono matched pattern (e.g. /api/offers/:id), not the literal URL — keeps cardinality bounded across the platform's ~50 routes. Unmatched requests group under route="unmatched".

Histogram buckets: 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10 seconds.

Useful queries

promql
# request rate by route, top 10
topk(10, sum by(route) (rate(vf_http_requests_total[5m])))

# 5xx error rate
sum(rate(vf_http_requests_total{status_code=~"5.."}[5m]))
  /
sum(rate(vf_http_requests_total[5m]))

# p99 latency by route
histogram_quantile(0.99,
  sum by(le, route) (rate(vf_http_request_duration_seconds_bucket[5m]))
)

# in-flight (saturation signal)
vf_http_requests_in_flight

Domain gauges (30 s tick)

The API process owns a 30 s background tick that runs cheap SELECT status, COUNT(*) … GROUP BY status queries against five tables and updates these gauges. One round-trip per table; failures log + continue so a transient DB hiccup doesn't leave half the metrics fresh and half stale.

MetricStatuses
vf_offers_by_status{status}pending, accepted, declined, withdrawn
vf_negotiations_by_status{status}in_progress, agreed, cancelled
vf_deals_by_status{status}active, completed, refunded, cancelled
vf_session_submissions_by_status{status}pending_review, confirmed, rejected
vf_campaigns_by_status{status}prepared, funded, in_progress, completed, …

Useful queries

promql
# pending offers right now
sum(vf_offers_by_status{status="pending"})

# active deals trend (last 24 h)
max_over_time(vf_deals_by_status{status="active"}[24h])

# submissions waiting for operator review
sum(vf_session_submissions_by_status{status="pending_review"})

Node runtime (free, via prom-client defaults)

Emitted by prom-client.collectDefaultMetrics(). Most useful single signal for "is the API healthy" is eventloop lag.

MetricWhat it tells you
nodejs_eventloop_lag_secondshow late the scheduler is firing — sustained > 100 ms means the API is starved
process_cpu_seconds_totalCPU usage over time
process_resident_memory_bytesRSS — alert above pod memory limit
nodejs_heap_size_total_bytes, nodejs_heap_size_used_bytesheap pressure
nodejs_active_handles_total, nodejs_active_requests_totalleak indicator
nodejs_gc_duration_secondsGC pause distribution

Useful queries

promql
# eventloop lag (alert > 100 ms for 5 min)
nodejs_eventloop_lag_seconds{service="api"}

# memory headroom
process_resident_memory_bytes{service="api"} / 1024 / 1024

Implementation

  • Module: api/src/metrics.ts
  • Wired in: api/src/server.ts (middleware + startMetricsServer(9091) + startDomainTick(env))

Adding a new metric

  1. Define the Counter / Histogram / Gauge at module scope in metrics.ts, registered against registry.
  2. Use a vf_<scope>_… name. Counter ends in _total. Duration ends in _seconds.
  3. Keep label cardinality bounded — never add a user-id / offer-id label.
  4. Increment / observe at the call site. Try to put it next to existing log.warn / log.info lines so signal stays co-located with logs.

Verifluence Documentation